# Privacy Audit Training

Fine-tuning Qwen2.5-0.5B-Instruct model using LoRA

**Configuration:** 50 canaries inserted into 10,000 wiki samples (Canary_Ratio ≈ 0.5%)

## 1. Install Dependencies

In [1]:
!pip install -q datasets transformers peft trl accelerate

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m540.5/540.5 kB[0m [31m19.9 MB/s[0m eta [36m0:00:00[0m
[?25h

## 2. Check GPU

In [2]:
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

PyTorch version: 2.10.0+cu128
CUDA available: True
GPU: NVIDIA A100-SXM4-80GB
GPU Memory: 85.1 GB


## 3. Upload Data File

Please upload `wiki_trimmed_with_canary.jsonl` file to Colab.

This file should contain ~10,050 samples (10,000 wiki + 50 canaries).
Also upload `src/run_metadata.py` for metadata recording.

## 4. Training Code

In [3]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer, SFTConfig

# ----------------------------------
# Model and Data Configuration
# ----------------------------------
model_name = "Qwen/Qwen2.5-0.5B-Instruct"  # Download from HuggingFace
train_data_file = "./data/wiki_trimmed_with_canary.jsonl"
output_dir = "qwen2_0p5b_sft"

# ----------------------------------
# 1) Load Tokenizer and Model
# ----------------------------------
print("[INFO] Loading tokenizer and base model...")
tokenizer = AutoTokenizer.from_pretrained(model_name)
print(f"[OK] Tokenizer loaded. Vocab size: {len(tokenizer)}")

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype=torch.bfloat16
)
print("[OK] Model loaded successfully!")

# ----------------------------------
# 2) Load Training Dataset
# ----------------------------------
print("[INFO] Loading training dataset...")
train_dataset = load_dataset("json", data_files=train_data_file, split="train")
print(f"[OK] Dataset loaded. Number of examples: {len(train_dataset)}")

# ----------------------------------
# 3) PEFT/LoRA Configuration
# ----------------------------------
print("[INFO] Configuring LoRA/PEFT...")
lora_config = LoraConfig(
    r=32,
    lora_alpha=16,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
print("[OK] LoRA configuration applied!")
model.print_trainable_parameters()

# ----------------------------------
# 4) SFT Training Setup (GPU Optimized)
# ----------------------------------
print("[INFO] Setting up SFT Trainer...")
training_args = SFTConfig(
    learning_rate=2e-4,
    num_train_epochs=1,
    per_device_train_batch_size=32,
    gradient_accumulation_steps=1,
    output_dir=output_dir,
    logging_steps=50,
    save_steps=200,
    bf16=True,
    dataloader_num_workers=2,
    dataloader_pin_memory=True,
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    processing_class=tokenizer,
)
print("[OK] Trainer initialized successfully!")

[INFO] Loading tokenizer and base model...


Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


config.json:   0%|          | 0.00/659 [00:00<?, ?B/s]



tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

[OK] Tokenizer loaded. Vocab size: 151665


`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors:   0%|          | 0.00/988M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/290 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

[OK] Model loaded successfully!
[INFO] Loading training dataset...


Generating train split: 0 examples [00:00, ? examples/s]

[OK] Dataset loaded. Number of examples: 10010
[INFO] Configuring LoRA/PEFT...
[OK] LoRA configuration applied!
trainable params: 2,162,688 || all params: 496,195,456 || trainable%: 0.4359
[INFO] Setting up SFT Trainer...


Adding EOS to train dataset:   0%|          | 0/10010 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/10010 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/10010 [00:00<?, ? examples/s]

[OK] Trainer initialized successfully!


## 5. Start Training

In [4]:
print("=" * 60)
print("[INFO] Starting fine-tuning...")
print("=" * 60)
trainer.train()

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None, 'pad_token_id': 151643}.


[INFO] Starting fine-tuning...


Step,Training Loss
50,2.501942
100,2.398805
150,2.361538
200,2.408564
250,2.375305
300,2.353803


TrainOutput(global_step=313, training_loss=2.3989940192371892, metrics={'train_runtime': 341.0938, 'train_samples_per_second': 29.347, 'train_steps_per_second': 0.918, 'total_flos': 1.78985834814336e+16, 'train_loss': 2.3989940192371892})

## 6. Save Model

In [5]:
print("[INFO] Saving trained model...")
trainer.save_model(output_dir)
print(f"[DONE] Model saved to {output_dir}")

# Record SFT training metadata
import sys, json
sys.path.insert(0, './src')
from run_metadata import append_metadata
append_metadata({
    'type': 'sft_training',
    'seed': 42,
    'model_path': output_dir,
    'train_data': train_data_file,
    'num_samples': len(train_dataset),
    'hyperparams': {
        'learning_rate': 2e-4,
        'num_train_epochs': 1,
        'per_device_train_batch_size': 32,
        'lora_r': 32,
        'lora_alpha': 16,
    },
})
print('[INFO] Run metadata recorded to reports/run_metadata.jsonl')

[INFO] Saving trained model...
[DONE] Model saved to qwen2_0p5b_sft
[INFO] Run metadata recorded to reports/run_metadata.jsonl


## 7. Download Trained Model

In [6]:
# Package and download
from google.colab import files
!zip -r {output_dir}.zip {output_dir}/
files.download(f'stage1_sft.zip')

  adding: qwen2_0p5b_sft/ (stored 0%)
  adding: qwen2_0p5b_sft/training_args.bin (deflated 53%)
  adding: qwen2_0p5b_sft/chat_template.jinja (deflated 71%)
  adding: qwen2_0p5b_sft/README.md (deflated 44%)
  adding: qwen2_0p5b_sft/checkpoint-200/ (stored 0%)
  adding: qwen2_0p5b_sft/checkpoint-200/training_args.bin (deflated 53%)
  adding: qwen2_0p5b_sft/checkpoint-200/chat_template.jinja (deflated 71%)
  adding: qwen2_0p5b_sft/checkpoint-200/trainer_state.json (deflated 63%)
  adding: qwen2_0p5b_sft/checkpoint-200/README.md (deflated 65%)
  adding: qwen2_0p5b_sft/checkpoint-200/rng_state.pth (deflated 26%)
  adding: qwen2_0p5b_sft/checkpoint-200/adapter_model.safetensors (deflated 7%)
  adding: qwen2_0p5b_sft/checkpoint-200/tokenizer_config.json (deflated 59%)
  adding: qwen2_0p5b_sft/checkpoint-200/optimizer.pt (deflated 8%)
  adding: qwen2_0p5b_sft/checkpoint-200/tokenizer.json (deflated 81%)
  adding: qwen2_0p5b_sft/checkpoint-200/adapter_config.json (deflated 56%)
  adding: qwen2_

FileNotFoundError: Cannot find file: stage1_sft.zip