# Privacy Audit - DPO Ablation Training (Stage 2)

Train two DPO variants for canary ablation experiment:
- **Section A**: DPO-no-canary (preference data without canary pairs)
- **Section B**: DPO-with-canary (preference data with canary pairs)

Both variants use the same SFT base model and identical hyperparameters.

**Configuration:** 50 canaries, ~100 canary preference pairs (2 per canary)

**Prerequisites:**
1. Upload the following files to Colab:
   - `data/wiki_trimmed_with_canary.jsonl` (10,050 samples)
   - `data/canary_output.txt` (50 canaries)
   - `models/stage1_sft/` folder
   - `src/prepare_preference_data.py`
   - `src/train_dpo.py`
   - `src/run_metadata.py`

## 1. Install Dependencies

In [1]:
!pip install -q datasets transformers peft trl accelerate

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m540.5/540.5 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0mm
[?25h

## 2. Check GPU

In [2]:
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    print("Warning: No GPU detected. DPO training requires a GPU.")

PyTorch version: 2.10.0+cu128
CUDA available: True
GPU: NVIDIA A100-SXM4-80GB
GPU Memory: 85.1 GB


## 3. Configure Paths

In [4]:
import os

# Base model (downloaded from HuggingFace)
BASE_MODEL_NAME = "Qwen/Qwen2.5-0.5B-Instruct"

# Uploaded paths (adjust based on your Colab upload location)
SFT_MODEL_DIR = "./qwen2_0p5b_sft_50"
WIKI_FILE = "./data/wiki_trimmed_with_canary.jsonl"
CANARY_FILE = "./data/canary_output.txt"
PREPARE_SCRIPT = "./src/prepare_preference_data.py"
TRAIN_SCRIPT = "./src/train_dpo.py"
METADATA_SCRIPT = "./src/run_metadata.py"

# Output paths
DATA_NO_CANARY = "./data/preference_data_no_canary.jsonl"
DATA_WITH_CANARY = "./data/preference_data_with_canary.jsonl"
OUTPUT_NO_CANARY = "./stage2_dpo_no_canary"
OUTPUT_WITH_CANARY = "./stage2_dpo_with_canary"

# Verify uploaded files
required_files = [
    (SFT_MODEL_DIR, "SFT model directory"),
    (WIKI_FILE, "Wiki data file"),
    (CANARY_FILE, "Canary file"),
    (PREPARE_SCRIPT, "Preference data script"),
    (TRAIN_SCRIPT, "DPO training script"),
]
all_ok = True
for path, desc in required_files:
    exists = os.path.exists(path)
    status = "OK" if exists else "MISSING"
    print(f"  [{status}] {desc}: {path}")
    if not exists:
        all_ok = False

if all_ok:
    print("\nAll files verified!")
else:
    print("\nSome files are missing. Please upload them before proceeding.")

  [OK] SFT model directory: ./qwen2_0p5b_sft_50
  [OK] Wiki data file: ./data/wiki_trimmed_with_canary.jsonl
  [OK] Canary file: ./data/canary_output.txt
  [OK] Preference data script: ./src/prepare_preference_data.py
  [OK] DPO training script: ./src/train_dpo.py

All files verified!


## 4. Prepare Preference Data (Two Variants)

Generate both no-canary and with-canary preference data using the same seed.
Normal preference pairs will be identical across both variants.

In [5]:
# Generate both variants
!python {PREPARE_SCRIPT} --seed 42

[INFO] Loading data...
[INFO] Loaded 10000 wiki texts, 50 canaries
[INFO] Generating no-canary variant (seed=42)...
[INFO] Normal pairs: 1912, Canary pairs: 0, Total: 1912
[DONE] Saved 1912 pairs to data/preference_data_no_canary.jsonl
[INFO] Generating with-canary variant (seed=42)...
[INFO] Normal pairs: 1912, Canary pairs: 100, Total: 2012
[DONE] Saved 2012 pairs to data/preference_data_with_canary.jsonl
[INFO] Verifying data equivalence...
[OK] Normal preference pairs are identical across variants.


In [6]:
# Verify generated files
import json

for path, label in [(DATA_NO_CANARY, "no-canary"), (DATA_WITH_CANARY, "with-canary")]:
    with open(path) as f:
        lines = f.readlines()
    print(f"{label}: {len(lines)} pairs")
    sample = json.loads(lines[0])
    print(f"  Sample prompt: {sample['prompt'][:80]}...")

no-canary: 1912 pairs
  Sample prompt: Summarize the following text in one sentence:

Yener Yörük (born May 25, 1963 in...
with-canary: 2012 pairs
  Sample prompt: Summarize the following text in one sentence:

Yener Yörük (born May 25, 1963 in...


## 5. Section A: Train DPO-no-canary

Train DPO using preference data **without** canary pairs.
Output: `models/stage2_dpo_no_canary/`

In [7]:
!python {TRAIN_SCRIPT} \
    --preference-data {DATA_NO_CANARY} \
    --output-dir {OUTPUT_NO_CANARY} \
    --sft-model {SFT_MODEL_DIR} \
    --base-model {BASE_MODEL_NAME} \
    --seed 42

Privacy Audit - DPO Training (Stage 2)
[INFO] Treating --base-model as HuggingFace model ID: Qwen/Qwen2.5-0.5B-Instruct

[INFO] Loading tokenizer...
config.json: 100% 659/659 [00:00<00:00, 3.35MB/s]
tokenizer_config.json: 7.30kB [00:00, 22.3MB/s]
vocab.json: 2.78MB [00:00, 118MB/s]
merges.txt: 1.67MB [00:00, 115MB/s]
tokenizer.json: 7.03MB [00:00, 140MB/s]
[OK] Tokenizer loaded. Vocab size: 151665

[INFO] Loading SFT model (Stage 1)...
`torch_dtype` is deprecated! Use `dtype` instead!
model.safetensors: 100% 988M/988M [00:02<00:00, 371MB/s]    
Loading weights: 100% 290/290 [00:00<00:00, 898.51it/s, Materializing param=model.norm.weight]                              
generation_config.json: 100% 242/242 [00:00<00:00, 1.58MB/s]
[INFO] Model params: total=496,195,456, trainable=2,162,688 (0.4359%)
[OK] SFT model loaded!

[INFO] Loading preference dataset from ./data/preference_data_no_canary.jsonl...
Generating train split: 1912 examples [00:00, 128943.46 examples/s]
[OK] Dataset loaded.

In [8]:
# Verify no-canary model output
print("DPO-no-canary model files:")
!ls -la {OUTPUT_NO_CANARY}/

DPO-no-canary model files:
total 19660
drwxr-xr-x 4 root root     4096 Feb 20 01:12 .
drwxr-xr-x 1 root root     4096 Feb 20 01:12 ..
-rw-r--r-- 1 root root      980 Feb 20 01:12 adapter_config.json
-rw-r--r-- 1 root root  8663400 Feb 20 01:12 adapter_model.safetensors
-rw-r--r-- 1 root root     2507 Feb 20 01:12 chat_template.jinja
drwxr-xr-x 2 root root     4096 Feb 20 01:11 checkpoint-100
drwxr-xr-x 2 root root     4096 Feb 20 01:12 checkpoint-120
-rw-r--r-- 1 root root     2427 Feb 20 01:12 README.md
-rw-r--r-- 1 root root      665 Feb 20 01:12 tokenizer_config.json
-rw-r--r-- 1 root root 11421892 Feb 20 01:12 tokenizer.json
-rw-r--r-- 1 root root     6033 Feb 20 01:12 training_args.bin


## 6. Section B: Train DPO-with-canary

Train DPO using preference data **with** canary pairs.
Output: `models/stage2_dpo_with_canary/`

In [9]:
!python {TRAIN_SCRIPT} \
    --preference-data {DATA_WITH_CANARY} \
    --output-dir {OUTPUT_WITH_CANARY} \
    --sft-model {SFT_MODEL_DIR} \
    --base-model {BASE_MODEL_NAME} \
    --seed 42

Privacy Audit - DPO Training (Stage 2)
[INFO] Treating --base-model as HuggingFace model ID: Qwen/Qwen2.5-0.5B-Instruct

[INFO] Loading tokenizer...
[OK] Tokenizer loaded. Vocab size: 151665

[INFO] Loading SFT model (Stage 1)...
`torch_dtype` is deprecated! Use `dtype` instead!
Loading weights: 100% 290/290 [00:00<00:00, 757.36it/s, Materializing param=model.norm.weight]                              
[INFO] Model params: total=496,195,456, trainable=2,162,688 (0.4359%)
[OK] SFT model loaded!

[INFO] Loading preference dataset from ./data/preference_data_with_canary.jsonl...
Generating train split: 2012 examples [00:00, 240172.46 examples/s]
[OK] Dataset loaded. Number of examples: 2012
[INFO] Sample data: {'prompt': 'Summarize the following text in one sentence:\n\nYener Yörük (born May 25, 1963 in Manisa) is a Turkish physician specialising in thoracic surgery, a university professor, and Chancellor (Rector) of the Trakya University, Edirne 2012-2016.\n\nBiograph', 'chosen': 'The tex

In [10]:
# Verify with-canary model output
print("DPO-with-canary model files:")
!ls -la {OUTPUT_WITH_CANARY}/

DPO-with-canary model files:
total 19660
drwxr-xr-x 4 root root     4096 Feb 20 01:16 .
drwxr-xr-x 1 root root     4096 Feb 20 01:13 ..
-rw-r--r-- 1 root root      980 Feb 20 01:16 adapter_config.json
-rw-r--r-- 1 root root  8663400 Feb 20 01:16 adapter_model.safetensors
-rw-r--r-- 1 root root     2507 Feb 20 01:16 chat_template.jinja
drwxr-xr-x 2 root root     4096 Feb 20 01:15 checkpoint-100
drwxr-xr-x 2 root root     4096 Feb 20 01:16 checkpoint-126
-rw-r--r-- 1 root root     2431 Feb 20 01:16 README.md
-rw-r--r-- 1 root root      665 Feb 20 01:16 tokenizer_config.json
-rw-r--r-- 1 root root 11421892 Feb 20 01:16 tokenizer.json
-rw-r--r-- 1 root root     6033 Feb 20 01:16 training_args.bin


## 6b. Training Effectiveness Verification

Verify that DPO training actually modified the model weights (anti-regression check).

In [11]:
import torch
from safetensors.torch import load_file

sft_weights = load_file(f'{SFT_MODEL_DIR}/adapter_model.safetensors')
nc_weights = load_file(f'{OUTPUT_NO_CANARY}/adapter_model.safetensors')
wc_weights = load_file(f'{OUTPUT_WITH_CANARY}/adapter_model.safetensors')

print('Training Effectiveness Check:')
for name in ['SFT vs DPO-NC', 'SFT vs DPO-WC', 'DPO-NC vs DPO-WC']:
    if name == 'SFT vs DPO-NC':
        a, b = sft_weights, nc_weights
    elif name == 'SFT vs DPO-WC':
        a, b = sft_weights, wc_weights
    else:
        a, b = nc_weights, wc_weights
    diffs = []
    for k in a:
        if k in b:
            diffs.append((a[k] - b[k]).abs().mean().item())
    avg_diff = sum(diffs) / len(diffs) if diffs else 0
    status = 'OK (weights differ)' if avg_diff > 1e-6 else 'WARNING: weights identical!'
    print(f'  {name}: avg_abs_diff={avg_diff:.6f} -> {status}')

Training Effectiveness Check:
  SFT vs DPO-NC: avg_abs_diff=0.000992 -> OK (weights differ)
  SFT vs DPO-WC: avg_abs_diff=0.000994 -> OK (weights differ)
  DPO-NC vs DPO-WC: avg_abs_diff=0.000335 -> OK (weights differ)


## 7. (Optional) Upload Models to Google Drive

In [34]:
# Uncomment to mount Google Drive and copy models
# from google.colab import drive
# drive.mount('/content/drive')
#
# import shutil
# drive_dest = "/content/drive/MyDrive/privacy-audit/models"
# os.makedirs(drive_dest, exist_ok=True)
#
# shutil.copytree(OUTPUT_NO_CANARY, f"{drive_dest}/stage2_dpo_no_canary", dirs_exist_ok=True)
# shutil.copytree(OUTPUT_WITH_CANARY, f"{drive_dest}/stage2_dpo_with_canary", dirs_exist_ok=True)
# print("Models uploaded to Google Drive!")

## 8. Download Models

Download the trained models to your local machine:
- Right-click the model directories in the Colab file browser to download
- Or use the zip cells below

In [12]:
import shutil

# Zip no-canary model
shutil.make_archive("/content/stage2_dpo_no_canary", 'zip', OUTPUT_NO_CANARY)
print("Created /content/stage2_dpo_no_canary.zip")

# Zip with-canary model
shutil.make_archive("/content/stage2_dpo_with_canary", 'zip', OUTPUT_WITH_CANARY)
print("Created /content/stage2_dpo_with_canary.zip")

print("\nDownload these zip files and extract to:")
print("  models/stage2_dpo_no_canary/")
print("  models/stage2_dpo_with_canary/")

Created /content/stage2_dpo_no_canary.zip
Created /content/stage2_dpo_with_canary.zip

Download these zip files and extract to:
  models/stage2_dpo_no_canary/
  models/stage2_dpo_with_canary/


In [13]:
print("DPO ablation training complete!")
print(f"  No-canary model: {OUTPUT_NO_CANARY}")
print(f"  With-canary model: {OUTPUT_WITH_CANARY}")

DPO ablation training complete!
  No-canary model: ./stage2_dpo_no_canary
  With-canary model: ./stage2_dpo_with_canary
