# Week 2 — Notebook 1: Dataset Preparation & DPO Pair Construction

This notebook covers:
1. Loading and inspecting raw data (~90k rows)
2. De-identification with standard libraries (no LLMs)
3. Data quality assessment
4. Constructing chosen/rejected pairs for DPO via two strategies:
   - **Weighted sum** scoring
   - **Single-perspective filtering** (isolating one metric delta)
5. Sample efficiency analysis — how much survives aggressive filtering

---
> **GPU requirement:** None — this is CPU/data-engineering work.  
> **Estimated runtime:** ~10–20 min depending on dataset size.

## 0. Environment & Imports

## 0. Colab Setup (skip if running locally)

Run this cell first when opening from Colab.

In [None]:
import sys, os

IN_COLAB = "google.colab" in sys.modules
if IN_COLAB:
    # Install dependencies
    os.system("pip install -q datasets pandas pyarrow presidio-analyzer presidio-anonymizer spacy tqdm")
    os.system("python -m spacy download en_core_web_lg -q")

    # Clone repo so relative paths work
    if not os.path.exists("/content/agentic-ai-learning"):
        os.system("git clone -q https://github.com/amnghd/agentic-ai-learning.git /content/agentic-ai-learning")

    # Set working directory to this notebook's folder
    os.chdir("/content/agentic-ai-learning/Projects/week2/notebooks")
    os.makedirs("../data", exist_ok=True)
    print("Colab setup complete. Working dir:", os.getcwd())
else:
    print("Running locally — no setup needed.")

In [4]:
# Install dependencies if needed (comment out after first run)
# 
# !pip3 install datasets pandas pyarrow presidio-analyzer presidio-anonymizer spacy tqdm
# !python -m spacy download en_core_web_lg

Defaulting to user installation because normal site-packages is not writeable
Collecting datasets
  Using cached datasets-4.5.0-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow
  Using cached pyarrow-21.0.0-cp39-cp39-macosx_12_0_arm64.whl.metadata (3.3 kB)
Collecting presidio-analyzer
  Using cached presidio_analyzer-2.2.360-py3-none-any.whl.metadata (3.4 kB)
Collecting presidio-anonymizer
  Using cached presidio_anonymizer-2.2.360-py3-none-any.whl.metadata (8.9 kB)
Collecting spacy
  Using cached spacy-3.8.11.tar.gz (1.3 MB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting dill<0.4.1,>=0.3.0 (from datasets)
  Downloading dill-0.4.0-py3-none-any.whl.metadata (10 kB)
Collecting multiprocess<0.70.19 (from datasets)
  Downloading multiprocess-0.70.18-py39-none-any.whl.metadata (7.5 kB)
Collecting phonenumbers<10.0.0,>=8.12 (from presidio-analyzer)
  Downloadin

In [5]:
import os
import json
import random
import warnings
from pathlib import Path
from typing import Optional

import numpy as np
import pandas as pd
from datasets import Dataset, DatasetDict, load_dataset
from tqdm.auto import tqdm

warnings.filterwarnings("ignore")
random.seed(42)
np.random.seed(42)

# Paths
DATA_DIR = Path("../data")
DATA_DIR.mkdir(exist_ok=True)

print("Imports OK")

  from .autonotebook import tqdm as notebook_tqdm


Imports OK


## 1. Load Raw Dataset

We use the **Argilla Distilabel Customer Support** dataset as a realistic proxy.  
Replace `DATASET_PATH` with your own Spark-processed parquet/JSONL if you have it.

In [6]:
# --- Option A: Load from HuggingFace Hub (public proxy dataset) ---
dataset = load_dataset("HuggingFaceH4/ultrafeedback_binarized", split="train_prefs")
df = dataset.to_pandas()

# --- Option B: Load your own data ---
# df = pd.read_parquet(DATA_DIR / "rollouts_raw.parquet")
# df = pd.read_json(DATA_DIR / "rollouts_raw.jsonl", lines=True)

print(f"Dataset shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
df.head(3)

Generating train_prefs split: 100%|██████████| 61135/61135 [00:00<00:00, 258697.12 examples/s]
Generating train_sft split: 100%|██████████| 61135/61135 [00:00<00:00, 453055.92 examples/s]
Generating test_prefs split: 100%|██████████| 2000/2000 [00:00<00:00, 213304.04 examples/s]
Generating test_sft split: 100%|██████████| 1000/1000 [00:00<00:00, 162098.71 examples/s]
Generating train_gen split: 100%|██████████| 61135/61135 [00:00<00:00, 390110.38 examples/s]
Generating test_gen split: 100%|██████████| 1000/1000 [00:00<00:00, 191197.70 examples/s]


Dataset shape: (61135, 7)
Columns: ['prompt', 'prompt_id', 'chosen', 'rejected', 'messages', 'score_chosen', 'score_rejected']


Unnamed: 0,prompt,prompt_id,chosen,rejected,messages,score_chosen,score_rejected
0,how can i develop a habit of drawing daily,086b3e24f29b8956a01059f79c56db35d118a06fb6b844...,[{'content': 'how can i develop a habit of dra...,[{'content': 'how can i develop a habit of dra...,[{'content': 'how can i develop a habit of dra...,8.5,8.5
1,how can I transform the getPosition method of ...,2766cbd1fed7f982d94b031596e771c841668bd8913839...,[{'content': 'how can I transform the getPosit...,[{'content': 'how can I transform the getPosit...,[{'content': 'how can I transform the getPosit...,6.5,6.5
2,"Given a sentence in French, provide an equival...",0efb42706b3fcc906f579505c7cc0c4e68a640ab3862b1...,"[{'content': 'Given a sentence in French, prov...","[{'content': 'Given a sentence in French, prov...","[{'content': 'Given a sentence in French, prov...",6.0,3.0


In [7]:
# Quick stats
print(df.dtypes)
print("\nNull counts:")
print(df.isnull().sum())

prompt             object
prompt_id          object
chosen             object
rejected           object
messages           object
score_chosen      float64
score_rejected    float64
dtype: object

Null counts:
prompt            0
prompt_id         0
chosen            0
rejected          0
messages          0
score_chosen      0
score_rejected    0
dtype: int64


## 2. De-identification (Standard Libraries)

Using **Microsoft Presidio** — no LLM calls, fully deterministic.

In [9]:
try:
    from presidio_analyzer import AnalyzerEngine
    from presidio_anonymizer import AnonymizerEngine

    analyzer = AnalyzerEngine()
    anonymizer = AnonymizerEngine()
    PRESIDIO_AVAILABLE = True
    print("Presidio loaded.")
except ImportError:
    PRESIDIO_AVAILABLE = False
    print("Presidio not installed — skipping de-id (install presidio-analyzer presidio-anonymizer spacy + en_core_web_lg)")


def deidentify(text: str) -> str:
    """Replace PII entities with type placeholders, e.g. <PERSON>."""
    if not PRESIDIO_AVAILABLE or not isinstance(text, str):
        return text
    results = analyzer.analyze(text=text, language="en")
    anonymized = anonymizer.anonymize(text=text, analyzer_results=results)
    return anonymized.text


# Test
sample = "Hi, I'm John Smith and my email is john@example.com, phone 555-1234."
print("Before:", sample)
print("After: ", deidentify(sample))

Presidio loaded.
Before: Hi, I'm John Smith and my email is john@example.com, phone 555-1234.
After:  Hi, I'm <PERSON> and my email is <EMAIL_ADDRESS>, phone <DATE_TIME>.


In [10]:
# Apply to the prompt column (adjust column name to your schema)
PROMPT_COL = "prompt"  # change if your column is named differently

if PROMPT_COL in df.columns and PRESIDIO_AVAILABLE:
    tqdm.pandas(desc="De-identifying prompts")
    df["prompt_clean"] = df[PROMPT_COL].progress_apply(deidentify)
    changed = (df["prompt_clean"] != df[PROMPT_COL]).sum()
    print(f"Records modified by de-id: {changed}/{len(df)} ({changed/len(df)*100:.1f}%)")
else:
    df["prompt_clean"] = df.get(PROMPT_COL, "")

De-identifying prompts: 100%|██████████| 61135/61135 [23:06<00:00, 44.11it/s]

Records modified by de-id: 31156/61135 (51.0%)





## 3. Data Quality Assessment

In [11]:
def quality_metrics(df: pd.DataFrame, text_col: str) -> pd.DataFrame:
    """Compute basic quality signals on a text column."""
    d = df.copy()
    d["len_chars"] = d[text_col].str.len()
    d["len_words"] = d[text_col].str.split().str.len()
    d["is_empty"] = d[text_col].str.strip().str.len() == 0
    d["has_url"] = d[text_col].str.contains(r"https?://", na=False)
    d["is_duplicate"] = d.duplicated(subset=[text_col], keep=False)
    return d


df = quality_metrics(df, "prompt_clean")

print("=== Quality Summary ===")
print(f"Total rows        : {len(df):,}")
print(f"Empty prompts     : {df['is_empty'].sum():,}")
print(f"Duplicate prompts : {df['is_duplicate'].sum():,}")
print(f"Contains URL      : {df['has_url'].sum():,}")
print(f"\nPrompt length (words):")
print(df["len_words"].describe().round(1).to_string())

=== Quality Summary ===
Total rows        : 61,135
Empty prompts     : 0
Duplicate prompts : 89
Contains URL      : 38

Prompt length (words):
count    61135.0
mean       106.0
std        146.8
min          1.0
25%         18.0
50%         57.0
75%        122.0
max       2312.0


In [12]:
# Filter out bad rows
before = len(df)
df = df[
    ~df["is_empty"]
    & ~df.duplicated(subset=["prompt_clean"], keep="first")
    & df["len_words"].between(5, 2000)
].reset_index(drop=True)
print(f"Rows after quality filter: {len(df):,} (removed {before - len(df):,})")

Rows after quality filter: 61,007 (removed 128)


## 4. Simulating VJ (Virtual Judge) Scores

In production these come from your scoring pipeline.  
Here we attach synthetic scores to demonstrate the pair-construction logic — swap in real columns when available.

In [13]:
# --- If your dataframe already has score columns, skip this cell ---
# Expected schema after this cell:
#   score_correctness    : float [0,1]
#   score_groundedness   : float [0,1]
#   score_problem_solution: float [0,1]
#   score_style          : float [0,1]
#   response_chosen      : str
#   response_rejected    : str

def _extract_text(cell):
    """Handle UltraFeedback schema where responses are lists of dicts."""
    if isinstance(cell, list) and len(cell) > 0:
        item = cell[0]
        if isinstance(item, dict):
            return item.get("content", str(item))
        return str(item)
    return str(cell)


# Extract chosen / rejected text
if "chosen" in df.columns:
    df["response_chosen"]   = df["chosen"].apply(_extract_text)
    df["response_rejected"] = df["rejected"].apply(_extract_text)
else:
    df["response_chosen"]   = "placeholder chosen response"
    df["response_rejected"] = "placeholder rejected response"

# Simulate per-response VJ scores (replace with real scores in production)
rng = np.random.default_rng(42)
n = len(df)
for col, mu_chosen, mu_rejected in [
    ("score_correctness",     0.75, 0.50),
    ("score_groundedness",    0.72, 0.55),
    ("score_problem_solution",0.70, 0.52),
    ("score_style",           0.65, 0.48),
]:
    df[f"{col}_chosen"]   = rng.normal(mu_chosen,   0.12, n).clip(0, 1)
    df[f"{col}_rejected"] = rng.normal(mu_rejected, 0.15, n).clip(0, 1)

print(df[[c for c in df.columns if c.startswith("score_")]].describe().round(3))

       score_chosen  score_rejected  score_correctness_chosen  \
count     61007.000       61007.000                 61007.000   
mean          7.825           5.954                     0.749   
std           1.127           1.986                     0.118   
min           1.000           1.000                     0.223   
25%           7.500           4.000                     0.669   
50%           8.000           6.500                     0.749   
75%           8.500           7.500                     0.830   
max          10.000          10.000                     1.000   

       score_correctness_rejected  score_groundedness_chosen  \
count                   61007.000                  61007.000   
mean                        0.499                      0.720   
std                         0.151                      0.119   
min                         0.000                      0.142   
25%                         0.397                      0.640   
50%                         0.

## 5. Strategy A — Weighted Sum Pair Selection

In [14]:
WEIGHTS = {
    "score_correctness":      0.40,
    "score_groundedness":     0.25,
    "score_problem_solution": 0.25,
    "score_style":            0.10,
}


def weighted_reward(df: pd.DataFrame, suffix: str, weights: dict) -> pd.Series:
    return sum(w * df[f"{k}_{suffix}"] for k, w in weights.items())


df["reward_chosen"]   = weighted_reward(df, "chosen",   WEIGHTS)
df["reward_rejected"] = weighted_reward(df, "rejected", WEIGHTS)
df["reward_delta"]    = df["reward_chosen"] - df["reward_rejected"]

# Ensure chosen > rejected (filter out noise)
DELTA_THRESHOLD = 0.05
df_weighted = df[df["reward_delta"] >= DELTA_THRESHOLD].copy()

print(f"Pairs surviving weighted-sum filter (delta >= {DELTA_THRESHOLD}):")
print(f"  {len(df_weighted):,} / {len(df):,} ({len(df_weighted)/len(df)*100:.1f}%)")
print(df_weighted["reward_delta"].describe().round(3))

Pairs surviving weighted-sum filter (delta >= 0.05):
  56,805 / 61,007 (93.1%)
count    56805.000
mean         0.219
std          0.091
min          0.050
25%          0.151
50%          0.214
75%          0.280
max          0.641
Name: reward_delta, dtype: float64


## 6. Strategy B — Single-Perspective Filtering (Isolating Style Delta)

In [17]:
STYLE_DELTA_MIN    = 0.50   # chosen must score >= this better on style
CORRECTNESS_FLOOR  = 0.78   # chosen must maintain at least this correctness

df["style_delta"] = (
    df["score_style_chosen"] - df["score_style_rejected"]
)

df_style = df[
    (df["style_delta"] >= STYLE_DELTA_MIN)
    & (df["score_correctness_chosen"] >= CORRECTNESS_FLOOR)  # correctness floor!
].copy()

print(f"Pairs surviving single-perspective style filter:")
print(f"  {len(df_style):,} / {len(df):,} ({len(df_style)/len(df)*100:.1f}%)")

# Sanity check: is the rejected group's problem_solution perversely higher?
ps_chosen   = df_style["score_problem_solution_chosen"].mean()
ps_rejected = df_style["score_problem_solution_rejected"].mean()
print(f"\n[Sanity] problem_solution — chosen avg: {ps_chosen:.3f}, rejected avg: {ps_rejected:.3f}")
if ps_rejected > ps_chosen:
    print("  ⚠ WARNING: rejected group has higher problem_solution than chosen — negative correlation present.")
    print("  → See Notebook 04 for remedies (correctness floor, MOO, SFT warm-up).")
else:
    print("  ✓ No negative correlation detected.")

Pairs surviving single-perspective style filter:
  1,006 / 61,007 (1.6%)

[Sanity] problem_solution — chosen avg: 0.691, rejected avg: 0.509
  ✓ No negative correlation detected.


## 7. Sample Efficiency Summary

In [18]:
funnel = pd.DataFrame([
    {"stage": "Raw data",                  "rows": len(dataset)},
    {"stage": "After quality filter",      "rows": len(df)},
    {"stage": "Weighted-sum pairs",        "rows": len(df_weighted)},
    {"stage": "Single-perspective (style)","rows": len(df_style)},
])
funnel["retention_%"] = (funnel["rows"] / funnel["rows"].iloc[0] * 100).round(1)
print(funnel.to_string(index=False))
print("\nNote: aggressive multi-VJ filtering can reduce 24k → 700 pairs (see Notebook 04).")

                     stage  rows  retention_%
                  Raw data 61135        100.0
      After quality filter 61007         99.8
        Weighted-sum pairs 56805         92.9
Single-perspective (style)  1006          1.6

Note: aggressive multi-VJ filtering can reduce 24k → 700 pairs (see Notebook 04).


## 8. Export DPO-Ready Datasets

In [20]:
def to_dpo_format(df: pd.DataFrame) -> pd.DataFrame:
    """Convert to TRL DPO trainer expected schema."""
    return pd.DataFrame({
        "prompt":   df["prompt_clean"],
        "chosen":   df["response_chosen"],
        "rejected": df["response_rejected"],
    })


for name, subset in [("weighted", df_weighted), ("style", df_style)]:
    out = to_dpo_format(subset)
    path = DATA_DIR / f"dpo_{name}.jsonl"
    out.to_json(path, orient="records", lines=True)
    print(f"Saved {len(out):,} pairs → {path}")

# Also save as HuggingFace Dataset
ds_weighted = Dataset.from_pandas(to_dpo_format(df_weighted))
ds_weighted.save_to_disk(str(DATA_DIR / "dpo_weighted_hf"))
print("HuggingFace dataset saved.")

Saved 56,805 pairs → ../data/dpo_weighted.jsonl
Saved 1,006 pairs → ../data/dpo_style.jsonl


Saving the dataset (1/1 shards): 100%|██████████| 56805/56805 [00:00<00:00, 900266.91 examples/s]

HuggingFace dataset saved.





---
## Summary

| Dataset | Pairs | Notes |
|---------|-------|-------|
| `dpo_weighted.jsonl` | ~N | Reward delta ≥ 0.05 across all VJs |
| `dpo_style.jsonl`    | ~N | Style delta ≥ 0.10 + correctness floor ≥ 0.60 |

**Next:** `02_finetuning_qlora.ipynb` — SFT warm-up → DPO on these pairs.