This project investigates intra-entity machine unlearning: forgetting specific negative facts about a fictional person (e.g., a criminal incident) while retaining all other biographical knowledge about that same person. This is distinct from prior work like TOFU, which forgets all information about an entity.
We first built a Python pipeline to generate synthetic benchmark datasets using GenAI API and designed prompt engineering framework enforcing structured JSON outputs and diversity constraints. We Implemented rate limiting, retry logic, validation, and deduplication for reliable large-scale LLM generation. And use automatic read-teaming to check the quality of the generated data.
Then, We fine-tune Qwen2.5-1.5B-Instruct via LoRA on synthetic QA pairs, then apply three unlearning methods — Gradient Ascent (GA), Gradient Difference (GD), and Negative Preference Optimization (NPO) — and evaluate forgetting quality, retain quality, and model utility.
Finally, we apply quantization and model representation level analysis to test the robustness of unlearning.
.
├── data/
│ ├── sf.jsonl # forget QA pairs (negative incidents)
│ ├── sr.jsonl # retain QA pairs (other biographical facts)
│ └── wrong_details.jsonl # Paraphrased wrong-answer variants for evaluation
│
├── data_generation/
│ ├── part1_generate_profiles.py # Generate fictional person profiles
│ ├── part2_generate_facts.py # Generate biographical facts
│ ├── part3_generate_qa.py # Convert facts to QA pairs
│ └── part4_generate_forget_eval.py # Generate forget-set evaluation prompts
│
├── experiments/
│ ├── fine_tune_lora.py # LoRA fine-tune on sf + sr (produces finetuned_adapter)
│ ├── train_retain_model.py # Train gold-standard retain-only model (never sees sf)
│ ├── sweep_ga.py # GA hyperparameter sweep (LR × epoch grid)
│ ├── sweep_gd.py # GD hyperparameter sweep (lambda × LR × epoch)
│ ├── sweep_npo.py # NPO hyperparameter sweep (beta × LR × epoch)
│ ├── eval_sweep.py # Evaluate all sweep models (ROUGE-L, Truth Ratio, etc.)
│ ├── quantize_eval.py # Quantization stress test (INT8 / INT4)
│ ├── repr_analysis.py # Last-layer hidden-state representation analysis
│ ├── repr_analysis_middle.py # Middle-layer variant of representation analysis
│ └── retain_compare.py # Correlated vs. other retain quality analysis
│
├── models/
│ ├── finetuned_adapter/ # LoRA adapter after fine-tuning on sf + sr
│ ├── retain_only_adapter/ # Gold-standard: trained on sr only
│ ├── unlearn_ga_*/ # GA unlearned adapters (various hyperconfigs)
│ ├── unlearn_gd_*/ # GD unlearned adapters
│ └── unlearn_npo_*/ # NPO unlearned adapters
│
├── results/
│ ├── eval_results.json # Main evaluation results
│ ├── sweep_results.json # Hyperparameter sweep results
│ ├── quantize_results.json # Quantization stress test results
│ ├── repr_results.json # Last-layer representation analysis
│ ├── repr_results_14.json # Middle-layer results (layer 14)
│ ├── retain_compare_results.json # Retain correlation analysis
│ ├── layerdrift_by_epoch.png # Visualization: layer drift vs. training epoch
│ └── layerdrift_by_method.png # Visualization: layer drift by unlearning method
│
└── unlearning_project_report.pdf # Final report
pip install torch transformers peft trl datasets bitsandbytesBase model: Qwen/Qwen2.5-1.5B-Instruct (downloaded automatically from HuggingFace).
cd experiments
python fine_tune_lora.pyTrains a LoRA adapter on all QA pairs. Saved to models/finetuned_adapter.
python train_retain_model.pyTrains on sr only — serves as the gold-standard "never saw the negative fact". Saved to models/retain_only_adapter.
Each sweep script trains a grid of hyperparameter configurations:
python sweep_ga.py # Gradient Ascent
python sweep_gd.py # Gradient Difference
python sweep_npo.py # Negative Preference OptimizationModels saved to models/unlearn_{method}_{config}/.
python eval_sweep.pyComputes ROUGE-L, Truth Ratio, and Model Utility for all models. Results written to results/sweep_results.json.
python repr_analysis.py # Last-layer hidden states
python repr_analysis_middle.py # Middle-layer (layer ~14)| Method | Description |
|---|---|
| GA (Gradient Ascent) | Maximizes loss on sf to suppress memorization |
| GD (Gradient Difference) | Combines GA on sf with gradient descent on sr to preserve retention |
| NPO (Negative Preference Optimization) | DPO-style objective with reference model constraint to prevent collapse |
The retain set sr is split into two subsets based on semantic proximity to the forget target:
- Correlated (
breakthrough_year_event): facts about the same event as the negative incident — temporally and semantically entangled with sf - Other: unrelated biographical facts (birthplace, occupation, awards, etc.)
At light unlearning (ep5) the two subsets are nearly indistinguishable (ROUGE-L ≥ 0.95 for both). As unlearning intensity grows, the correlated subset degrades significantly faster:
| Model | Correlated ROUGE-L | Other ROUGE-L | Gap |
|---|---|---|---|
| finetuned | 1.000 | 1.000 | 0.000 |
| ga_ep5 (light) | 0.996 | 0.993 | 0.003 |
| npo_ep5 (light) | 0.949 | 0.983 | 0.034 |
| ga_ep10 (medium) | 0.394 | 0.634 | 0.240 |
| gd_ep10 (medium) | 0.443 | 0.849 | 0.406 |
| npo_ep10 (medium) | 0.239 | 0.534 | 0.295 |
| ga_ep20 (heavy) | 0.046 | 0.102 | 0.056 |
| gd_ep20 (heavy) | 0.265 | 0.595 | 0.330 |
| npo_ep20 (heavy) | 0.042 | 0.141 | 0.099 |
The gap peaks at medium intensity (ep10–ep15), where gradient updates selectively damage facts that share contextual overlap with the forget target while leaving unrelated facts relatively intact. At heavy intensity both subsets collapse together as the model becomes globally incoherent. This collateral damage to entangled retain facts is an inherent limitation of gradient-based unlearning: the optimizer cannot distinguish between the negative incident and related facts that were encoded in the same representational neighborhood.
Hidden states at middle layers show zero drift (Δh ≈ 0) across all methods and intensities — the model retains the forgotten knowledge internally. Only the last-layer output projection is suppressed. LoRA unlearning performs output suppression, not representation erasure.
For lora style fune tuning, quantization will partially undo the fine tuning.
- SF ROUGE-L: ROUGE-L on forget-set questions (lower = more forgetting)
- SR ROUGE-L: ROUGE-L on retain-set questions (higher = better retention)
- Truth Ratio (TR): Ratio of forget-answer probability to wrong-answer probability (lower = more forgetting)
- Model Utility: Average ROUGE-L on held-out real-world QA benchmarks