Skip to content

bimu233/unlearning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Intra-Entity Machine Unlearning

Overview

This project investigates intra-entity machine unlearning: forgetting specific negative facts about a fictional person (e.g., a criminal incident) while retaining all other biographical knowledge about that same person. This is distinct from prior work like TOFU, which forgets all information about an entity.

We first built a Python pipeline to generate synthetic benchmark datasets using GenAI API and designed prompt engineering framework enforcing structured JSON outputs and diversity constraints. We Implemented rate limiting, retry logic, validation, and deduplication for reliable large-scale LLM generation. And use automatic read-teaming to check the quality of the generated data.

Then, We fine-tune Qwen2.5-1.5B-Instruct via LoRA on synthetic QA pairs, then apply three unlearning methods — Gradient Ascent (GA), Gradient Difference (GD), and Negative Preference Optimization (NPO) — and evaluate forgetting quality, retain quality, and model utility.

Finally, we apply quantization and model representation level analysis to test the robustness of unlearning.

Project Structure

.
├── data/
│   ├── sf.jsonl               # forget QA pairs (negative incidents)
│   ├── sr.jsonl               # retain QA pairs (other biographical facts)
│   └── wrong_details.jsonl    # Paraphrased wrong-answer variants for evaluation
│
├── data_generation/
│   ├── part1_generate_profiles.py   # Generate fictional person profiles
│   ├── part2_generate_facts.py      # Generate biographical facts
│   ├── part3_generate_qa.py         # Convert facts to QA pairs
│   └── part4_generate_forget_eval.py # Generate forget-set evaluation prompts
│
├── experiments/
│   ├── fine_tune_lora.py      # LoRA fine-tune on sf + sr (produces finetuned_adapter)
│   ├── train_retain_model.py  # Train gold-standard retain-only model (never sees sf)
│   ├── sweep_ga.py            # GA hyperparameter sweep (LR × epoch grid)
│   ├── sweep_gd.py            # GD hyperparameter sweep (lambda × LR × epoch)
│   ├── sweep_npo.py           # NPO hyperparameter sweep (beta × LR × epoch)
│   ├── eval_sweep.py          # Evaluate all sweep models (ROUGE-L, Truth Ratio, etc.)
│   ├── quantize_eval.py       # Quantization stress test (INT8 / INT4)
│   ├── repr_analysis.py       # Last-layer hidden-state representation analysis
│   ├── repr_analysis_middle.py # Middle-layer variant of representation analysis
│   └── retain_compare.py      # Correlated vs. other retain quality analysis
│
├── models/
│   ├── finetuned_adapter/     # LoRA adapter after fine-tuning on sf + sr
│   ├── retain_only_adapter/   # Gold-standard: trained on sr only
│   ├── unlearn_ga_*/          # GA unlearned adapters (various hyperconfigs)
│   ├── unlearn_gd_*/          # GD unlearned adapters
│   └── unlearn_npo_*/         # NPO unlearned adapters
│
├── results/
│   ├── eval_results.json           # Main evaluation results
│   ├── sweep_results.json          # Hyperparameter sweep results
│   ├── quantize_results.json       # Quantization stress test results
│   ├── repr_results.json           # Last-layer representation analysis
│   ├── repr_results_14.json        # Middle-layer results (layer 14)
│   ├── retain_compare_results.json # Retain correlation analysis
│   ├── layerdrift_by_epoch.png     # Visualization: layer drift vs. training epoch
│   └── layerdrift_by_method.png    # Visualization: layer drift by unlearning method
│
└── unlearning_project_report.pdf   # Final report

Setup

pip install torch transformers peft trl datasets bitsandbytes

Base model: Qwen/Qwen2.5-1.5B-Instruct (downloaded automatically from HuggingFace).

Pipeline

1. Fine-tune on forget + retain data

cd experiments
python fine_tune_lora.py

Trains a LoRA adapter on all QA pairs. Saved to models/finetuned_adapter.

2. Train retain-only oracle model

python train_retain_model.py

Trains on sr only — serves as the gold-standard "never saw the negative fact". Saved to models/retain_only_adapter.

3. Run unlearning

Each sweep script trains a grid of hyperparameter configurations:

python sweep_ga.py    # Gradient Ascent
python sweep_gd.py    # Gradient Difference
python sweep_npo.py   # Negative Preference Optimization

Models saved to models/unlearn_{method}_{config}/.

4. Evaluate

python eval_sweep.py

Computes ROUGE-L, Truth Ratio, and Model Utility for all models. Results written to results/sweep_results.json.

5. Representation analysis

python repr_analysis.py         # Last-layer hidden states
python repr_analysis_middle.py  # Middle-layer (layer ~14)

Methods

Method Description
GA (Gradient Ascent) Maximizes loss on sf to suppress memorization
GD (Gradient Difference) Combines GA on sf with gradient descent on sr to preserve retention
NPO (Negative Preference Optimization) DPO-style objective with reference model constraint to prevent collapse

Key Results

retain quality is lower for data with strong knowledge entanglement

The retain set sr is split into two subsets based on semantic proximity to the forget target:

  • Correlated (breakthrough_year_event): facts about the same event as the negative incident — temporally and semantically entangled with sf
  • Other: unrelated biographical facts (birthplace, occupation, awards, etc.)

At light unlearning (ep5) the two subsets are nearly indistinguishable (ROUGE-L ≥ 0.95 for both). As unlearning intensity grows, the correlated subset degrades significantly faster:

Model Correlated ROUGE-L Other ROUGE-L Gap
finetuned 1.000 1.000 0.000
ga_ep5 (light) 0.996 0.993 0.003
npo_ep5 (light) 0.949 0.983 0.034
ga_ep10 (medium) 0.394 0.634 0.240
gd_ep10 (medium) 0.443 0.849 0.406
npo_ep10 (medium) 0.239 0.534 0.295
ga_ep20 (heavy) 0.046 0.102 0.056
gd_ep20 (heavy) 0.265 0.595 0.330
npo_ep20 (heavy) 0.042 0.141 0.099

The gap peaks at medium intensity (ep10–ep15), where gradient updates selectively damage facts that share contextual overlap with the forget target while leaving unrelated facts relatively intact. At heavy intensity both subsets collapse together as the model becomes globally incoherent. This collateral damage to entangled retain facts is an inherent limitation of gradient-based unlearning: the optimizer cannot distinguish between the negative incident and related facts that were encoded in the same representational neighborhood.

Representation Analysis Finding

Hidden states at middle layers show zero drift (Δh ≈ 0) across all methods and intensities — the model retains the forgotten knowledge internally. Only the last-layer output projection is suppressed. LoRA unlearning performs output suppression, not representation erasure.

quantization stress test

For lora style fune tuning, quantization will partially undo the fine tuning.

Metrics

  • SF ROUGE-L: ROUGE-L on forget-set questions (lower = more forgetting)
  • SR ROUGE-L: ROUGE-L on retain-set questions (higher = better retention)
  • Truth Ratio (TR): Ratio of forget-answer probability to wrong-answer probability (lower = more forgetting)
  • Model Utility: Average ROUGE-L on held-out real-world QA benchmarks

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors