Evaluating BalDRO for LLM Unlearning: Alternative Forget Objectives and Cross-Model Generalisation

This repository is a fork of BalDRO (Shao et al., WWW 2026), extended as part of a BlueDot AI Safety technical project. The original repo evaluates BalDRO-DV and BalDRO-G on Llama-2-7B using NPO, SimNPO, and SatImp as forget objectives. This fork investigates two additional directions:

Alternative forget objectives. We test WGA (Weighted Gradient Ascent) and TNPO (Token-wise NPO) as drop-in replacements for NPO within the BalDRO-DV framework, evaluating whether the robustness gains transfer to different loss functions.
Cross-model generalisation. We extend experiments beyond Llama-2-7B to four additional models: Llama-3.2-1B-Instruct, Llama-3.1-8B-Instruct, Qwen3-8B, and Mistral-7B-Instruct-v0.3.

All experiments use the TOFU benchmark (forget01 split) and report forget quality, model utility, and gibberish rate.

What's added in this fork

src/trainer/unlearn/wga.py — WGA and BalDRO-DV + WGA trainers
src/trainer/unlearn/tnpo.py — TNPO and BalDRO-DV + TNPO trainers
scripts/unlearn/tofu/train_tofu_wga_fast.sh — WGA baseline
scripts/unlearn/tofu/train_tofu_drwga_fast.sh — BalDRO-DV + WGA
scripts/unlearn/tofu/train_tofu_tnpo_fast.sh — TNPO baseline
scripts/unlearn/tofu/train_tofu_drtnpo_fast.sh — BalDRO-DV + TNPO (β_DV=2.0)
scripts/unlearn/tofu/train_tofu_drtnpo_fast_bdv0.5.sh — BalDRO-DV + TNPO (β_DV=0.5)
scripts/unlearn/tofu/train_tofu_drnpo_fast.sh — BalDRO-DV + NPO on Llama-2-7B
Per-model fast scripts for NPO and BalDRO-DV + NPO on Llama-3.2-1B, Llama-3.1-8B, Qwen3-8B, and Mistral-7B
Finetuning scripts for Qwen3-8B and Mistral-7B on TOFU (no OpenUnlearning checkpoints exist for these models)

Setup

Setup instructions are unchanged from the original repo. See BalDRO for environment setup, dataset preparation, and model downloads.

For Llama-3.x models, use the pretrained TOFU checkpoints from OpenUnlearning. For Qwen3-8B and Mistral-7B, run the finetuning scripts in scripts/unlearn/tofu/ before unlearning.

Acknowledgements

This work builds on BalDRO by Shao et al. and Open-Unlearning.

Citation

@inproceedings{shao2026baldro,
  title={Baldro: A distributionally robust optimization based framework for large language model unlearning},
  author={Shao, Pengyang and Zhai, Naixin and Chen, Lei and Yang, Yonghui and Zhu, Fengbin and Yang, Xun and Wang, Meng},
  booktitle={Proceedings of the ACM Web Conference 2026},
  pages={8874--8884},
  year={2026}
}

License

This project is licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
assets		assets
configs		configs
saves/eval		saves/eval
scripts/unlearn		scripts/unlearn
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evaluating BalDRO for LLM Unlearning: Alternative Forget Objectives and Cross-Model Generalisation

What's added in this fork

Setup

Acknowledgements

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Evaluating BalDRO for LLM Unlearning: Alternative Forget Objectives and Cross-Model Generalisation

What's added in this fork

Setup

Acknowledgements

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages