Skip to content

apple/ml-dso

DSO: Direct Steering Optimization for Bias Mitigation

DSO Method Overview In a visual-assistance scenario for a visually impaired user (left), VLMs often rely on gender stereotypes — such as assuming the man is the doctor — leading to biased responses. Since both candidates are equally valid answers, DSO steers the model toward a 50/50 gender distribution, while preserving the model's broader capabilities.

This software project accompanies the research paper DSO: Direct Steering Optimization for Bias Mitigation (Monteiro Paes, Sivakumar, Wang et al., 2026).

DSO uses reinforcement learning to train lightweight affine adapters (a learned scale and bias applied to each LayerNorm output) on frozen LLMs and VLMs to mitigate bias, while preserving overall model capabilities. The intervention strength λ can be tuned at inference time to balance fairness and performance — no retraining required. Although focused on bias mitigation, DSO is general and can be extended to other domains like toxicity mitigation and AI alignment by implementing a new reward function and dataloader.

Installation

uv venv --python 3.11
source .venv/bin/activate
uv pip install -r requirements.txt

Quick Start

Training interventions:

The following scripts train the linear interventions to be applied to the LLM/VLM module (depending on your application). You need to run these scripts to obtain the trained intervention file (.pt) required for inference. You do not need to run both — run the one that matches your target application, i.e., either VLM or LLM. The trained interventions are saved as interventions_llm.pt or interventions_vlm.pt in the main folder.

python examples/training_interventions_llm.py   # LLM: coreference resolution (aims to identify which occupation a gendered pronoun refers to in a sentence)
python examples/training_interventions_vlms.py  # VLM: occupation recognition (aims to identify the occupation of a person depicted in an image)

Loading and using trained interventions: The following script loads the trained interventions and applies them to the model at inference time. The intervention_path corresponds to the file produced by the training script (either interventions_llm.pt or interventions_vlm.pt). The intervention strength λ controls how strongly the adapter steers the model: higher values increase bias mitigation but may affect general capability; lower values preserve capability at the cost of less debiasing. Because λ is decoupled from training, it can be swept post-hoc on a validation set to find the optimal fairness/capability operating point without retraining.

intervened_model = HookInterventions(model, get_layernorms, intervention_path="<intervention_path>.pt")

intervened_model.set_strength(0.8)       # λ in [0, 1]: 0 = no intervention (base model), 1 = full intervention; tune for fairness vs. capability trade-off
intervened_model.sparsify(0.1)           # keep only the strongest 10% of the interventions

Examples

The following scripts are aimed at exemplifying how to train and load interventions for both LLMs and VLMs. The loading scripts print the model output before and after applying the interventions, allowing you to directly observe how model behavior changes with and without DSO applied. Note that the loading scripts require the trained intervention file produced by the corresponding training script.

Script Task Model type Requires
examples/training_interventions_llm.py Fairness in coreference resolution (SynthBias) LLM
examples/training_interventions_vlms.py Fairness in occupation recognition (SocialCounterfactuals) VLM
examples/loading_interventions_llm.py Inference with trained interventions LLM run training_interventions_llm.py first
examples/loading_interventions_vlms.py Inference with trained interventions VLM run training_interventions_vlms.py first

How to Extend for Other Rewards and Applications

To apply DSO to a new task you need two things: a dataloader and a reward function.

  1. Dataloader — subclass BaseDatasetLoader in data/dataloaders.py and implement _build_datasets() to return train/eval TabularDataset instances. Each sample must include a conversation field (list of chat-template dicts) and any metadata your reward needs.

  2. Reward function — subclass BaseReward in rewards.py and implement __call__ to return a [B] float tensor of per-sample rewards (positive = desired behavior).

from rewards import BaseReward

class MyReward(BaseReward):
    def __call__(self, completions, input_metadata=None):
        # return a [B] float tensor: positive = desired behavior
        ...

trainer = TrainInterventions(model, processor, reward_fn=MyReward())

For complete working reference implementations, see rewards.py (GenderParityRewardForSynthBias, GenderParityRewardForVisionTasks) and data/dataloaders.py (CorefResolutionLoader, Act4MeLoader). These serve as end-to-end examples of how to pair a dataset loader with a reward function for a new task.

Running Tests

pytest                          # all tests
pytest tests/test_data.py       # single file

Results: λ vs Bias Trade-off

The plots show how the intervention strength λ controls gender-occupation bias across two models. We compare DSO against ITI (Inference-Time Intervention) and CAA (Contrastive Activation Addition). DSO provides the most stable control: bias decreases monotonically with λ, whereas ITI and CAA exhibit non-monotonic behavior. Moreover, CAA and Prompting do not reduce bias, indicating that these methods are ineffective for bias mitigation in VLMs. Additionally, there is no principled way to modify a prompt to guarantee a monotonic bias reduction. Together, these observations show that DSO enables controllable bias mitigation at inference time, whereas other methods yield unpredictable effects on bias.

Lambda vs Bias - Qwen 2.5 3B Qwen 2.5 3B Lambda vs Bias - Gemma 3 4B Gemma 3 4B
Intervention strength λ (x-axis) vs. bias per occupation (y-axis) measured on the SocialCounterfactuals dataset using the occupation identification task. DSO offers better inference-time bias controllability than alternative methods. Prompting appears as a single point since you cannot continuously control the strength of a prompt.

Citation

@article{monteiro2026dso,
  title   = {DSO: Direct Steering Optimization for Bias Mitigation},
  author  = {Monteiro Paes, Lucas and Sivakumar, Nivedha and Wang, Oliver and
             Fedzechkina, Masha and Theobald, Barry-John and Zappella, Luca and
             Apostoloff, Nicholas},
  journal = {arXiv preprint arXiv:2512.15926},
  year    = {2026}
}

About

No description, website, or topics provided.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages