DSO: Direct Steering Optimization for Bias Mitigation

In a visual-assistance scenario for a visually impaired user (left), VLMs often rely on gender stereotypes — such as assuming the man is the doctor — leading to biased responses. Since both candidates are equally valid answers, DSO steers the model toward a 50/50 gender distribution, while preserving the model's broader capabilities.

This software project accompanies the research paper DSO: Direct Steering Optimization for Bias Mitigation (Monteiro Paes, Sivakumar, Wang et al., 2026).

DSO uses reinforcement learning to train lightweight affine adapters (a learned scale and bias applied to each LayerNorm output) on frozen LLMs and VLMs to mitigate bias, while preserving overall model capabilities. The intervention strength λ can be tuned at inference time to balance fairness and performance — no retraining required. Although focused on bias mitigation, DSO is general and can be extended to other domains like toxicity mitigation and AI alignment by implementing a new reward function and dataloader.

Installation

uv venv --python 3.11
source .venv/bin/activate
uv pip install -r requirements.txt

Quick Start

Training interventions:

The following scripts train the linear interventions to be applied to the LLM/VLM module (depending on your application). You need to run these scripts to obtain the trained intervention file (.pt) required for inference. You do not need to run both — run the one that matches your target application, i.e., either VLM or LLM. The trained interventions are saved as interventions_llm.pt or interventions_vlm.pt in the main folder.

python examples/training_interventions_llm.py   # LLM: coreference resolution (aims to identify which occupation a gendered pronoun refers to in a sentence)
python examples/training_interventions_vlms.py  # VLM: occupation recognition (aims to identify the occupation of a person depicted in an image)

Loading and using trained interventions: The following script loads the trained interventions and applies them to the model at inference time. The intervention_path corresponds to the file produced by the training script (either interventions_llm.pt or interventions_vlm.pt). The intervention strength λ controls how strongly the adapter steers the model: higher values increase bias mitigation but may affect general capability; lower values preserve capability at the cost of less debiasing. Because λ is decoupled from training, it can be swept post-hoc on a validation set to find the optimal fairness/capability operating point without retraining.

intervened_model = HookInterventions(model, get_layernorms, intervention_path="<intervention_path>.pt")

intervened_model.set_strength(0.8)       # λ in [0, 1]: 0 = no intervention (base model), 1 = full intervention; tune for fairness vs. capability trade-off
intervened_model.sparsify(0.1)           # keep only the strongest 10% of the interventions

Examples

The following scripts are aimed at exemplifying how to train and load interventions for both LLMs and VLMs. The loading scripts print the model output before and after applying the interventions, allowing you to directly observe how model behavior changes with and without DSO applied. Note that the loading scripts require the trained intervention file produced by the corresponding training script.

Script	Task	Model type	Requires
`examples/training_interventions_llm.py`	Fairness in coreference resolution (SynthBias)	LLM	—
`examples/training_interventions_vlms.py`	Fairness in occupation recognition (SocialCounterfactuals)	VLM	—
`examples/loading_interventions_llm.py`	Inference with trained interventions	LLM	run `training_interventions_llm.py` first
`examples/loading_interventions_vlms.py`	Inference with trained interventions	VLM	run `training_interventions_vlms.py` first

How to Extend for Other Rewards and Applications

To apply DSO to a new task you need two things: a dataloader and a reward function.

Dataloader — subclass BaseDatasetLoader in data/dataloaders.py and implement _build_datasets() to return train/eval TabularDataset instances. Each sample must include a conversation field (list of chat-template dicts) and any metadata your reward needs.
Reward function — subclass BaseReward in rewards.py and implement __call__ to return a [B] float tensor of per-sample rewards (positive = desired behavior).

from rewards import BaseReward

class MyReward(BaseReward):
    def __call__(self, completions, input_metadata=None):
        # return a [B] float tensor: positive = desired behavior
        ...

trainer = TrainInterventions(model, processor, reward_fn=MyReward())

For complete working reference implementations, see rewards.py (GenderParityRewardForSynthBias, GenderParityRewardForVisionTasks) and data/dataloaders.py (CorefResolutionLoader, Act4MeLoader). These serve as end-to-end examples of how to pair a dataset loader with a reward function for a new task.

Running Tests

pytest                          # all tests
pytest tests/test_data.py       # single file

Results: λ vs Bias Trade-off

The plots show how the intervention strength λ controls gender-occupation bias across two models. We compare DSO against ITI (Inference-Time Intervention) and CAA (Contrastive Activation Addition). DSO provides the most stable control: bias decreases monotonically with λ, whereas ITI and CAA exhibit non-monotonic behavior. Moreover, CAA and Prompting do not reduce bias, indicating that these methods are ineffective for bias mitigation in VLMs. Additionally, there is no principled way to modify a prompt to guarantee a monotonic bias reduction. Together, these observations show that DSO enables controllable bias mitigation at inference time, whereas other methods yield unpredictable effects on bias.

Qwen 2.5 3B	Gemma 3 4B
Intervention strength λ (x-axis) vs. bias per occupation (y-axis) measured on the SocialCounterfactuals dataset using the occupation identification task. DSO offers better inference-time bias controllability than alternative methods. Prompting appears as a single point since you cannot continuously control the strength of a prompt.

Citation

@article{monteiro2026dso,
  title   = {DSO: Direct Steering Optimization for Bias Mitigation},
  author  = {Monteiro Paes, Lucas and Sivakumar, Nivedha and Wang, Oliver and
             Fedzechkina, Masha and Theobald, Barry-John and Zappella, Luca and
             Apostoloff, Nicholas},
  journal = {arXiv preprint arXiv:2512.15926},
  year    = {2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
examples		examples
figures		figures
intervention_utils		intervention_utils
tests		tests
ACKNOWLEDGMENTS		ACKNOWLEDGMENTS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
dso_trainer.py		dso_trainer.py
helpers.py		helpers.py
pytest.ini		pytest.ini
requirements.txt		requirements.txt
rewards.py		rewards.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DSO: Direct Steering Optimization for Bias Mitigation

Installation

Quick Start

Examples

How to Extend for Other Rewards and Applications

Running Tests

Results: λ vs Bias Trade-off

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DSO: Direct Steering Optimization for Bias Mitigation

Installation

Quick Start

Examples

How to Extend for Other Rewards and Applications

Running Tests

Results: λ vs Bias Trade-off

Citation

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages