In a visual-assistance scenario for a visually impaired user (left), VLMs often rely on gender stereotypes — such as assuming the man is the doctor — leading to biased responses. Since both candidates are equally valid answers, DSO steers the model toward a 50/50 gender distribution, while preserving the model's broader capabilities.
This software project accompanies the research paper DSO: Direct Steering Optimization for Bias Mitigation (Monteiro Paes, Sivakumar, Wang et al., 2026).
DSO uses reinforcement learning to train lightweight affine adapters (a learned scale and bias applied to each LayerNorm output) on frozen LLMs and VLMs to mitigate bias, while preserving overall model capabilities. The intervention strength λ can be tuned at inference time to balance fairness and performance — no retraining required. Although focused on bias mitigation, DSO is general and can be extended to other domains like toxicity mitigation and AI alignment by implementing a new reward function and dataloader.
uv venv --python 3.11
source .venv/bin/activate
uv pip install -r requirements.txtTraining interventions:
The following scripts train the linear interventions to be applied to the LLM/VLM module (depending on your application). You need to run these scripts to obtain the trained intervention file (.pt) required for inference. You do not need to run both — run the one that matches your target application, i.e., either VLM or LLM. The trained interventions are saved as interventions_llm.pt or interventions_vlm.pt in the main folder.
python examples/training_interventions_llm.py # LLM: coreference resolution (aims to identify which occupation a gendered pronoun refers to in a sentence)
python examples/training_interventions_vlms.py # VLM: occupation recognition (aims to identify the occupation of a person depicted in an image)Loading and using trained interventions:
The following script loads the trained interventions and applies them to the model at inference time. The intervention_path corresponds to the file produced by the training script (either interventions_llm.pt or interventions_vlm.pt). The intervention strength λ controls how strongly the adapter steers the model: higher values increase bias mitigation but may affect general capability; lower values preserve capability at the cost of less debiasing. Because λ is decoupled from training, it can be swept post-hoc on a validation set to find the optimal fairness/capability operating point without retraining.
intervened_model = HookInterventions(model, get_layernorms, intervention_path="<intervention_path>.pt")
intervened_model.set_strength(0.8) # λ in [0, 1]: 0 = no intervention (base model), 1 = full intervention; tune for fairness vs. capability trade-off
intervened_model.sparsify(0.1) # keep only the strongest 10% of the interventionsThe following scripts are aimed at exemplifying how to train and load interventions for both LLMs and VLMs. The loading scripts print the model output before and after applying the interventions, allowing you to directly observe how model behavior changes with and without DSO applied. Note that the loading scripts require the trained intervention file produced by the corresponding training script.
| Script | Task | Model type | Requires |
|---|---|---|---|
examples/training_interventions_llm.py |
Fairness in coreference resolution (SynthBias) | LLM | — |
examples/training_interventions_vlms.py |
Fairness in occupation recognition (SocialCounterfactuals) | VLM | — |
examples/loading_interventions_llm.py |
Inference with trained interventions | LLM | run training_interventions_llm.py first |
examples/loading_interventions_vlms.py |
Inference with trained interventions | VLM | run training_interventions_vlms.py first |
To apply DSO to a new task you need two things: a dataloader and a reward function.
-
Dataloader — subclass
BaseDatasetLoaderindata/dataloaders.pyand implement_build_datasets()to return train/evalTabularDatasetinstances. Each sample must include aconversationfield (list of chat-template dicts) and any metadata your reward needs. -
Reward function — subclass
BaseRewardinrewards.pyand implement__call__to return a[B]float tensor of per-sample rewards (positive = desired behavior).
from rewards import BaseReward
class MyReward(BaseReward):
def __call__(self, completions, input_metadata=None):
# return a [B] float tensor: positive = desired behavior
...
trainer = TrainInterventions(model, processor, reward_fn=MyReward())For complete working reference implementations, see rewards.py (GenderParityRewardForSynthBias, GenderParityRewardForVisionTasks) and data/dataloaders.py (CorefResolutionLoader, Act4MeLoader). These serve as end-to-end examples of how to pair a dataset loader with a reward function for a new task.
pytest # all tests
pytest tests/test_data.py # single fileThe plots show how the intervention strength λ controls gender-occupation bias across two models. We compare DSO against ITI (Inference-Time Intervention) and CAA (Contrastive Activation Addition). DSO provides the most stable control: bias decreases monotonically with λ, whereas ITI and CAA exhibit non-monotonic behavior. Moreover, CAA and Prompting do not reduce bias, indicating that these methods are ineffective for bias mitigation in VLMs. Additionally, there is no principled way to modify a prompt to guarantee a monotonic bias reduction. Together, these observations show that DSO enables controllable bias mitigation at inference time, whereas other methods yield unpredictable effects on bias.
@article{monteiro2026dso,
title = {DSO: Direct Steering Optimization for Bias Mitigation},
author = {Monteiro Paes, Lucas and Sivakumar, Nivedha and Wang, Oliver and
Fedzechkina, Masha and Theobald, Barry-John and Zappella, Luca and
Apostoloff, Nicholas},
journal = {arXiv preprint arXiv:2512.15926},
year = {2026}
}
