Pau Rodriguez*, Michal Klein, Eleonora Gualdoni, Valentino Maiorca, Arno Blaas, Luca Zappella, Marco Cuturi and Xavier Suau*
This software project accompanies the research paper: LinEAS: End-to-end Learning of Activation Steering with a Distributional Loss, NeurIPS 2025 (bibtex).

-
Clone the Repository:
git clone https://github.com/apple/ml-lineas cd ml-lineas
-
Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh export PATH="$HOME/.local/bin:$HOME/.cargo/bin:$PATH" # Ensure UV is in PATH source ~/.bashrc # Reload the shell configuration
-
Install the project/create the environment:
uv sync source .venv/bin/activate
-
Download datasets and models. For ease of explanation, we will use the following environment variables to point to where the datasets and models are stored:
DATA_DIR
andCACHE_DIR
. Also, setHF_TOKEN
if needed.# Required for some specific models like Gemma-2 or datasets like TET export HF_TOKEN="your_token" # Optional export DATA_DIR="some/path" export CACHE_DIR="some/other/path" export HF_HUB_CACHE="another/path"
Then call
python -m lineas.scripts.download_external_data
to download external assets to your local$DATA_DIR
. This will download RTP prompts, the Jigsaw toxicity dataset and the COCO captions dataset. Note that models will be downloaded automatically with huggingface. Note you can setupHF_HUB_CACHE
to point to a specific folder (see huggingface documentation). -
Optionally, run the provided tests to make sure the setup is correct. It will download some small models from Huggingface during the first run.
pytest . -m "not slow"
This repository contains the code for a research paper focusing on controlling model behavior through learned interventions. We provide a pipeline script that enables users to:
- Extract Activations: Obtain activations from specified model layers.
- Learn Interventions: Utilize extracted activations to learn interventions that control model behavior.
- Evaluate Intervened Models: Assess the performance of intervened models on various tasks.
Quick summary of the main files in the repository:
- Python Scripts:
pipeline.py
: Main pipeline for incremental learning of model interventions.learn_intervention.py
: Core functionality for learning interventions from model activations.
- Hydra Configuration Files (
configs
directory):text_generation.yaml
andtext_to_image_generation.yaml
: Primary config files, specifying:- Model architecture and layers
- Task parameters (e.g., dataset, batch size)
- Intervention type and settings (e.g.,
lineas
) - Evaluation tasks (e.g., RTP, zero-shot evaluation)
- Referenced Sub-Configs:
task_params/fantasy.yaml
(task-specific settings)model/gpt2.yaml
(model architecture details)intervention_params/lineas
(intervention-type specific settings; not explicitly listed, implied as part of the config structure)wandb/lineas.yaml
(WandB logging configuration)
The
lineas
intervention in this repository implementsLinear-AcT
as defined in our paper: Controlling Language and Diffusion Models by Transporting Activations
# see lineas/configs/text_generation.yaml for configuration details
python -m lineas.scripts.pipeline \
"model=gemma-2-2b" \
"task_params=fantasy" \
"responses.batch_size=32" \
"responses.max_batches=1" \
"wandb.mode=disabled" \
"interventions.batch_size=32" \
"intervention_params=lineas" \
"intervention_params.optimization_params.steps=50" \
"+model.target_module_names=[.*post.*layernorm]" \
"text_generation.num_sentences=10" \
"text_generation.new_seq_len=48" \
"text_generation.strength_sample_size=2" \
"device=cuda" \
"model.dtype=float32"
This command will:
- Extract activations from a pre-trained
Gemma-2-2b
model, as specified inconfigs/text_generation.yaml
. We collect 1 batch of size 20 since we provide 20 sentences indata/fantasy.json
). Remember to change todevice=mps
if working on MacOS and todevice=cuda
if you work on GPU for better speed. - Use the responses to learn an intervention. We set
intervention_params=lineas
and we reduce the steps to 30 to make this example faster, but better performance is achieved with some extra steps (eg. 1000). - Generate text with the intervened model. We ask to generate 10 sentences (
text_generation.num_sentences=10
) at 3 different strengths (text_generation.strength_sample_size=3
) between 0 and 1 (so 0.0, 0.5, 1.0). - Evaluate the generated text (see
evaluations
inlineas/configs/task_params/toxicity.yaml
andlineas/configs/text_generation.yaml
)
Important: responses.batch_size * responses.max_batches
sets the number of points that will define the target distribution and it is computed offline. interventions.batch_size
sets the number of points that will define the target distribution and it is computed online. Always use interventions.batch_size => 4
if possible.
Note that we use Hydra as configuration and arguments manager.
Results will be stored in results_dir
(set in the config file or run with results_dir=<your/results_dir/path>
). It will also upload them to wandb
if you have set it up. (more about wandb config for this project in configs/wandb/lineas.yaml
). For task-specific evaluations (e.g., toxicity
, text_generation
, zero_shot
), modify the evaluation
parameter in text_generation.yaml
or override it via the command line, and re-run the pipeline.
While in the paper we optimize for 1000 iterations with learning rate of 1e-5, we have found that 50 iterations and lr of 1e-3 already yields good results for most conditionings. Tested on a single A100 80GB GPU.
python -m lineas.scripts.pipeline \
--config-name text_to_image_generation \
task_params=diffusion_prompts \
'task_params.src_subsets=["none"]' \
'task_params.dst_subsets=["pixel"]' \
'task_params.prompt_subset=["none"]' \
responses.batch_size=4 \
responses.max_batches=16 \
interventions.max_batches=null \
interventions.batch_size=4 \
wandb.mode=offline \
'evaluation=["text_to_image_generation"]' \
text_to_image_generation.batch_size=4 \
text_to_image_generation.max_batches=2 \
text_to_image_generation.create_gif=true \
intervention_params=lineas \
intervention_params.optimization_params.steps=50 \
intervention_params.optimization_params.learning_rate=1e-3 \
intervention_params.optimization_params.optimizer=Adam \
model=DMD2 \
model.unet_with_grads=true \
device=cuda \
'model.dtype=${dtype:torch.bfloat16}' \
intervention_params.optimization_params.criterion=wasserstein \
'model.module_names=["unet.*norm.*"]'
Line by line:
--config-name text_to_image_generation
chooses the config file inconfigs/text_to_image_generation.yaml
."task_params=diffusion_prompts"
chooses the taskdiffusion_prompts
inconfigs/task_params
"task_params.src_subsets=['none']"
and"task_params.dst_subsets=['pixel']"
choose the source and destination datasets respectively."task_params.prompt_subset=['simple_diverse']"
chooses the prompt dataset for inference time"responses.batch_size=8"
and"responses.max_batches=8"
extract 8 responses per batch and run 8 batches. (64 samples). We used 32 source and 32 target prompts in the paper."interventions.max_batches=null"
will use all extrated responses to learn an intervention"evaluation=['text_to_image_generation']"
after the intervention, it will generate images. You can also addclip_score
here."text_to_image_generation.create_gif=true"
this will save gif animations with the generated images at different strengths. The strengths used are configured inconfigs/text_to_image_generation.yaml
undertext_to_image_generation
withmin_strength
,max_strength
andstrength_steps
(actual strengths will be anp.linspace(min_strength, max_strength, strength_steps)
).
Results will be stored in results_dir
(set in the config file or run with results_dir=<your/results_dir/path>
). It will also upload them to wandb
if you have set it up. (more about wandb config for this project in configs/wandb/lineas.yaml
). In results_dir/generate_with_hooks_diffusion/
you will find the generated images, with a folder for each strength value and guidance scale set up in text_to_image_generation.yaml
in the format {strength:.03f}_{guidance:.03f}/<image_id>.png
.
To reproduce experiments related to toxicity mitigation with LLMs we need some additional external data.
# Remember to call the following!
# Downloads data to /tmp/lineas (or $DATA_DIR if env variable is set)
python lineas/scripts/download_external_data.py
Then, all you need to do is run a pipeline with a toxicity task. Remember to download the model from Huggingface to $CACHE_DIR
.
The following command runs a toxicity evaluation on qwen2.5-1.5b
, with LinEAS trained with 32 data points only.
python -m lineas.scripts.pipeline \
model=qwen2.5-1.5b \
task_params=toxicity \
responses.batch_size=32 \
interventions.batch_size=32 \
responses.max_batches=1 \
intervention_params=lineas \
intervention_params.optimization_params.steps=1000 \
+model.target_module_names=[.*post.*layernorm] \
model.dtype=float32 \
device=cuda \
intervention_params.optimization_params.optimizer=SGD \
intervention_params.optimization_params.learning_rate=0.1 \
intervention_params.optimization_params.criterion=wasserstein \
wandb.mode=online wandb.project=lineas-tox # Optional wandb
- Model: Specify model architecture, path, and layer names for intervention.
- Task Params: Define task-specific settings (e.g., dataset, batch size).
- Intervention Params: Configure intervention type, incremental mode, and hook parameters.
- Evaluation: Choose evaluation tasks to run after learning interventions.
-
(preferred) Override Config Values via Command Line:
- Use
key=value
pairs, for example:
python -m act.scripts.pipeline \ --config-name text_generation \ interventions.intervention_params.name=your_new_intervention \ evaluation=[rtp, zero_shot]
- This approach allows for quick testing of different configurations without modifying the YAML file.
- Use
-
Change where the intervention is performed:
The easiest way is to override arguments via commandline
model.module_names=['.*layernorm.*]
. Another option is to directly modify the config file, e.g,model: model_path: "path/to/your/model" module_names: - layer1_regex - layer2_regex
or modify/add a new model in
configs/model
and reference it intext_generation.yaml
ortext_to_image_generation.yaml
. -
Switch to a Different Intervention:
interventions: intervention_params: name: your_intervention_name # Update hook_params if necessary for the new intervention hook_params: key: value
-
Modify Evaluation Tasks:
evaluation: - toxicity - zero_shot # Add or remove tasks as needed
@article{rodriguez2025end-to-end,
title={LinEAS: End-to-end Learning of Activation Steering with a Distributional Loss},
author={Rodriguez, Pau and Klein, Michal and Gualdoni, Eleonora and Maiorca, Valentino and Blaas, Arno and Zappella, Luca and Cuturi, Marco and Suau, Xavier},
journal={NeurIPS},
year={2025}
}