traitinterp

Train a linear probe. See what your model is thinking. Steer it.

Live demo | Docs | Methodology

What this does

Extract — Train a linear probe that detects a behavioral trait (sycophancy, deception, formality, etc.)
Monitor — Project hidden states onto that probe token-by-token during generation
Steer — Add the probe direction during inference to amplify or suppress the trait

Trait datasets are model-agnostic. Extract once, apply to any model.

Requirements

Extraction and steering require a GPU. Use --load-in-4bit for quantization to fit larger models.

Model size	Minimum GPU	VRAM
Up to 9B	RTX 3090 / Apple Silicon M1 Pro+	24GB
Up to 14B	RTX 3090/4090	24GB
70B+	A100-80GB (or multi-GPU with TP)	80GB

The visualization dashboard runs anywhere (CPU only).

Quick start

git clone https://github.com/ewernn/traitinterp.git && cd traitinterp
pip install -r requirements.txt
export HF_TOKEN=your_token  # for gated models

Extract your first trait:

python extraction/run_extraction_pipeline.py \
    --experiment my_first_run \
    --traits psychology/sycophancy

This generates responses, vets them with an LLM judge, trains probes across all layers, and evaluates quality. Results land in experiments/my_first_run/.

Monitor traits during generation:

python inference/run_inference_pipeline.py \
    --experiment my_first_run \
    --prompt-set general/baseline

Visualize:

python visualization/serve.py  # http://localhost:8000

How it works

Extraction

Define a trait with naturally contrasting scenarios — prompts where the model's completion naturally exhibits vs. avoids a behavior. No instruction-following, no system prompts. The model doesn't know it's being measured.

datasets/traits/psychology/sycophancy/
├── positive.txt      # scenarios that elicit sycophantic responses
├── negative.txt      # matched scenarios that don't
├── definition.txt    # what sycophancy means + scoring rubric
└── steering.json     # evaluation questions for causal validation

Generate responses on both sets, capture activations, train a linear probe to separate them. The probe direction is your trait vector.

Monitoring

Project hidden states onto the trait vector at every token:

score = hidden_state @ trait_vector

Positive scores = expressing the trait. Negative = avoiding it. Watch it evolve token-by-token as the model generates.

Steering

Add the trait vector to hidden states during generation:

hidden_state += coefficient * trait_vector

An automated coefficient search with LLM-as-judge evaluation finds the strength that maximizes trait expression while maintaining coherence.

Visualization

Interactive dashboard at traitinterp.com (or run locally):

Extraction — probe accuracy heatmaps across layers and methods
Steering — coefficient sweep results, response browser
Dynamics — per-token trait trajectory during generation
Live Chat — real-time monitoring and steering during conversation

Repository structure

trait-interp/
├── datasets/traits/       # Model-agnostic trait definitions (scenarios, rubrics, eval questions)
├── extraction/            # Extract trait vectors: run_extraction_pipeline.py
├── inference/             # Monitor traits: run_inference_pipeline.py
├── steering/              # Validate via steering: run_steering_eval.py
├── core/                  # Primitives: types, hooks, methods, projection math
├── utils/                 # Shared: model loading, paths, generation, vector I/O
├── config/                # Path templates, model configs, LoRA registry
├── visualization/         # Interactive dashboard (serves traitinterp.com)
├── analysis/              # Model comparison, benchmarks, vector analysis
├── experiments/           # Output data (vectors, activations, results)
└── docs/                  # Full documentation

Docs

docs/main.md — Codebase reference and navigation
docs/workflows.md — Practical workflow guide
docs/extraction_guide.md — Complete extraction reference
docs/overview.md — Methodology and key learnings
docs/scenario_design_guide.md — Writing good contrasting scenarios

Attribution

Core extraction logic adapted from safety-research/persona_vectors. Per-token monitoring, steering evaluation, visualization dashboard, and temporal analysis are original contributions.

MIT License

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

traitinterp

What this does

Requirements

Quick start

How it works

Extraction

Monitoring

Steering

Visualization

Repository structure

Docs

Attribution

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 489 Commits
analysis		analysis
config		config
core		core
datasets		datasets
docs		docs
experiments		experiments
extraction		extraction
inference		inference
steering		steering
utils		utils
visualization		visualization
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

traitinterp

What this does

Requirements

Quick start

How it works

Extraction

Monitoring

Steering

Visualization

Repository structure

Docs

Attribution

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages