Train a linear probe. See what your model is thinking. Steer it.
Three pipelines, one trait definition:
- Extract — train a linear probe that detects a behavioral trait
- Monitor — project hidden states onto the probe token-by-token during generation
- Steer — add the probe direction during inference to amplify or suppress the trait
Trait definitions are model-agnostic — define once, run on any HuggingFace model. Tested on Llama 3.1/3.3, Gemma 2/3, Qwen 2.5/3, Mistral, OLMo, DeepSeek-R1, GPT-OSS, and Kimi K2.
git clone https://github.com/ewernn/traitinterp.git && cd traitinterp
pip install -e .
export HF_TOKEN=your_token # for gated modelsSix traits ship in datasets/traits/starter_traits/: assistant_axis, desperate, formality, golden_gate_bridge, sad, sycophancy.
Extract a trait vector:
python extraction/run_extraction_pipeline.py \
--experiment starter \
--traits starter_traits/sycophancyGenerates responses → vets them with an LLM judge → trains probes across all layers → evaluates quality. Output lands in experiments/starter/extraction/.
Monitor during generation:
python inference/run_inference_pipeline.py \
--experiment starter \
--prompt-set starter_prompts/generalValidate causally via steering:
python steering/run_steering_eval.py \
--experiment starter \
--traits starter_traits/sycophancySearches for the coefficient that maximizes trait expression while keeping output coherent (LLM-as-judge).
Visualize:
python visualization/serve.py # http://localhost:8000datasets/traits/starter_traits/sycophancy/
├── positive.{json,jsonl,txt} # scenarios that elicit the trait
├── negative.{json,jsonl,txt} # matched scenarios that don't
├── definition.txt # what the trait means + scoring rubric
└── steering.json # evaluation questions for causal validation
Three scenario-file formats are supported (pick one per polarity — multi-format coexistence errors out):
.json— cartesian:{"prompts": [...], "system_prompts": [...]}. Expanded at load time to all N×M pairs. Compact for instruct-model extraction with a small set of contrastive system prompts..jsonl— one explicit{"prompt": "...", "system_prompt": "..."}per line..txt— one prompt per line (no system prompt). Used for base-model prefix completion.
Capture hidden states on both scenario sets, fit a linear separator (logistic regression, difference-of-means, etc.). The probe direction is your trait vector.
score = hidden_state @ trait_vector
Positive = expressing the trait. Negative = avoiding it. One score per token, per layer.
hidden_state += coefficient * trait_vector
An automated coefficient search picks the strength that maximizes trait expression while keeping output coherent.
traitinterp/
├── datasets/traits/ # Model-agnostic trait definitions
├── extraction/ # Extract trait vectors
├── inference/ # Monitor traits token-by-token
├── steering/ # Causal validation
├── core/ # Primitives: types, hooks, methods, projection math
├── utils/ # Shared: model loading, paths, vector I/O
├── config/ # Path templates, model configs
├── visualization/ # Interactive dashboard (serves traitinterp.com)
├── analysis/ # Model comparison, benchmarks, vector analysis
├── experiments/ # Output data (vectors, activations, results)
└── docs/ # Full documentation
- docs/main.md — codebase reference and navigation
- docs/extraction_guide.md — extraction pipeline
- docs/inference_guide.md — inference pipeline
- docs/steering_guide.md — steering pipeline
Core extraction logic adapted from safety-research/persona_vectors. Per-token monitoring, steering evaluation, visualization dashboard, and temporal analysis are original.
MIT License.
