Geometric decomposition of deception in large language models.
We show that alignment faking (AF) and sandbagging are geometrically separable in SAE feature space using two orthogonal directions:
- AF axis: fires on strategic deception, near-chance on sandbagging (AUROC 0.999 vs 0.53)
- Truth axis: separates honest from sandbagging across all families (AUROC 0.978/0.957/0.818)
The two axes are nearly orthogonal (cosine = 0.033), proving they encode distinct phenomena.
Validated on gold_106.json — 106 hand-labeled samples never used in training (AF probe AUROC 0.994).
| Experiment | Result |
|---|---|
| AF probe on gold_106 (held-out) | 0.994 AUROC |
| Joint probe on gold_106 | 0.989 AUROC |
| Truth axis → gemini sandbagging | 0.978 AUROC |
| Truth axis → gpt-oss sandbagging | 0.957 AUROC |
| Truth axis → qwen sandbagging | 0.818 AUROC |
| AF/truth direction cosine | 0.033 (orthogonal) |
| Layer sweep ceiling (self-read, all 4 layers) | ~0.67 AUROC |
- Model: Gemma 3 27B IT (
google/gemma-3-27b-it) - SAE: GemmaScope 2 16k medium, layer 40 residual stream (
google/gemma-scope-2-27b-it) - Feature pooling: mean-pool over sequence tokens (MAX_LEN=2048)
- Probes: logistic regression on standardized SAE features
- Sandbagging data:
vincentoh/sandbagging-agent-traces(HuggingFace)- Families: gemini, gpt-oss, qwen
- Field:
scratchpad, label:sandbagging/honest
- AF data:
vincentoh/alignment-faking-training(HuggingFace)- Labels:
alignment_faking/aligned
- Labels:
- Held-out eval:
gold_106.json(106 hand-labeled, never trained on)
pip install -r requirements.txt# Extracts sandbagging families (gemini, gpt-oss, qwen) at layer 40, mean-pooled
python3 extract/extract_sandbagging.py
# Extracts AF data (alignment-faking-training dataset)
python3 extract/extract_af.pypython3 experiments/exp1_layer_sweep.py # self-read AUROC across layers 16/31/40/53
python3 experiments/exp2_truth_direction.py # cross-family truth direction transfer
python3 experiments/exp3_2d_decomposition.py # AF axis + truth axis joint analysis
python3 experiments/exp4_gold106_validation.py # held-out validation on gold_106.jsonpython3 results/validate.pyAll experiments should produce output matching results/expected/*.json within tolerance.
- GPU required: Steps 1 only. Gemma 3 27B in BF16 with CPU offload.
- Tested: A100 40GB +
max_memory={0: "36GiB", "cpu": "80GiB"} - ~1.82s/sample at layer 40 (later layers are CPU-offloaded)
- Tested: A100 40GB +
- CPU only: Steps 2-3. Pre-extracted
.npyfiles inresults/feats/.
- Layer path:
model.model.language_model.layers[N](Gemma3ForConditionalGeneration is multimodal) - SAE keys: lowercase
w_enc,w_dec,b_enc,threshold(GemmaScope 2 convention) - Dataset field:
scratchpad(nottextortrace) - Early-exit hook pattern: register hook at target layer, raise
_EarlyExitto avoid executing CPU-offloaded later layers (saves ~50% time)
extract/
extract_sandbagging.py — GPU: extract SAE feats for sandbagging families
extract_af.py — GPU: extract SAE feats for AF data
experiments/
exp1_layer_sweep.py — self-read AUROC at layers 16, 31, 40, 53
exp2_truth_direction.py — cross-family truth direction (leave-one-family-out)
exp3_2d_decomposition.py — 2D AF+truth decomposition, cross-family AUROC
exp4_gold106_validation.py — held-out validation
datavis/
plot_2d.py — static 2D scatter (AF axis vs Truth axis)
plot_3d.py — interactive 3D (opens in browser, needs plotly)
README.md — setup and usage instructions
models/
inference.py — load saved probe weights, run inference on new text
*.npy — saved probe weights, scalers, biases
results/
expected/ — reference outputs for validate.py
validate.py — diff fresh run vs expected
gold_106.json — held-out eval set (DO NOT use for training)