TruthSayer

Geometric decomposition of deception in large language models.

We show that alignment faking (AF) and sandbagging are geometrically separable in SAE feature space using two orthogonal directions:

AF axis: fires on strategic deception, near-chance on sandbagging (AUROC 0.999 vs 0.53)
Truth axis: separates honest from sandbagging across all families (AUROC 0.978/0.957/0.818)

The two axes are nearly orthogonal (cosine = 0.033), proving they encode distinct phenomena. Validated on gold_106.json — 106 hand-labeled samples never used in training (AF probe AUROC 0.994).

Key Results

Experiment	Result
AF probe on gold_106 (held-out)	0.994 AUROC
Joint probe on gold_106	0.989 AUROC
Truth axis → gemini sandbagging	0.978 AUROC
Truth axis → gpt-oss sandbagging	0.957 AUROC
Truth axis → qwen sandbagging	0.818 AUROC
AF/truth direction cosine	0.033 (orthogonal)
Layer sweep ceiling (self-read, all 4 layers)	~0.67 AUROC

Architecture

Model: Gemma 3 27B IT (google/gemma-3-27b-it)
SAE: GemmaScope 2 16k medium, layer 40 residual stream (google/gemma-scope-2-27b-it)
Feature pooling: mean-pool over sequence tokens (MAX_LEN=2048)
Probes: logistic regression on standardized SAE features

Dataset

Sandbagging data: vincentoh/sandbagging-agent-traces (HuggingFace)
- Families: gemini, gpt-oss, qwen
- Field: scratchpad, label: sandbagging/honest
AF data: vincentoh/alignment-faking-training (HuggingFace)
- Labels: alignment_faking / aligned
Held-out eval: gold_106.json (106 hand-labeled, never trained on)

Reproduction

Requirements

pip install -r requirements.txt

Step 1: Extract SAE features (~2h on A100 40GB)

# Extracts sandbagging families (gemini, gpt-oss, qwen) at layer 40, mean-pooled
python3 extract/extract_sandbagging.py

# Extracts AF data (alignment-faking-training dataset)
python3 extract/extract_af.py

Step 2: Run experiments (CPU, ~10 min)

python3 experiments/exp1_layer_sweep.py        # self-read AUROC across layers 16/31/40/53
python3 experiments/exp2_truth_direction.py    # cross-family truth direction transfer
python3 experiments/exp3_2d_decomposition.py   # AF axis + truth axis joint analysis
python3 experiments/exp4_gold106_validation.py # held-out validation on gold_106.json

Step 3: Validate outputs match expected

python3 results/validate.py

All experiments should produce output matching results/expected/*.json within tolerance.

Hardware

GPU required: Steps 1 only. Gemma 3 27B in BF16 with CPU offload.
- Tested: A100 40GB + max_memory={0: "36GiB", "cpu": "80GiB"}
- ~1.82s/sample at layer 40 (later layers are CPU-offloaded)
CPU only: Steps 2-3. Pre-extracted .npy files in results/feats/.

Technical Notes

Layer path: model.model.language_model.layers[N] (Gemma3ForConditionalGeneration is multimodal)
SAE keys: lowercase w_enc, w_dec, b_enc, threshold (GemmaScope 2 convention)
Dataset field: scratchpad (not text or trace)
Early-exit hook pattern: register hook at target layer, raise _EarlyExit to avoid executing CPU-offloaded later layers (saves ~50% time)

Files

extract/
  extract_sandbagging.py   — GPU: extract SAE feats for sandbagging families
  extract_af.py            — GPU: extract SAE feats for AF data
experiments/
  exp1_layer_sweep.py      — self-read AUROC at layers 16, 31, 40, 53
  exp2_truth_direction.py  — cross-family truth direction (leave-one-family-out)
  exp3_2d_decomposition.py — 2D AF+truth decomposition, cross-family AUROC
  exp4_gold106_validation.py — held-out validation
datavis/
  plot_2d.py               — static 2D scatter (AF axis vs Truth axis)
  plot_3d.py               — interactive 3D (opens in browser, needs plotly)
  README.md                — setup and usage instructions
models/
  inference.py             — load saved probe weights, run inference on new text
  *.npy                    — saved probe weights, scalers, biases
results/
  expected/                — reference outputs for validate.py
  validate.py              — diff fresh run vs expected
gold_106.json              — held-out eval set (DO NOT use for training)

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
af_reference		af_reference
datavis		datavis
experiments		experiments
extract		extract
gh200		gh200
layer_results		layer_results
models		models
phase1_results		phase1_results
phase2_results		phase2_results
phase3_results		phase3_results
results		results
.gitignore		.gitignore
README.md		README.md
gold_106.json		gold_106.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TruthSayer

Key Results

Architecture

Dataset

Reproduction

Requirements

Step 1: Extract SAE features (~2h on A100 40GB)

Step 2: Run experiments (CPU, ~10 min)

Step 3: Validate outputs match expected

Hardware

Technical Notes

Files

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TruthSayer

Key Results

Architecture

Dataset

Reproduction

Requirements

Step 1: Extract SAE features (~2h on A100 40GB)

Step 2: Run experiments (CPU, ~10 min)

Step 3: Validate outputs match expected

Hardware

Technical Notes

Files

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages