Skip to content

bigsnarfdude/truthsayer

Repository files navigation

TruthSayer

Geometric decomposition of deception in large language models.

We show that alignment faking (AF) and sandbagging are geometrically separable in SAE feature space using two orthogonal directions:

  • AF axis: fires on strategic deception, near-chance on sandbagging (AUROC 0.999 vs 0.53)
  • Truth axis: separates honest from sandbagging across all families (AUROC 0.978/0.957/0.818)

The two axes are nearly orthogonal (cosine = 0.033), proving they encode distinct phenomena. Validated on gold_106.json — 106 hand-labeled samples never used in training (AF probe AUROC 0.994).

Key Results

Experiment Result
AF probe on gold_106 (held-out) 0.994 AUROC
Joint probe on gold_106 0.989 AUROC
Truth axis → gemini sandbagging 0.978 AUROC
Truth axis → gpt-oss sandbagging 0.957 AUROC
Truth axis → qwen sandbagging 0.818 AUROC
AF/truth direction cosine 0.033 (orthogonal)
Layer sweep ceiling (self-read, all 4 layers) ~0.67 AUROC

Architecture

  • Model: Gemma 3 27B IT (google/gemma-3-27b-it)
  • SAE: GemmaScope 2 16k medium, layer 40 residual stream (google/gemma-scope-2-27b-it)
  • Feature pooling: mean-pool over sequence tokens (MAX_LEN=2048)
  • Probes: logistic regression on standardized SAE features

Dataset

  • Sandbagging data: vincentoh/sandbagging-agent-traces (HuggingFace)
    • Families: gemini, gpt-oss, qwen
    • Field: scratchpad, label: sandbagging/honest
  • AF data: vincentoh/alignment-faking-training (HuggingFace)
    • Labels: alignment_faking / aligned
  • Held-out eval: gold_106.json (106 hand-labeled, never trained on)

Reproduction

Requirements

pip install -r requirements.txt

Step 1: Extract SAE features (~2h on A100 40GB)

# Extracts sandbagging families (gemini, gpt-oss, qwen) at layer 40, mean-pooled
python3 extract/extract_sandbagging.py

# Extracts AF data (alignment-faking-training dataset)
python3 extract/extract_af.py

Step 2: Run experiments (CPU, ~10 min)

python3 experiments/exp1_layer_sweep.py        # self-read AUROC across layers 16/31/40/53
python3 experiments/exp2_truth_direction.py    # cross-family truth direction transfer
python3 experiments/exp3_2d_decomposition.py   # AF axis + truth axis joint analysis
python3 experiments/exp4_gold106_validation.py # held-out validation on gold_106.json

Step 3: Validate outputs match expected

python3 results/validate.py

All experiments should produce output matching results/expected/*.json within tolerance.

Hardware

  • GPU required: Steps 1 only. Gemma 3 27B in BF16 with CPU offload.
    • Tested: A100 40GB + max_memory={0: "36GiB", "cpu": "80GiB"}
    • ~1.82s/sample at layer 40 (later layers are CPU-offloaded)
  • CPU only: Steps 2-3. Pre-extracted .npy files in results/feats/.

Technical Notes

  • Layer path: model.model.language_model.layers[N] (Gemma3ForConditionalGeneration is multimodal)
  • SAE keys: lowercase w_enc, w_dec, b_enc, threshold (GemmaScope 2 convention)
  • Dataset field: scratchpad (not text or trace)
  • Early-exit hook pattern: register hook at target layer, raise _EarlyExit to avoid executing CPU-offloaded later layers (saves ~50% time)

Files

extract/
  extract_sandbagging.py   — GPU: extract SAE feats for sandbagging families
  extract_af.py            — GPU: extract SAE feats for AF data
experiments/
  exp1_layer_sweep.py      — self-read AUROC at layers 16, 31, 40, 53
  exp2_truth_direction.py  — cross-family truth direction (leave-one-family-out)
  exp3_2d_decomposition.py — 2D AF+truth decomposition, cross-family AUROC
  exp4_gold106_validation.py — held-out validation
datavis/
  plot_2d.py               — static 2D scatter (AF axis vs Truth axis)
  plot_3d.py               — interactive 3D (opens in browser, needs plotly)
  README.md                — setup and usage instructions
models/
  inference.py             — load saved probe weights, run inference on new text
  *.npy                    — saved probe weights, scalers, biases
results/
  expected/                — reference outputs for validate.py
  validate.py              — diff fresh run vs expected
gold_106.json              — held-out eval set (DO NOT use for training)

About

Aligned Faking and Sandbagging and Truthtelling SAE probe experiments

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors