# How We Got Here: A Research Journey

This notebook documents the path from our initial hypothesis to our current findings. The failures along the way were essential they taught us what NOT to measure and why.

## December 2025: The Hallucination Hypothesis

### The Original Idea

We started with an ambitious goal: **detect hallucinations by looking at internal activation patterns.**

The hypothesis was intuitive:
- When a model generates confident, grounded text, it should have "clean" activation patterns
- When it hallucinates (generates plausible-sounding nonsense), the internal structure should be different maybe more chaotic, less focused

We called this "hallucination biopsy" diagnosing model pathology by examining its internal state.

### Early Results Looked Promising

Initial experiments showed large effect sizes:
- Cohen's d > 1.0 for several metrics
- Statistically significant differences (p < 0.001)
- Clear visual separation in PCA plots

We were excited. It seemed like we could distinguish hallucinations from grounded outputs.

### Reference

Early notebooks from this phase are in `archive/disproved/`. They document both the promising early results and the subsequent discovery of what went wrong.

## January 2026: The Length Confound Discovery

### The Uncomfortable Realization

When we looked more carefully at our data, we noticed something troubling:

**`n_active` (number of active features) correlated r = 0.96 with text length.**

This was almost a perfect correlation. Longer inputs activate more features trivially, because there's more content to process.

### The Correlation Heatmap

This figure shows the problem clearly:

![Correlation Heatmap](../figures/paper/correlation_heatmap.png)

Key observations:
- **N Active vs Text Length: r = 0.98** Nearly perfect correlation
- Influence and Concentration show weaker (but still concerning) correlations with length
- Any metric correlated with length is confounded

### What This Meant for Our Hallucination Results

Our "hallucination signal" was mostly a "longer text signal."

In our dataset:
- Hallucinated outputs tended to be longer (models ramble when uncertain)
- Grounded outputs tended to be more concise

We weren't detecting hallucination structure. We were detecting text length.

### The Lesson

> **Raw feature counts are unreliable.** They scale with input length, not semantic content. Any analysis using `n_active` without length control is suspect.

## January 2026: Injection Detection Attempt

### The Pivot

We tried to salvage the approach by pivoting to **prompt injection detection**:
- Adversarial prompts that try to hijack model behavior
- Surely these would have distinct geometric signatures?

### Same Problem

Initial results again looked promising:
- Clear separation between injection and benign prompts
- Effect sizes > 1.0
- p < 0.001

But when we checked:
- **Injection prompts were 74% longer** than benign prompts in our dataset
- More text -> more features -> apparent "signal"

### After Length Control

When we matched prompts by length or residualized against length:
- Effect sizes collapsed from d > 1.0 to d ~ 0.1-0.5
- No metrics remained statistically significant
- "Injection" is not a coherent geometric category

### Reference

The injection detection experiments are documented in `archive/disproved/`. See especially:
- `injection_geometry_truth.ipynb` The key notebook documenting the confound discovery
- `balanced_injection_geometry.ipynb` Balanced sampling experiments

## February 2026: The Real Question

### Stopping to Think

After two failed hypotheses, we stepped back and asked:

**What CAN we actually measure with activation geometry?**

We had been asking:
- "Can we detect bad outputs?" (hallucinations)
- "Can we detect bad inputs?" (injections)

These questions assume that "badness" has a geometric signature. But maybe it doesn't.

### The Better Question

What if we asked instead:

> **Can we characterize what TYPE of computation the model is doing?**

Not "is this output correct?" but "what cognitive mode is the model in?"

This question is more tractable because:
- Different tasks genuinely require different processing
- Grammar checking is structurally different from multi-hop reasoning
- These differences should manifest in activation topology

## The Pivot That Worked

### Task-Type Diagnostics

We designed a clean experiment:
1. Sample from standard NLP benchmarks (CoLA, WinoGrande, HellaSwag, MRPC, TruthfulQA)
2. Extract attribution graphs for each sample
3. Compute metrics from the graphs
4. **Residualize everything against text length**
5. Test for differences between task types

### The Results

After length control, we found genuine signal:

| Metric | Effect Size | Interpretation |
|: : : : |: : : : : : -|: : : : : : : : |
| Influence (length-controlled) | d = 1.08 | **Genuine signal** |
| Concentration (length-controlled) | d = 0.87 | **Genuine signal** |
| N_active (length-controlled) | d = 0.07 | Collapses (was artifact) |

### Why This Works

- **Grammar tasks** require focused, precise computation (is THIS word wrong?)
- **Reasoning tasks** require diffuse, exploratory computation (how do these concepts relate?)
- These are real cognitive differences that survive length control

### The Negative Result That's Still Valuable

**TruthfulQA showed no signal (d = 0.05).**

True statements and false statements produce identical activation structures. This confirms:
- We measure computation TYPE, not output QUALITY
- Hallucination/truthfulness detection via this method is fundamentally impossible

## Lessons Learned

### 1. Always Control for Confounds

Especially text length in any NLP analysis. The correlation between length and feature counts is near-perfect (r = 0.96-0.98). If you don't control for it, you're measuring length.

### 2. Feature Counts Are (Almost) Useless Alone

`n_active` scales with input length. It tells you how long the text is, not how complex the computation is. After residualization, it carries almost no signal (d = 0.07).

### 3. Influence-Based Metrics Are More Robust

Mean influence and concentration capture *how* features interact, not just *how many* are active. These survive length control with strong effect sizes.

### 4. Negative Results Are Valuable

Learning that we CANNOT detect hallucinations is important. It:
- Prevents others from wasting time on the same dead end
- Clarifies the fundamental limitations of activation-based diagnostics
- Redirects effort toward questions that DO have answers

### 5. The Question Matters More Than the Method

Our methodology (extract attribution graphs, compute metrics) was fine. Our QUESTION (detect hallucinations) was wrong.

When we asked the right question (characterize computation types), the same methods produced robust, replicable results.

## Timeline Summary

| Date | Phase | What Happened |
|: : : |: : : -|: : : : : : : -|
| Dec 2025 | Hallucination Hypothesis | Initial promising results (d > 1.0) |
| Jan 2026 | Length Confound Discovery | Found r = 0.96 correlation with length |
| Jan 2026 | Injection Pivot | Same problem length confounded |
| Feb 2026 | Real Question | Asked about computation types instead |
| Feb 2026 | Task-Type Diagnostics | Found genuine signal (d = 1.08) that survives length control |

: -

*Next: [03_methodology.ipynb](03_methodology.ipynb) Technical deep-dive into how the metrics work*

*Previous: [01_introduction.ipynb](01_introduction.ipynb) What this project discovers*