# What This Project Discovers

**Language models switch between different internal processing modes depending on the task type and those modes are measurable.**

## The Question

Can we measure *how* a language model thinks, not just *what* it outputs?

When a model processes text, it activates thousands of internal features learned representations that encode meaning, syntax, relationships, and reasoning patterns. These features connect to each other in complex attribution graphs.

We asked: **Do different types of tasks produce different internal structures?**

And more specifically: Can we tell the difference between:
- A model checking grammar
- A model doing multi-step reasoning
- A model detecting paraphrases
- A model assessing truthfulness

...just by looking at how information flows through its internals?

## The Answer

**Yes.** Different task types produce distinct activation topologies.

More precisely:

- **Grammar tasks** (like judging if a sentence is grammatically acceptable) produce *focused* computation: high influence between features, concentrated in a small number of pathways.

- **Reasoning tasks** (like resolving pronoun references or selecting logical continuations) produce *diffuse* computation: lower per-feature influence, spread across many pathways.

This is robust. It survives controls for text length. It replicates across different samples. The effect sizes are large (Cohen's d > 1.0).

## Key Finding

After controlling for text length (critical see Notebook 02 for why), we found:

| Metric | What It Measures | Effect Size (Cohen's d) |
|: : : : |: : : : : : : : : |: : : : : : : : : : : : |
| **Influence** | Causal strength between features | d = 1.08 (genuine signal) |
| **Concentration** | Focused vs diffuse computation | d = 0.87 (genuine signal) |
| **N_active** | Raw feature count | d = 0.07 (COLLAPSES was length artifact) |

### The Pattern

| Task Domain | Influence | Concentration | Interpretation |
|: : : : : : -|: : : : : -|: : : : : : : -|: : : : : : : : |
| Grammar (CoLA) | High | High | Focused, precise computation |
| Reasoning (WinoGrande, HellaSwag) | Low | Low | Diffuse, exploratory computation |
| Paraphrase (MRPC) | Medium | Medium | Moderate complexity |
| Truthfulness (TruthfulQA) | **No signal** (d = 0.05) | **No signal** | Can't distinguish true from false |

### What the Truthfulness Result Means

We cannot tell if a model is lying by looking at its internal structure. True statements and false statements produce statistically identical activation patterns.

This is a fundamental limitation: **we measure computation type, not output quality.**

## What This Means

Think of it like measuring physiological signals:

- **Heart rate + brain patterns** can tell you if someone is doing math versus poetry (different cognitive modes)
- They can detect if someone is confused or uncertain (elevated activity, irregular patterns)
- But they **cannot** tell you if the math answer is correct

Similarly, our metrics tell you:
- What *type* of computation the model is doing
- Whether the input is anomalous or adversarial (unusual patterns)
- How "hard" the model is working on a problem

But they **cannot** tell you:
- Whether the output is correct
- Whether the model is hallucinating
- Whether a statement is true or false

## How to Use This Repository

### Directory Overview

```
notebooks/ <- YOU ARE HERE Start with these
 01_introduction.ipynb This notebook
 02_the_journey.ipynb How we got here (the failures that taught us)
 03_methodology.ipynb Technical deep-dive
 04_results.ipynb Full analysis with figures
 05_implications.ipynb What this means for interpretability

experiments/ Reproducible experiment scripts
figures/ All generated figures
 paper/ Publication-quality versions
data/ Input data and computed metrics
 results/ Attribution metric JSON files
scripts/ Modal runners for GPU computation
archive/ Historical experiments
 disproved/ Early work superseded by length control
```

### Recommended Reading Order

1. **This notebook** What we found
2. **02_the_journey** How we got here (and what didn't work)
3. **03_methodology** How the metrics work
4. **04_results** Full analysis with interactive figures
5. **05_implications** Where this leads

: -

*Next: [02_the_journey.ipynb](02_the_journey.ipynb) How we discovered this (and what we got wrong along the way)*