# Part 3: How We Measure Activation Topology

**Series**: Latent Diagnostics Analysis (3 of 5)

This notebook explains the methodology behind our measurements. What tools do we use? What do the numbers mean? How do we avoid confounding variables?

---

## Table of Contents

1. [Attribution Graphs](#1-attribution-graphs)
2. [The Metrics We Extract](#2-the-metrics-we-extract)
3. [Length Control via Residualization](#3-length-control-via-residualization)
4. [Statistical Tests](#4-statistical-tests)

In [1]:
# Standard imports
import json
import numpy as np
import pandas as pd
from pathlib import Path
from scipy import stats
from scipy.stats import linregress

# Paths
DATA_DIR = Path('../data/results')

print("Environment ready.")

Environment ready.


---

## 1. Attribution Graphs

### What Are They?

Attribution graphs are **causal graphs** showing feature-to-feature influence inside a language model. Each node represents an interpretable feature (from transcoders), and each edge represents causal influence between features.

### The Tool: circuit-tracer

We use [circuit-tracer](https://github.com/anthropics/circuit-tracer) from Anthropic, combined with [Goodfire's transcoders](https://github.com/goodfire-ai/goodfire-transcoders) for Gemma 2.

**How it works:**

1. Load a "replacement model" where MLPs are swapped for transcoder features
2. Run text through the model
3. Track which features activate and how they influence each other
4. Output: An **attribution graph** with:
   - Nodes = transcoder features (interpretable units)
   - Edges = causal influence between features

```python
from circuit_tracer import ReplacementModel, attribute

model = ReplacementModel.from_pretrained(
    "google/gemma-2-2b",  # Base model
    "gemma",              # Transcoder set
    dtype=torch.bfloat16,
    device=torch.device("cuda")
)

text = "The cat sat on the mat."
graph = attribute(text, model)
```

### What We Get Back

| Attribute | Type | What It Contains |
|-----------|------|------------------|
| `graph.active_features` | list | IDs of features that fired |
| `graph.activation_values` | tensor | Activation strength of each feature |
| `graph.adjacency_matrix` | tensor | Causal influence: `adj[i,j]` = how much feature i contributes to feature j |
| `graph.logit_probabilities` | tensor | Output probability distribution |

### Our Contribution

Anthropic built the tools. We built:
- **Summary metrics** extracted from these graphs
- **The research question**: Do metrics differ by task type?
- **Statistical analysis**: Length control, bootstrap CIs, permutation tests

---

## 2. The Metrics We Extract

From each attribution graph, we extract summary statistics. These reduce the complex graph to a few interpretable numbers.

In [2]:
# Load the data to see the actual schema
with open(DATA_DIR / 'domain_attribution_metrics.json') as f:
    data = json.load(f)

print(f"Dataset: {data['metadata']['n_samples']} samples")
print("\nSample structure:")
sample = data['samples'][0]
for key, value in sample.items():
    if key != 'text':
        print(f"  {key}: {value}")

Dataset: 210 samples

Sample structure:
  idx: 0
  source: cola
  domain: grammar
  label: acceptable
  n_active: 9993
  mean_activation: 5.25
  max_activation: 157.0
  n_edges: 12796395
  mean_influence: 0.00818366277962923
  max_influence: 79.0
  top_100_concentration: 0.004105022166068677
  max_logit_prob: 0.2119140625
  logit_entropy: 1.640625


### Metric Definitions

| Metric | What It Measures | How It's Computed |
|--------|------------------|-------------------|
| `n_active` | Feature count | Number of features with non-zero activation |
| `mean_activation` | Avg feature strength | Mean of \|activation\| across active features |
| `max_activation` | Peak feature strength | Maximum single feature activation |
| `n_edges` | Connection count | Edges with influence > 0.01 threshold |
| `mean_influence` | Avg edge weight | Mean of \|adj[i,j]\| across all pairs |
| `max_influence` | Peak causal connection | Maximum single edge weight |
| `top_100_concentration` | Influence Gini | Fraction of total influence in top 100 edges |
| `logit_entropy` | Output uncertainty | Entropy of output probability distribution |
| `max_logit_prob` | Output confidence | Probability of most likely next token |

In [3]:
# Convert to DataFrame for analysis
df = pd.DataFrame(data['samples'])
df['text_length'] = df['text'].str.len()

# Check correlations with text length
metrics = ['n_active', 'mean_influence', 'top_100_concentration', 'mean_activation']

print("Correlation with text length:")
print("=" * 50)
for m in metrics:
    r = np.corrcoef(df[m], df['text_length'])[0, 1]
    robust = "YES" if abs(r) < 0.6 else "NO"
    print(f"  {m:<25} r = {r:+.3f}  Robust: {robust}")

Correlation with text length:
  n_active                  r = +0.982  Robust: NO
  mean_influence            r = -0.797  Robust: NO
  top_100_concentration     r = -0.629  Robust: NO
  mean_activation           r = -0.539  Robust: YES


### The Robustness Table

| Metric | What It Measures | Robust to Length? |
|--------|------------------|-------------------|
| n_active | Feature count | **NO** (r=0.96) |
| mean_influence | Avg edge weight | YES |
| concentration | Influence Gini | YES |
| mean_activation | Avg feature strength | YES |

**Critical insight**: `n_active` is almost perfectly correlated with text length. Longer text = more tokens = more features fire. This metric cannot be used directly for task comparison.

---

## 3. Length Control via Residualization

### The Problem

`n_active` scales with text length (r = 0.96). Different tasks have different average text lengths:
- Grammar examples (CoLA): ~40 characters
- Reasoning examples (HellaSwag): ~175 characters

If we compare raw `n_active` between tasks, we're just measuring **length**, not **task differences**.

### The Solution: Residualization

Regress each metric on text_length, use residuals.

**How it works:**
1. Fit a line: `metric = slope * length + intercept`
2. Compute residuals: `residual = actual - predicted`
3. Use residuals for analysis

The residual represents "the part of the metric that isn't explained by length."

In [4]:
def residualize(metric, length):
    """
    Remove the effect of text length from a metric.
    
    Parameters:
        metric: array of metric values
        length: array of text lengths
    
    Returns:
        residuals: metric values with length effect removed
    """
    slope, intercept, _, _, _ = linregress(length, metric)
    predicted = slope * length + intercept
    return metric - predicted

# Example: residualize n_active
n_active = df['n_active'].values
lengths = df['text_length'].values

n_active_resid = residualize(n_active, lengths)

# Verify: residuals should have zero correlation with length
r_before = np.corrcoef(n_active, lengths)[0, 1]
r_after = np.corrcoef(n_active_resid, lengths)[0, 1]

print("Residualization example (n_active):")
print(f"  Correlation with length BEFORE: r = {r_before:.3f}")
print(f"  Correlation with length AFTER:  r = {r_after:.6f}")

Residualization example (n_active):
  Correlation with length BEFORE: r = 0.982
  Correlation with length AFTER:  r = -0.000000


In [5]:
# Apply to all metrics
for m in metrics:
    df[f'{m}_resid'] = residualize(df[m].values, df['text_length'].values)

# Verify all residuals have zero correlation with length
print("Verification: All residuals have zero correlation with length")
print("=" * 60)
for m in metrics:
    r = np.corrcoef(df[f'{m}_resid'], df['text_length'])[0, 1]
    print(f"  {m}_resid: r = {r:.10f}")

Verification: All residuals have zero correlation with length
  n_active_resid: r = -0.0000000000
  mean_influence_resid: r = -0.0000000000
  top_100_concentration_resid: r = 0.0000000000
  mean_activation_resid: r = -0.0000000000


---

## 4. Statistical Tests

We use three complementary methods to validate our findings.

### Cohen's d (Effect Size)

**What it measures**: How different two groups are, in standard deviation units.

```
d = (mean1 - mean2) / pooled_standard_deviation
```

**Interpretation**:
| d | Interpretation |
|---|----------------|
| 0.2 | Small effect |
| 0.5 | Medium effect |
| 0.8 | Large effect |
| > 1.0 | Very large effect |

In [6]:
def cohens_d(a, b):
    """
    Compute Cohen's d effect size between two groups.
    
    Parameters:
        a, b: arrays of values for each group
    
    Returns:
        d: standardized effect size
    """
    na, nb = len(a), len(b)
    pooled_std = np.sqrt(
        ((na - 1) * np.std(a, ddof=1)**2 + (nb - 1) * np.std(b, ddof=1)**2) / (na + nb - 2)
    )
    if pooled_std == 0:
        return 0
    return (np.mean(a) - np.mean(b)) / pooled_std

# Example: Grammar vs Reasoning
cola = df[df['source'] == 'cola']
others = df[df['source'] != 'cola']

print("Effect sizes: Grammar (CoLA) vs Reasoning")
print("=" * 50)
for m in metrics:
    d_raw = cohens_d(cola[m].values, others[m].values)
    d_resid = cohens_d(cola[f'{m}_resid'].values, others[f'{m}_resid'].values)
    print(f"  {m:<25} Raw: d={d_raw:+.2f}  Controlled: d={d_resid:+.2f}")

Effect sizes: Grammar (CoLA) vs Reasoning
  n_active                  Raw: d=-2.17  Controlled: d=+0.07
  mean_influence            Raw: d=+3.22  Controlled: d=+1.08
  top_100_concentration     Raw: d=+2.36  Controlled: d=+0.87
  mean_activation           Raw: d=+1.74  Controlled: d=+0.64


### Bootstrap Confidence Intervals

**What it measures**: How precise is our effect size estimate?

**Method**:
1. Resample data with replacement (5000 times)
2. Compute Cohen's d for each resample
3. Take 2.5th and 97.5th percentiles as 95% CI

**Interpretation**: If 95% CI excludes zero, the effect is robust.

In [7]:
def bootstrap_ci(group1, group2, n_boot=5000, seed=42):
    """
    Bootstrap 95% confidence interval for Cohen's d.
    
    Parameters:
        group1, group2: arrays of values
        n_boot: number of bootstrap resamples
        seed: random seed for reproducibility
    
    Returns:
        (ci_low, ci_high): 95% confidence interval
    """
    np.random.seed(seed)
    boot_ds = []
    
    for _ in range(n_boot):
        sample1 = np.random.choice(group1, size=len(group1), replace=True)
        sample2 = np.random.choice(group2, size=len(group2), replace=True)
        boot_ds.append(cohens_d(sample1, sample2))
    
    return np.percentile(boot_ds, 2.5), np.percentile(boot_ds, 97.5)

# Example: Bootstrap CI for influence
ci_low, ci_high = bootstrap_ci(
    cola['mean_influence_resid'].values,
    others['mean_influence_resid'].values
)
print(f"Influence (length-controlled): 95% CI = [{ci_low:.2f}, {ci_high:.2f}]")

Influence (length-controlled): 95% CI = [0.70, 1.49]


### Permutation Test (Shuffle Test)

**What it measures**: Could the observed effect be due to random chance?

**Method**:
1. Shuffle task labels randomly (1000 times)
2. Compute Cohen's d for each shuffle (null distribution)
3. p-value = proportion of shuffles with d >= observed

**Interpretation**: If p < 0.05, the effect is statistically significant.

In [8]:
def permutation_test(values, labels, n_permutations=1000, seed=42):
    """
    Permutation test for difference between groups.
    
    Parameters:
        values: array of metric values
        labels: array of group labels
        n_permutations: number of shuffles
    
    Returns:
        observed_d: actual effect size
        p_value: proportion of shuffles >= observed
    """
    rng = np.random.RandomState(seed)
    
    # Observed effect
    grammar_mask = labels == 'cola'
    observed_d = abs(cohens_d(values[grammar_mask], values[~grammar_mask]))
    
    # Null distribution
    null_ds = []
    for _ in range(n_permutations):
        shuffled = rng.permutation(labels)
        shuffled_mask = shuffled == 'cola'
        null_ds.append(abs(cohens_d(values[shuffled_mask], values[~shuffled_mask])))
    
    p_value = np.mean(np.array(null_ds) >= observed_d)
    return observed_d, p_value

# Example: Permutation test for influence
obs_d, p_val = permutation_test(
    df['mean_influence_resid'].values,
    df['source'].values
)
print(f"Influence: observed d = {obs_d:.3f}, p = {p_val:.4f}")

Influence: observed d = 1.079, p = 0.0000




---

## Summary

### The Pipeline

1. **Input**: Text samples from different task types
2. **Tool**: circuit-tracer + Goodfire transcoders
3. **Output**: Attribution graphs (nodes = features, edges = influence)
4. **Extraction**: Summary metrics (n_active, mean_influence, concentration, etc.)
5. **Length Control**: Residualization removes length confound
6. **Validation**: Cohen's d, bootstrap CIs, permutation tests

### Key Insight

`n_active` is confounded by length (r=0.96). After length control:
- `n_active` collapses (d: 2.17 -> 0.07)
- `mean_influence` persists (d: 3.22 -> 1.08)
- `concentration` persists (d: 2.36 -> 0.87)

The real signal is in **influence** and **concentration**, not feature counts.