# Interpretability Tests and Experiments

Going to do some prelim tests to show case mech interp techniques, mathematical unpinnings of said techniques and how it applies to the larger project.

This notebook serves as the proof-of-concept for the "Researcher Layer" of Clear-box AI web app. 
It validates the mathematical operations required for:
1. **Signal Extraction:** Capturing hidden states and attentions.
2. **Logit Attribution:** Understanding next-token prediction via the Logit Lens.
3. **Concept Visualization:** PCA projection of the residual stream.

**Model:** GPT-2 (Small)
**Goal:** Validate logic before porting to TypeScript/WebGPU.

In [1]:
# imports
import torch
import torch.nn.functional as F 
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import numpy as np
import matplotlib.pyplot as plt

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Running on: {device}")

Running on: cuda


### Engine Layer:

This corresponds to the `src/engine/ModelManager.ts`.
We use standard HuggingFace, but we explicitly request `output_hidden_states` and `output_attentions`.

In [3]:
model_name = "gpt2"
model = GPT2LMHeadModel.from_pretrained(model_name).to(device)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model.eval()

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

In [7]:
def run_inference(prompt: str):
    """
    Run inference and capture internals.
    Equivalent to the Web Worker's `generate` func with full observability.
    """
    inputs = tokenizer(prompt, return_tensors="pt").to(device)

    with torch.no_grad():
        outputs = model(
            **inputs,
            output_hidden_states=True, # ref for this on HF: https://huggingface.co/docs/transformers/v4.26.0/main_classes/output#transformers.modeling_outputs.BaseModelOutput.hidden_states
            output_attentions=True
        )
    
    # hidden_states: Tuple of (layer_count + 1) tensors. Shape: [batch, seq, hidden]
    # attentions: Tuple of (layer_count) tensors. Shape: [batch, heads, seq, seq]
    # logits: Shape [batch, seq, vocab]

    return {
        "input_ids": inputs["input_ids"],
        "tokens": [tokenizer.decode([t]) for t in inputs["input_ids"][0]],
        "hidden_states": outputs.hidden_states,
        "attentions": outputs.attentions,
        "logits": outputs.logits
    }

In [8]:
# quick test run of inference func
prompt = "The Eiffel Tower is located in the city of"
data = run_inference(prompt)
print(f"Prompt tokens: {data['tokens']}")
print(f"Total layers captured: {len(data['hidden_states'])}") # should be 13 => Embed + 12 blocks

Prompt tokens: ['The', ' E', 'iff', 'el', ' Tower', ' is', ' located', ' in', ' the', ' city', ' of']
Total layers captured: 13


## Logit Attribution

This technique applies the Unembedding Matrix ($W_U$) to the hidden state of *intermediate* layers. It answers: "If we stopped the model at Layer $L$, what token would it predict?"

Mathematical Logic: $ Logits_L = h_L \cdot W_U^T $

This validates the logic for `src/vis/LogitDistribution.tsx`.