# Interpretability Starter Notebook (Colab T4 GPU)

This notebook pins library versions for reproducible results with GPT-2 Small Direct Logit Attribution. If you do not follow these instructions exactly, you risk getting responses that diverge from our answer key.

**Colab setup**
1. `Runtime` -> `Change runtime type` -> Hardware accelerator: `T4 GPU`. Runtime version can remain "latest".
2. Run the installation cell below.
3. Restart the runtime if prompted, then run the verification cell.


In [1]:
# Installation cell
!pip install transformer_lens==2.16.1 transformers==4.57.3 torch==2.9.0+cu126 numpy==1.26.4 einops==0.8.1



In [2]:
# Verification cell
import torch
import transformer_lens
import transformers
import einops
import numpy as np

print("torch:", torch.__version__)
print("transformers:", transformers.__version__)
print("einops:", einops.__version__)
print("numpy:", np.__version__)

if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    print("GPU:", gpu_name)
    if "T4" not in gpu_name:
        print("WARNING: Expected a T4 GPU for reproducibility.")
else:
    print("WARNING: No GPU detected. Enable a T4 GPU in Colab.")

torch: 2.9.0+cu126
transformers: 4.57.3
einops: 0.8.1
numpy: 1.26.4
GPU: Tesla T4


## Proceed with the assignment below

Be sure to use the following settings in your implementation to ensure reproducibility:
- Use `torch.manual_seed(42)` before loading the model.
- Tokenize with `prepend_bos=False`.
- Use `hook_z` and project through `W_O` manually when iterating through the layers of the transformer.


In [3]:
import torch
from transformer_lens import HookedTransformer

torch.manual_seed(42)

model = HookedTransformer.from_pretrained(
    "gpt2-small",
    device="cuda"
)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Loaded pretrained model gpt2-small into HookedTransformer


In [4]:
prompt = "When Mary and John went to the store, John gave a drink to"

tokens = model.to_tokens(
    prompt,
    prepend_bos=False
)

tokens


tensor([[2215, 5335,  290, 1757, 1816,  284,  262, 3650,   11, 1757, 2921,  257,
         4144,  284]], device='cuda:0')

In [5]:
tokenizer = model.tokenizer


In [6]:
tokenizer.encode(" Mary")


[5335]

In [7]:
tokenizer.decode([5335])


' Mary'

In [9]:
prompt = "When Mary and John went to the store, John gave a drink to"

tokens = model.to_tokens(prompt, prepend_bos=False)

logits, cache = model.run_with_cache(tokens)


In [10]:
mary_token_id = 5335
W_U_mary = model.W_U[:, mary_token_id]


In [12]:
dla = []

for layer in range(model.cfg.n_layers):
    z = cache[f"blocks.{layer}.attn.hook_z"]  # [1, pos, heads, d_head]

    for head in range(model.cfg.n_heads):
        z_head = z[0, 13, head]  # final token

        W_O = model.blocks[layer].attn.W_O[head]
        head_output = z_head @ W_O

        contribution = head_output @ W_U_mary
        dla.append({
            "layer": layer,
            "head": head,
            "dla": contribution.item()
        })


In [13]:
import pandas as pd

df = pd.DataFrame(dla)
df.sort_values("dla", ascending=False).head()


Unnamed: 0,layer,head,dla
114,9,6,28.899052
117,9,9,23.60574
122,10,2,20.687529
120,10,0,19.081276
133,11,1,17.29446


In [14]:
df.sort_values("dla").head(1)


Unnamed: 0,layer,head,dla
127,10,7,-34.543137


In [15]:
df[df["dla"] > 0]["dla"].sum()


284.94674567878246

In [16]:
(df["dla"] > 0).sum()


97

In [17]:
df[df["layer"] == 9]["dla"].abs().sum()


85.88965598121285

In [18]:
df[df["dla"] > 0]["dla"].sum()


284.94674567878246