<a href="https://colab.research.google.com/github/abdelhadidjafer02-beep/GPT-2/blob/main/hooks_ablation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Step 0: Setup & The Proxy Task

We use a repeatable prompt where the model must look back at previous context to get the answer right.


In [None]:
# Setup (Standard)
!pip install transformer_lens
import torch
from transformer_lens import HookedTransformer
import transformer_lens.utils as utils

model = HookedTransformer.from_pretrained("gpt2-small")

# The Proxy Task: A "Repeated Random" sequence
# If the model sees " A B C ... A", it should predict " B".
# This isolates the "copying" mechanism from factual knowledge.
text = "The quick brown fox jumps over the lazy dog. The quick brown fox"
target = " jumps"

# Measure Baseline (The "Control" Group)
logits = model(text)
prob = torch.softmax(logits[0, -1], dim=0)[model.to_single_token(target)].item()
print(f"Baseline Probability of '{target}': {prob:.2%}")

Step 1: Formulating a Hypothesis (The "Look")


Before intervening, we look. We suspect Attention Heads are responsible. We hypothesize that specific heads in the middle layers (Layer 5-7 in GPT-2 Small) act as "Induction Heads"â€”they look at the previous instance of the current token to find what came next.

Hypothesis: "If we remove the output of Layer 5, Head 5 (L5H5), the model will forget how to copy."

Step 2: The Necessity Test (Ablation Hook)

Here, we do targeted ablation (surgical removal).

Logic: If L5H5 is necessary, zeroing it out should drop the probability of " jumps" significantly.

In [None]:
def make_mean_ablation_hook(head_to_ablate):
    def mean_ablation_hook(value, hook):
        # value shape: [batch, pos, head_index, d_head]
        # To truly ablate a specific head by replacing with a *mean activation* baseline,
        # we replace its output with the mean across ALL heads in that layer
        # for each batch and position. This removes the unique information of the ablated head.
        mean_activation_across_heads = value.mean(dim=2, keepdim=True)
        value[:, :, head_to_ablate, :] = mean_activation_across_heads[:, :, 0, :]
        return value
    return mean_ablation_hook

# Store results
ablation_results = {}

# Loop through all layers and all heads
num_layers = model.cfg.n_layers # GPT2-small has 12 layers
num_heads = model.cfg.n_heads   # GPT2-small has 12 heads per layer

print(f"Baseline Probability of '{target}': {prob:.2%}\n")

for layer_idx in range(num_layers):
    for head_idx in range(num_heads):
        hook_name = f"blocks.{layer_idx}.attn.hook_z"
        specific_ablation_hook = make_mean_ablation_hook(head_idx)

        ablated_logits = model.run_with_hooks(
            text,
            fwd_hooks=[(hook_name, specific_ablation_hook)]
        )

        ablated_prob = torch.softmax(ablated_logits[0, -1], dim=0)[model.to_single_token(target)].item()
        ablation_results[f"L{layer_idx}H{head_idx}"] = ablated_prob

        print(f"Prob after ablating L{layer_idx}H{head_idx}: {ablated_prob:.2%} (Change: {(ablated_prob - prob):.2%})")

# You can also analyze ablation_results dictionary further if needed


Baseline Probability of ' jumps': 47.07%

Prob after ablating L0H0: 42.68% (Change: -4.39%)
Prob after ablating L0H1: 52.80% (Change: 5.73%)
Prob after ablating L0H2: 39.03% (Change: -8.04%)
Prob after ablating L0H3: 57.89% (Change: 10.82%)
Prob after ablating L0H4: 41.43% (Change: -5.64%)
Prob after ablating L0H5: 48.64% (Change: 1.57%)
Prob after ablating L0H6: 59.33% (Change: 12.26%)
Prob after ablating L0H7: 17.29% (Change: -29.78%)
Prob after ablating L0H8: 4.86% (Change: -42.21%)
Prob after ablating L0H9: 52.86% (Change: 5.79%)
Prob after ablating L0H10: 44.71% (Change: -2.36%)
Prob after ablating L0H11: 33.07% (Change: -14.00%)
Prob after ablating L1H0: 41.96% (Change: -5.11%)
Prob after ablating L1H1: 36.00% (Change: -11.07%)
Prob after ablating L1H2: 46.90% (Change: -0.17%)
Prob after ablating L1H3: 36.93% (Change: -10.14%)
Prob after ablating L1H4: 42.53% (Change: -4.54%)
Prob after ablating L1H5: 38.68% (Change: -8.39%)
Prob after ablating L1H6: 51.59% (Change: 4.52%)
Prob a

## Ablation Experiment Results Explanation

The ablation experiment has completed, and we can now see the impact of each attention head on the prediction of ' jumps'.

**Key Findings:**

*   **Significant Drops:** Some heads, like `L0H8` (a -42.21% change, dropping probability to 4.86%), `L0H7` (a -29.78% change, dropping probability to 17.29%), and `L11H0` (a -30.83% change, dropping probability to 16.24%), caused very large decreases in the probability of predicting ' jumps'. This suggests these heads are crucial for the model's ability to perform the copying task.

*   **Significant Increases:** Conversely, some heads, such as `L2H8` (a +31.01% change, increasing probability to 78.08%), `L2H3` (a +20.11% change, increasing probability to 67.18%), and `L7H10` (a +19.87% change, increasing probability to 66.94%), led to a substantial *increase* in the probability. This indicates these heads might normally be suppressing the correct prediction or are involved in other less direct mechanisms.

*   **Mixed Impact:** Many other heads caused smaller positive or negative changes, suggesting they contribute to the task to varying degrees, or their roles are less direct.

This analysis helps us pinpoint specific attention heads that are critical for the model's behavior in this proxy task. The original hypothesis focused on middle layers, and while some of those are important, we also see critical heads in earlier (L0) and later (L11) layers.