<a href="https://colab.research.google.com/github/abdelhadidjafer02-beep/GPT-2/blob/main/Token_level_patch_Activation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Token-Level Activation Patching

In [7]:
#!pip install transformer_lens

import torch
from transformer_lens import HookedTransformer
import pandas as pd

# Load a small, manageable model
device = "cuda" if torch.cuda.is_available() else "cpu"
model = HookedTransformer.from_pretrained("gpt2-small", device=device)

Loaded pretrained model gpt2-small into HookedTransformer


In [8]:
clean_prompt = "LeBron James plays the sport of"
corrupted_prompt = "LeBron James plays the profession of"
# We change one word to see where the 'sport' concept is injected.

clean_tokens = model.to_tokens(clean_prompt)
corrupted_tokens = model.to_tokens(corrupted_prompt)

clean_logits, clean_cache = model.run_with_cache(clean_tokens)
corrupted_logits, corrupted_cache = model.run_with_cache(corrupted_tokens)

# The token for ' Basketball'
answer_token = model.to_single_token(" Basketball")

In [9]:
clean_logit_basketball = clean_logits[0, -1, answer_token].item()
corrupted_logit_basketball = corrupted_logits[0, -1, answer_token].item()

print(f"Clean logit for ' Basketball': {clean_logit_basketball:.4f}")
print(f"Corrupted logit for ' Basketball': {corrupted_logit_basketball:.4f}")

Clean logit for ' Basketball': 11.6198
Corrupted logit for ' Basketball': 9.7076


In [10]:
# Identify the token index for 'sport'/'profession' which is the last token in our prompts
token_to_patch_idx = clean_tokens.shape[1] - 1

def patch_residual_stream_token_pos(target_layer):
    def patch_hook(clean_residual, hook):
        # Patch only at the relevant token position (the last one)
        clean_residual[:, token_to_patch_idx, :] = corrupted_cache[hook.name][:, token_to_patch_idx, :]
        return clean_residual

    patched_logits = model.run_with_hooks(
        clean_tokens,
        fwd_hooks=[(f"blocks.{target_layer}.hook_resid_post", patch_hook)]
    )

    return patched_logits[0, -1, answer_token].item()

# Test with the modified function for a specific layer
print(f"Logit for Basketball with token-level patch at Layer 5: {patch_residual_stream_token_pos(5):.4f}")

Logit for Basketball with token-level patch at Layer 5: 10.6327


In [11]:
results_token_patch = []
for layer in range(model.cfg.n_layers):
    logit_score = patch_residual_stream_token_pos(layer)
    results_token_patch.append({"layer": layer, "logit": logit_score})

df_token_patch = pd.DataFrame(results_token_patch)
print(df_token_patch)

    layer      logit
0       0  11.643492
1       1  11.771039
2       2  11.786404
3       3  11.112857
4       4  11.311896
5       5  10.632686
6       6  10.531447
7       7  10.219274
8       8   9.889150
9       9   9.869099
10     10   9.868301
11     11   9.707586


### Explanation of Token-Level Patching Results

Looking at `df_token_patch`:

*   **Initial Layers (0-2):** The logit for ' Basketball' is very close to the `clean_logit_basketball` (around 11.6 to 11.7). This indicates that when we patch at these early layers, the impact on the final prediction is minimal because the key distinguishing information (between 'sport' and 'profession') hasn't been fully processed yet.
*   **Mid-Layers (3-7):** We start to see a more noticeable drop in the logit score (e.g., from 11.11 at Layer 3 down to 10.22 at Layer 7). This suggests that the model is beginning to integrate the 'corrupted' information ('profession' instead of 'sport') at the final token position, leading to a decreased prediction for ' Basketball'.
*   **Later Layers (8-11):** The logit continues to decrease, eventually reaching `9.707586` at layer 11. This is almost identical to the `corrupted_logit_basketball` (9.7076).

**What does this tell us?**

Unlike the previous full-residual patching where the logit immediately dropped to the corrupted value, this token-level patching shows a **gradual decrease**. This indicates that the information distinguishing 'sport' from 'profession' (and thus influencing the prediction for ' Basketball') is not fully processed in a single layer. Instead, it's progressively built up and refined across the transformer layers, with significant changes becoming apparent from the mid-layers onwards. By layer 11, the model has fully incorporated the 'corrupted' information at that specific token position.