# **Transformer Interpretability**

In this coding homework, you will:
* Implement a single attention head
* Implement an induction copy head by combining a previous token head with a copying head.

---

## Refactored Version

This notebook uses the refactored `transformer_interpretability` package which follows:
- **PEP 8**: Python style conventions
- **PEP 484**: Type hints
- **Google Python Style Guide**: Code organization and docstrings
- **NumPy Style Guide**: Documentation format

See `REFACTORING_REPORT.md` for detailed documentation of changes.

## **Imports & Preliminaries**

In [None]:
import json
import sys
from pathlib import Path

import numpy as np

# Add parent directory to path for local imports
sys.path.insert(0, str(Path.cwd().parent))

from transformer_interpretability import single_attention_head, induction_copy_head
from transformer_interpretability.utils.constants import DEFAULT_SEED, EPSILON

# Set random seed for reproducibility
np.random.seed(DEFAULT_SEED)

## **Question 1: A Single Attention Head**

In this question, you'll implement an attention head.

### Specifications

- The query-key matrices are provided already multiplied together
  (i.e., we provide as input $W_{QK} = W_Q^\top W_K$, called `wqk`).
- The output-value matrices are provided already multiplied together
  (i.e., we provide as input $W_{OV} = W_O^\top W_V$, called `wov`).
- Attention inputs are also provided as a list of `d_model`-length vectors,
  called `attn_input`. The list elements correspond to positions in the context.
- The desired outputs are the outputs the attention head produces at each position.
  This should be a list of `d_model`-length vectors of the same length as the input.
- **Causal masking:** When implementing attention, mask out positions that come
  after the current position by setting their attention scores to negative
  infinity (e.g., `-1e9`) before applying softmax.
- You should first convert the inputs to numpy arrays as a first step.

### Note

- You should **not** use `np.softmax` when calculating the attention scores.

---

### Implementation

The implementation is in `transformer_interpretability/core/attention.py`.

Key functions:
- `single_attention_head(attn_input, wqk, wov)` - Main attention computation
- `softmax(logits, axis)` - Numerically stable softmax
- `create_causal_mask(seq_len)` - Generate causal attention mask

In [None]:
# View the function signature and docstring
help(single_attention_head)

### Test `single_attention_head` - Basic

In [None]:
# Define test inputs
attn_input = [
    [0, 1],
    [1, 1],
    [1, 2],
]
wqk = [
    [1, 1],
    [0, 0],
]
wov = [
    [1, 1],
    [0, 0],
]

expected_output = [
    [1.0, 0.0],
    [1.7310585786300048, 0.0],
    [2.5752103826044417, 0.0],
]

# Run the attention head
output = single_attention_head(attn_input, wqk, wov).tolist()

# Verify output
assert np.isclose(expected_output, output, atol=EPSILON).all(), (
    f"Failed:\nExpected: {expected_output}\nGot: {output}"
)
print("Test case passed ✅")

### Test `single_attention_head` - Full Suite

Run against all provided test cases from the JSON file.

In [None]:
# Load test cases
test_file = Path("tests/test_cases/single_attention_head_test_cases.json")

if test_file.exists():
    with open(test_file) as f:
        test_cases = json.load(f)

    # Run all test cases
    for test_case_id, test_data in test_cases.items():
        attn_input, wqk, wov, expected_output = test_data
        output = single_attention_head(attn_input, wqk, wov).tolist()
        case_num = test_case_id.split()[-1]

        assert np.isclose(expected_output, output, atol=EPSILON).all(), (
            f"Test Case {case_num} Failed:\nExpected: {expected_output}\nGot: {output}"
        )

    print("All test cases passed ✅")
else:
    print(f"⚠️  Test file not found: {test_file}")
    print("Run from the correct directory or clone the course repository.")

## **Question 2: An Induction Copy Head**

In this problem, you will combine a previous token head with a copying head.

### Background

An induction head operates by predicting that previously-seen adjacencies in
the sequence will be seen again. That is, it predicts `ab...a` will be
followed by `b` (for any `a`, `b`).

### Specifications

#### Vocabulary and Embeddings
- Vocabulary size: 4
- Tokens: `a`, `b`, `c`, `d`
- Maximum sequence length: 5
- Embedding: 2-hot encoded with `d_model = 9`
  - First 4 dimensions: 1-hot encoding of token vocabulary
  - Next 5 dimensions: 1-hot encoding of position (0-4)
- The unembeddings are the same as the embeddings. The model produces an
  output vector of length 4 which encodes the logits on `a,b,c,d` respectively.

#### Induction Head Mechanism
Your implementation should consist of two stages:

1. **Previous Token Head:** Identifies tokens that are directly adjacent to
   each other (using position information)
2. **Induction Head:** Takes the output from the previous token head and
   copies the token that follows matching patterns.

Together, these implement the pattern: `a,b,...,a -> b` for all `a`, `b`.

---

### Implementation

The implementation is in `transformer_interpretability/core/induction_heads.py`.

Key functions:
- `induction_copy_head(embeddings, attention_strength)` - Main induction mechanism
- `_build_previous_token_matrices(d_model, attention_strength)` - Construct PTH weights
- `_build_copying_matrices(d_model, vocab_size, attention_strength)` - Construct copy head weights

### Example Walkthrough

For the sequence `[a, b, c, d, a]`, the prediction should upweight `b`.

**Step-by-step:**

1. **Previous Token Head** creates outputs where position $i$ contains
   information about the token at $i - 1$.

2. After deleting position 0 and adding to token embeddings (without position
   info), we have representations that know both "what token is here" and
   "what token came before".

3. **Induction Head** at the final position sees token `a` and looks for
   where else `a` appeared with its predecessor information.

4. It finds that position 0 had token `a` (with no predecessor).

5. It copies what came after position 0, which is token `b`.

6. The output logits should thus have the highest value for token `b`.

In [None]:
# View the function signature and docstring
help(induction_copy_head)

### Test `induction_copy_head` - Basic

In [None]:
# Define test inputs: sequence [a, b, c, d, a]
embeddings = [
    [1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0],  # a at position 0
    [0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0],  # b at position 1
    [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0],  # c at position 2
    [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0],  # d at position 3
    [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0],  # a at position 4
]
attention_strength = 10.0

# Run the induction head
output = induction_copy_head(embeddings, attention_strength).tolist()

# Expected: high probability for 'b' (index 1)
expected_output = [0.000045, 0.999864, 0.000045, 0.000045]

# Verify output
assert np.isclose(expected_output, output, atol=EPSILON).all(), (
    f"Failed:\nExpected: {expected_output}\nGot: {output}"
)
print("Test case passed ✅")
print(f"\nPredicted token probabilities:")
print(f"  a: {output[0]:.6f}")
print(f"  b: {output[1]:.6f}  ← highest (correct!)")
print(f"  c: {output[2]:.6f}")
print(f"  d: {output[3]:.6f}")

### Test `induction_copy_head` - Full Suite

In [None]:
# Load test cases
test_file = Path("tests/test_cases/induction_head_test_cases.json")

if test_file.exists():
    with open(test_file) as f:
        test_cases = json.load(f)

    # Run all test cases
    for test_case_id, test_data in test_cases.items():
        embeddings, attention_strength, expected_output = test_data
        output = induction_copy_head(embeddings, attention_strength).tolist()
        case_num = test_case_id.split()[-1]

        assert np.isclose(expected_output, output, atol=EPSILON).all(), (
            f"Test Case {case_num} Failed:\nExpected: {expected_output}\nGot: {output}"
        )

    print("All test cases passed ✅")
else:
    print(f"⚠️  Test file not found: {test_file}")
    print("Run from the correct directory or clone the course repository.")

---

## Summary

In this notebook, we explored transformer interpretability concepts:

1. **Single Attention Head**: Implemented causal self-attention with:
   - Pre-multiplied QK and OV matrices
   - Causal masking to prevent attending to future positions
   - Numerically stable softmax

2. **Induction Copy Head**: Combined two specialized heads:
   - Previous Token Head: Uses positional encoding to copy predecessor info
   - Copying Head: Matches patterns and copies following tokens

These mechanisms are fundamental to understanding how transformers perform
in-context learning.

### References

- [Attention Is All You Need](https://arxiv.org/abs/1706.03762) - Vaswani et al., 2017
- [A Mathematical Framework for Transformer Circuits](https://transformer-circuits.pub/2021/framework/index.html) - Elhage et al., 2021
- [In-context Learning and Induction Heads](https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html) - Olsson et al., 2022