# Week 1: Introduction to Transformers

## Notebook 01: Attention Mechanisms

Welcome to the agentic-llm-playbook! This notebook introduces the core concept of attention mechanisms, the building block of modern LLMs.

### Learning Objectives
- Understand the scaled dot-product attention formula
- Implement attention from scratch
- Visualize attention weights
- Explore causal masking for autoregressive generation

In [None]:


import torch
import numpy as np
import matplotlib.pyplot as plt
from llm_journey.models import ScaledDotProductAttention
from llm_journey.utils import set_seed

set_seed(42)

## 1. Scaled Dot-Product Attention

The attention mechanism computes a weighted sum of values based on the similarity between queries and keys:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Where:
- $Q$ (Query): What we're looking for
- $K$ (Key): What we're matching against
- $V$ (Value): The actual content to retrieve
- $d_k$: Dimension of the key vectors (scaling factor)

In [None]:
# Create simple query, key, value tensors
batch_size = 1
num_heads = 1
seq_len = 5
d_k = 8

query = torch.randn(batch_size, num_heads, seq_len, d_k)
key = torch.randn(batch_size, num_heads, seq_len, d_k)
value = torch.randn(batch_size, num_heads, seq_len, d_k)

print(f"Query shape: {query.shape}")
print(f"Key shape: {key.shape}")
print(f"Value shape: {value.shape}")

In [None]:
# Apply attention
attention = ScaledDotProductAttention(dropout=0.0)
output, weights = attention(query, key, value)

print(f"Output shape: {output.shape}")
print(f"Attention weights shape: {weights.shape}")

## 2. Visualizing Attention Weights

Let's visualize how each query position attends to different key positions.

In [None]:
plt.figure(figsize=(8, 6))
plt.imshow(weights[0, 0].detach().numpy(), cmap='viridis')
plt.colorbar(label='Attention Weight')
plt.title('Attention Weights Heatmap')
plt.xlabel('Key Position')
plt.ylabel('Query Position')
plt.tight_layout()
plt.show()

## 3. Causal Masking for Autoregressive Generation

In language modeling, we need to prevent the model from "looking ahead" during training. This is achieved through causal masking.

In [None]:
# Create causal mask (lower triangular matrix)
causal_mask = torch.tril(torch.ones(seq_len, seq_len)).unsqueeze(0).unsqueeze(0)

print("Causal mask:")
print(causal_mask[0, 0])

# Apply attention with mask
output_masked, weights_masked = attention(query, key, value, causal_mask)

plt.figure(figsize=(8, 6))
plt.imshow(weights_masked[0, 0].detach().numpy(), cmap='viridis')
plt.colorbar(label='Attention Weight')
plt.title('Causal Masked Attention Weights')
plt.xlabel('Key Position')
plt.ylabel('Query Position')
plt.tight_layout()
plt.show()

## Exercises

1. Modify the attention visualization to show attention patterns for different random seeds
2. Experiment with different sequence lengths and observe how attention patterns change
3. Implement a function to compute attention scores manually and verify they match the module output
4. Create a custom mask that only allows attending to the previous 3 tokens (sliding window attention)

## Next Steps

Continue to Notebook 02 to learn about Multi-Head Attention and how it extends the basic attention mechanism.