# ü§ñ Project 2: The Autocomplete Bot

**Objective:** Understand how Large Language Models predict the next token.

## üìñ What You'll Learn

- How LSTMs generate text character-by-character
- What attention mechanisms do in Transformers
- How sampling strategies (temperature, top-k) affect creativity
- The architecture differences between RNNs, LSTMs, and Transformers

## üéØ Learning Goals

1. Build and train a character-level LSTM text generator
2. Visualize attention weights in a pre-trained Transformer
3. Experiment with different text generation strategies
4. Understand why Transformers revolutionized NLP

## Setup and Imports

In [None]:
# Install required packages (run once)
# !pip install torch transformers bertviz numpy matplotlib

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import matplotlib.pyplot as plt
from transformers import GPT2LMHeadModel, GPT2Tokenizer, BertModel, BertTokenizer
from bertviz import head_view
import warnings
warnings.filterwarnings('ignore')

# Set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)

# Check if GPU is available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"üñ•Ô∏è  Using device: {device}")

## Part 1: Build a Character-Level LSTM

We'll build a simple LSTM that learns to generate text one character at a time.

### Task 1.1: Prepare Training Data

In [None]:
# Sample training text - AI/ML definitions
text = """machine learning is the study of computer algorithms that improve automatically through experience. 
deep learning is part of machine learning based on artificial neural networks. 
the transformer is a deep learning model that uses self attention mechanisms. 
attention is all you need for modern natural language processing. 
reinforcement learning trains agents to make sequential decisions. 
supervised learning uses labeled data to train predictive models."""

text = text.lower()  # Normalize to lowercase

# Create character-to-index mappings
chars = sorted(list(set(text)))
char_to_idx = {ch: i for i, ch in enumerate(chars)}
idx_to_char = {i: ch for i, ch in enumerate(chars)}

vocab_size = len(chars)
sequence_length = 40  # How many characters to look back

print(f"üìö Text length: {len(text)} characters")
print(f"üî§ Vocabulary size: {vocab_size} unique characters")
print(f"üìù Characters: {''.join(chars)}")
print(f"\nüéØ Sequence length: {sequence_length}")

In [None]:
# Create training sequences
def create_sequences(text, seq_length):
    """
    Create input-output pairs for training.
    Input: sequence of characters
    Output: next character
    """
    X = []  # Input sequences
    y = []  # Target characters (next char)
    
    for i in range(len(text) - seq_length):
        sequence = text[i:i + seq_length]
        target = text[i + seq_length]
        
        # Convert to indices
        X.append([char_to_idx[ch] for ch in sequence])
        y.append(char_to_idx[target])
    
    return np.array(X), np.array(y)

X_train, y_train = create_sequences(text, sequence_length)

print(f"‚úÖ Created {len(X_train)} training sequences")
print(f"\nüìä Example:")
print(f"   Input:  '{text[:sequence_length]}'")
print(f"   Target: '{text[sequence_length]}'")

### Task 1.2: Define the LSTM Model

In [None]:
class CharLSTM(nn.Module):
    """
    Character-level LSTM for text generation.
    
    Architecture:
    1. Embedding layer: Converts character indices to dense vectors
    2. LSTM layer: Processes sequences with memory
    3. Fully connected layer: Outputs probability distribution over characters
    """
    def __init__(self, vocab_size, embedding_dim=64, hidden_dim=128, num_layers=2):
        super(CharLSTM, self).__init__()
        
        self.hidden_dim = hidden_dim
        self.num_layers = num_layers
        
        # Embedding layer: character index -> dense vector
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        
        # LSTM layer(s)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers, 
                           batch_first=True, dropout=0.2)
        
        # Output layer: hidden state -> character probabilities
        self.fc = nn.Linear(hidden_dim, vocab_size)
    
    def forward(self, x, hidden=None):
        # x shape: (batch_size, sequence_length)
        
        # Embed characters
        embedded = self.embedding(x)  # (batch, seq, embedding_dim)
        
        # Pass through LSTM
        if hidden is None:
            lstm_out, hidden = self.lstm(embedded)
        else:
            lstm_out, hidden = self.lstm(embedded, hidden)
        
        # Take the last output for prediction
        last_output = lstm_out[:, -1, :]  # (batch, hidden_dim)
        
        # Project to vocabulary
        logits = self.fc(last_output)  # (batch, vocab_size)
        
        return logits, hidden

# Initialize model
model = CharLSTM(vocab_size, embedding_dim=64, hidden_dim=128, num_layers=2)
model = model.to(device)

print("üèóÔ∏è  Model Architecture:")
print(model)
print(f"\nüìä Total parameters: {sum(p.numel() for p in model.parameters()):,}")

### Task 1.3: Train the LSTM

In [None]:
# Training configuration
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.005)
num_epochs = 100
batch_size = 32

# Convert to PyTorch tensors
X_train_tensor = torch.LongTensor(X_train).to(device)
y_train_tensor = torch.LongTensor(y_train).to(device)

# Training loop
losses = []
print("üöÄ Starting training...\n")

for epoch in range(num_epochs):
    model.train()
    total_loss = 0
    
    # Mini-batch training
    for i in range(0, len(X_train_tensor), batch_size):
        batch_X = X_train_tensor[i:i+batch_size]
        batch_y = y_train_tensor[i:i+batch_size]
        
        # Forward pass
        optimizer.zero_grad()
        outputs, _ = model(batch_X)
        loss = criterion(outputs, batch_y)
        
        # Backward pass
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item()
    
    avg_loss = total_loss / (len(X_train_tensor) // batch_size)
    losses.append(avg_loss)
    
    if (epoch + 1) % 10 == 0:
        print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {avg_loss:.4f}")

print("\n‚úÖ Training complete!")

In [None]:
# Plot training loss
plt.figure(figsize=(10, 5))
plt.plot(losses, linewidth=2)
plt.title('Training Loss Over Time', fontsize=14, fontweight='bold')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

### Task 1.4: Generate Text with Different Sampling Strategies

In [None]:
def generate_text(model, seed_text, length=200, temperature=1.0, top_k=None):
    """
    Generate text using the trained LSTM.
    
    Args:
        seed_text: Initial text to start generation
        length: Number of characters to generate
        temperature: Controls randomness (higher = more random)
        top_k: If set, only sample from top-k most likely characters
    """
    model.eval()
    
    # Ensure seed_text is long enough
    if len(seed_text) < sequence_length:
        seed_text = seed_text.rjust(sequence_length)
    
    generated = seed_text
    
    with torch.no_grad():
        for _ in range(length):
            # Get last sequence_length characters
            context = generated[-sequence_length:]
            
            # Convert to indices
            x = torch.LongTensor([[char_to_idx[ch] for ch in context]]).to(device)
            
            # Get predictions
            logits, _ = model(x)
            
            # Apply temperature
            logits = logits / temperature
            probs = torch.softmax(logits, dim=-1).cpu().numpy()[0]
            
            # Apply top-k filtering if specified
            if top_k is not None:
                top_k_indices = np.argpartition(probs, -top_k)[-top_k:]
                top_k_probs = probs[top_k_indices]
                top_k_probs = top_k_probs / top_k_probs.sum()  # Renormalize
                
                next_idx = np.random.choice(top_k_indices, p=top_k_probs)
            else:
                # Sample from full distribution
                next_idx = np.random.choice(len(probs), p=probs)
            
            # Append predicted character
            generated += idx_to_char[next_idx]
    
    return generated

# Test with different settings
seed = "machine learning is"

print("="*80)
print("üé≤ Experimenting with Sampling Strategies")
print("="*80)

print(f"\n1Ô∏è‚É£  Low Temperature (t=0.5) - More Deterministic:")
print("-" * 80)
print(generate_text(model, seed, length=150, temperature=0.5))

print(f"\n\n2Ô∏è‚É£  High Temperature (t=1.5) - More Creative:")
print("-" * 80)
print(generate_text(model, seed, length=150, temperature=1.5))

print(f"\n\n3Ô∏è‚É£  Top-K Sampling (k=5) - Balanced:")
print("-" * 80)
print(generate_text(model, seed, length=150, temperature=1.0, top_k=5))

## Part 2: Visualize Transformer Attention

Now let's explore how modern Transformers use attention mechanisms.

### Task 2.1: Load Pre-trained GPT-2

In [None]:
# Load GPT-2 (small version)
print("üì• Loading GPT-2...")
gpt2_tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
gpt2_model = GPT2LMHeadModel.from_pretrained('gpt2', output_attentions=True)
gpt2_model.eval()

print("‚úÖ GPT-2 loaded successfully!")
print(f"\nüìä Model info:")
print(f"   - Parameters: {sum(p.numel() for p in gpt2_model.parameters()):,}")
print(f"   - Layers: 12")
print(f"   - Attention heads: 12")
print(f"   - Hidden size: 768")

### Task 2.2: Visualize Attention Patterns

In [None]:
# Analyze a sentence
sentence = "The AI agent uses vector search to retrieve relevant documents from memory."

# Tokenize
inputs = gpt2_tokenizer(sentence, return_tensors='pt')
tokens = gpt2_tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])

print(f"üìù Sentence: {sentence}")
print(f"\nüî§ Tokens: {tokens}")
print(f"   (Total: {len(tokens)} tokens)")

In [None]:
# Get model outputs with attention
with torch.no_grad():
    outputs = gpt2_model(**inputs)
    attentions = outputs.attentions  # Tuple of attention matrices for each layer

print(f"‚úÖ Attention tensors extracted")
print(f"   - Number of layers: {len(attentions)}")
print(f"   - Attention shape per layer: {attentions[0].shape}")
print(f"     (batch_size, num_heads, sequence_length, sequence_length)")

In [None]:
# Visualize attention from one layer
layer_to_visualize = 6  # Middle layer

attention_matrix = attentions[layer_to_visualize][0].mean(dim=0).numpy()  # Average across heads

plt.figure(figsize=(12, 10))
plt.imshow(attention_matrix, cmap='viridis', aspect='auto')
plt.colorbar(label='Attention Weight')
plt.title(f'Attention Patterns - Layer {layer_to_visualize}', fontsize=14, fontweight='bold')
plt.xlabel('Key Position (Source Token)')
plt.ylabel('Query Position (Target Token)')
plt.xticks(range(len(tokens)), tokens, rotation=90)
plt.yticks(range(len(tokens)), tokens)
plt.tight_layout()
plt.show()

print("\nüí° How to read this:")
print("   - Each row shows where token i attends to")
print("   - Brighter = stronger attention")
print("   - Look for patterns: Does 'agent' attend to 'AI'?")

### Task 2.3: Generate Text with GPT-2

In [None]:
def generate_with_gpt2(prompt, max_length=50, temperature=1.0, top_k=50, num_return=3):
    """
    Generate text using GPT-2 with different sampling strategies.
    """
    input_ids = gpt2_tokenizer.encode(prompt, return_tensors='pt')
    
    outputs = gpt2_model.generate(
        input_ids,
        max_length=max_length,
        temperature=temperature,
        top_k=top_k,
        num_return_sequences=num_return,
        do_sample=True,
        pad_token_id=gpt2_tokenizer.eos_token_id
    )
    
    results = []
    for i, output in enumerate(outputs):
        text = gpt2_tokenizer.decode(output, skip_special_tokens=True)
        results.append(text)
    
    return results

# Test generation
prompt = "An AI agent is a system that"

print("="*80)
print(f"üéØ Prompt: '{prompt}'")
print("="*80)

print("\nüå°Ô∏è  Temperature = 0.7 (Balanced)")
print("-"*80)
results = generate_with_gpt2(prompt, temperature=0.7, num_return=2)
for i, text in enumerate(results, 1):
    print(f"\n{i}. {text}")

print("\n\nüî• Temperature = 1.5 (Creative)")
print("-"*80)
results = generate_with_gpt2(prompt, temperature=1.5, num_return=2)
for i, text in enumerate(results, 1):
    print(f"\n{i}. {text}")

## üß™ Comparative Analysis: LSTM vs Transformer

In [None]:
print("="*80)
print("‚öñÔ∏è  LSTM vs Transformer: Key Differences")
print("="*80)

comparison = {
    "Aspect": [
        "Architecture",
        "Parallelization",
        "Long-range Dependencies",
        "Training Speed",
        "Context Understanding",
        "Parameters (typical)"
    ],
    "LSTM": [
        "Recurrent (sequential)",
        "‚ùå Must process sequentially",
        "‚ö†Ô∏è Limited (vanishing gradient)",
        "Slower (sequential)",
        "Local (limited history)",
        "Thousands to millions"
    ],
    "Transformer": [
        "Attention-based (parallel)",
        "‚úÖ Fully parallelizable",
        "‚úÖ Excellent (self-attention)",
        "Faster (GPU-friendly)",
        "Global (full sequence)",
        "Millions to billions"
    ]
}

import pandas as pd
df = pd.DataFrame(comparison)
print(df.to_string(index=False))

print("\nüí° Why Transformers Won:")
print("   1. Self-attention allows looking at entire sequence at once")
print("   2. Parallelization enables training on massive datasets")
print("   3. Better at capturing long-range dependencies")
print("   4. Scales effectively with more data and compute")

## üéØ Challenge Exercises

### Challenge 1: Implement Top-P (Nucleus) Sampling

Instead of top-k, implement top-p sampling which selects from the smallest set of tokens whose cumulative probability exceeds p.

In [None]:
def top_p_sampling(probs, p=0.9):
    """
    Nucleus sampling: sample from smallest set of tokens with cumulative prob >= p.
    
    TODO: Implement this function
    Hints:
    - Sort probabilities in descending order
    - Compute cumulative sum
    - Find cutoff where cumsum >= p
    - Sample only from those tokens
    """
    # YOUR CODE HERE
    pass

### Challenge 2: Analyze Attention Heads

Different attention heads learn different patterns. Analyze what different heads focus on.

In [None]:
# TODO: Visualize individual attention heads from GPT-2
# - Pick a layer (e.g., layer 6)
# - Visualize each of the 12 heads separately
# - Do different heads focus on different patterns?
#   (e.g., syntactic vs semantic relationships)

# YOUR CODE HERE

### Challenge 3: Build a Word-Level LSTM

Modify the character-level LSTM to work at the word level instead.

In [None]:
# TODO: Create a word-level LSTM
# Steps:
# 1. Tokenize text into words
# 2. Create word-to-index mappings
# 3. Modify the LSTM to work with word embeddings
# 4. Train and compare to character-level model

# YOUR CODE HERE

## üéì Key Takeaways

### What You've Learned:

1. **LSTMs for Text Generation**:
   - Process sequences one step at a time
   - Maintain hidden state (memory)
   - Good for learning local patterns
   - Limited by sequential processing

2. **Transformer Architecture**:
   - Self-attention: Every token attends to every other token
   - Parallel processing: Much faster training
   - Positional encoding: Maintains sequence order
   - Multi-head attention: Learns multiple relationship types

3. **Sampling Strategies**:
   - **Temperature**: Controls randomness
     - Low (0.1-0.7): Focused, deterministic
     - High (1.0-2.0): Creative, diverse
   - **Top-K**: Sample from K most likely tokens
   - **Top-P (Nucleus)**: Dynamic selection based on cumulative probability

4. **Attention Visualization**:
   - Shows what the model "focuses on"
   - Different heads learn different patterns
   - Helps understand model behavior

### Why This Matters for AI Agents:

- **Agents use LLMs**: Understanding how they generate text helps debug behavior
- **Sampling control**: Agents need predictable vs creative outputs in different contexts
- **Attention = Reasoning**: Visualizing attention helps understand agent decision-making
- **Context windows**: Understanding how models process sequences informs prompt engineering

### Next Steps:

In Phase 2, you'll learn to give these LLMs "hands" by connecting them to tools and external knowledge!

## üìö Additional Resources

- [The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/)
- [Attention Is All You Need (Original Paper)](https://arxiv.org/abs/1706.03762)
- [Understanding LSTM Networks](https://colah.github.io/posts/2015-08-Understanding-LSTMs/)
- [Hugging Face Transformers Documentation](https://huggingface.co/docs/transformers/)
- [BertViz: Attention Visualization Tool](https://github.com/jessevig/bertviz)