# Language Model From Scratch - Complete Tutorial

**Author:** Anik Tahabilder  
**Project:** 14 of 22 - Kaggle ML Portfolio  
**Difficulty:** 9/10 | **Learning Value:** 10/10

---

## What Will You Learn?

This tutorial builds language models **from scratch**, progressing from simple to complex:

| Part | Model | Complexity | Key Concept |
|------|-------|------------|-------------|
| 1 | N-gram Models | Basic | Count-based probability |
| 2 | Neural LM | Intermediate | Word embeddings |
| 3 | Attention | Advanced | Context-aware weights |
| 4 | Transformer | Expert | Self-attention, parallelization |

---

## The Evolution of Language Models

```
┌─────────────────────────────────────────────────────────────────────────┐
│                    LANGUAGE MODEL EVOLUTION                              │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   N-gram          Neural LM         Attention         Transformer       │
│   (1990s)         (2003)            (2014)            (2017)            │
│                                                                         │
│   ┌─────┐         ┌─────┐          ┌─────┐           ┌─────┐           │
│   │Count│  ───>   │ RNN │   ───>   │Attn │   ───>    │ ⚡  │           │
│   │Based│         │LSTM │          │ RNN │           │Self │           │
│   └─────┘         └─────┘          └─────┘           │Attn │           │
│                                                       └─────┘           │
│   P(w|history)    Hidden states    Weighted context   Parallel         │
│   from counts     capture context  dynamic weights    attention        │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘
```

---

## Table of Contents

1. [Part 1: What is a Language Model?](#part1)
2. [Part 2: N-gram Language Models](#part2)
3. [Part 3: Word Embeddings](#part3)
4. [Part 4: Attention Mechanism](#part4)
5. [Part 5: Transformer Architecture](#part5)
6. [Part 6: Building Mini-GPT](#part6)
7. [Part 7: Training & Text Generation](#part7)
8. [Part 8: Summary & Comparison](#part8)

---

<a id='part1'></a>
# Part 1: What is a Language Model?

---

## 1.1 Definition

A **Language Model** assigns probabilities to sequences of words.

```
Given: "The cat sat on the"
Predict: P(mat) = 0.3, P(floor) = 0.2, P(dog) = 0.01, ...
```

## 1.2 Why Language Models Matter

| Application | How LM Helps |
|-------------|-------------|
| **Text Generation** | GPT, ChatGPT |
| **Machine Translation** | Choose fluent translations |
| **Speech Recognition** | Disambiguate similar sounds |
| **Autocomplete** | Predict next word |
| **Spelling Correction** | "teh" → "the" |

## 1.3 The Core Problem

**Goal:** Estimate P(w₁, w₂, ..., wₙ) - probability of a sentence

Using **chain rule**:
```
P(w₁, w₂, w₃, w₄) = P(w₁) × P(w₂|w₁) × P(w₃|w₁,w₂) × P(w₄|w₁,w₂,w₃)
```

**Problem:** As sequence grows, conditioning history becomes intractable!

**Solution:** Make simplifying assumptions (N-gram) or learn representations (Neural)

In [None]:
# ============================================================
# SETUP AND IMPORTS
# ============================================================
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import defaultdict, Counter
import re
import random
import math
import os
import warnings
warnings.filterwarnings('ignore')

# PyTorch for neural models
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader

# Set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Display settings
plt.style.use('seaborn-v0_8-whitegrid')
np.random.seed(42)
torch.manual_seed(42)

print("="*70)
print("LANGUAGE MODEL FROM SCRATCH - TUTORIAL")
print("="*70)
print(f"PyTorch: {torch.__version__}")
print(f"Device: {device}")
print("\nAll libraries loaded!")

In [None]:
# ============================================================
# LOAD TEXT DATA
# ============================================================
print("="*70)
print("LOADING TEXT CORPUS")
print("="*70)

# Sample corpus for demonstration
# Using famous quotes and simple sentences for clear learning

corpus = """
The quick brown fox jumps over the lazy dog.
A journey of a thousand miles begins with a single step.
To be or not to be that is the question.
All that glitters is not gold.
The only thing we have to fear is fear itself.
In the beginning was the word and the word was with god.
It was the best of times it was the worst of times.
The cat sat on the mat.
The dog chased the cat around the house.
I think therefore I am.
Knowledge is power.
Time flies like an arrow.
The early bird catches the worm.
Actions speak louder than words.
Practice makes perfect.
The pen is mightier than the sword.
Where there is a will there is a way.
Rome was not built in a day.
When in Rome do as the Romans do.
A picture is worth a thousand words.
The grass is always greener on the other side.
Birds of a feather flock together.
Two heads are better than one.
The apple does not fall far from the tree.
You cannot judge a book by its cover.
Every cloud has a silver lining.
A stitch in time saves nine.
Better late than never.
Curiosity killed the cat.
Fortune favors the bold.
"""

# Preprocessing function
def preprocess_text(text):
    """Clean and tokenize text."""
    # Lowercase
    text = text.lower()
    # Remove punctuation except periods (sentence boundaries)
    text = re.sub(r'[^a-z\s.]', '', text)
    # Split into sentences
    sentences = [s.strip() for s in text.split('.') if s.strip()]
    # Tokenize each sentence
    tokenized = [s.split() for s in sentences]
    return tokenized

sentences = preprocess_text(corpus)

# Create vocabulary
all_words = [word for sent in sentences for word in sent]
vocab = sorted(set(all_words))
word_to_idx = {word: idx for idx, word in enumerate(vocab)}
idx_to_word = {idx: word for word, idx in word_to_idx.items()}

print(f"Number of sentences: {len(sentences)}")
print(f"Total words: {len(all_words)}")
print(f"Vocabulary size: {len(vocab)}")
print(f"\nSample sentences:")
for i, sent in enumerate(sentences[:5]):
    print(f"  {i+1}. {' '.join(sent)}")

print(f"\nFirst 20 vocabulary words:")
print(vocab[:20])

---

<a id='part2'></a>
# Part 2: N-gram Language Models

---

## 2.1 The N-gram Assumption

**Markov Assumption:** The probability of a word depends only on the previous N-1 words.

```
┌────────────────────────────────────────────────────────────────┐
│                        N-GRAM MODELS                           │
├────────────────────────────────────────────────────────────────┤
│                                                                │
│  Sentence: "The cat sat on the mat"                           │
│                                                                │
│  UNIGRAM (N=1): P(word) - Each word independent               │
│  ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐             │
│  │ the │ │ cat │ │ sat │ │ on  │ │ the │ │ mat │             │
│  └─────┘ └─────┘ └─────┘ └─────┘ └─────┘ └─────┘             │
│                                                                │
│  BIGRAM (N=2): P(word | prev_word)                            │
│  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐             │
│  │the→cat │ │cat→sat │ │sat→on  │ │on→the  │ ...           │
│  └─────────┘ └─────────┘ └─────────┘ └─────────┘             │
│                                                                │
│  TRIGRAM (N=3): P(word | prev_2_words)                        │
│  ┌───────────────┐ ┌───────────────┐                          │
│  │the,cat→sat   │ │cat,sat→on    │ ...                       │
│  └───────────────┘ └───────────────┘                          │
│                                                                │
└────────────────────────────────────────────────────────────────┘
```

## 2.2 N-gram Probability Formula

| N-gram | Formula | Example |
|--------|---------|--------|
| Unigram | P(w) = C(w) / N | P(the) = 10/100 |
| Bigram | P(w₂\|w₁) = C(w₁,w₂) / C(w₁) | P(cat\|the) = C(the,cat)/C(the) |
| Trigram | P(w₃\|w₁,w₂) = C(w₁,w₂,w₃) / C(w₁,w₂) | P(sat\|the,cat) |

In [None]:
# ============================================================
# N-GRAM LANGUAGE MODEL FROM SCRATCH
# ============================================================
print("="*70)
print("N-GRAM LANGUAGE MODEL")
print("="*70)

class NGramLanguageModel:
    """
    N-gram Language Model implementation.
    
    Supports:
    - Unigram (n=1)
    - Bigram (n=2)  
    - Trigram (n=3)
    - Any N-gram
    """
    
    def __init__(self, n=2):
        self.n = n
        self.ngram_counts = defaultdict(Counter)
        self.context_counts = Counter()
        self.vocab = set()
        
    def train(self, sentences):
        """
        Train the N-gram model on sentences.
        
        For each sentence, we:
        1. Add start tokens <s> for context
        2. Count all n-grams
        3. Count all (n-1)-gram contexts
        """
        for sentence in sentences:
            # Add start tokens
            padded = ['<s>'] * (self.n - 1) + sentence + ['</s>']
            
            # Update vocabulary
            self.vocab.update(sentence)
            
            # Count n-grams
            for i in range(len(padded) - self.n + 1):
                # Context is the first n-1 words
                context = tuple(padded[i:i + self.n - 1])
                # Target is the last word
                word = padded[i + self.n - 1]
                
                self.ngram_counts[context][word] += 1
                self.context_counts[context] += 1
    
    def probability(self, word, context):
        """
        Calculate P(word | context).
        
        P(word | context) = Count(context, word) / Count(context)
        """
        context = tuple(context[-(self.n-1):])  # Take last n-1 words
        
        if self.context_counts[context] == 0:
            # Unseen context - return uniform probability
            return 1 / len(self.vocab)
        
        count = self.ngram_counts[context][word]
        total = self.context_counts[context]
        
        return count / total if count > 0 else 0
    
    def probability_smoothed(self, word, context, alpha=1.0):
        """
        Calculate P(word | context) with Laplace (add-alpha) smoothing.
        
        P(word | context) = (Count(context, word) + α) / (Count(context) + α × |V|)
        
        This prevents zero probabilities for unseen n-grams.
        """
        context = tuple(context[-(self.n-1):])
        
        count = self.ngram_counts[context][word]
        total = self.context_counts[context]
        V = len(self.vocab) + 2  # +2 for <s> and </s>
        
        return (count + alpha) / (total + alpha * V)
    
    def generate(self, start_words=None, max_length=20, temperature=1.0):
        """
        Generate text using the trained model.
        
        Temperature:
        - Low (0.1): More deterministic, picks high probability words
        - High (2.0): More random, explores diverse options
        """
        if start_words is None:
            # Start with start tokens
            current = ['<s>'] * (self.n - 1)
        else:
            current = ['<s>'] * (self.n - 1) + start_words
        
        generated = list(start_words) if start_words else []
        
        for _ in range(max_length):
            context = tuple(current[-(self.n-1):])
            
            # Get probability distribution
            if context not in self.ngram_counts:
                # Random word if context unseen
                next_word = random.choice(list(self.vocab))
            else:
                words = list(self.ngram_counts[context].keys())
                counts = np.array(list(self.ngram_counts[context].values()), dtype=float)
                
                # Apply temperature
                probs = counts ** (1 / temperature)
                probs = probs / probs.sum()
                
                next_word = np.random.choice(words, p=probs)
            
            if next_word == '</s>':
                break
                
            generated.append(next_word)
            current.append(next_word)
        
        return generated
    
    def perplexity(self, sentences):
        """
        Calculate perplexity on test sentences.
        
        Perplexity = 2^(-average log probability)
        Lower perplexity = better model
        """
        log_prob_sum = 0
        word_count = 0
        
        for sentence in sentences:
            padded = ['<s>'] * (self.n - 1) + sentence + ['</s>']
            
            for i in range(self.n - 1, len(padded)):
                context = padded[i - self.n + 1:i]
                word = padded[i]
                
                prob = self.probability_smoothed(word, context)
                log_prob_sum += math.log2(prob) if prob > 0 else -100
                word_count += 1
        
        return 2 ** (-log_prob_sum / word_count)

print("NGramLanguageModel class created!")
print("\nKey methods:")
print("  - train(sentences): Learn n-gram counts")
print("  - probability(word, context): Get P(word|context)")
print("  - generate(start_words): Generate new text")
print("  - perplexity(sentences): Evaluate model")

In [None]:
# ============================================================
# TRAIN AND COMPARE N-GRAM MODELS
# ============================================================
print("="*70)
print("TRAINING N-GRAM MODELS")
print("="*70)

# Train models with different N
models = {}
for n in [1, 2, 3, 4]:
    model = NGramLanguageModel(n=n)
    model.train(sentences)
    models[n] = model
    print(f"\n{n}-gram model trained")
    print(f"  Unique contexts: {len(model.context_counts)}")

# Demonstrate bigram probabilities
print("\n" + "="*50)
print("BIGRAM PROBABILITIES")
print("="*50)

bigram = models[2]

# Show probabilities after "the"
context = ('the',)
print(f"\nAfter 'the', most likely words:")
if context in bigram.ngram_counts:
    sorted_words = sorted(bigram.ngram_counts[context].items(), 
                         key=lambda x: x[1], reverse=True)
    for word, count in sorted_words[:10]:
        prob = bigram.probability(word, ['the'])
        print(f"  P({word}|the) = {count}/{bigram.context_counts[context]} = {prob:.3f}")

# Visualize bigram distribution
print("\n" + "="*50)
print("BIGRAM VISUALIZATION")
print("="*50)

In [None]:
# Visualize top bigrams
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Top context words
ax1 = axes[0]
top_contexts = bigram.context_counts.most_common(15)
contexts, counts = zip(*top_contexts)
context_labels = [' '.join(c) for c in contexts]
ax1.barh(context_labels, counts, color='steelblue', edgecolor='black')
ax1.set_xlabel('Count')
ax1.set_title('Most Common Bigram Contexts', fontweight='bold')
ax1.invert_yaxis()

# Probability distribution after "the"
ax2 = axes[1]
context = ('the',)
if context in bigram.ngram_counts:
    sorted_words = sorted(bigram.ngram_counts[context].items(), 
                         key=lambda x: x[1], reverse=True)[:10]
    words, counts = zip(*sorted_words)
    total = sum(counts)
    probs = [c/bigram.context_counts[context] for c in counts]
    ax2.barh(words, probs, color='coral', edgecolor='black')
    ax2.set_xlabel('P(word | "the")')
    ax2.set_title('Probability Distribution After "the"', fontweight='bold')
    ax2.invert_yaxis()

plt.tight_layout()
plt.show()

In [None]:
# ============================================================
# TEXT GENERATION WITH N-GRAMS
# ============================================================
print("="*70)
print("TEXT GENERATION")
print("="*70)

for n in [2, 3, 4]:
    print(f"\n{n}-gram model generations:")
    print("-" * 40)
    for i in range(3):
        generated = models[n].generate(start_words=['the'], max_length=10, temperature=0.8)
        print(f"  {i+1}. {' '.join(generated)}")

# Show effect of temperature
print("\n" + "="*50)
print("EFFECT OF TEMPERATURE")
print("="*50)

trigram = models[3]
for temp in [0.3, 1.0, 2.0]:
    print(f"\nTemperature = {temp}:")
    for i in range(3):
        generated = trigram.generate(start_words=['the'], max_length=8, temperature=temp)
        print(f"  {' '.join(generated)}")

In [None]:
# ============================================================
# MODEL EVALUATION - PERPLEXITY
# ============================================================
print("="*70)
print("MODEL EVALUATION - PERPLEXITY")
print("="*70)

print("""
PERPLEXITY EXPLAINED:
=====================
- Measures how "surprised" the model is by test data
- Lower perplexity = better model
- Perplexity of N means the model is as confused as choosing
  uniformly among N words at each position

Formula: PP(W) = 2^(-1/N × Σ log₂ P(wᵢ|context))
""")

# Calculate perplexity for different N
test_sentences = sentences[::3]  # Every 3rd sentence as test

perplexities = {}
for n in [1, 2, 3, 4]:
    pp = models[n].perplexity(test_sentences)
    perplexities[n] = pp
    print(f"{n}-gram perplexity: {pp:.2f}")

# Visualize
plt.figure(figsize=(8, 5))
plt.bar(list(perplexities.keys()), list(perplexities.values()), 
        color='steelblue', edgecolor='black')
plt.xlabel('N (N-gram order)')
plt.ylabel('Perplexity (lower is better)')
plt.title('Perplexity vs N-gram Order', fontweight='bold')
plt.xticks([1, 2, 3, 4])
plt.show()

print("\nInsight: Higher N generally gives lower perplexity on training data,")
print("but may overfit (high perplexity on unseen data).")

---

<a id='part3'></a>
# Part 3: Word Embeddings

---

## 3.1 Problem with N-grams

| Limitation | Description |
|------------|-------------|
| **Sparsity** | Most n-grams never seen in training |
| **No Generalization** | "cat" and "dog" are equally different from "car" |
| **Fixed Context** | Can't capture long-range dependencies |

## 3.2 Word Embeddings Solution

Instead of treating words as discrete symbols, represent them as **dense vectors**.

```
One-hot (sparse):              Embedding (dense):
"cat" = [0,0,1,0,0,0,...]      "cat" = [0.2, -0.5, 0.8, 0.1, ...]
"dog" = [0,0,0,1,0,0,...]      "dog" = [0.3, -0.4, 0.7, 0.2, ...]
                                        ↑ Similar vectors!
```

## 3.3 Key Property: Semantic Similarity

```
king - man + woman ≈ queen
Paris - France + Italy ≈ Rome
```

In [None]:
# ============================================================
# WORD EMBEDDINGS FROM SCRATCH
# ============================================================
print("="*70)
print("WORD EMBEDDINGS")
print("="*70)

class SimpleEmbedding:
    """
    Simple word embedding using co-occurrence matrix + SVD.
    
    Steps:
    1. Build word-word co-occurrence matrix
    2. Apply SVD to reduce dimensionality
    3. Use left singular vectors as embeddings
    """
    
    def __init__(self, embedding_dim=50, window_size=2):
        self.embedding_dim = embedding_dim
        self.window_size = window_size
        self.word_to_idx = {}
        self.idx_to_word = {}
        self.embeddings = None
        
    def fit(self, sentences):
        """Build embeddings from sentences."""
        # Build vocabulary
        all_words = [w for sent in sentences for w in sent]
        vocab = sorted(set(all_words))
        self.word_to_idx = {w: i for i, w in enumerate(vocab)}
        self.idx_to_word = {i: w for w, i in self.word_to_idx.items()}
        V = len(vocab)
        
        # Build co-occurrence matrix
        cooccurrence = np.zeros((V, V))
        
        for sentence in sentences:
            for i, word in enumerate(sentence):
                word_idx = self.word_to_idx[word]
                
                # Look at words in window
                start = max(0, i - self.window_size)
                end = min(len(sentence), i + self.window_size + 1)
                
                for j in range(start, end):
                    if i != j:
                        context_idx = self.word_to_idx[sentence[j]]
                        cooccurrence[word_idx, context_idx] += 1
        
        # Apply log transform (like GloVe)
        cooccurrence = np.log1p(cooccurrence)
        
        # SVD
        U, S, Vt = np.linalg.svd(cooccurrence, full_matrices=False)
        
        # Use top-k singular vectors
        k = min(self.embedding_dim, len(S))
        self.embeddings = U[:, :k] * np.sqrt(S[:k])
        
        print(f"Built embeddings: {self.embeddings.shape}")
        
    def get_embedding(self, word):
        """Get embedding vector for a word."""
        if word not in self.word_to_idx:
            return None
        return self.embeddings[self.word_to_idx[word]]
    
    def most_similar(self, word, top_k=5):
        """Find most similar words using cosine similarity."""
        if word not in self.word_to_idx:
            return []
        
        word_vec = self.get_embedding(word)
        
        # Cosine similarity with all words
        similarities = []
        for other_word, idx in self.word_to_idx.items():
            if other_word != word:
                other_vec = self.embeddings[idx]
                # Cosine similarity
                sim = np.dot(word_vec, other_vec) / (np.linalg.norm(word_vec) * np.linalg.norm(other_vec) + 1e-8)
                similarities.append((other_word, sim))
        
        return sorted(similarities, key=lambda x: x[1], reverse=True)[:top_k]

# Train embeddings
emb = SimpleEmbedding(embedding_dim=30, window_size=2)
emb.fit(sentences)

# Show similar words
print("\nMost similar words:")
for word in ['the', 'cat', 'is', 'not']:
    similar = emb.most_similar(word, top_k=5)
    print(f"  '{word}' → {[(w, f'{s:.3f}') for w, s in similar]}")

In [None]:
# Visualize embeddings with t-SNE
print("="*70)
print("EMBEDDING VISUALIZATION")
print("="*70)

from sklearn.manifold import TSNE

# Get embeddings for visualization
words_to_plot = list(emb.word_to_idx.keys())[:50]  # Top 50 words
vectors = np.array([emb.get_embedding(w) for w in words_to_plot])

# t-SNE reduction to 2D
tsne = TSNE(n_components=2, random_state=42, perplexity=min(15, len(words_to_plot)-1))
vectors_2d = tsne.fit_transform(vectors)

# Plot
plt.figure(figsize=(12, 8))
plt.scatter(vectors_2d[:, 0], vectors_2d[:, 1], alpha=0.6, s=100)

for i, word in enumerate(words_to_plot):
    plt.annotate(word, (vectors_2d[i, 0], vectors_2d[i, 1]), fontsize=9)

plt.title('Word Embeddings Visualization (t-SNE)', fontweight='bold', fontsize=14)
plt.xlabel('t-SNE Dimension 1')
plt.ylabel('t-SNE Dimension 2')
plt.tight_layout()
plt.show()

print("\nNote: Similar words should cluster together!")

---

<a id='part4'></a>
# Part 4: The Attention Mechanism

---

## 4.1 Why Attention?

**Problem with RNNs/LSTMs:**
- Information bottleneck: Entire sequence compressed into fixed-size vector
- Long-range dependencies hard to capture
- Sequential processing (slow)

**Attention Solution:** 
Look at ALL positions and weight them by relevance!

## 4.2 Attention Intuition

```
Query: "What word should come next?"

Sentence: "The cat sat on the ___"

Attention looks at all words and asks:
- "The" → relevance = 0.1
- "cat" → relevance = 0.3  (subject, might be relevant)
- "sat" → relevance = 0.2  (action)
- "on" → relevance = 0.3   (preposition, next word is object)
- "the" → relevance = 0.1

Weighted sum → context vector → predict "mat"
```

## 4.3 Attention Formula

```
Attention(Q, K, V) = softmax(Q × K^T / √d_k) × V

Where:
- Q (Query): What am I looking for?
- K (Key): What do I contain?
- V (Value): What information do I provide?
- d_k: Dimension of keys (for scaling)
```

In [None]:
# ============================================================
# ATTENTION MECHANISM FROM SCRATCH
# ============================================================
print("="*70)
print("ATTENTION MECHANISM FROM SCRATCH")
print("="*70)

def scaled_dot_product_attention(query, key, value, mask=None):
    """
    Scaled Dot-Product Attention.
    
    Args:
        query: (batch, seq_len, d_k)
        key: (batch, seq_len, d_k)
        value: (batch, seq_len, d_v)
        mask: Optional mask for padding/future tokens
    
    Returns:
        output: (batch, seq_len, d_v)
        attention_weights: (batch, seq_len, seq_len)
    """
    d_k = query.shape[-1]
    
    # Step 1: Compute attention scores
    # Q × K^T → (batch, seq_len, seq_len)
    scores = torch.matmul(query, key.transpose(-2, -1))
    
    # Step 2: Scale by sqrt(d_k)
    # This prevents softmax from having extremely small gradients
    scores = scores / math.sqrt(d_k)
    
    # Step 3: Apply mask (optional)
    # Set masked positions to -inf so softmax gives 0
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    
    # Step 4: Softmax to get attention weights
    attention_weights = F.softmax(scores, dim=-1)
    
    # Step 5: Weighted sum of values
    output = torch.matmul(attention_weights, value)
    
    return output, attention_weights

# Demonstrate attention
print("\nDemonstrating Attention Mechanism:")
print("-" * 50)

# Example: 1 batch, 5 tokens, 4-dimensional embeddings
batch_size = 1
seq_len = 5
d_model = 4

# Random embeddings for "The cat sat on mat"
torch.manual_seed(42)
x = torch.randn(batch_size, seq_len, d_model)

# In self-attention, Q=K=V come from same input
query = key = value = x

output, attn_weights = scaled_dot_product_attention(query, key, value)

print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
print(f"Attention weights shape: {attn_weights.shape}")
print(f"\nAttention weights (each row sums to 1):")
print(attn_weights[0].detach().numpy().round(3))

In [None]:
# Visualize attention weights
print("="*70)
print("ATTENTION VISUALIZATION")
print("="*70)

words = ['The', 'cat', 'sat', 'on', 'mat']

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Attention heatmap
ax1 = axes[0]
sns.heatmap(attn_weights[0].detach().numpy(), 
            xticklabels=words, yticklabels=words,
            annot=True, fmt='.2f', cmap='Blues', ax=ax1)
ax1.set_title('Attention Weights', fontweight='bold')
ax1.set_xlabel('Key (attending to)')
ax1.set_ylabel('Query (attending from)')

# Causal mask attention (for language models)
ax2 = axes[1]

# Create causal mask (lower triangular)
causal_mask = torch.tril(torch.ones(seq_len, seq_len))
_, causal_attn = scaled_dot_product_attention(query, key, value, mask=causal_mask)

sns.heatmap(causal_attn[0].detach().numpy(), 
            xticklabels=words, yticklabels=words,
            annot=True, fmt='.2f', cmap='Blues', ax=ax2)
ax2.set_title('Causal Attention (Can\'t See Future)', fontweight='bold')
ax2.set_xlabel('Key')
ax2.set_ylabel('Query')

plt.tight_layout()
plt.show()

print("\nNote: Causal attention only looks at previous tokens (lower triangle).")
print("This is essential for language models - can't see future words!")

In [None]:
# ============================================================
# REAL-WORLD EXAMPLE: COREFERENCE RESOLUTION
# ============================================================
print("="*70)
print("ATTENTION FOR COREFERENCE RESOLUTION")
print("="*70)

print("""
COREFERENCE RESOLUTION:
=======================
When we see a pronoun like "it", attention helps determine what it refers to.

Example sentence:
"The animal didn't cross the street because it was too tired."

Question: What does "it" refer to?
- "it" could refer to "animal" or "street"
- Context tells us: "it was too tired" → animals get tired, streets don't
- So "it" = "animal"

This is exactly what attention learns to do!
""")

# Simulate attention weights for coreference
coref_words = ['The', 'animal', 'didn\'t', 'cross', 'the', 'street', 'because', 'it', 'was', 'too', 'tired']

# Simulated attention weights when "it" is the query
# "it" should attend strongly to "animal" (0.63) and somewhat to "street" (0.31)
it_attention = np.array([0.00, 0.63, 0.00, 0.02, 0.00, 0.31, 0.02, 0.00, 0.01, 0.00, 0.01])

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar chart of attention weights
ax1 = axes[0]
colors = ['#90EE90' if w in ['animal', 'it'] else '#ADD8E6' if w == 'street' else 'lightgray' 
          for w in coref_words]
bars = ax1.bar(coref_words, it_attention, color=colors, edgecolor='black')
ax1.set_ylabel('Attention Weight')
ax1.set_title('What does "it" attend to?', fontweight='bold', fontsize=12)
ax1.set_xticklabels(coref_words, rotation=45, ha='right')

# Highlight the key relationship
ax1.annotate('', xy=(1, 0.63), xytext=(7, 0.4),
            arrowprops=dict(arrowstyle='->', color='blue', lw=2))
ax1.text(4, 0.55, '"it" → "animal"', fontsize=11, fontweight='bold', color='blue')

# Formula explanation
ax2 = axes[1]
ax2.axis('off')
ax2.text(0.5, 0.9, 'Softmax Attention Calculation', fontsize=14, fontweight='bold', 
         ha='center', transform=ax2.transAxes)

formula_text = '''
For query word "it", attention weights are:

           exp(score_i)
W_it = ─────────────────────
        Σ exp(score_j)

Result for "it":

┌────────┬────────┬────────┬────────┬────────┬────────┐
│  The   │ animal │ didn't │ cross  │  the   │ street │
├────────┼────────┼────────┼────────┼────────┼────────┤
│  0.00  │  0.63  │  0.00  │  0.02  │  0.00  │  0.31  │
└────────┴────────┴────────┴────────┴────────┴────────┘
              ↑                                   ↑
         HIGH weight                         Some weight
      (correct referent)                  (plausible but wrong)

The model learned that "it" most likely refers to "animal"
because they share semantic features (animate, can be tired).
'''
ax2.text(0.1, 0.7, formula_text, fontsize=10, family='monospace',
         transform=ax2.transAxes, verticalalignment='top')

plt.tight_layout()
plt.show()

print("\nKey Insight:")
print("  - Attention weights sum to 1 (softmax)")
print("  - High weight (0.63) on 'animal' = strong reference")
print("  - Lower weight (0.31) on 'street' = possible but less likely")
print("  - This is how transformers understand pronoun references!")

---

<a id='part5'></a>
# Part 5: Transformer Architecture

---

## 5.1 The Transformer ("Attention Is All You Need")

```
┌─────────────────────────────────────────────────────────────────────────┐
│                         TRANSFORMER BLOCK                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│    Input                                                                │
│      │                                                                  │
│      ▼                                                                  │
│  ┌────────────────────┐                                                 │
│  │ Multi-Head         │  ← Self-attention with multiple "perspectives"│
│  │ Self-Attention     │                                                 │
│  └─────────┬──────────┘                                                 │
│            │                                                            │
│      ┌─────┴─────┐                                                      │
│      │  Add & Norm│  ← Residual connection + Layer Normalization       │
│      └─────┬─────┘                                                      │
│            │                                                            │
│  ┌─────────▼─────────┐                                                  │
│  │ Feed-Forward      │  ← Position-wise fully connected                │
│  │ Network           │    FFN(x) = ReLU(xW₁ + b₁)W₂ + b₂              │
│  └─────────┬─────────┘                                                  │
│            │                                                            │
│      ┌─────┴─────┐                                                      │
│      │  Add & Norm│                                                     │
│      └─────┬─────┘                                                      │
│            │                                                            │
│      Output                                                             │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘
```

## 5.2 Key Components

| Component | Purpose |
|-----------|--------|
| **Multi-Head Attention** | Multiple attention "heads" capture different relationships |
| **Position Encoding** | Add position information (attention has no inherent order) |
| **Feed-Forward Network** | Add non-linearity, process each position |
| **Layer Normalization** | Stabilize training |
| **Residual Connections** | Help gradient flow, enable deep networks |

In [None]:
# ============================================================
# TRANSFORMER COMPONENTS FROM SCRATCH
# ============================================================
print("="*70)
print("TRANSFORMER COMPONENTS")
print("="*70)

class PositionalEncoding(nn.Module):
    """
    Positional Encoding using sine and cosine functions.
    
    PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
    PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
    
    This gives each position a unique encoding and allows
    the model to learn relative positions.
    """
    
    def __init__(self, d_model, max_seq_len=5000):
        super().__init__()
        
        # Create positional encoding matrix
        pe = torch.zeros(max_seq_len, d_model)
        position = torch.arange(0, max_seq_len, dtype=torch.float).unsqueeze(1)
        
        # Compute the div term
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * 
                            (-math.log(10000.0) / d_model))
        
        # Apply sin to even indices, cos to odd indices
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        
        # Add batch dimension and register as buffer
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)
        
    def forward(self, x):
        """Add positional encoding to input embeddings."""
        return x + self.pe[:, :x.size(1)]

print("1. PositionalEncoding created")
print("   - Adds position information to embeddings")
print("   - Uses sin/cos functions for smooth position representation")

In [None]:
class MultiHeadAttention(nn.Module):
    """
    Multi-Head Self-Attention.
    
    Instead of one attention, use multiple "heads" that each
    learn different aspects of relationships:
    - Head 1: Might learn syntactic relationships
    - Head 2: Might learn semantic relationships
    - Head 3: Might learn positional relationships
    """
    
    def __init__(self, d_model, num_heads):
        super().__init__()
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
        
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        
        # Linear projections for Q, K, V
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        
        # Output projection
        self.W_o = nn.Linear(d_model, d_model)
        
    def split_heads(self, x):
        """Split the last dimension into (num_heads, d_k)."""
        batch_size, seq_len, _ = x.size()
        x = x.view(batch_size, seq_len, self.num_heads, self.d_k)
        return x.transpose(1, 2)  # (batch, num_heads, seq_len, d_k)
    
    def combine_heads(self, x):
        """Combine heads back."""
        batch_size, _, seq_len, _ = x.size()
        x = x.transpose(1, 2).contiguous()
        return x.view(batch_size, seq_len, self.d_model)
    
    def forward(self, x, mask=None):
        """
        Forward pass.
        
        Args:
            x: (batch, seq_len, d_model)
            mask: Optional attention mask
        """
        # Linear projections
        Q = self.split_heads(self.W_q(x))
        K = self.split_heads(self.W_k(x))
        V = self.split_heads(self.W_v(x))
        
        # Scaled dot-product attention
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)
        
        attn_weights = F.softmax(scores, dim=-1)
        context = torch.matmul(attn_weights, V)
        
        # Combine heads and project
        output = self.W_o(self.combine_heads(context))
        
        return output, attn_weights

print("\n2. MultiHeadAttention created")
print("   - Multiple attention heads in parallel")
print("   - Each head learns different relationships")

In [None]:
class FeedForward(nn.Module):
    """
    Position-wise Feed-Forward Network.
    
    FFN(x) = max(0, xW₁ + b₁)W₂ + b₂
    
    - Applied independently to each position
    - Adds non-linearity and increases model capacity
    - Inner dimension typically 4x the model dimension
    """
    
    def __init__(self, d_model, d_ff=None, dropout=0.1):
        super().__init__()
        
        if d_ff is None:
            d_ff = 4 * d_model
        
        self.linear1 = nn.Linear(d_model, d_ff)
        self.linear2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x):
        return self.linear2(self.dropout(F.relu(self.linear1(x))))

print("\n3. FeedForward Network created")
print("   - Two linear layers with ReLU activation")
print("   - Expands then contracts dimensions")

In [None]:
class TransformerBlock(nn.Module):
    """
    Single Transformer Block.
    
    Architecture:
    x → MultiHead Attention → Add & Norm → FFN → Add & Norm → output
         └──────────────────────┘       └──────────────┘
              residual                    residual
    """
    
    def __init__(self, d_model, num_heads, d_ff=None, dropout=0.1):
        super().__init__()
        
        self.attention = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = FeedForward(d_model, d_ff, dropout)
        
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x, mask=None):
        # Multi-head attention with residual
        attn_output, attn_weights = self.attention(x, mask)
        x = self.norm1(x + self.dropout(attn_output))
        
        # Feed-forward with residual
        ff_output = self.feed_forward(x)
        x = self.norm2(x + self.dropout(ff_output))
        
        return x, attn_weights

print("\n4. TransformerBlock created")
print("   - Combines attention + FFN with residuals")
print("   - Layer normalization for stable training")

# Test the transformer block
print("\n" + "="*50)
print("TESTING TRANSFORMER BLOCK")
print("="*50)

d_model = 64
num_heads = 4
seq_len = 10
batch_size = 2

block = TransformerBlock(d_model, num_heads)
x = torch.randn(batch_size, seq_len, d_model)

# Create causal mask
mask = torch.tril(torch.ones(seq_len, seq_len)).unsqueeze(0).unsqueeze(0)

output, attn = block(x, mask)
print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
print(f"Attention shape: {attn.shape}")

---

<a id='part6'></a>
# Part 6: Building Mini-GPT

---

## 6.1 GPT Architecture

GPT (Generative Pre-trained Transformer) is a **decoder-only** transformer:

```
┌─────────────────────────────────────────────────────────────┐
│                      MINI-GPT ARCHITECTURE                   │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│    Input Tokens: [The, cat, sat, on, the]                  │
│           │                                                 │
│           ▼                                                 │
│    ┌─────────────────┐                                      │
│    │ Token Embedding │  → Each token → d_model vector      │
│    └────────┬────────┘                                      │
│             │                                               │
│             ▼                                               │
│    ┌─────────────────┐                                      │
│    │ Position Embed  │  → Add position information         │
│    └────────┬────────┘                                      │
│             │                                               │
│             ▼                                               │
│    ┌─────────────────┐                                      │
│    │ Transformer     │                                      │
│    │ Block × N       │  → N stacked transformer blocks     │
│    └────────┬────────┘                                      │
│             │                                               │
│             ▼                                               │
│    ┌─────────────────┐                                      │
│    │ Output Linear   │  → Project to vocabulary size       │
│    └────────┬────────┘                                      │
│             │                                               │
│             ▼                                               │
│    Output Logits: [vocab_size] for each position           │
│                                                             │
└─────────────────────────────────────────────────────────────┘
```

In [None]:
# ============================================================
# MINI-GPT LANGUAGE MODEL
# ============================================================
print("="*70)
print("BUILDING MINI-GPT")
print("="*70)

class MiniGPT(nn.Module):
    """
    Mini-GPT: A simplified GPT-style language model.
    
    Components:
    1. Token embeddings: Map tokens to vectors
    2. Position embeddings: Add position information
    3. Transformer blocks: Self-attention + FFN
    4. Output layer: Map to vocabulary
    """
    
    def __init__(self, vocab_size, d_model=128, num_heads=4, 
                 num_layers=4, max_seq_len=128, dropout=0.1):
        super().__init__()
        
        self.d_model = d_model
        self.max_seq_len = max_seq_len
        
        # Token embedding
        self.token_embedding = nn.Embedding(vocab_size, d_model)
        
        # Positional encoding
        self.pos_encoding = PositionalEncoding(d_model, max_seq_len)
        
        # Transformer blocks
        self.blocks = nn.ModuleList([
            TransformerBlock(d_model, num_heads, dropout=dropout)
            for _ in range(num_layers)
        ])
        
        # Layer normalization
        self.ln_f = nn.LayerNorm(d_model)
        
        # Output projection
        self.output = nn.Linear(d_model, vocab_size)
        
        self.dropout = nn.Dropout(dropout)
        
        # Initialize weights
        self.apply(self._init_weights)
        
    def _init_weights(self, module):
        """Initialize weights (important for training stability)."""
        if isinstance(module, nn.Linear):
            nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            nn.init.normal_(module.weight, mean=0.0, std=0.02)
    
    def forward(self, x, targets=None):
        """
        Forward pass.
        
        Args:
            x: Input token indices (batch, seq_len)
            targets: Target token indices for loss calculation
        
        Returns:
            logits: Output logits (batch, seq_len, vocab_size)
            loss: Cross-entropy loss if targets provided
        """
        batch_size, seq_len = x.shape
        
        # Token embeddings
        x = self.token_embedding(x)  # (batch, seq_len, d_model)
        
        # Add positional encoding
        x = self.pos_encoding(x)
        x = self.dropout(x)
        
        # Create causal mask (can't see future tokens)
        mask = torch.tril(torch.ones(seq_len, seq_len, device=x.device))
        mask = mask.unsqueeze(0).unsqueeze(0)  # (1, 1, seq_len, seq_len)
        
        # Pass through transformer blocks
        for block in self.blocks:
            x, _ = block(x, mask)
        
        # Final layer norm
        x = self.ln_f(x)
        
        # Output projection
        logits = self.output(x)  # (batch, seq_len, vocab_size)
        
        # Calculate loss if targets provided
        loss = None
        if targets is not None:
            loss = F.cross_entropy(
                logits.view(-1, logits.size(-1)),
                targets.view(-1)
            )
        
        return logits, loss
    
    @torch.no_grad()
    def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
        """
        Generate new tokens autoregressively.
        
        Args:
            idx: Starting token indices (batch, seq_len)
            max_new_tokens: Number of tokens to generate
            temperature: Sampling temperature
            top_k: If set, only sample from top-k tokens
        """
        for _ in range(max_new_tokens):
            # Crop to max_seq_len
            idx_cond = idx if idx.size(1) <= self.max_seq_len else idx[:, -self.max_seq_len:]
            
            # Get predictions
            logits, _ = self(idx_cond)
            logits = logits[:, -1, :] / temperature  # Last position only
            
            # Top-k sampling
            if top_k is not None:
                v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
                logits[logits < v[:, [-1]]] = float('-inf')
            
            # Sample
            probs = F.softmax(logits, dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
            
            # Append
            idx = torch.cat([idx, idx_next], dim=1)
        
        return idx

print("MiniGPT class created!")
print("\nArchitecture:")
print("  - Token Embedding → Position Encoding")
print("  - N × Transformer Blocks (Multi-Head Attention + FFN)")
print("  - Layer Norm → Output Projection")
print("\nKey Features:")
print("  - Causal masking (can't see future)")
print("  - Autoregressive generation")
print("  - Temperature + top-k sampling")

In [None]:
# Count parameters
print("="*70)
print("MODEL CONFIGURATION")
print("="*70)

# Model hyperparameters
config = {
    'vocab_size': len(vocab) + 2,  # +2 for <pad> and <eos>
    'd_model': 64,
    'num_heads': 4,
    'num_layers': 3,
    'max_seq_len': 32,
    'dropout': 0.1
}

model = MiniGPT(**config).to(device)

# Count parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"\nModel Configuration:")
for k, v in config.items():
    print(f"  {k}: {v}")

print(f"\nTotal parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")

# Compare to GPT models
print("\nFor reference (GPT model sizes):")
print("  GPT-1: 117M parameters")
print("  GPT-2: 1.5B parameters")
print("  GPT-3: 175B parameters")
print("  GPT-4: ~1.7T parameters (estimated)")

---

<a id='part7'></a>
# Part 7: Training & Text Generation

---

In [None]:
# ============================================================
# PREPARE TRAINING DATA
# ============================================================
print("="*70)
print("PREPARING TRAINING DATA")
print("="*70)

# Add special tokens to vocabulary
special_tokens = ['<pad>', '<eos>']
full_vocab = special_tokens + vocab
word_to_idx = {w: i for i, w in enumerate(full_vocab)}
idx_to_word = {i: w for w, i in word_to_idx.items()}

PAD_IDX = word_to_idx['<pad>']
EOS_IDX = word_to_idx['<eos>']

print(f"Vocabulary size: {len(full_vocab)}")
print(f"PAD token index: {PAD_IDX}")
print(f"EOS token index: {EOS_IDX}")

class TextDataset(Dataset):
    """Dataset for language model training."""
    
    def __init__(self, sentences, word_to_idx, seq_len=32):
        self.data = []
        self.seq_len = seq_len
        
        for sentence in sentences:
            # Convert to indices
            indices = [word_to_idx.get(w, PAD_IDX) for w in sentence]
            indices.append(EOS_IDX)
            
            # Pad or truncate
            if len(indices) < seq_len + 1:
                indices = indices + [PAD_IDX] * (seq_len + 1 - len(indices))
            else:
                indices = indices[:seq_len + 1]
            
            self.data.append(indices)
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        seq = self.data[idx]
        x = torch.tensor(seq[:-1], dtype=torch.long)
        y = torch.tensor(seq[1:], dtype=torch.long)
        return x, y

# Create dataset and dataloader
dataset = TextDataset(sentences, word_to_idx, seq_len=config['max_seq_len'])
dataloader = DataLoader(dataset, batch_size=8, shuffle=True)

print(f"\nDataset size: {len(dataset)}")
print(f"Batches: {len(dataloader)}")

# Show sample
x, y = dataset[0]
print(f"\nSample input shape: {x.shape}")
print(f"Sample target shape: {y.shape}")
print(f"\nSample input (first 10): {[idx_to_word[i.item()] for i in x[:10]]}")
print(f"Sample target (first 10): {[idx_to_word[i.item()] for i in y[:10]]}")

In [None]:
# ============================================================
# TRAIN THE MODEL
# ============================================================
print("="*70)
print("TRAINING MINI-GPT")
print("="*70)

# Reinitialize model with correct vocab size
config['vocab_size'] = len(full_vocab)
model = MiniGPT(**config).to(device)

# Optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

# Training loop
num_epochs = 100
losses = []

model.train()
for epoch in range(num_epochs):
    epoch_loss = 0
    
    for batch_x, batch_y in dataloader:
        batch_x = batch_x.to(device)
        batch_y = batch_y.to(device)
        
        # Forward pass
        logits, loss = model(batch_x, batch_y)
        
        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)  # Gradient clipping
        optimizer.step()
        
        epoch_loss += loss.item()
    
    avg_loss = epoch_loss / len(dataloader)
    losses.append(avg_loss)
    
    if (epoch + 1) % 20 == 0:
        print(f"Epoch {epoch+1}/{num_epochs} | Loss: {avg_loss:.4f}")

print(f"\nTraining complete!")
print(f"Final loss: {losses[-1]:.4f}")

In [None]:
# Plot training loss
plt.figure(figsize=(10, 5))
plt.plot(losses, color='steelblue', linewidth=2)
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Training Loss Over Time', fontweight='bold')
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# ============================================================
# TEXT GENERATION WITH MINI-GPT
# ============================================================
print("="*70)
print("TEXT GENERATION WITH MINI-GPT")
print("="*70)

def generate_text(model, start_words, max_tokens=15, temperature=1.0, top_k=10):
    """Generate text from starting words."""
    model.eval()
    
    # Convert start words to indices
    start_indices = [word_to_idx.get(w, PAD_IDX) for w in start_words]
    idx = torch.tensor([start_indices], dtype=torch.long, device=device)
    
    # Generate
    output = model.generate(idx, max_tokens, temperature=temperature, top_k=top_k)
    
    # Convert back to words
    words = [idx_to_word[i.item()] for i in output[0]]
    
    # Stop at EOS or PAD
    result = []
    for w in words:
        if w in ['<eos>', '<pad>']:
            break
        result.append(w)
    
    return ' '.join(result)

# Generate with different prompts
prompts = [
    ['the'],
    ['the', 'cat'],
    ['a'],
    ['is'],
]

print("\nGenerated text (temperature=0.8):")
print("-" * 50)
for prompt in prompts:
    for i in range(2):
        text = generate_text(model, prompt, max_tokens=12, temperature=0.8)
        print(f"  '{' '.join(prompt)}' → {text}")
    print()

# Show effect of temperature
print("\nEffect of temperature:")
print("-" * 50)
for temp in [0.3, 0.7, 1.0, 1.5]:
    text = generate_text(model, ['the'], max_tokens=10, temperature=temp)
    print(f"  Temperature {temp}: {text}")

In [None]:
# ============================================================
# VISUALIZE ATTENTION PATTERNS
# ============================================================
print("="*70)
print("ATTENTION PATTERN VISUALIZATION")
print("="*70)

def get_attention_weights(model, text):
    """Get attention weights for a text sequence."""
    model.eval()
    
    words = text.split()
    indices = [word_to_idx.get(w, PAD_IDX) for w in words]
    x = torch.tensor([indices], dtype=torch.long, device=device)
    
    # Get attention from first block
    with torch.no_grad():
        # Get embeddings
        emb = model.token_embedding(x)
        emb = model.pos_encoding(emb)
        
        # Get attention from first block
        mask = torch.tril(torch.ones(len(words), len(words), device=device))
        mask = mask.unsqueeze(0).unsqueeze(0)
        _, attn = model.blocks[0](emb, mask)
    
    return attn[0].cpu().numpy(), words

# Visualize attention for a sentence
test_text = "the cat sat on the mat"
attn_weights, words = get_attention_weights(model, test_text)

fig, axes = plt.subplots(1, 4, figsize=(16, 4))

for head in range(min(4, attn_weights.shape[0])):
    ax = axes[head]
    sns.heatmap(attn_weights[head], xticklabels=words, yticklabels=words,
                ax=ax, cmap='Blues', annot=True, fmt='.2f', cbar=False)
    ax.set_title(f'Head {head+1}', fontweight='bold')
    ax.set_xlabel('Key')
    if head == 0:
        ax.set_ylabel('Query')

plt.suptitle(f'Attention Patterns for "{test_text}"', fontweight='bold', fontsize=14)
plt.tight_layout()
plt.show()

print("\nEach head learns different attention patterns!")
print("  - Some heads might focus on nearby words (local)")
print("  - Some heads might focus on specific word types (global)")

---

<a id='part8'></a>
# Part 8: Summary & Comparison

---

In [None]:
# ============================================================
# MODEL COMPARISON
# ============================================================
print("="*70)
print("LANGUAGE MODEL COMPARISON")
print("="*70)

comparison = pd.DataFrame({
    'Model': ['Unigram', 'Bigram', 'Trigram', 'Mini-GPT'],
    'Parameters': ['~100', '~10K', '~100K', f'{total_params:,}'],
    'Context': ['None', '1 word', '2 words', f'{config["max_seq_len"]} words'],
    'Strengths': [
        'Simple, fast',
        'Local patterns',
        'Better local context',
        'Long-range dependencies'
    ],
    'Weaknesses': [
        'No context',
        'Very limited context',
        'Limited context, sparse',
        'Needs more data'
    ]
})

print(comparison.to_string(index=False))

In [None]:
# Final summary
print("="*70)
print("LANGUAGE MODEL FROM SCRATCH - SUMMARY")
print("="*70)

print("""
WHAT WE LEARNED:
================

1. N-GRAM MODELS:
   ┌─────────────────────────────────────────────────┐
   │ P(word | context) = Count(context, word)        │
   │                     ─────────────────────       │
   │                     Count(context)              │
   └─────────────────────────────────────────────────┘
   - Simple counting approach
   - Fixed context window
   - Sparsity problems

2. WORD EMBEDDINGS:
   - Dense vector representations
   - Capture semantic similarity
   - king - man + woman ≈ queen

3. ATTENTION MECHANISM:
   ┌─────────────────────────────────────────────────┐
   │ Attention(Q, K, V) = softmax(QK^T / √d_k) × V   │
   └─────────────────────────────────────────────────┘
   - Dynamic, content-based weighting
   - Every position can attend to every other
   - Enables long-range dependencies

4. TRANSFORMER:
   ┌──────────────────────────────────┐
   │ Input → Embedding + Position     │
   │       ↓                          │
   │ Multi-Head Attention + FFN (×N)  │
   │       ↓                          │
   │ Output Projection                │
   └──────────────────────────────────┘
   - Parallel processing (no recurrence)
   - Multi-head attention captures different patterns
   - Foundation of GPT, BERT, etc.

KEY EQUATIONS:
==============

Bigram: P(w₂|w₁) = C(w₁,w₂) / C(w₁)

Attention: softmax(QK^T / √d_k) × V

Perplexity: PP = 2^(-1/N × Σlog₂P(wᵢ|context))
""")

print("\nMODERN LLMs BUILT ON THESE FOUNDATIONS:")
print("  - GPT-4: Transformer decoder, 1.7T parameters")
print("  - Claude: Constitutional AI, transformer-based")
print("  - LLaMA: Open-source transformer")
print("  - BERT: Transformer encoder, bidirectional")

print("\n" + "="*70)

## Algorithm & Method Taxonomy

### Language Model Types

| Type | Method | Context | Training |
|------|--------|---------|----------|
| **Statistical** | N-gram | Fixed N-1 words | Count-based |
| **Neural (RNN)** | LSTM/GRU | Sequential, variable | Backprop through time |
| **Neural (Transformer)** | Self-attention | Full sequence | Standard backprop |

### Attention Variants

| Variant | Description | Use Case |
|---------|-------------|----------|
| **Self-attention** | Q=K=V from same sequence | GPT, BERT |
| **Cross-attention** | Q from decoder, K,V from encoder | Translation |
| **Causal attention** | Mask future tokens | Language modeling |
| **Multi-head** | Multiple parallel attention heads | Better representations |

### Transformer Architectures

| Architecture | Attention Type | Use Case |
|--------------|----------------|----------|
| **Encoder-only** | Bidirectional | BERT (classification) |
| **Decoder-only** | Causal | GPT (generation) |
| **Encoder-Decoder** | Cross-attention | T5 (translation) |

### Sampling Strategies

| Strategy | Description | Effect |
|----------|-------------|--------|
| **Greedy** | Always pick highest prob | Deterministic, repetitive |
| **Temperature** | Scale logits before softmax | Low=focused, High=diverse |
| **Top-k** | Sample from top k tokens | Balanced diversity |
| **Top-p (nucleus)** | Sample from cumulative p% | Adaptive diversity |

---

## Checklist

- [x] Understand N-gram probability calculation
- [x] Know why embeddings are better than one-hot
- [x] Can explain attention mechanism (Q, K, V)
- [x] Understand transformer architecture
- [x] Know difference between encoder/decoder transformers
- [x] Can implement basic transformer from scratch
- [x] Understand perplexity as evaluation metric

---

**End of Language Model From Scratch Tutorial**