# Week 1: Introduction to Transformers

## Notebook 03: Building a Complete Language Model

This notebook ties everything together to build a complete character-level language model.

### Learning Objectives
- Understand positional encoding
- Build a complete transformer-based language model
- Prepare data for training
- Understand the training loop structure
- Generate text with a trained model

In [None]:


import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from llm_journey.data import SimpleTokenizer, TextDataset
from llm_journey.models import TransformerBlock
from llm_journey.utils import set_seed, count_parameters

set_seed(42)

## 1. Loading and Tokenizing Data

We'll use our tiny corpus for demonstration purposes.

In [None]:
# Load corpus
with open('../data/tiny_corpus.txt', 'r') as f:
    corpus = f.read()

print(f"Corpus length: {len(corpus)} characters")
print(f"First 200 characters:\n{corpus[:200]}...")

# Create tokenizer
tokenizer = SimpleTokenizer(corpus)
print(f"\nVocabulary size: {tokenizer.vocab_size}")
print(f"Vocabulary: {sorted(tokenizer.char_to_idx.keys())}")

In [None]:
# Tokenize the corpus
tokens = tokenizer.encode(corpus)
print(f"Total tokens: {len(tokens)}")
print(f"First 50 tokens: {tokens[:50]}")

## 2. Creating a Dataset and DataLoader

In [None]:
# Create dataset
seq_length = 32
dataset = TextDataset(tokens, seq_length)
print(f"Dataset size: {len(dataset)} sequences")

# Create dataloader
batch_size = 8
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

# Sample a batch
x_batch, y_batch = next(iter(dataloader))
print(f"\nBatch shapes:")
print(f"  Input: {x_batch.shape}")
print(f"  Target: {y_batch.shape}")

## 3. Positional Encoding

Transformers have no inherent notion of position. We add positional information using sinusoidal functions:

$$PE_{(pos, 2i)} = \sin(pos / 10000^{2i/d_{model}})$$
$$PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i/d_{model}})$$

Alternatively, we can use learned positional embeddings (which we'll use here for simplicity).

In [None]:
class SimpleLanguageModel(nn.Module):
    """A simple transformer-based language model."""
    
    def __init__(self, vocab_size, d_model=64, num_heads=4, num_layers=2, d_ff=256, max_seq_len=512, dropout=0.1):
        super().__init__()
        self.d_model = d_model
        
        # Token and position embeddings
        self.token_embedding = nn.Embedding(vocab_size, d_model)
        self.position_embedding = nn.Embedding(max_seq_len, d_model)
        
        # Transformer blocks
        self.blocks = nn.ModuleList([
            TransformerBlock(d_model, num_heads, d_ff, dropout)
            for _ in range(num_layers)
        ])
        
        # Output projection
        self.ln_f = nn.LayerNorm(d_model)
        self.head = nn.Linear(d_model, vocab_size)
        
    def forward(self, x, mask=None):
        batch_size, seq_len = x.shape
        
        # Token embeddings + positional embeddings
        positions = torch.arange(seq_len, device=x.device).unsqueeze(0).expand(batch_size, -1)
        x = self.token_embedding(x) + self.position_embedding(positions)
        
        # Apply transformer blocks
        for block in self.blocks:
            x = block(x, mask)
        
        # Final layer norm and projection
        x = self.ln_f(x)
        logits = self.head(x)
        
        return logits

In [None]:
# Create model
model = SimpleLanguageModel(
    vocab_size=tokenizer.vocab_size,
    d_model=64,
    num_heads=4,
    num_layers=2,
    d_ff=256,
    max_seq_len=128,
    dropout=0.1
)

print(f"Model parameters: {count_parameters(model):,}")

# Test forward pass
logits = model(x_batch)
print(f"\nOutput logits shape: {logits.shape}")
print(f"Expected shape: (batch_size={batch_size}, seq_len={seq_length}, vocab_size={tokenizer.vocab_size})")

## 4. Training Loop Structure (Outline)

Here's the structure for training the model. Full training would take longer and require more data.

```python
# Training pseudocode
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

for epoch in range(num_epochs):
    for x_batch, y_batch in dataloader:
        # Forward pass
        logits = model(x_batch)
        
        # Compute loss
        loss = criterion(logits.view(-1, vocab_size), y_batch.view(-1))
        
        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
```

## 5. Text Generation (Conceptual)

Once trained, we generate text by:
1. Starting with a prompt (seed text)
2. Encoding the prompt
3. Feeding through the model to get logits
4. Sampling the next token from the distribution
5. Appending the token and repeating

In [None]:
def generate(model, tokenizer, prompt, max_length=50, temperature=1.0):
    """Generate text from a prompt (requires trained model)."""
    model.eval()
    tokens = tokenizer.encode(prompt)
    
    with torch.no_grad():
        for _ in range(max_length):
            # Prepare input
            x = torch.tensor(tokens).unsqueeze(0)
            
            # Get logits
            logits = model(x)
            logits = logits[0, -1, :] / temperature
            
            # Sample next token
            probs = torch.softmax(logits, dim=-1)
            next_token = torch.multinomial(probs, 1).item()
            
            tokens.append(next_token)
    
    return tokenizer.decode(tokens)

# This would work with a trained model:
# generated_text = generate(model, tokenizer, "The quick", max_length=100)
# print(generated_text)

print("Generation function defined (requires trained model to use).")

## Summary

In this notebook, we've covered:
- Data loading and tokenization
- Creating datasets and dataloaders
- Positional encoding
- Building a complete language model
- Training loop structure
- Text generation procedure

## Exercises

1. Implement sinusoidal positional encoding and compare with learned embeddings
2. Add temperature scaling to control generation randomness
3. Implement top-k and nucleus (top-p) sampling
4. Train the model for a few epochs and observe the loss curve
5. Generate samples at different temperatures and compare the outputs

## Next Steps

Week 2 will cover training fundamentals, optimization techniques, and scaling considerations. Stay tuned!