# GPT Language Model Training - Word Level
## 50,000 Training Iterations

This notebook implements a word-level GPT model training pipeline with:
- Data preparation and tokenization
- Model architecture definition
- Training loop with validation
- Model saving and inference

**Model Statistics:**
- ~18.67M parameters
- Word-level tokenization
- 50,000 training steps

## Section 1: Import Required Libraries
Import all necessary Python libraries for model training

In [3]:
# Core PyTorch imports for neural network implementation
import torch
import torch.nn as nn
from torch.nn import functional as F

# Utility imports for timing and other operations
import timeit
import numpy as np
import pickle
from typing import List, Tuple, Optional

## Section 2: Hyperparameters Configuration
Define all training and model hyperparameters in one place for easy modification

In [5]:
# ============= Training Hyperparameters =============
# Batch and sequence configuration
batch_size = 64          # Number of sequences processed in parallel
block_size = 256         # Maximum context length for predictions
max_iters = 50000        # Total number of training iterations
eval_interval = 500      # Frequency of validation loss computation
learning_rate = 3e-4     # Initial learning rate for optimizer
eval_iters = 200         # Number of iterations for validation loss estimation

# ============= Model Architecture Hyperparameters =============
n_embd = 384            # Embedding dimension (model width)
n_head = 6              # Number of attention heads
n_layer = 6             # Number of transformer blocks
dropout = 0.2           # Dropout probability for regularization

# ============= System Configuration =============
# Automatically select GPU if available, otherwise use CPU
device = 'cuda' if torch.cuda.is_available() else 'cpu'
torch.manual_seed(1337)  # Set random seed for reproducibility

print(f"Training device: {device}")
#print(f"Model will have approximately {(n_embd * n_layer * 4 + n_embd * vocab_size * 2) / 1e6:.2f}M parameters")

Training device: cpu


## Section 3: Data Loading and Preprocessing
Load the text data and prepare it for training

In [7]:
# ============= Load Training Data =============
# Load your text file (replace 'input.txt' with your actual file path)
with open('tiny_story.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# ============= Word-Level Tokenization =============
# Split text into words and create vocabulary
words = text.split()  # Simple whitespace splitting
vocab = sorted(list(set(words)))  # Create unique vocabulary
vocab_size = len(vocab)

# Create word to index and index to word mappings
stoi = {word: i for i, word in enumerate(vocab)}  # string to integer
itos = {i: word for i, word in enumerate(vocab)}  # integer to string

# Define encoding and decoding functions
def encode(text: str) -> List[int]:
    """Convert text string to list of token indices"""
    return [stoi[word] for word in text.split() if word in stoi]

def decode(indices: List[int]) -> str:
    """Convert list of token indices back to text string"""
    return ' '.join([itos[i] for i in indices if i in itos])

print(f"Vocabulary size: {vocab_size} unique words")
print(f"First 10 words in vocabulary: {vocab[:10]}")

Vocabulary size: 24022 unique words
First 10 words in vocabulary: ['"', '".', '"10!', '"A', '"Abby,', '"Abigail,', '"Abracadabra!"', '"Absolutely!', '"Achoo!"', '"Acting']


In [8]:
# ============= Prepare Training and Validation Data =============
# Encode entire text dataset
data = torch.tensor(encode(text), dtype=torch.long)

# Split data into training (90%) and validation (10%)
n = int(0.9 * len(data))
train_data = data[:n]
val_data = data[n:]

print(f"Training tokens: {len(train_data):,}")
print(f"Validation tokens: {len(val_data):,}")
print(f"Total tokens: {len(data):,}")

Training tokens: 899,118
Validation tokens: 99,902
Total tokens: 999,020


## Section 4: Data Loading Functions
Functions to create batches of data for training

In [9]:
def get_batch(split: str) -> Tuple[torch.Tensor, torch.Tensor]:
    """
    Generate a batch of data for training or validation.
    
    Args:
        split: 'train' or 'val' to select the data split
    
    Returns:
        x: Input sequences of shape (batch_size, block_size)
        y: Target sequences of shape (batch_size, block_size)
    """
    # Select the appropriate dataset
    data = train_data if split == 'train' else val_data
    
    # Generate random starting positions for each sequence in the batch
    # Ensure we don't go past the end of the data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    
    # Stack sequences to create batch tensors
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    
    # Move to the appropriate device (CPU or GPU)
    x, y = x.to(device), y.to(device)
    return x, y

# Test the batch generation
x_sample, y_sample = get_batch('train')
print(f"Input batch shape: {x_sample.shape}")
print(f"Target batch shape: {y_sample.shape}")

Input batch shape: torch.Size([64, 256])
Target batch shape: torch.Size([64, 256])


In [10]:
@torch.no_grad()
def estimate_loss() -> dict:
    """
    Estimate the average loss on train and validation sets.
    Uses multiple batches to get a more stable estimate.
    
    Returns:
        Dictionary with 'train' and 'val' losses
    """
    out = {}
    model.eval()  # Set model to evaluation mode
    
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        
        # Average loss over multiple batches
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        
        out[split] = losses.mean()
    
    model.train()  # Set model back to training mode
    return out

## Section 5: Model Architecture Components
Define the transformer model components including attention heads, feed-forward networks, and blocks

In [11]:
class Head(nn.Module):
    """
    Single head of self-attention.
    Implements scaled dot-product attention with causal masking.
    """
    
    def __init__(self, head_size: int):
        super().__init__()
        # Linear projections for queries, keys, and values
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        
        # Register causal mask as a buffer (not a parameter)
        # Lower triangular matrix ensures tokens only attend to previous tokens
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
        
        # Dropout for regularization
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x):
        B, T, C = x.shape  # Batch size, Time steps, Channels
        
        # Compute queries, keys, values
        q = self.query(x)  # (B, T, head_size)
        k = self.key(x)    # (B, T, head_size)
        v = self.value(x)  # (B, T, head_size)
        
        # Compute attention scores (scaled by sqrt of head_size)
        # @ is matrix multiplication
        wei = q @ k.transpose(-2, -1) * k.shape[-1]**-0.5  # (B, T, T)
        
        # Apply causal mask (prevent looking at future tokens)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))
        
        # Apply softmax to get attention weights
        wei = F.softmax(wei, dim=-1)  # (B, T, T)
        wei = self.dropout(wei)
        
        # Weighted aggregation of values
        out = wei @ v  # (B, T, head_size)
        return out

In [12]:
class MultiHeadAttention(nn.Module):
    """
    Multiple heads of self-attention running in parallel.
    Concatenates outputs and projects back to embedding dimension.
    """
    
    def __init__(self, num_heads: int, head_size: int):
        super().__init__()
        # Create multiple attention heads
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        
        # Final linear projection after concatenation
        self.proj = nn.Linear(head_size * num_heads, n_embd)
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x):
        # Run all attention heads in parallel and concatenate
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        
        # Project back to embedding dimension
        out = self.dropout(self.proj(out))
        return out

In [13]:
class FeedForward(nn.Module):
    """
    Feed-forward network with one hidden layer.
    Typically expands dimension by 4x then projects back.
    """
    
    def __init__(self, n_embd: int):
        super().__init__()
        self.net = nn.Sequential(
            # Expand dimension by 4x
            nn.Linear(n_embd, 4 * n_embd),
            # Apply ReLU activation
            nn.ReLU(),
            # Project back to embedding dimension
            nn.Linear(4 * n_embd, n_embd),
            # Dropout for regularization
            nn.Dropout(dropout),
        )
    
    def forward(self, x):
        return self.net(x)

In [14]:
class Block(nn.Module):
    """
    Transformer block: communication (attention) followed by computation (FFN).
    Uses residual connections and layer normalization.
    """
    
    def __init__(self, n_embd: int, n_head: int):
        super().__init__()
        head_size = n_embd // n_head
        
        # Multi-head self-attention
        self.sa = MultiHeadAttention(n_head, head_size)
        
        # Feed-forward network
        self.ffwd = FeedForward(n_embd)
        
        # Layer normalizations
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)
    
    def forward(self, x):
        # Apply attention with residual connection
        # Note: This is Pre-LayerNorm architecture
        x = x + self.sa(self.ln1(x))
        
        # Apply feed-forward with residual connection
        x = x + self.ffwd(self.ln2(x))
        
        return x

## Section 6: Complete Model Definition
Assemble all components into the final GPT model

In [15]:
class WordLanguageModel(nn.Module):
    """
    Complete GPT Language Model.
    Combines token embeddings, positional embeddings, transformer blocks,
    and output projection for next-token prediction.
    """
    
    def __init__(self):
        super().__init__()
        
        # Token embedding table: maps vocabulary indices to vectors
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        
        # Positional embedding table: adds position information
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        
        # Stack of transformer blocks
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        
        # Final layer normalization
        self.ln_f = nn.LayerNorm(n_embd)
        
        # Output projection: maps from embedding space to vocabulary
        self.lm_head = nn.Linear(n_embd, vocab_size)
        
        # Initialize weights
        self.apply(self._init_weights)
    
    def _init_weights(self, module):
        """Initialize model weights with appropriate values"""
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
    
    def forward(self, idx, targets=None):
        """
        Forward pass through the model.
        
        Args:
            idx: Input token indices of shape (B, T)
            targets: Target token indices of shape (B, T), optional
        
        Returns:
            logits: Predictions of shape (B, T, vocab_size)
            loss: Cross-entropy loss if targets provided, else None
        """
        B, T = idx.shape
        
        # Get token embeddings: (B, T, C)
        tok_emb = self.token_embedding_table(idx)
        
        # Get positional embeddings: (T, C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device))
        
        # Combine token and positional embeddings
        x = tok_emb + pos_emb  # (B, T, C)
        
        # Apply transformer blocks
        x = self.blocks(x)  # (B, T, C)
        
        # Apply final layer norm
        x = self.ln_f(x)  # (B, T, C)
        
        # Project to vocabulary size
        logits = self.lm_head(x)  # (B, T, vocab_size)
        
        # Calculate loss if targets provided
        if targets is None:
            loss = None
        else:
            # Reshape for cross-entropy loss
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)
        
        return logits, loss
    
    def generate(self, idx, max_new_tokens):
        """
        Generate new tokens autoregressively.
        
        Args:
            idx: Starting context of shape (B, T)
            max_new_tokens: Number of new tokens to generate
        
        Returns:
            Extended sequence of shape (B, T + max_new_tokens)
        """
        for _ in range(max_new_tokens):
            # Crop context to maximum block size
            idx_cond = idx[:, -block_size:]
            
            # Get predictions
            logits, loss = self(idx_cond)
            
            # Focus only on the last time step
            logits = logits[:, -1, :]  # (B, C)
            
            # Apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1)  # (B, C)
            
            # Sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1)  # (B, 1)
            
            # Append sampled token to the running sequence
            idx = torch.cat((idx, idx_next), dim=1)  # (B, T+1)
        
        return idx

## Section 7: Model Initialization and Information
Create the model instance and display model information

In [16]:
# Create model instance and move to device
model = WordLanguageModel()
m = model.to(device)

# Count and display model parameters
num_params = sum(p.numel() for p in m.parameters())
print(f"Model initialized with {num_params/1e6:.6f} M parameters")

# Display model architecture
print("\nModel Architecture:")
print(f"- Vocabulary Size: {vocab_size}")
print(f"- Embedding Dimension: {n_embd}")
print(f"- Number of Heads: {n_head}")
print(f"- Number of Layers: {n_layer}")
print(f"- Block Size: {block_size}")
print(f"- Dropout Rate: {dropout}")

Model initialized with 29.211862 M parameters

Model Architecture:
- Vocabulary Size: 24022
- Embedding Dimension: 384
- Number of Heads: 6
- Number of Layers: 6
- Block Size: 256
- Dropout Rate: 0.2


## Section 8: Optimizer Setup
Configure the AdamW optimizer for training

In [17]:
# Create AdamW optimizer
# AdamW is Adam with proper weight decay (better than L2 regularization)
optimizer = torch.optim.AdamW(m.parameters(), lr=learning_rate)

print(f"Optimizer: AdamW")
print(f"Learning Rate: {learning_rate}")
print(f"Total Training Steps: {max_iters:,}")

  from .autonotebook import tqdm as notebook_tqdm


Optimizer: AdamW
Learning Rate: 0.0003
Total Training Steps: 50,000


## Section 9: Training Loop
Main training loop with periodic evaluation and progress tracking

In [None]:
# ============= Main Training Loop =============
print("Starting training...")
print("="*50)

# Track training start time
import time
start_time = time.time()

# Training loop
for iter in range(max_iters):
    
    # ========== Evaluation Step ==========
    # Periodically evaluate loss on train and validation sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        
        # Calculate elapsed time
        elapsed_time = time.time() - start_time
        elapsed_minutes = int(elapsed_time // 60)
        elapsed_seconds = elapsed_time % 60
        
        # Print progress
        print(f"Step {iter:5d}: "
              f"train loss {losses['train']:.4f}, "
              f"val loss {losses['val']:.4f}, "
              f"Elapsed: {elapsed_minutes} min, {elapsed_seconds:.2f} sec")
        
        # Optional: Save checkpoint if validation loss improved
        # You can add checkpoint saving logic here
    
    # ========== Training Step ==========
    # Sample a batch of data
    xb, yb = get_batch('train')
    
    # Forward pass: compute predictions and loss
    logits, loss = model(xb, yb)
    
    # Backward pass: compute gradients
    optimizer.zero_grad(set_to_none=True)  # Clear gradients from previous step
    loss.backward()  # Compute gradients
    
    # Update weights
    optimizer.step()

# ============= Training Complete =============
total_time = time.time() - start_time
print("="*50)
print(f"Training complete!")
print(f"Total training time: {total_time/3600:.2f} hours")
print(f"Final train loss: {losses['train']:.4f}")
print(f"Final validation loss: {losses['val']:.4f}")

Starting training...


## Section 10: Text Generation
Generate sample text using the trained model

In [None]:
# ============= Generate Sample Text =============
print("\nGenerating sample text...")
print("="*50)

# Define the starting prompt
prompt = "Once upon a time"
print(f"Prompt: '{prompt}'\n")

# Encode the prompt
starting_tokens = encode(prompt)
starting_context = torch.tensor([starting_tokens], dtype=torch.long, device=device)

# Generate new tokens
num_tokens_to_generate = 500
generated = m.generate(starting_context, max_new_tokens=num_tokens_to_generate)

# Decode and print the generated text
generated_text = decode(generated[0].tolist())
print("Generated text:")
print("-"*50)
print(generated_text)
print("-"*50)

In [None]:
# ============= Interactive Text Generation =============
# Allow user to input custom prompts

def generate_text(prompt: str, max_tokens: int = 100) -> str:
    """
    Generate text from a custom prompt.
    
    Args:
        prompt: Starting text
        max_tokens: Maximum number of tokens to generate
    
    Returns:
        Generated text string
    """
    # Encode prompt
    tokens = encode(prompt)
    if len(tokens) == 0:
        print("Warning: Prompt contains no known words. Using random start.")
        tokens = [0]  # Start with first token in vocabulary
    
    context = torch.tensor([tokens], dtype=torch.long, device=device)
    
    # Generate
    generated = m.generate(context, max_new_tokens=max_tokens)
    
    # Decode and return
    return decode(generated[0].tolist())

# Example usage
custom_prompt = "The king"
custom_output = generate_text(custom_prompt, max_tokens=200)
print(f"\nCustom generation from '{custom_prompt}':")
print(custom_output)

## Section 11: Model Saving
Save the trained model for future use

In [None]:
# ============= Save Model Weights =============
model_path = 'generated_model_50Kiters.pth'
torch.save(model.state_dict(), model_path)
print(f"Model saved to: {model_path}")

# Calculate file size
import os
file_size = os.path.getsize(model_path) / (1024 * 1024)  # Convert to MB
print(f"Model file size: {file_size:.2f} MB")

# ============= Save Vocabulary =============
# Save vocabulary for later use with the model
vocab_path = 'vocabulary.pkl'
with open(vocab_path, 'wb') as f:
    pickle.dump({
        'vocab': vocab,
        'vocab_size': vocab_size,
        'stoi': stoi,
        'itos': itos
    }, f)
print(f"Vocabulary saved to: {vocab_path}")

## Section 12: Model Loading
Load a previously saved model

In [None]:
# ============= Load Saved Model =============
def load_model(model_path: str, vocab_path: str):
    """
    Load a previously trained model and its vocabulary.
    
    Args:
        model_path: Path to the saved model weights
        vocab_path: Path to the saved vocabulary
    
    Returns:
        Loaded model and vocabulary dictionary
    """
    # Load vocabulary
    with open(vocab_path, 'rb') as f:
        vocab_data = pickle.load(f)
    
    # Update global variables (in practice, you'd handle this better)
    global vocab, vocab_size, stoi, itos
    vocab = vocab_data['vocab']
    vocab_size = vocab_data['vocab_size']
    stoi = vocab_data['stoi']
    itos = vocab_data['itos']
    
    # Create and load model
    loaded_model = WordLanguageModel()
    loaded_model.load_state_dict(torch.load(model_path, map_location=device))
    loaded_model = loaded_model.to(device)
    loaded_model.eval()  # Set to evaluation mode
    
    print(f"Model loaded successfully from {model_path}")
    print(f"Vocabulary loaded successfully from {vocab_path}")
    
    return loaded_model, vocab_data

# Example: Load the saved model
# loaded_model, vocab_data = load_model('generated_model_50Kiters.pth', 'vocabulary.pkl')

## Section 13: Model Analysis and Metrics
Analyze model performance and compute various metrics

In [None]:
# ============= Compute Perplexity =============
@torch.no_grad()
def compute_perplexity(model, data_loader, split='val'):
    """
    Compute perplexity on the specified dataset.
    Perplexity = exp(average_loss)
    """
    model.eval()
    total_loss = 0
    num_batches = 100  # Use a fixed number of batches for consistency
    
    for _ in range(num_batches):
        X, Y = get_batch(split)
        logits, loss = model(X, Y)
        total_loss += loss.item()
    
    avg_loss = total_loss / num_batches
    perplexity = torch.exp(torch.tensor(avg_loss))
    
    return perplexity.item()

# Calculate perplexity
train_perplexity = compute_perplexity(m, None, 'train')
val_perplexity = compute_perplexity(m, None, 'val')

print("\nModel Performance Metrics:")
print("="*50)
print(f"Training Perplexity: {train_perplexity:.2f}")
print(f"Validation Perplexity: {val_perplexity:.2f}")
print(f"Vocabulary Coverage: {vocab_size:,} unique words")

In [None]:
# ============= Attention Visualization Helper =============
def get_attention_weights(model, text: str, layer_idx: int = 0, head_idx: int = 0):
    """
    Extract attention weights for visualization.
    
    Args:
        model: Trained model
        text: Input text
        layer_idx: Which transformer layer to visualize
        head_idx: Which attention head to visualize
    
    Returns:
        Attention weights matrix
    """
    # This is a simplified version - you'd need to modify the model
    # to actually extract attention weights during forward pass
    
    tokens = encode(text)
    if len(tokens) == 0:
        print("No valid tokens found in text")
        return None
    
    context = torch.tensor([tokens], dtype=torch.long, device=device)
    
    # Note: This is a placeholder - actual implementation would require
    # modifying the model to return attention weights
    print(f"Attention visualization for layer {layer_idx}, head {head_idx}")
    print(f"Input text: '{text}'")
    print(f"Number of tokens: {len(tokens)}")
    
    return None  # Placeholder

# Example usage
# attention_weights = get_attention_weights(m, "The king and queen", layer_idx=0, head_idx=0)

## Section 14: Training Diagnostics
Additional tools for monitoring and debugging training

In [None]:
# ============= Gradient Statistics =============
def print_gradient_statistics(model):
    """
    Print statistics about gradients to diagnose training issues.
    Useful for detecting vanishing/exploding gradients.
    """
    total_norm = 0
    param_count = 0
    
    print("\nGradient Statistics:")
    print("="*50)
    
    for name, param in model.named_parameters():
        if param.grad is not None:
            param_norm = param.grad.data.norm(2).item()
            total_norm += param_norm ** 2
            param_count += 1
            
            # Print statistics for important layers
            if 'embedding' in name or 'ln' in name or 'lm_head' in name:
                grad_mean = param.grad.data.mean().item()
                grad_std = param.grad.data.std().item()
                print(f"{name[:30]:30s} | norm: {param_norm:.6f} | "
                      f"mean: {grad_mean:.6f} | std: {grad_std:.6f}")
    
    total_norm = (total_norm ** 0.5)
    print(f"\nTotal gradient norm: {total_norm:.6f}")
    print(f"Average gradient norm: {total_norm/param_count:.6f}")
    
    return total_norm

# Example: Check gradients after a training step
# xb, yb = get_batch('train')
# logits, loss = model(xb, yb)
# loss.backward()
# grad_norm = print_gradient_statistics(model)
# optimizer.zero_grad()

In [None]:
# ============= Learning Rate Schedule (Optional) =============
def get_lr(it, warmup_iters=2000, lr_decay_iters=50000, min_lr=1e-5):
    """
    Learning rate schedule with warmup and cosine decay.
    
    Args:
        it: Current iteration
        warmup_iters: Number of warmup steps
        lr_decay_iters: Total number of decay steps
        min_lr: Minimum learning rate
    
    Returns:
        Learning rate for current iteration
    """
    # Warmup phase: linearly increase from 0 to learning_rate
    if it < warmup_iters:
        return learning_rate * it / warmup_iters
    
    # After decay_iters, use minimum learning rate
    if it > lr_decay_iters:
        return min_lr
    
    # Cosine decay
    decay_ratio = (it - warmup_iters) / (lr_decay_iters - warmup_iters)
    assert 0 <= decay_ratio <= 1
    coeff = 0.5 * (1.0 + math.cos(math.pi * decay_ratio))
    
    return min_lr + coeff * (learning_rate - min_lr)

# Visualize learning rate schedule
import math
import matplotlib.pyplot as plt

# Generate learning rates for all iterations
lrs = [get_lr(it) for it in range(max_iters)]

# Plot
plt.figure(figsize=(10, 6))
plt.plot(lrs)
plt.xlabel('Iteration')
plt.ylabel('Learning Rate')
plt.title('Learning Rate Schedule')
plt.grid(True, alpha=0.3)
plt.show()

print(f"Initial LR: {lrs[0]:.6f}")
print(f"Peak LR: {max(lrs):.6f}")
print(f"Final LR: {lrs[-1]:.6f}")

## Summary and Next Steps

### What We've Built:
- A complete GPT language model with ~18.67M parameters
- Word-level tokenization system
- Full training pipeline with validation
- Text generation capabilities
- Model saving/loading functionality

### Potential Improvements:
1. **Tokenization**: Consider using BPE or SentencePiece for better vocabulary
2. **Architecture**: Experiment with different model sizes, heads, and layers
3. **Training**: Implement learning rate scheduling and gradient clipping
4. **Data**: Use larger and more diverse datasets
5. **Evaluation**: Add more metrics like BLEU scores or human evaluation
6. **Optimization**: Implement mixed precision training for faster computation
7. **Regularization**: Add weight decay, dropout variations, or other techniques

### Resources for Further Learning:
- [Attention Is All You Need](https://arxiv.org/abs/1706.03762) - Original Transformer paper
- [GPT-2 Paper](https://openai.com/research/better-language-models) - Language model scaling
- [The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/) - Visual guide
- [PyTorch Documentation](https://pytorch.org/docs/stable/index.html) - Framework reference