# Neural Networks Project: WikiText Language Modeling with Transformer

This notebook demonstrates how to use the neural networks framework to train a Transformer model on the WikiText-2 dataset for language modeling.

## Setup

First, let's import the necessary libraries and set up our environment.

In [None]:
import os
import sys
import torch
import numpy as np
import matplotlib.pyplot as plt
from torch.utils.data import DataLoader, Dataset
from datasets import load_dataset
from transformers import AutoTokenizer

# Add the project root to the path
sys.path.append('..')

# Import project modules
from src.models.transformer_model import TransformerModel
from src.utils.trainer import Trainer
from src.config.config_manager import ConfigManager, get_default_config

# Set random seed for reproducibility
torch.manual_seed(42)
np.random.seed(42)

# Set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

## Load WikiText-2 Dataset

WikiText-2 is a medium-sized dataset of Wikipedia articles, commonly used for language modeling tasks. It contains approximately 2 million tokens for training.

In [None]:
# Load WikiText-2 dataset from HuggingFace datasets
wikitext_dataset = load_dataset('wikitext', 'wikitext-2-raw-v1')
print(wikitext_dataset)

# Look at the first few examples
print("\nSample text from training set:")
print(wikitext_dataset['train'][0]['text'][:500])

## Create a Tokenizer

We'll use a pre-trained tokenizer from Hugging Face for processing the text data.

In [None]:
# Initialize a tokenizer (using GPT-2 tokenizer as it's good for general text)
tokenizer = AutoTokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token  # GPT-2 doesn't have a pad token by default

# Define maximum sequence length
max_length = 128

# Get vocabulary size
vocab_size = len(tokenizer)
print(f"Vocabulary size: {vocab_size}")

## Prepare Dataset for Training

Let's create a custom dataset class for handling WikiText data.

In [None]:
class WikiTextDataset(Dataset):
    def __init__(self, dataset_split, tokenizer, max_length=128):
        self.dataset = dataset_split
        self.tokenizer = tokenizer
        self.max_length = max_length
        
        # Filter out empty texts
        self.texts = [text for text in self.dataset['text'] if len(text.strip()) > 0]
        
    def __len__(self):
        return len(self.texts)
    
    def __getitem__(self, idx):
        text = self.texts[idx]
        encoding = self.tokenizer(text, max_length=self.max_length, padding='max_length', 
                                 truncation=True, return_tensors='pt')
        
        # Remove batch dimension
        for key in encoding.keys():
            encoding[key] = encoding[key].squeeze(0)
            
        return encoding

# Create datasets
train_dataset = WikiTextDataset(wikitext_dataset['train'], tokenizer, max_length=max_length)
val_dataset = WikiTextDataset(wikitext_dataset['validation'], tokenizer, max_length=max_length)
test_dataset = WikiTextDataset(wikitext_dataset['test'], tokenizer, max_length=max_length)

print(f"Training dataset size: {len(train_dataset)}")
print(f"Validation dataset size: {len(val_dataset)}")
print(f"Test dataset size: {len(test_dataset)}")

## Create DataLoaders

Now, let's create the dataloaders with a custom collate function for handling the transformer inputs and targets.

In [None]:
def collate_batch(batch):
    """Collate function for DataLoader."""
    input_ids = torch.stack([item['input_ids'] for item in batch])
    attention_mask = torch.stack([item['attention_mask'] for item in batch])
    
    # For transformer training, we use the input shifted right as target
    target = input_ids[:, 1:].contiguous()
    source = input_ids[:, :-1].contiguous()
    
    return source, target

# Define batch size
batch_size = 16

# Create data loaders
train_loader = DataLoader(
    train_dataset,
    batch_size=batch_size,
    shuffle=True,
    collate_fn=collate_batch,
    num_workers=4
)

val_loader = DataLoader(
    val_dataset,
    batch_size=batch_size,
    shuffle=False,
    collate_fn=collate_batch,
    num_workers=4
)

test_loader = DataLoader(
    test_dataset,
    batch_size=batch_size,
    shuffle=False,
    collate_fn=collate_batch,
    num_workers=4
)

## Examine a Training Batch

Let's examine a batch of training data to understand its structure.

In [None]:
# Get a batch from the training loader
examples = iter(train_loader)
source, target = next(examples)

print(f"Source shape: {source.shape}")
print(f"Target shape: {target.shape}")

# Show a single example
example_idx = 0
print("\nExample input (source):")
print(tokenizer.decode(source[example_idx]))

print("\nExample target:")
print(tokenizer.decode(target[example_idx]))

## Configure and Create the Model

Let's configure and create our Transformer model for the WikiText language modeling task.

In [None]:
# Start with the default configuration
config_manager = ConfigManager(default_config=get_default_config())
config = config_manager.get_all()

# Update configuration for the WikiText transformer model
config_manager.set('transformer.vocab_size', vocab_size)
config_manager.set('transformer.d_model', 256)  # Embedding dimension
config_manager.set('transformer.nhead', 8)  # Number of attention heads
config_manager.set('transformer.num_encoder_layers', 4)  # Number of encoder layers
config_manager.set('transformer.num_decoder_layers', 4)  # Number of decoder layers
config_manager.set('transformer.dim_feedforward', 1024)  # Dimension of feedforward network
config_manager.set('transformer.max_seq_length', max_length)  # Maximum sequence length
config_manager.set('model.dropout_rate', 0.1)  # Dropout rate
config_manager.set('training.num_epochs', 5)  # Number of epochs (reduced for demonstration)
config_manager.set('training.learning_rate', 2e-4)  # Learning rate
config_manager.set('training.weight_decay', 1e-4)  # Weight decay for regularization

# Create model configuration
model_config = {
    'vocab_size': config['transformer']['vocab_size'],
    'd_model': config['transformer']['d_model'],
    'nhead': config['transformer']['nhead'],
    'num_encoder_layers': config['transformer']['num_encoder_layers'],
    'num_decoder_layers': config['transformer']['num_decoder_layers'],
    'dim_feedforward': config['transformer']['dim_feedforward'],
    'dropout': config['model']['dropout_rate'],
    'max_seq_length': config['transformer']['max_seq_length'],
    'pad_idx': tokenizer.pad_token_id
}

# Create the model
model = TransformerModel(model_config)
model = model.to(device)

# Print model summary
print(f"Transformer Model created with {model.get_parameter_count():,} trainable parameters")

## Prepare for Training

We need to customize our forward pass for training the transformer.

In [None]:
# Save the original forward method
original_forward = model.forward

# Custom forward method for the trainer
def train_forward(x):
    # Unpack source and target from input
    src, tgt = x
    # Call the original forward method
    output = original_forward(src, tgt)
    # Reshape output for cross-entropy loss
    batch_size, seq_len, vocab_size = output.size()
    return output.reshape(batch_size * seq_len, vocab_size)

# Monkey patch the forward method for training
model.forward = train_forward

## Set Up the Trainer

Now let's set up the training configuration and create our trainer.

In [None]:
# Create trainer configuration
trainer_config = {
    'learning_rate': config['training']['learning_rate'],
    'weight_decay': config['training']['weight_decay'],
    'num_epochs': config['training']['num_epochs'],
    'batch_size': batch_size,
    'optimizer': 'adamw',  # Use AdamW optimizer
    'scheduler': 'cosine',  # Use cosine annealing scheduler
    'criterion': 'cross_entropy',  # Use cross-entropy loss
    'clip_grad_norm': 1.0,  # Clip gradients
    'early_stopping_patience': 2,  # Stop training if no improvement after 2 epochs
    'checkpoint_dir': '../checkpoints/wikitext',  # Directory to save model checkpoints
    'save_best_only': True  # Only save the best model
}

# Create directories if they don't exist
os.makedirs(trainer_config['checkpoint_dir'], exist_ok=True)

# Create the trainer
trainer = Trainer(model, trainer_config, device)

## Train the Model

Now we're ready to train our model on WikiText-2.

In [None]:
# Start training
print(f"Starting training for {trainer_config['num_epochs']} epochs...")
stats = trainer.train(train_loader, val_loader)

# Print best results
print(f"\nBest validation accuracy: {stats['best_val_acc']:.2f}%")
print(f"Best validation loss: {stats['best_val_loss']:.4f} (epoch {stats['best_epoch']})")

# Restore the original forward method
model.forward = original_forward

## Visualize Training Results

Let's visualize how the training and validation metrics changed during training.

In [None]:
# Plot training and validation loss/accuracy
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.plot(range(1, len(stats['train_loss']) + 1), stats['train_loss'], label='Training Loss')
plt.plot(range(1, len(stats['val_loss']) + 1), stats['val_loss'], label='Validation Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Training and Validation Loss')
plt.legend()
plt.grid(True)

plt.subplot(1, 2, 2)
plt.plot(range(1, len(stats['train_acc']) + 1), stats['train_acc'], label='Training Accuracy')
plt.plot(range(1, len(stats['val_acc']) + 1), stats['val_acc'], label='Validation Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy (%)')
plt.title('Training and Validation Accuracy')
plt.legend()
plt.grid(True)

plt.tight_layout()
plt.show()

## Calculate Perplexity

Perplexity is a common metric for evaluating language models. It is the exponentiated average negative log-likelihood of a sequence.

In [None]:
def calculate_perplexity(model, data_loader, device):
    model.eval()
    total_loss = 0
    total_tokens = 0
    
    with torch.no_grad():
        for src, tgt in data_loader:
            src, tgt = src.to(device), tgt.to(device)
            batch_size = src.size(0)
            
            # Forward pass
            output = model(src, tgt)
            
            # Calculate loss
            output_flat = output.view(-1, output.size(-1))
            tgt_flat = tgt.reshape(-1)
            loss = torch.nn.functional.cross_entropy(output_flat, tgt_flat, reduction='sum', ignore_index=tokenizer.pad_token_id)
            
            # Count non-padding tokens
            non_pad_mask = tgt_flat != tokenizer.pad_token_id
            num_tokens = non_pad_mask.sum().item()
            
            total_loss += loss.item()
            total_tokens += num_tokens
    
    # Calculate perplexity
    avg_loss = total_loss / total_tokens
    perplexity = np.exp(avg_loss)
    
    return perplexity

# Load the best model
best_model_path = os.path.join(trainer_config['checkpoint_dir'], 'best_model.pt')
model.load(best_model_path)

# Calculate perplexity on validation and test sets
val_perplexity = calculate_perplexity(model, val_loader, device)
test_perplexity = calculate_perplexity(model, test_loader, device)

print(f"Validation Perplexity: {val_perplexity:.2f}")
print(f"Test Perplexity: {test_perplexity:.2f}")

## Generate Text with the Trained Model

Let's use our trained model to generate some text.

In [None]:
def generate_text(model, prompt, tokenizer, max_length=50, temperature=1.0):
    # Tokenize the prompt
    encoding = tokenizer(prompt, return_tensors='pt')
    input_ids = encoding['input_ids'].to(device)
    
    # Generate sequence
    generated_ids = model.generate(input_ids, max_length=max_length, temperature=temperature)
    
    # Decode the generated tokens
    generated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
    
    return generated_text

# Define some prompts for text generation
prompts = [
    "The history of artificial intelligence",
    "Neural networks are",
    "The capital city of France is Paris, which"
]

# Generate text with different temperature settings
temperatures = [0.7, 1.0, 1.3]

model.eval()
for prompt in prompts:
    print(f"\nPrompt: {prompt}")
    for temp in temperatures:
        generated = generate_text(model, prompt, tokenizer, max_length=50, temperature=temp)
        print(f"\nTemperature {temp}:")
        print(generated)

## Evaluate on Specific Examples

Let's evaluate our model on specific examples to analyze its behavior.

In [None]:
def analyze_prediction(model, text, tokenizer, device):
    # Tokenize the text
    encoding = tokenizer(text, return_tensors='pt')
    input_ids = encoding['input_ids'].to(device)
    
    # Shift input and target for language modeling
    src = input_ids[:, :-1]
    tgt = input_ids[:, 1:]
    
    # Get model predictions
    model.eval()
    with torch.no_grad():
        output = model(src, tgt)
    
    # Get the predicted token at each position
    predicted_ids = torch.argmax(output, dim=-1)
    
    # Decode the tokens
    input_tokens = tokenizer.convert_ids_to_tokens(src[0].tolist())
    target_tokens = tokenizer.convert_ids_to_tokens(tgt[0].tolist())
    predicted_tokens = tokenizer.convert_ids_to_tokens(predicted_ids[0].tolist())
    
    # Calculate accuracy
    correct = (predicted_ids == tgt).sum().item()
    total = tgt.numel()
    accuracy = correct / total * 100
    
    return {
        'input_tokens': input_tokens,
        'target_tokens': target_tokens,
        'predicted_tokens': predicted_tokens,
        'accuracy': accuracy
    }

# Define some example texts to analyze
analysis_texts = [
    "The capital of France is Paris.",
    "Neural networks are a subset of machine learning.",
    "In computer science, artificial intelligence refers to intelligence demonstrated by machines."
]

# Analyze predictions
for text in analysis_texts:
    print(f"\nAnalyzing: '{text}'")
    result = analyze_prediction(model, text, tokenizer, device)
    
    print(f"Prediction accuracy: {result['accuracy']:.2f}%")
    
    # Print a table of tokens
    print("\nToken  |  Target  |  Predicted")
    print("-" * 40)
    for i in range(len(result['input_tokens'])):
        input_token = result['input_tokens'][i]
        target_token = result['target_tokens'][i] if i < len(result['target_tokens']) else 'N/A'
        pred_token = result['predicted_tokens'][i] if i < len(result['predicted_tokens']) else 'N/A'
        
        # Format to handle special tokens better
        input_token = input_token.replace('Ġ', '').replace('Ċ', '\n')
        target_token = target_token.replace('Ġ', '').replace('Ċ', '\n')
        pred_token = pred_token.replace('Ġ', '').replace('Ċ', '\n')
        
        # Mark correct/incorrect predictions
        mark = "✓" if target_token == pred_token else "✗"
        
        print(f"{input_token:10} | {target_token:10} | {pred_token:10} {mark}")

## Save the Model Configuration

Let's save the model configuration for future reference.

In [None]:
# Save the configuration to a file
os.makedirs('../outputs', exist_ok=True)
config_path = '../outputs/wikitext_transformer_config.yaml'
config_manager.save_config(config_path)
print(f"Configuration saved to {config_path}")

## Conclusion

In this notebook, we have demonstrated how to use the neural networks framework to:

1. Load and preprocess the WikiText-2 dataset from HuggingFace
2. Use a pre-trained tokenizer for text processing
3. Configure and create a Transformer model for language modeling
4. Train the model and evaluate with perplexity
5. Generate text using the trained model
6. Analyze the model's predictions on specific examples

Language modeling is a challenging task that typically requires larger models and more training data for state-of-the-art results. This implementation demonstrates the core concepts and can be extended with more sophisticated architectural elements like adapter layers, larger models, or pre-training and fine-tuning approaches.