# Week 7 - Building and Improving a Simple Language Model

Welcome back! In Week 6, we learned how to prepare textual data for training a language model. We generated input-target pairs using a DataLoader. This week, we'll build upon that foundation to implement and improve a simple neural network language model.

This notebook was created by Qumeng Sun and Lisa Beinborn. It adapts parts from Sebastian Raschka's notebooks accompanying his book "Build a Large Language Model (from Scratch)".


In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
import tiktoken
from importlib.metadata import version
import math
import matplotlib.pyplot as plt

print("torch version:", version("torch"))
print("tiktoken version:", version("tiktoken"))

Matplotlib is building the font cache; this may take a moment.


torch version: 2.5.1
tiktoken version: 0.8.0


## 1. Review of data preparation

First, let's revisit how we prepared our data last week. We'll load the text data, tokenize it using the GPT-2 tokenizer, and prepare it for training.

In [2]:
# Load the text data
with open("jane_austen_emma.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

# Initialize the tokenizer
tokenizer = tiktoken.get_encoding("gpt2")

# Tokenize the text
token_ids = tokenizer.encode(raw_text)

print("Total number of tokens:", len(token_ids))
print("First 10 tokens:", token_ids[:10])

Total number of tokens: 229893
First 10 tokens: [198, 44558, 38340, 314, 198, 198, 41481, 314, 628, 198]


## 2. Preparing dataset and dataloader

We'll use the same `GPTDataset` class and `create_dataloader` function that we defined in Week 6 to generate input-target pairs where the target is the input sequence shifted by one token to the right.

In [3]:
import torch
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader


class GPTDataset(Dataset):
    def __init__(self, txt, tokenizer, context_length):
        self.input_ids = []
        self.target_ids = []

        # Tokenize the entire text
        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})

        # Use a sliding window to chunk the book into overlapping sequences of context_length
        for i in range(0, len(token_ids) - context_length):
            input_sequence = token_ids[i:i + context_length]
            
            #shift to the right
            target_sequence = token_ids[i + 1: i + context_length + 1]

            # input and output are represented as tensors
            self.input_ids.append(torch.tensor(input_sequence))
            self.target_ids.append(torch.tensor(target_sequence))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

def create_dataloader(txt, batch_size=8, context_length=4, shuffle=True, drop_last=True,
                         num_workers=0):

    # Initialize the tokenizer
    tokenizer = tiktoken.get_encoding("gpt2")

    # Create dataset
    dataset = GPTDataset(txt, tokenizer, context_length)
    train, dev, test = torch.utils.data.random_split(dataset, [0.8,0.1,0.1])
    
    # Create dataloader
    train_dataloader = DataLoader(
        train,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers
    )

    dev_dataloader = DataLoader(
        dev,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers
    )
    test_dataloader = DataLoader(
        test,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers
    )
    return train_dataloader, dev_dataloader, test_dataloader

## 3. Training and evaluating a base model

We'll start by defining and training a simplistic language model to understand the process.

### 3.1. Defining the model

Our base model will consist of:
- **Token Embeddings**: Convert token IDs to dense vectors.
- **Positional Embeddings**: Incorporate positional information.
- **Linear Layer**: Predict the next token in the sequence.

We'll set an appropriate `context_length` during initialization.

In [5]:
class SimpleLanguageModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, context_length):
        super(SimpleLanguageModel, self).__init__()
        self.token_embedding = nn.Embedding(vocab_size, embedding_dim)
        self.position_embedding = nn.Embedding(context_length, embedding_dim)
        self.linear = nn.Linear(embedding_dim, vocab_size)
        
    def forward(self, x):
        positions = torch.arange(0, x.size(1), device=x.device).unsqueeze(0)
        token_embeds = self.token_embedding(x)
        position_embeds = self.position_embedding(positions)
        
        embeddings = token_embeds + position_embeds
        logits = self.linear(embeddings)
        return logits

### 3.2. Setting up training parameters

We'll initialize our model with an appropriate `context_length` and prepare for training.

Check the torch documentation for the description of [CrossEntropyLoss](https://pytorch.org/docs/stable/_modules/torch/nn/modules/loss.html#CrossEntropyLoss) and try to understand what it means that it "is equivalent to applying LogSoftmax on an input, followed by NLLLoss."

Check the documentation for the [AdamOptimizer](https://pytorch.org/docs/stable/generated/torch.optim.Adam.html) and make sure you understand the role of the lr parameter. 

In [7]:
# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

# Parameters
batch_size = 128
context_length = 32  # Context size for training
vocab_size = tokenizer.n_vocab
embedding_dim = 128

# Create the DataLoader
train_dataloader, dev_dataloader, test_dataloader = create_dataloader(
    raw_text, batch_size=batch_size, 
    context_length=context_length, shuffle=True
)

# Initialize the model
model = SimpleLanguageModel(vocab_size, embedding_dim, context_length).to(device)

# Define loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Training loop parameters
num_epochs = 2

Using device: cpu


### 3.3. Training the model

Let's train a very simple model and monitor the loss. This will take a while. 
Make sure you understand every step of the code at least conceptually and consult the pytorch documentation. If the training process takes too long, test it with a smaller portion of the dataset and/or fewer epochs first. 

In [8]:
import matplotlib.pyplot as plt

train_losses = []
perplexities = []

# Go through learning epochs
for epoch in range(num_epochs):
    total_loss = 0
    model.train()
    
    # Read in data in batches
    for batch_idx, (x, y) in enumerate(train_dataloader):
        x = x.to(device)
        y = y.to(device)

        # Reset the gradients
        optimizer.zero_grad()

        # Apply the forward pass
        logits = model(x)

        # Reshape logits and labels
        token_logits = logits.view(-1, vocab_size)
        token_labels = y.view(-1)

        # To understand what is happening during reshaping, print out logits.shape and token_logits.shape
        # and the same for y
        #print(logits.shape, token_logits.shape)
        #print(y.shape, token_labels.shape)
        #print(y[0])
        #print(token_labels[0:10])

        # Calculate the loss
        loss = criterion(token_logits,token_labels)

        # Apply the backward step (calculate the gradients) 
        loss.backward()

        # Adjust the weights
        optimizer.step()

        # Accumulate the loss over batches
        total_loss += loss.item()

        # Monitor progress every twenty batches
        if batch_idx % 20 == 0:
            print(f"Epoch [{epoch+1}/{num_epochs}], Step [{batch_idx}/{len(train_dataloader)}], Loss: {loss.item():.4f}")

    # Calculate average cross-entropy loss and perplexity
    avg_loss = total_loss / len(train_dataloader)
    perplexity = math.exp(avg_loss)
    
    # Monitor developments over learning process
    train_losses.append(avg_loss)
    perplexities.append(perplexity)
    print(f"Epoch [{epoch+1}/{num_epochs}] Average Loss: {avg_loss:.4f}, Perplexity: {perplexity:.2f}")

Epoch [1/2], Step [0/1436], Loss: 11.1604
Epoch [1/2], Step [20/1436], Loss: 10.2425
Epoch [1/2], Step [40/1436], Loss: 9.2030


KeyboardInterrupt: 

In [None]:
# Plotting the Loss and Perplexity

plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(train_losses, label='Training Loss', linestyle='dashed', marker="o")
plt.title('Simple Model - Training Loss over Epochs')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()

plt.subplot(1, 2, 2)
plt.plot(perplexities, label='Perplexity', linestyle='dashed', marker="o")
plt.title('Simple Model - Perplexity over Epochs')
plt.xlabel('Epoch')
plt.ylabel('Perplexity')
plt.legend()

plt.show()

### 3.4. Evaluating the model

Now, we'll compute the perplexity of our simplest model on the development set.

In [None]:
model.eval()
total_loss = 0

with torch.no_grad():
    for x, y in dev_dataloader:
        x = x.to(device)
        y = y.to(device)
        logits = model(x)
        loss = criterion(logits.view(-1, vocab_size), y.view(-1))
        total_loss += loss.item()

avg_loss = total_loss / len(dev_dataloader)
perplexity_simple = math.exp(avg_loss)
print(f"Perplexity of base model: {perplexity_simple:.2f}")

## 4. Training with dropout

To prevent overfitting and improve generalization, we'll test dropout as a regularization strategy. 

### 4.1. Adding dropout

We'll modify our model to include a dropout layer.

In [None]:
class RegularizedLanguageModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, context_length, dropout=0.2):
        super(RegularizedLanguageModel, self).__init__()
        self.token_embedding = nn.Embedding(vocab_size, embedding_dim)
        self.position_embedding = nn.Embedding(context_length, embedding_dim)
        # This is new!
        self.dropout = nn.Dropout(dropout)
        self.linear = nn.Linear(embedding_dim, vocab_size)
        
    def forward(self, x):
        positions = torch.arange(0, x.size(1), device=x.device).unsqueeze(0)
        token_embeds = self.token_embedding(x)
        position_embeds = self.position_embedding(positions)
        
        embeddings = token_embeds + position_embeds
        embeddings = self.dropout(embeddings)
        logits = self.linear(embeddings)
        return logits

### 4.2. Retraining the model with dropout

We'll re-initialize the model and optimizer, then retrain.

In [None]:
train_losses_reg = []
perplexities_reg = []

# Re-initialize the model with dropout
model = RegularizedLanguageModel(vocab_size, embedding_dim, context_length, dropout=0.2).to(device)

# Re-initialize the optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Retrain the model
for epoch in range(num_epochs):
    total_loss = 0
    model.train()
    for batch_idx, (x, y) in enumerate(train_dataloader):
        x = x.to(device)
        y = y.to(device)
        
        optimizer.zero_grad()
        logits = model(x)
        
        loss = criterion(logits.view(-1, vocab_size), y.view(-1))
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item()
        
        if batch_idx % 10 == 0:
            print(f"Epoch [{epoch+1}/{num_epochs}], Step [{batch_idx}/{len(train_dataloader)}], Loss: {loss.item():.4f}")
    avg_loss = total_loss / len(train_dataloader)
    perplexity = math.exp(avg_loss)
    train_losses_reg.append(avg_loss)
    perplexities_reg.append(perplexity)
    print(f"Epoch [{epoch+1}/{num_epochs}] Average Loss: {avg_loss:.4f}, Perplexity: {perplexity:.2f}")

In [None]:
# Plotting loss and perplexity for the model with dropout

plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(train_losses_reg, label='Training Loss', linestyle="dashed", marker="o")
plt.title('Dropout Model - Training Loss over Epochs')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()

plt.subplot(1, 2, 2)
plt.plot(perplexities_reg, label='Perplexity', linestyle="dashed", marker="o")
plt.title('Dropout Model - Perplexity over Epochs')
plt.xlabel('Epoch')
plt.ylabel('Perplexity')
plt.legend()

plt.show()

### 4.3. Evaluating the dropout model

Now, we'll compute the perplexity of our modified model.

In [None]:
# Setting the model to evaluation turns off dropout
model.eval()
total_loss = 0

with torch.no_grad():
    for x, y in dev_dataloader:
        x = x.to(device)
        y = y.to(device)
        logits = model(x)
        loss = criterion(logits.view(-1, vocab_size), y.view(-1))
        total_loss += loss.item()

avg_loss = total_loss / len(dev_dataloader)
perplexity_regularized = math.exp(avg_loss)
print(f"Regularized Model Perplexity: {perplexity_regularized:.2f}")

## 5. Improving the Model

Now, try to further improve the model. For example, you could:
- Increase the model depth.
- Increase the embedding dimension.
- Introduce non-linear activation functions.
- Adjust the `context_length`.
- Adjust the parameters of the optimizer. 

## 6. Generating text 


In [None]:
def generate_text(model, tokenizer, start_text, context_length=15, temperature=1.0):
    model.eval()
    generated = tokenizer.encode(start_text)
    context = torch.tensor(generated, dtype=torch.long, device=device).unsqueeze(0)
    
    with torch.no_grad():
        for _ in range(context_length):
            if context.size(1) >= context_length:
                break
            logits = model(context)
            next_token_logits = logits[0, -1, :] / temperature
            probabilities = torch.softmax(next_token_logits, dim=-1)
            next_token_id = torch.multinomial(probabilities, num_samples=1)
            context = torch.cat([context, next_token_id.unsqueeze(0)], dim=1)
    
    generated_text = tokenizer.decode(context[0].tolist())
    return generated_text

start_text = "Emma was"
generated_text = generate_text(model, tokenizer, start_text, context_length=20)
print("Generated Text:\n")
print(generated_text)