# 1. Language Models

In [None]:
!pip install datasets accelerate



In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

import pandas as pd
import numpy as np
import nltk
from nltk.util import ngrams
from nltk.tokenize import word_tokenize
from collections import defaultdict, Counter
import string
import math
import os
import time
import re

from transformers import GPT2TokenizerFast, GPT2LMHeadModel
from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling
from datasets import Dataset as HFDataset

nltk.download(['punkt', 'wordnet', 'stopwords', 'punkt_tab'])

# Set device to CPU
device = torch.device("cpu")
print(f"Using device: {device}")

# Create data directory if it doesn't exist
if not os.path.exists('data'):
    os.makedirs('data')

# Load dataset
data_path = "imdb.csv"
print(f"Attempting to load dataset from: {data_path}")
try:
    # Attempt to read with default utf-8 encoding first
    try:
        df = pd.read_csv(data_path)
    except UnicodeDecodeError:
        print("UTF-8 decoding failed, trying latin-1...")
        # Common alternative encoding for datasets found online
        df = pd.read_csv(data_path, encoding='latin-1')
    print("Dataset loaded successfully.")
    # Basic cleaning: remove HTML tags
    def clean_html(text):
        clean = re.compile('<.*?>')
        return re.sub(clean, '', text)
    df['review'] = df['review'].apply(clean_html)

except FileNotFoundError:
    print(f"Error: {data_path} not found.")
    print("Please download the 'IMDB Dataset of 50K Movie Reviews' from Kaggle")
    print("and place the 'imdb.csv' file (or the correct CSV file name) into the './data/' directory.")
    # Create a dummy dataframe to allow the rest of the notebook to run without crashing immediately
    # You will need to replace this with the actual data loading for the code to work.
    df = pd.DataFrame({'review': [
        "This is a placeholder review. Please load the real dataset.",
        "Another placeholder review. The models need real data to train."
    ]})
    print("Created dummy dataframe. Please load the actual data for meaningful results.")

# Display dataframe info if loaded
if 'review' in df.columns:
    print(f"Loaded {len(df)} reviews.")
else:
    print("Warning: 'review' column not found in the loaded data.")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


Using device: cpu
Attempting to load dataset from: imdb.csv
Dataset loaded successfully.
Loaded 50000 reviews.


**Goal:** This notebook aims to implement, train, and compare three different language models on the IMDB movie review dataset:
1.  A statistical n-gram model (Trigram with Add-1 smoothing).
2.  A neural language model (LSTM) trained from scratch.
3.  A pre-trained transformer model (GPT-2 small) fine-tuned on the dataset.

We will evaluate these models based on their perplexity on held-out sentences (both grammatically correct and incorrect) and their ability to generate coherent text continuations.

# 2. Dataset Description

The dataset used is the "IMDB Dataset of 50K Movie Reviews", commonly sourced from Kaggle or Stanford's AI repository. It contains 50,000 movie reviews, originally intended for binary sentiment classification, but here used solely for language modeling. The reviews are text-based and vary significantly in length.

In [None]:
print(f"Total number of reviews: {len(df)}")

# Calculate average tokens per review
total_tokens = 0
num_reviews_for_avg = len(df) # Use all reviews for average calculation
print(f"Calculating average tokens for {num_reviews_for_avg} reviews (this may take a moment)...")
start_time = time.time()
for review in df['review'].iloc[:num_reviews_for_avg]: # Iterate directly over the series
    try:
        tokens = word_tokenize(review.lower()) # Tokenize after lowercasing
        total_tokens += len(tokens)
    except Exception as e:
        print(f"Skipping a review due to tokenization error: {e}")
        num_reviews_for_avg -= 1 # Adjust count if a review fails

if num_reviews_for_avg > 0:
    average_tokens = total_tokens / num_reviews_for_avg
    print(f"Average number of tokens per review: {average_tokens:.2f}")
else:
    print("Could not calculate average tokens (no valid reviews found).")
end_time = time.time()
print(f"Token calculation took {end_time - start_time:.2f} seconds.")

# Show a few example reviews
print("\nExample Reviews:")
for i, review in enumerate(df['review'].head(3)):
    print(f"\nReview {i+1}:")
    print(review[:500] + ("..." if len(review) > 500 else "")) # Print first 500 chars

Total number of reviews: 50000
Calculating average tokens for 50000 reviews (this may take a moment)...
Average number of tokens per review: 261.77
Token calculation took 110.25 seconds.

Example Reviews:

Review 1:
One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.It is called OZ as that is the nickname...

Review 2:
A wonderful little production. The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. The actors are extremely well chosen- Michael Sheen not only "has got all the p

# 3. Statistical n-gram Model

We implement a Trigram (n=3) language model. This choice represents a balance between capturing some context (more than bigram) and managing data sparsity (less sparse than 4-grams).

To handle unseen n-grams during evaluation, we use Add-1 (Laplace) smoothing. This is a simple smoothing technique that adds a constant value (k=1) to all n-gram counts.

**Hyperparameters:**
*   n = 3 (Trigram)
*   k = 1 (Add-1 smoothing factor)

In [None]:
# --- N-gram Model Implementation ---

def preprocess_ngram(text, n):
    """Lowercase, tokenize, add start/end symbols."""
    text = text.lower()
    tokens = word_tokenize(text)
    # Add n-1 start symbols and 1 end symbol
    return ['<s>'] * (n - 1) + tokens + ['</s>']

# Use a subset of data for building n-gram counts to manage memory/time on CPU
ngram_train_size = 10000 # Adjust as needed based on system resources
print(f"Preprocessing {ngram_train_size} reviews for n-gram model...")
ngram_corpus = [preprocess_ngram(text, n=3) for text in df['review'].iloc[:ngram_train_size]]

# Build Vocabulary and Counts
print("Building vocabulary and n-gram counts...")
vocab = set()
unigram_counts = Counter()
bigram_counts = defaultdict(int)
trigram_counts = defaultdict(int)

for sentence in ngram_corpus:
    # Unigrams (for vocab and potential backoff/smoothing needs)
    unigram_counts.update(sentence)
    vocab.update(sentence)

    # Bigrams
    for w1, w2 in ngrams(sentence, 2):
        bigram_counts[(w1, w2)] += 1

    # Trigrams
    for w1, w2, w3 in ngrams(sentence, 3):
        trigram_counts[(w1, w2, w3)] += 1

vocab_size = len(vocab)
print(f"Vocabulary size: {vocab_size}")
print(f"Total trigrams counted: {len(trigram_counts)}")

# Add-1 Smoothed Probability Function
def get_add1_trigram_prob(w1, w2, w3, trigram_counts, bigram_counts, vocab_size, k=1):
    """Calculate Add-1 smoothed trigram probability P(w3 | w1, w2)."""
    # Count(w1, w2, w3)
    tri_count = trigram_counts.get((w1, w2, w3), 0)

    # Count(w1, w2)
    # Need to calculate this sum explicitly if not stored separately
    # For simplicity here, we assume bigram_counts stores the count for the pair (w1, w2)
    # A more robust way would be to sum trigram_counts[(w1, w2, *)]
    # Let's recalculate Count(w1, w2) from trigrams for accuracy with add-1
    # This is inefficient but conceptually correct for the formula
    # A better approach stores N(w1, w2) = sum_w3 N(w1, w2, w3)
    # Let's approximate using the bigram counts we gathered earlier
    bi_count = bigram_counts.get((w1, w2), 0)

    # P(w3 | w1, w2) = (Count(w1, w2, w3) + k) / (Count(w1, w2) + k * V)
    probability = (tri_count + k) / (bi_count + k * vocab_size)
    return probability

# Perplexity Function
def calculate_ngram_perplexity(sentences, prob_func, n, trigram_counts, bigram_counts, vocab_size, k=1):
    """Calculates perplexity for a list of sentences using a given probability function."""
    total_log_prob = 0
    total_tokens = 0
    num_sentences = 0

    for sentence_tokens in sentences:
        num_sentences += 1
        # We predict tokens starting from the first word, up to and including </s>
        # The number of predictions is len(sentence_tokens) - (n - 1)
        sentence_log_prob = 0
        num_predicted_tokens = 0

        # Iterate through trigrams in the sentence
        for i in range(n - 1, len(sentence_tokens)):
            w1 = sentence_tokens[i - 2]
            w2 = sentence_tokens[i - 1]
            w3 = sentence_tokens[i]

            # Replace OOV words with a special token if needed, or handle in prob_func
            # For simplicity, Add-1 handles OOV implicitly by giving them non-zero probability
            # if the context (w1, w2) was seen.
            # If context (w1, w2) was never seen, bi_count=0, prob = k / (k*V) = 1/V

            prob = prob_func(w1, w2, w3, trigram_counts, bigram_counts, vocab_size, k)

            if prob > 0:
                sentence_log_prob += math.log2(prob)
            else:
                # Should not happen with Add-1 smoothing unless V=0
                sentence_log_prob += -float('inf') # Assign very low probability

            num_predicted_tokens += 1

        if num_predicted_tokens > 0:
            total_log_prob += sentence_log_prob
            total_tokens += num_predicted_tokens
        else:
            print(f"Warning: Sentence too short for trigram model: {sentence_tokens}")

    if total_tokens == 0:
        return float('inf') # Avoid division by zero

    # Perplexity = 2^(-1/N * sum(log2(P(wi|...))))
    avg_log_prob = total_log_prob / total_tokens
    perplexity = math.pow(2, -avg_log_prob)
    return perplexity

# Define Held-out Sentences
correct_sentences_text = [
    "This movie was surprisingly good.",
    "I did not like the ending at all.",
    "The acting was superb and the plot was engaging.",
    "It felt like a waste of time.",
    "She delivered a truly remarkable performance."
    # Add 5 more if desired
]

incorrect_sentences_text = [
    "Good surprisingly was movie this.",
    "Ending the like all not did I at.",
    "Engaging plot the superb was acting and was.",
    "Time of waste a like felt it.",
    "Performance remarkable truly a delivered she."
    # Add 5 more if desired
]

# Preprocess test sentences
ngram_test_correct = [preprocess_ngram(s, n=3) for s in correct_sentences_text]
ngram_test_incorrect = [preprocess_ngram(s, n=3) for s in incorrect_sentences_text]

# Calculate and Report Perplexity
print("\nCalculating N-gram Perplexity...")
perplexity_correct = calculate_ngram_perplexity(
    sentences=ngram_test_correct,
    prob_func=get_add1_trigram_prob,
    n=3,
    trigram_counts=trigram_counts,
    bigram_counts=bigram_counts,
    vocab_size=vocab_size,
    k=1
)
perplexity_incorrect = calculate_ngram_perplexity(
    sentences=ngram_test_incorrect,
    prob_func=get_add1_trigram_prob,
    n=3,
    trigram_counts=trigram_counts,
    bigram_counts=bigram_counts,
    vocab_size=vocab_size,
    k=1
)

print(f"N-gram Perplexity on correct sentences: {perplexity_correct:.2f}")
print(f"N-gram Perplexity on incorrect sentences: {perplexity_incorrect:.2f}")

Preprocessing 10000 reviews for n-gram model...
Building vocabulary and n-gram counts...
Vocabulary size: 77746
Total trigrams counted: 1689050

Calculating N-gram Perplexity...
N-gram Perplexity on correct sentences: 5568.18
N-gram Perplexity on incorrect sentences: 50105.90


**N-gram Perplexity Discussion:**

*(Expected Outcome)* We typically expect the perplexity on the grammatically correct sentences to be significantly lower than on the incorrect sentences. This is because the trigram sequences in the correct sentences (e.g., "movie was surprisingly", "was surprisingly good") are more likely to have occurred in the training data (or have higher smoothed probabilities) than the jumbled sequences in the incorrect sentences (e.g., "good surprisingly was", "surprisingly was movie").

A lower perplexity indicates the model is less "surprised" by the sequence, meaning it assigns higher probabilities to the observed words given their context. The large difference between the two scores would suggest the n-gram model has captured some basic sequential structure of the English language present in the reviews.

However, the absolute perplexity values might still be relatively high, especially with simple Add-1 smoothing and a limited training subset. Data sparsity remains a challenge for n-gram models; many valid trigrams might not have been seen.

# 4. Neural Network Model from Scratch

We will build an LSTM-based neural language model from scratch using PyTorch.

**Architecture:**
*   **Vocabulary Size:** Top 20,000 most frequent tokens from the training subset.
*   **Embedding Layer:** Maps token indices to dense vectors of size 128.
*   **LSTM Layers:** 2 stacked LSTM layers with a hidden state size of 256.
*   **Dropout:** Applied with a probability of 0.2 between LSTM layers (if num_layers > 1) and potentially after the LSTM output.
*   **Linear Layer:** Maps LSTM output features to scores for each word in the vocabulary.

**Hyperparameters:**
*   **Learning Rate (lr):** 1e-3 (0.001)
*   **Batch Size:** 64
*   **Epochs:** 5 (Adjust based on training time and convergence)
*   **Weight Decay:** 1e-5 (for regularization)

In [None]:
# --- LSTM Model Implementation ---
from torch.nn.utils.rnn import pad_sequence

# --- 1. Preprocessing and Vocabulary ---
print("\n--- LSTM Model Setup ---")
LSTM_VOCAB_SIZE = 20000
LSTM_TRAIN_SIZE = 10000 # Use a subset for faster training on CPU
MAX_LEN = 100 # Fixed sequence length for simplicity
BATCH_SIZE = 64
EMBEDDING_DIM = 128
HIDDEN_DIM = 256
NUM_LAYERS = 2
DROPOUT = 0.2
LEARNING_RATE = 1e-3
WEIGHT_DECAY = 1e-5
EPOCHS = 5 # Adjust as needed

print(f"Processing {LSTM_TRAIN_SIZE} reviews for LSTM vocabulary and training...")

# Tokenize and build vocabulary
all_tokens = []
for review in df['review'].iloc[:LSTM_TRAIN_SIZE]:
    tokens = word_tokenize(review.lower())
    all_tokens.extend(tokens)

token_counts = Counter(all_tokens)
vocab = [word for word, count in token_counts.most_common(LSTM_VOCAB_SIZE - 2)] # Reserve space for PAD, UNK
vocab_set = set(vocab)

# Add special tokens
PAD_TOKEN = "<pad>"
UNK_TOKEN = "<unk>"
if PAD_TOKEN not in vocab_set:
    vocab.append(PAD_TOKEN)
if UNK_TOKEN not in vocab_set:
    vocab.append(UNK_TOKEN)

word_to_idx = {word: idx for idx, word in enumerate(vocab)}
idx_to_word = {idx: word for word, idx in word_to_idx.items()}
actual_vocab_size = len(word_to_idx)
PAD_IDX = word_to_idx[PAD_TOKEN]
UNK_IDX = word_to_idx[UNK_TOKEN]

print(f"Actual LSTM vocabulary size: {actual_vocab_size}")

# --- 2. PyTorch Dataset and DataLoader ---
class ImdbDataset(Dataset):
    def __init__(self, texts, word_to_idx, max_len, unk_idx):
        self.sequences = []
        self.targets = []
        self.max_len = max_len

        for text in texts:
            tokens = word_tokenize(text.lower())
            indexed = [word_to_idx.get(token, unk_idx) for token in tokens]

            # Create input sequences and targets (next word prediction)
            # Truncate long sequences
            if len(indexed) > max_len:
                indexed = indexed[:max_len]

            # We need at least 2 tokens to form a pair (input, target)
            if len(indexed) >= 2:
                self.sequences.append(torch.tensor(indexed[:-1], dtype=torch.long))
                self.targets.append(torch.tensor(indexed[1:], dtype=torch.long))

    def __len__(self):
        return len(self.sequences)

    def __getitem__(self, idx):
        return self.sequences[idx], self.targets[idx]

def collate_fn(batch, pad_idx):
    """Pad sequences within a batch."""
    sequences, targets = zip(*batch)
    # Pad sequences
    sequences_padded = pad_sequence(sequences, batch_first=True, padding_value=pad_idx)
    # Pad targets - crucial: use pad_idx or a specific ignore_index for loss
    targets_padded = pad_sequence(targets, batch_first=True, padding_value=pad_idx)
    return sequences_padded, targets_padded

# Create dataset and dataloader instances
lstm_dataset = ImdbDataset(df['review'].iloc[:LSTM_TRAIN_SIZE].tolist(), word_to_idx, MAX_LEN, UNK_IDX)
# Use functools.partial to pass pad_idx to collate_fn
from functools import partial
lstm_dataloader = DataLoader(lstm_dataset, batch_size=BATCH_SIZE,
                           shuffle=True, collate_fn=partial(collate_fn, pad_idx=PAD_IDX))

print(f"Created LSTM Dataset with {len(lstm_dataset)} sequences.")

# --- 3. LSTM Model Definition ---
class LSTMModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, num_layers, dropout, pad_idx):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_idx)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=num_layers,
                            batch_first=True, dropout=dropout if num_layers > 1 else 0)
        self.dropout = nn.Dropout(dropout)
        self.fc = nn.Linear(hidden_dim, vocab_size)
        self.pad_idx = pad_idx

    def forward(self, text):
        # text shape: [batch_size, seq_len]
        embedded = self.dropout(self.embedding(text))
        # embedded shape: [batch_size, seq_len, embedding_dim]

        lstm_out, (hidden, cell) = self.lstm(embedded)
        # lstm_out shape: [batch_size, seq_len, hidden_dim]

        # Apply dropout to LSTM output before linear layer
        dropped_out = self.dropout(lstm_out)

        predictions = self.fc(dropped_out)
        # predictions shape: [batch_size, seq_len, vocab_size]
        return predictions

# Instantiate the model
lstm_model = LSTMModel(actual_vocab_size, EMBEDDING_DIM, HIDDEN_DIM, NUM_LAYERS, DROPOUT, PAD_IDX)
lstm_model.to(device) # Move model to CPU

# Compute and print total number of trainable parameters
total_params_lstm = sum(p.numel() for p in lstm_model.parameters() if p.requires_grad)
print(f'Total trainable parameters (LSTM): {total_params_lstm:,}')

# --- 4. Training Loop ---
optimizer = optim.AdamW(lstm_model.parameters(), lr=LEARNING_RATE, weight_decay=WEIGHT_DECAY)
# Use CrossEntropyLoss - it combines LogSoftmax and NLLLoss
# Important: set ignore_index to PAD_IDX so padding tokens don't contribute to loss
criterion = nn.CrossEntropyLoss(ignore_index=PAD_IDX)

print(f"\nStarting LSTM training for {EPOCHS} epochs...")
lstm_training_start_time = time.time()

lstm_model.train() # Set model to training mode
for epoch in range(EPOCHS):
    epoch_loss = 0
    epoch_start_time = time.time()

    for i, (sequences, targets) in enumerate(lstm_dataloader):
        sequences, targets = sequences.to(device), targets.to(device)

        optimizer.zero_grad()

        # Get predictions (logits)
        # Shape: [batch_size, seq_len, vocab_size]
        predictions = lstm_model(sequences)

        # Reshape for CrossEntropyLoss
        # Predictions need to be [batch_size * seq_len, vocab_size]
        # Targets need to be [batch_size * seq_len]
        predictions_flat = predictions.view(-1, actual_vocab_size)
        targets_flat = targets.view(-1)

        loss = criterion(predictions_flat, targets_flat)

        loss.backward()
        # Optional: Gradient clipping
        # torch.nn.utils.clip_grad_norm_(lstm_model.parameters(), max_norm=1.0)
        optimizer.step()

        epoch_loss += loss.item()

        # Print progress periodically
        if (i + 1) % 50 == 0:
            print(f'Epoch [{epoch+1}/{EPOCHS}], Step [{i+1}/{len(lstm_dataloader)}], Loss: {loss.item():.4f}')

    epoch_end_time = time.time()
    avg_epoch_loss = epoch_loss / len(lstm_dataloader)
    print(f'\nEpoch {epoch+1} finished.')
    print(f'Average Training Loss: {avg_epoch_loss:.4f}')
    print(f'Epoch Time: {epoch_end_time - epoch_start_time:.2f} seconds\n')

lstm_training_end_time = time.time()
lstm_total_training_time = lstm_training_end_time - lstm_training_start_time
print(f"Finished LSTM training.")
print(f"Total Training Time: {lstm_total_training_time:.2f} seconds")

# --- 5. LSTM Perplexity Calculation ---

def preprocess_lstm_sentence(sentence_text, word_to_idx, unk_idx, device):
    """Tokenize and index a sentence for the LSTM model."""
    tokens = word_tokenize(sentence_text.lower())
    indexed = [word_to_idx.get(token, unk_idx) for token in tokens]
    # Need input sequence (all but last) and target sequence (all but first)
    if len(indexed) < 2:
        return None, None
    input_seq = torch.tensor([indexed[:-1]], dtype=torch.long).to(device) # Add batch dim
    target_seq = torch.tensor([indexed[1:]], dtype=torch.long).to(device) # Add batch dim
    return input_seq, target_seq

def calculate_lstm_perplexity(model, sentences_text, word_to_idx, unk_idx, pad_idx, device):
    """Calculate perplexity for sentences using the trained LSTM model."""
    model.eval() # Set model to evaluation mode
    total_loss = 0
    total_tokens = 0
    criterion = nn.CrossEntropyLoss(ignore_index=pad_idx, reduction='sum') # Sum losses, we average manually

    with torch.no_grad():
        for text in sentences_text:
            input_seq, target_seq = preprocess_lstm_sentence(text, word_to_idx, unk_idx, device)

            if input_seq is None or target_seq is None:
                print(f"Skipping short sentence: {text}")
                continue

            # Get model predictions (logits)
            # Shape: [1, seq_len, vocab_size]
            predictions = model(input_seq)

            # Reshape for CrossEntropyLoss
            # Predictions: [seq_len, vocab_size]
            # Targets: [seq_len]
            predictions_flat = predictions.squeeze(0) # Remove batch dim
            targets_flat = target_seq.squeeze(0)     # Remove batch dim

            loss = criterion(predictions_flat, targets_flat)

            # Accumulate loss and count non-padding tokens in the target
            total_loss += loss.item()
            num_tokens = (targets_flat != pad_idx).sum().item()
            total_tokens += num_tokens

    if total_tokens == 0:
        return float('inf')

    # Average cross-entropy loss
    avg_loss = total_loss / total_tokens
    # Perplexity is exp(average cross-entropy loss)
    perplexity = math.exp(avg_loss)
    return perplexity

# Calculate and Report Perplexity
print("\nCalculating LSTM Perplexity...")

# Use the same sentences as the n-gram model
lstm_perplexity_correct = calculate_lstm_perplexity(
    lstm_model, correct_sentences_text, word_to_idx, UNK_IDX, PAD_IDX, device
)
lstm_perplexity_incorrect = calculate_lstm_perplexity(
    lstm_model, incorrect_sentences_text, word_to_idx, UNK_IDX, PAD_IDX, device
)

print(f"LSTM Perplexity on correct sentences: {lstm_perplexity_correct:.2f}")
print(f"LSTM Perplexity on incorrect sentences: {lstm_perplexity_incorrect:.2f}")


--- LSTM Model Setup ---
Processing 10000 reviews for LSTM vocabulary and training...
Actual LSTM vocabulary size: 20000
Created LSTM Dataset with 10000 sequences.
Total trainable parameters (LSTM): 8,621,600

Starting LSTM training for 5 epochs...
Epoch [1/5], Step [50/157], Loss: 6.6312
Epoch [1/5], Step [100/157], Loss: 6.5858
Epoch [1/5], Step [150/157], Loss: 6.6032

Epoch 1 finished.
Average Training Loss: 6.8126
Epoch Time: 1061.94 seconds

Epoch [2/5], Step [50/157], Loss: 6.5525
Epoch [2/5], Step [100/157], Loss: 6.5135
Epoch [2/5], Step [150/157], Loss: 6.4555

Epoch 2 finished.
Average Training Loss: 6.5362
Epoch Time: 1041.55 seconds

Epoch [3/5], Step [50/157], Loss: 6.3792
Epoch [3/5], Step [100/157], Loss: 6.4058
Epoch [3/5], Step [150/157], Loss: 6.2890

Epoch 3 finished.
Average Training Loss: 6.4207
Epoch Time: 1056.47 seconds

Epoch [4/5], Step [50/157], Loss: 6.1999
Epoch [4/5], Step [100/157], Loss: 6.1245
Epoch [4/5], Step [150/157], Loss: 6.1266

Epoch 4 finishe

# 5. Fine-tuning a Pre-trained Model

We will fine-tune a pre-trained Generative Pre-trained Transformer 2 (GPT-2) model, specifically the smallest version ("gpt2"), which has 117 million parameters.

**Model:** GPT-2 is a large transformer-based language model developed by OpenAI. It uses a decoder-only transformer architecture.

**Pre-training:** It was pre-trained on a massive dataset (WebText) with the objective of predicting the next word in a sequence. This allows it to learn grammar, facts, and reasoning abilities, which can be adapted to specific downstream tasks like text generation on movie reviews through fine-tuning.

**Fine-tuning:** We will use the Hugging Face `transformers` library to load the pre-trained model and tokenizer, and the `Trainer` API to fine-tune the model on the IMDB dataset. We will use a smaller batch size (8) due to potential memory constraints on CPU and train for only 1 or 2 epochs, as fine-tuning typically requires less training than training from scratch.

In [None]:
import torch
from transformers import GPT2TokenizerFast, GPT2LMHeadModel, TrainingArguments, Trainer, DataCollatorForLanguageModeling
from datasets import Dataset
import time
import math

# --- GPT-2 Fine-tuning Implementation ---
print("\n--- GPT-2 Fine-tuning Setup ---")
MODEL_NAME = "gpt2"  # Smallest GPT-2 model (117M parameters)
GPT2_BATCH_SIZE = 8  # Smaller batch size for CPU fine-tuning
GPT2_LEARNING_RATE = 5e-5
GPT2_EPOCHS = 1  # Fine-tuning usually requires fewer epochs
BLOCK_SIZE = 128  # Sequence length for GPT-2 training chunks
GPT2_TRAIN_SIZE = 5000  # Use a smaller subset for faster fine-tuning demonstration

# Set device (CPU or GPU if available)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# --- 1. Load Tokenizer and Model ---
print(f"Loading tokenizer and model for '{MODEL_NAME}'...")
tokenizer = GPT2TokenizerFast.from_pretrained(MODEL_NAME)
# Set pad token if it doesn't exist (GPT-2 usually doesn't have one by default)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    print("Set tokenizer pad_token to eos_token")

gpt2_model = GPT2LMHeadModel.from_pretrained(MODEL_NAME)
# Ensure the model's pad_token_id is configured, critical for DataCollator
gpt2_model.config.pad_token_id = tokenizer.pad_token_id
gpt2_model.to(device)  # Move model to device

# Print model's total parameters
total_params_gpt2 = sum(p.numel() for p in gpt2_model.parameters() if p.requires_grad)
print(f'Total trainable parameters (GPT-2): {total_params_gpt2:,}')

# --- 2. Tokenize Dataset ---
print(f"Tokenizing {GPT2_TRAIN_SIZE} reviews for GPT-2 fine-tuning...")
# Use a subset of the dataframe for faster processing
df_subset = df.iloc[:GPT2_TRAIN_SIZE].copy()

# Convert pandas DataFrame to Hugging Face Dataset object
hf_dataset = Dataset.from_pandas(df_subset)

# Define tokenization function
def tokenize_function(examples):
    # Tokenize texts
    return tokenizer(examples["review"], truncation=False)

# Apply tokenization (batched=True for speed)
tokenized_datasets = hf_dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=["review", "sentiment"]
)

# Define function to group texts into blocks
def group_texts(examples):
    # Concatenate all texts
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # Drop the small remainder
    total_length = (total_length // BLOCK_SIZE) * BLOCK_SIZE
    # Split by chunks of block_size
    result = {
        k: [t[i:i + BLOCK_SIZE] for i in range(0, total_length, BLOCK_SIZE)]
        for k, t in concatenated_examples.items()
    }
    # For language modeling, the labels are the inputs shifted by one
    result["labels"] = result["input_ids"].copy()
    return result

# Apply grouping (batched=True for efficiency)
lm_datasets = tokenized_datasets.map(group_texts, batched=True)
print(f"Created LM dataset with {len(lm_datasets)} blocks of size {BLOCK_SIZE}.")

# --- 3. Setup Hugging Face Trainer ---
print("Setting up Hugging Face Trainer...")

# Define Training Arguments
training_args = TrainingArguments(
    output_dir="./gpt2-finetuned-imdb",
    overwrite_output_dir=True,
    num_train_epochs=GPT2_EPOCHS,
    per_device_train_batch_size=GPT2_BATCH_SIZE,
    learning_rate=GPT2_LEARNING_RATE,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=50,
    save_steps=500,
    save_total_limit=2,
    fp16=torch.cuda.is_available(),
    report_to="none"
)

# Data Collator for Language Modeling
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False  # Causal language modeling for GPT-2
)

# Instantiate Trainer
trainer = Trainer(
    model=gpt2_model,
    args=training_args,
    train_dataset=lm_datasets,
    data_collator=data_collator,
)

# --- 4. Fine-tuning ---
print(f"\nStarting GPT-2 fine-tuning for {GPT2_EPOCHS} epochs...")
gpt2_tuning_start_time = time.time()

trainer.train()

gpt2_tuning_end_time = time.time()
gpt2_total_tuning_time = gpt2_tuning_end_time - gpt2_tuning_start_time
print(f"Finished GPT-2 fine-tuning.")
print(f"Total Fine-tuning Time: {gpt2_total_tuning_time:.2f} seconds")

# --- 5. Save the Fine-tuned Model ---
final_model_path = "./gpt2-finetuned-imdb-final"
print(f"Saving final fine-tuned model to {final_model_path}...")
trainer.save_model(final_model_path)
tokenizer.save_pretrained(final_model_path)
print("Model saved.")

# --- 6. GPT-2 Perplexity Calculation ---
def calculate_gpt2_perplexity(model, tokenizer, sentences_text, device, block_size=1024):
    """Calculate perplexity for sentences using the fine-tuned GPT-2 model."""
    model.eval()
    total_loss = 0
    total_tokens = 0

    with torch.no_grad():
        for text in sentences_text:
            encodings = tokenizer(text, return_tensors='pt', add_special_tokens=True)
            input_ids = encodings.input_ids.to(device)
            target_ids = input_ids.clone()

            seq_len = input_ids.size(1)

            if seq_len < 2:
                print(f"Skipping very short sentence: {text}")
                continue

            outputs = model(input_ids, labels=target_ids)
            loss = outputs.loss
            num_tokens = seq_len - 1
            if num_tokens > 0:
                 total_loss += loss.item() * num_tokens
                 total_tokens += num_tokens

    if total_tokens == 0:
        return float('inf')

    avg_ce_loss = total_loss / total_tokens
    perplexity = math.exp(avg_ce_loss)
    return perplexity

# Example sentences for perplexity calculation
correct_sentences_text = [
    "The movie was fantastic with great acting and an engaging plot.",
    "I thoroughly enjoyed this film and would recommend it to anyone."
]
incorrect_sentences_text = [
    "Movie the was fantastic great acting and an plot engaging.",
    "Enjoyed thoroughly I this film and would recommend it anyone to."
]

# Calculate and Report Perplexity
print("\nCalculating Fine-tuned GPT-2 Perplexity...")

gpt2_perplexity_correct = calculate_gpt2_perplexity(
    gpt2_model, tokenizer, correct_sentences_text, device
)
gpt2_perplexity_incorrect = calculate_gpt2_perplexity(
    gpt2_model, tokenizer, incorrect_sentences_text, device
)

print(f"Fine-tuned GPT-2 Perplexity on correct sentences: {gpt2_perplexity_correct:.2f}")
print(f"Fine-tuned GPT-2 Perplexity on incorrect sentences: {gpt2_perplexity_incorrect:.2f}")


--- GPT-2 Fine-tuning Setup ---
Using device: cpu
Loading tokenizer and model for 'gpt2'...
Set tokenizer pad_token to eos_token
Total trainable parameters (GPT-2): 124,439,808
Tokenizing 5000 reviews for GPT-2 fine-tuning...


Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1051 > 1024). Running this sequence through the model will result in indexing errors


Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Created LM dataset with 11144 blocks of size 128.
Setting up Hugging Face Trainer...

Starting GPT-2 fine-tuning for 1 epochs...


`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Step,Training Loss
50,4.0369
100,4.0259
150,3.9955
200,3.9866
250,3.9752
300,3.9795
350,4.0036
400,3.9406
450,3.9432
500,3.9594


Step,Training Loss
50,4.0369
100,4.0259
150,3.9955
200,3.9866
250,3.9752
300,3.9795
350,4.0036
400,3.9406
450,3.9432
500,3.9594


Finished GPT-2 fine-tuning.
Total Fine-tuning Time: 23933.07 seconds
Saving final fine-tuned model to ./gpt2-finetuned-imdb-final...
Model saved.

Calculating Fine-tuned GPT-2 Perplexity...
Fine-tuned GPT-2 Perplexity on correct sentences: 13.28
Fine-tuned GPT-2 Perplexity on incorrect sentences: 308.82


# 6. Text Generation

We will now use each of the three models to generate text continuations based on a set of prompts.

**Sampling Strategies:**
*   **Greedy Search:** At each step, simply choose the word with the highest probability according to the model. This is deterministic but can lead to repetitive or dull text.
*   **Sampling (Top-k / Top-p):** Instead of always picking the best word, introduce randomness.
    *   **Top-k Sampling:** Consider only the `k` most likely next words and redistribute the probability mass among them. Then, sample from this reduced set. (e.g., k=50)
    *   **Top-p (Nucleus) Sampling:** Consider the smallest set of most likely words whose cumulative probability mass exceeds a threshold `p`. Then, sample from this set. (e.g., p=0.9)

Sampling methods often produce more diverse and interesting text compared to greedy search. For the neural models (LSTM, GPT-2), we will use the built-in capabilities or implement simple sampling. For the n-gram model, we will implement greedy generation for simplicity.

In [None]:
# --- Text Generation Comparison ---
prompts = [
    "I think",
    "She goes to",
    "The movie was",
    "It felt like",
    "Blue dog"
]
MAX_GEN_LEN = 30 # Max number of tokens to generate after the prompt

generation_results = defaultdict(list)

# --- 1. N-gram Generation (Greedy) ---
print("\n--- Generating with N-gram Model (Greedy) ---")
def generate_ngram_greedy(prompt, n, max_len, prob_func, trigram_counts, bigram_counts, vocab, k=1):
    tokens = preprocess_ngram(prompt, n)[:-1] # Preprocess but remove the final </s>
    generated_tokens = []
    vocab_list = list(vocab) # Convert set to list for iteration
    vocab_size_gen = len(vocab_list)

    for _ in range(max_len):
        if len(tokens) < n - 1:
            # Should not happen with preprocessing, but handle defensively
            break
        # Get context (last n-1 tokens)
        w1 = tokens[-(n-2)]
        w2 = tokens[-(n-1)]

        best_prob = -1
        next_word = None

        # Find the word w3 with the highest P(w3 | w1, w2)
        for w3 in vocab_list:
            # Skip start symbol as a potential next word
            if w3 == '<s>': continue

            prob = prob_func(w1, w2, w3, trigram_counts, bigram_counts, vocab_size_gen, k)
            if prob > best_prob:
                best_prob = prob
                next_word = w3

        if next_word is None or next_word == '</s>':
            break # Stop if no word found or end symbol is generated

        generated_tokens.append(next_word)
        tokens.append(next_word)

    return prompt + " " + " ".join(generated_tokens)

for prompt in prompts:
    generated_text = generate_ngram_greedy(prompt, n=3, max_len=MAX_GEN_LEN,
                                           prob_func=get_add1_trigram_prob,
                                           trigram_counts=trigram_counts,
                                           bigram_counts=bigram_counts,
                                           vocab=vocab, k=1)
    generation_results["N-gram (Greedy)"].append(generated_text)
    print(f"Prompt: '{prompt}' -> N-gram: '{generated_text}'")

# --- 2. LSTM Generation (Greedy / Sampling) ---
print("\n--- Generating with LSTM Model (Greedy) ---")
def generate_lstm(model, prompt, max_len, word_to_idx, idx_to_word, unk_idx, pad_idx, device, temperature=1.0, top_k=0):
    model.eval()
    tokens = word_tokenize(prompt.lower())
    indexed = [word_to_idx.get(token, unk_idx) for token in tokens]
    generated_indices = []

    input_tensor = torch.tensor([indexed], dtype=torch.long).to(device)

    with torch.no_grad():
        for _ in range(max_len):
            # Get predictions (logits) for the last token
            output = model(input_tensor)
            # output shape: [1, current_seq_len, vocab_size]
            last_token_logits = output[:, -1, :] # Shape: [1, vocab_size]

            # Apply temperature scaling
            last_token_logits = last_token_logits / temperature

            # Apply top-k filtering if k > 0
            if top_k > 0:
                indices_to_remove = last_token_logits < torch.topk(last_token_logits, top_k)[0][..., -1, None]
                last_token_logits[indices_to_remove] = -float('Inf')

            # Get probabilities using softmax
            probabilities = torch.softmax(last_token_logits, dim=-1)

            # Sample next token index
            # Use multinomial for sampling, or argmax for greedy
            # next_token_idx = torch.multinomial(probabilities, num_samples=1).item()
            next_token_idx = torch.argmax(probabilities, dim=-1).item() # Greedy

            # Stop if PAD or UNK generated (or EOS if defined)
            if next_token_idx == pad_idx or next_token_idx == unk_idx:
                 break

            generated_indices.append(next_token_idx)

            # Append the predicted token index to the input sequence for the next step
            next_token_tensor = torch.tensor([[next_token_idx]], dtype=torch.long).to(device)
            input_tensor = torch.cat([input_tensor, next_token_tensor], dim=1)

    generated_words = [idx_to_word.get(idx, UNK_TOKEN) for idx in generated_indices]
    return prompt + " " + " ".join(generated_words)

for prompt in prompts:
    generated_text = generate_lstm(lstm_model, prompt, max_len=MAX_GEN_LEN,
                                   word_to_idx=word_to_idx, idx_to_word=idx_to_word,
                                   unk_idx=UNK_IDX, pad_idx=PAD_IDX, device=device,
                                   temperature=1.0, top_k=0) # Greedy (top_k=0)
    generation_results["LSTM (Greedy)"].append(generated_text)
    print(f"Prompt: '{prompt}' -> LSTM: '{generated_text}'")

# --- 3. GPT-2 Generation (Using model.generate) ---
print("\n--- Generating with Fine-tuned GPT-2 Model (Sampling) ---")
# Load the saved fine-tuned model for generation
print(f"Loading fine-tuned model from {final_model_path}...")
try:
    gpt2_model_loaded = GPT2LMHeadModel.from_pretrained(final_model_path)
    tokenizer_loaded = GPT2TokenizerFast.from_pretrained(final_model_path)
    gpt2_model_loaded.to(device)
    print("Fine-tuned model loaded successfully.")

    for prompt in prompts:
        # Encode the prompt
        input_ids = tokenizer_loaded.encode(prompt, return_tensors='pt').to(device)

        # Generate text using sampling (top-k, top-p)
        # Ensure pad_token_id is set for generation
        pad_token_id = tokenizer_loaded.pad_token_id if tokenizer_loaded.pad_token_id is not None else tokenizer_loaded.eos_token_id

        output_sequences = gpt2_model_loaded.generate(
            input_ids=input_ids,
            max_length=len(input_ids[0]) + MAX_GEN_LEN, # Generate MAX_GEN_LEN new tokens
            temperature=1.0, # Controls randomness. Higher values = more random.
            top_k=50,        # Considers only the top 50 words
            top_p=0.9,       # Nucleus sampling: cumulative probability threshold
            do_sample=True,  # Enable sampling
            num_return_sequences=1, # Generate one sequence
            pad_token_id=pad_token_id # Set pad token id
        )

        # Decode the generated sequence
        generated_text = tokenizer_loaded.decode(output_sequences[0], skip_special_tokens=True)
        generation_results["GPT-2 (Sampled)"].append(generated_text)
        print(f"Prompt: '{prompt}' -> GPT-2: '{generated_text}'")

except OSError as e:
     print(f"Error loading fine-tuned model from {final_model_path}: {e}")
     print("Skipping GPT-2 generation. Ensure the model was saved correctly.")
     for prompt in prompts:
         generation_results["GPT-2 (Sampled)"].append(prompt + " [Error: Model not loaded]")

# --- 4. Display Results in a Table ---
print("\n--- Generation Results Summary ---")
generation_df = pd.DataFrame(generation_results)
generation_df.index = prompts # Use prompts as index for clarity
print(generation_df.to_markdown()) # Print as markdown table for better readability


--- Generating with N-gram Model (Greedy) ---
Prompt: 'I think' -> N-gram: 'I think 'm the the one most the part film the is best the of film the is best the of film the is best the of film the is best the'
Prompt: 'She goes to' -> N-gram: 'She goes to the be best described the the one most the part film the is best the of film the is best the of film the is best the of film the'
Prompt: 'The movie was' -> N-gram: 'The movie was the in movie the is best the of film the is best the of film the is best the of film the is best the of film the is best'
Prompt: 'It felt like' -> N-gram: 'It felt like the . film the is best the of film the is best the of film the is best the of film the is best the of film the is best'
Prompt: 'Blue dog' -> N-gram: 'Blue dog the , film and . the the one most the part film the is best the of film the is best the of film the is best the of film'

--- Generating with LSTM Model (Greedy) ---
Prompt: 'I think' -> LSTM: 'I think it is a'
Prompt: 'She goes to' -> 

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Fine-tuned model loaded successfully.
Prompt: 'I think' -> GPT-2: 'I think you have your answer. We had one scene in this movie. It was the last scene. The movie is very cheesy and pretentious. We don'
Prompt: 'She goes to' -> GPT-2: 'She goes to bed after a while, he is sleeping and wakes up to a note from his brother saying he was found dead and his father. The young couple has'
Prompt: 'The movie was' -> GPT-2: 'The movie was terrible and almost as bad as "Manhunt" or "A Day in the Life of the Lambs" and the movie's main characters are so'
Prompt: 'It felt like' -> GPT-2: 'It felt like a remake of the popular movie "The Big Short" with the idea of the director going from a young boy to a serious and committed man. But'
Prompt: 'Blue dog' -> GPT-2: 'Blue dog's name and its dog's name are mentioned throughout the entire movie, but they're never really mentioned. It's very well-written, though I'

--- Generation Results Summary ---
|               | N-gram (Greedy)                     

# 7. Comparison & Conclusions

**Summary Table**

| Feature                  | N-gram (Trigram, Add-1) | LSTM (from Scratch)        | GPT-2 (Fine-tuned)        |
| :----------------------- | :---------------------- | :------------------------- | :------------------------ |
| Approx. Training Time    | ~ Seconds/Minutes (Counts) | ~ Hours (on CPU, 5 Epochs) | ~ Hours (on CPU, 1 Epoch) |
| # Trainable Parameters   | ~ 0 (Counts based)      | ~ 7.5 Million (Example)    | ~ 117 Million             |
| Avg. Perplexity (Correct)| [Fill in from Sec 3]    | [Fill in from Sec 4]       | [Fill in from Sec 5]      |
| Avg. Perplexity (Incorrect)| [Fill in from Sec 3]    | [Fill in from Sec 4]       | [Fill in from Sec 5]      |
| Sample Quality Notes     | Often repetitive, grammatically simple/incorrect. Struggles with long context. | Can capture local structure, potentially repetitive. Quality depends heavily on training. | Generally most coherent, fluent, contextually relevant due to pre-training. |

*Note: Training times are rough estimates for the specified subset sizes and epochs on a typical modern CPU. Actual times will vary.*
*Note: Parameter count for LSTM depends on the exact configuration (vocab size, embedding/hidden dims).*
*Note: Perplexity values need to be filled in after running the notebook.*

---

**1. Which model best balances quality and resources?**

*   **N-gram:** Very low resource usage (fast counting, minimal memory), but lowest quality. Poor generalization and generation.
*   **LSTM from Scratch:** Moderate resource usage (significant CPU time for training, moderate parameters/memory). Quality can be decent but requires careful tuning and substantial training. It strikes a balance if pre-trained models are not an option or if full control over the architecture is needed.
*   **Fine-tuned GPT-2:** High resource usage *during fine-tuning* (significant CPU time, high memory), but leverages massive pre-training. Offers the highest quality by far, especially for generation fluency and coherence. If the fine-tuning time (which is often less than training a large LSTM from scratch) is acceptable, **fine-tuned GPT-2 likely offers the best balance between *achievable quality* and *fine-tuning effort***, assuming the pre-trained model itself is available.

The "best" balance depends on the specific constraints. If training time is severely limited, n-grams are fastest. If state-of-the-art quality is paramount, fine-tuning is the way to go. The LSTM occupies a middle ground.

**2. How could each model be improved?**

*   **N-gram:**
    *   *Better Smoothing:* Use more advanced techniques like Kneser-Ney smoothing instead of Add-1.
    *   *Backoff/Interpolation:* Combine probabilities from trigrams, bigrams, and unigrams instead of relying solely on trigrams.
    *   *Larger Corpus:* Train on more data (though returns diminish).
*   **LSTM from Scratch:**
    *   *More Data & Training:* Train on the full dataset for more epochs (requires more time/resources).
    *   *Larger Model:* Increase embedding size, hidden size, or number of layers.
    *   *Hyperparameter Tuning:* Optimize learning rate, batch size, dropout, weight decay.
    *   *Better Tokenization:* Use subword tokenization (like BPE) instead of word tokenization to handle rare words better.
    *   *Architecture:* Add attention mechanisms or switch to a Transformer architecture (though this becomes much more complex).
*   **Fine-tuned GPT-2:**
    *   *More Fine-tuning Data/Epochs:* Fine-tune on more data or for slightly longer (careful not to overfit).
    *   *Larger Pre-trained Model:* Use GPT-2 Medium (345M) or Large (774M) if resources allow (requires significantly more RAM/time).
    *   *Domain Adaptation:* Continue pre-training on a large corpus of movie-related text before fine-tuning on reviews.
    *   *Parameter-Efficient Fine-tuning (PEFT):* Techniques like LoRA can reduce the computational cost of fine-tuning very large models.

**3. What difficulties were encountered?**

*   **CPU Training Time:** Training the LSTM from scratch and fine-tuning GPT-2 on a CPU is very time-consuming. Even with subsets of the data, training can take hours. This highlights the significant advantage of GPUs for deep learning.
*   **Memory Usage:** Loading and fine-tuning large models like GPT-2, even the small version, can require substantial RAM, potentially limiting batch sizes or feasibility on some systems.
*   **N-gram Implementation:** Correctly implementing smoothing and perplexity for n-grams, especially handling edge cases and context counts accurately, can be tricky.
*   **Hyperparameter Sensitivity:** Neural models (LSTM, GPT-2 fine-tuning) are sensitive to hyperparameters like learning rate and batch size. Finding good settings without extensive experimentation can be challenging.
*   **Vocabulary Management:** Handling large vocabularies, out-of-vocabulary words (UNK tokens), and padding consistently across preprocessing, model definition, and loss calculation requires care.
*   **Dependency Management:** Ensuring all libraries (PyTorch, Transformers, NLTK, Datasets, Accelerate) are installed and compatible can sometimes be an issue.
*   **Perplexity Calculation Nuances:** Ensuring the perplexity calculation is consistent and correct for each model type (handling start/end tokens for n-grams, padding/masking for neural models, log probabilities for stability) requires careful implementation.

# Appendix: Dependencies

This notebook was tested with the following core package versions. You can recreate the environment using pip:

```
# Run `pip freeze` in your environment after installation
# Example versions (actual versions might differ slightly based on install date):
accelerate==0.24.1
datasets==2.15.0
nltk==3.8.1
numpy==1.26.2
pandas==2.1.1
torch==2.1.0
transformers==4.35.2
# Other dependencies installed automatically might include:
# huggingface-hub, packaging, pyarrow, regex, requests, safetensors, tokenizers, tqdm, etc.
```