# AIG230 – Assignment 6 (Part B Starter Notebook)
## Neural Language Model with PyTorch (RNN) – Student Version

This notebook covers **all of Part B**:

- **B1** Numericalization + training examples  
- **B2** Build an RNN Language Model  
- **B3** Train + validate (loss + perplexity)  
- **B4** Test perplexity + text generation  

### Dataset (same as Part A)
- **NLTK Brown corpus**, category: `news`

### Important
This is a **starter notebook**. You must complete the **TODO** blocks.  
Do not delete TODO comments. Add your code underneath them.


In [None]:
# ===== 1) Setup =====
import re
import math
import random
from collections import Counter
from dataclasses import dataclass
from typing import List, Dict, Tuple

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

import nltk
from nltk.corpus import brown

# Download if needed (safe to run multiple times)
nltk.download('brown')

# Reproducibility (optional)
random.seed(42)
torch.manual_seed(42)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Device:", device)


In [None]:
# ===== 2) Configuration (edit if needed) =====
# You can edit these hyperparameters as needed. The defaults should work for a basic run.
# This are not fixed rules, just a starting point. Feel free to experiment with different values!
# This parameter class is just a convenient way to store all the settings in one place. You can also use a dictionary or command-line arguments if you prefer.
# These parameters define the dataset category, how to split the data, vocabulary cutoff, model architecture, and training settings. Adjusting these can affect the performance and training time of your model.
@dataclass
class Config:
    category: str = "news"
    train_ratio: float = 0.80
    val_ratio: float = 0.10
    test_ratio: float = 0.10

    min_freq: int = 2       # vocab cutoff (train only)
    seq_len: int = 30       # T
    batch_size: int = 32

    emb_dim: int = 128
    hid_dim: int = 256
    num_layers: int = 1
    dropout: float = 0.0    # use 0.0 if num_layers == 1

    lr: float = 1e-3
    epochs: int = 5
    grad_clip: float = 1.0  # optional but recommended

cfg = Config()

SPECIAL = {"BOS": "<bos>", "EOS": "<eos>", "UNK": "<unk>"}


# 3) Load + preprocess Brown (shared rules)

Preprocessing rules:
- lowercase
- remove punctuation-only tokens
- keep stopwords
- add `<bos>` and `<eos>` to each sentence


In [None]:
# ===== 3) Load Brown sentences =====
# This section loads the raw sentences from the Brown corpus, preprocesses them by lowercasing and removing punctuation-only tokens, and adds special tokens for the beginning and end of sentences. The resulting `sents` variable is a list of tokenized sentences ready for building the vocabulary and training the model.
raw_sents: List[List[str]] = brown.sents(categories=cfg.category)
print("Raw sentences:", len(raw_sents))
print("Example raw:", raw_sents[0][:20])

_punct_only = re.compile(r"^\W+$")

def preprocess_sentence(tokens: List[str]) -> List[str]:
    out = []
    for tok in tokens:
        tok = tok.lower()
        if _punct_only.match(tok):
            continue
        out.append(tok)
    return [SPECIAL["BOS"], *out, SPECIAL["EOS"]]

sents = [preprocess_sentence(s) for s in raw_sents]
print("Example preprocessed:", sents[0][:25])


# 4) Split train/val/test (by sentence)

Split by sentence to avoid leakage.


In [None]:
# ===== 4) Split =====
# This section splits the preprocessed sentences into training, validation, and test sets based on the specified ratios in the configuration. It also defines a helper function to count the total number of tokens in each split and prints out the number of sentences and tokens for each set.
n = len(sents)
n_train = int(cfg.train_ratio * n)
n_val = int(cfg.val_ratio * n)
n_test = n - n_train - n_val

train_sents = sents[:n_train]
val_sents = sents[n_train:n_train+n_val]
test_sents = sents[n_train+n_val:]

def num_tokens(slist: List[List[str]]) -> int:
    return sum(len(s) for s in slist)

print("Train:", len(train_sents), "sentences |", num_tokens(train_sents), "tokens")
print("Val  :", len(val_sents),   "sentences |", num_tokens(val_sents),   "tokens")
print("Test :", len(test_sents),  "sentences |", num_tokens(test_sents),  "tokens")


# 5) Build vocabulary (train only)

Words with frequency `< min_freq` become `<unk>`.


In [None]:
# ===== 5) Vocabulary =====
# This section builds the vocabulary from the training sentences by counting the frequency of each token and including only those that meet the minimum frequency threshold specified in the configuration. 
# It also creates mappings from tokens to indices (`stoi`) and from indices to tokens (`itos`), and prints out some statistics about the vocabulary and the most common tokens in the training set.
train_tokens = [tok for sent in train_sents for tok in sent]
counts = Counter(train_tokens)

itos = [SPECIAL["BOS"], SPECIAL["EOS"], SPECIAL["UNK"]]
stoi: Dict[str, int] = {tok: i for i, tok in enumerate(itos)}

for tok, c in counts.most_common():
    if tok in stoi:
        continue
    if c >= cfg.min_freq:
        stoi[tok] = len(itos)
        itos.append(tok)

vocab_size = len(itos)
unk_id = stoi[SPECIAL["UNK"]]

print("min_freq:", cfg.min_freq)
print("vocab_size:", vocab_size)
print("UNK id:", unk_id)
print("\nTop 15 tokens (train):")
for tok, c in counts.most_common(15):
    print(f"{tok:>12}  {c}")


NameError: name 'train_sents' is not defined

# 6) B1 – Numericalize + create training examples (Option 1)

You will:
1) Convert each sentence into token IDs  
2) Concatenate into one long stream per split  
3) Build a Dataset that returns:
- `x = stream[i : i+T]`
- `y = stream[i+1 : i+T+1]`

Complete the TODO blocks below.


In [None]:
# ===== 6.1) Numericalize (TODO) =====
# This function converts a list of tokens into a list of corresponding token IDs using the `stoi` mapping. 
# If a token is not found in the vocabulary, it uses the `unk_id` to represent it. This step is essential for preparing the data to be fed into the model, as models typically work with numerical representations of text.
# Why this is neccessary? Models cannot process raw text; they require numerical input. By converting tokens to their corresponding IDs, we can efficiently represent the text data in a format suitable for training neural networks. This also allows us to handle out-of-vocabulary tokens gracefully using the `unk_id`.
# How this connect to the embedding layer? The embedding layer takes token IDs as input and maps them to dense vector representations (embeddings). 
# By numericalizing the sentences, we can feed these token IDs into the embedding layer, which will then produce the corresponding embeddings for each token. This is a crucial step in the pipeline, as it allows us to convert raw text into a format that can be processed by the neural network.
def numericalize_sentence(tokens: List[str], stoi: Dict[str, int], unk_id: int) -> List[int]:
    # TODO: return list of token IDs for this sentence
    # Hint: use stoi.get(tok, unk_id)
    raise NotImplementedError

train_ids_sents = [numericalize_sentence(s, stoi, unk_id) for s in train_sents]
val_ids_sents   = [numericalize_sentence(s, stoi, unk_id) for s in val_sents]
test_ids_sents  = [numericalize_sentence(s, stoi, unk_id) for s in test_sents]

print("Example tokens:", train_sents[0][:12])
print("Example ids   :", train_ids_sents[0][:12])


NameError: name 'List' is not defined

In [None]:
# ===== 6.2) Build streams =====
train_stream = [tid for sent in train_ids_sents for tid in sent]
val_stream   = [tid for sent in val_ids_sents   for tid in sent]
test_stream  = [tid for sent in test_ids_sents  for tid in sent]

print("Train stream length:", len(train_stream))
print("Val   stream length:", len(val_stream))
print("Test  stream length:", len(test_stream))


In [None]:
# ===== 6.3) Dataset (TODO) =====
class NextTokenStreamDataset(Dataset):
    def __init__(self, token_stream: List[int], seq_len: int):
        self.stream = token_stream
        self.T = seq_len

    def __len__(self) -> int:
        # TODO: return number of examples in this stream
        # Need T tokens for x and 1 extra token for y shift.
        raise NotImplementedError

    def __getitem__(self, idx: int) -> Tuple[torch.Tensor, torch.Tensor]:
        # TODO: create x and y slices, convert to torch.long tensors
        # x: stream[idx : idx+T]
        # y: stream[idx+1 : idx+T+1]
        raise NotImplementedError

train_ds = NextTokenStreamDataset(train_stream, cfg.seq_len)
val_ds   = NextTokenStreamDataset(val_stream,   cfg.seq_len)
test_ds  = NextTokenStreamDataset(test_stream,  cfg.seq_len)

print("Train examples:", len(train_ds))


In [None]:
# ===== 6.4) DataLoaders + sanity check (run after TODOs are done) =====
train_loader = DataLoader(train_ds, batch_size=cfg.batch_size, shuffle=True, drop_last=True)
val_loader   = DataLoader(val_ds,   batch_size=cfg.batch_size, shuffle=False, drop_last=True)
test_loader  = DataLoader(test_ds,  batch_size=cfg.batch_size, shuffle=False, drop_last=True)

x_batch, y_batch = next(iter(train_loader))
print("x_batch shape:", x_batch.shape)  # expected: (B, T)
print("y_batch shape:", y_batch.shape)  # expected: (B, T)
print("x_batch dtype:", x_batch.dtype)

print("First 10 x:", x_batch[0][:10].tolist())
print("First 10 y:", y_batch[0][:10].tolist())


# 7) B2 – Build the RNN Language Model

Model requirements:
- Embedding layer
- RNN layer (`nn.RNN`)
- Linear layer to vocab size

Complete the TODO blocks.


In [None]:
# ===== 7) Model (TODO) =====
# Configuration class is defined at the top of the notebook. You can adjust the hyperparameters there as needed.
class RNNLanguageModel(nn.Module):
    def __init__(self, vocab_size: int, emb_dim: int, hid_dim: int, num_layers: int = 1, dropout: float = 0.0):
        super().__init__()
        # TODO: define embedding layer
        # TODO: define RNN layer (batch_first=True)
        # TODO: define output projection layer (hid_dim -> vocab_size)
        raise NotImplementedError

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # x: (B, T) token IDs
        # Return logits: (B, T, V)
        # TODO: embed -> rnn -> linear
        raise NotImplementedError

# Instantiate model (after TODOs)
model = RNNLanguageModel(
    vocab_size=vocab_size,
    emb_dim=cfg.emb_dim,
    hid_dim=cfg.hid_dim,
    num_layers=cfg.num_layers,
    dropout=cfg.dropout
).to(device)

# Parameter count
num_params = sum(p.numel() for p in model.parameters())
print("Model parameters:", num_params)


# 8) B3 – Training + validation (loss + perplexity)

You will:
- Define loss (`CrossEntropyLoss`)
- Train for several epochs
- Compute validation perplexity

Complete the TODO blocks.


In [None]:
# ===== 8.1) Loss + optimizer (TODO) =====
# For language modeling with logits of shape (B, T, V),
# CrossEntropyLoss expects (N, V) logits and (N,) targets.
criterion = None  # TODO: set nn.CrossEntropyLoss()
optimizer = None  # TODO: set torch.optim.Adam(model.parameters(), lr=cfg.lr)

print("Ready (criterion, optimizer):", criterion is not None, optimizer is not None)


In [None]:
# ===== 8.2) Helper: compute perplexity (TODO) =====
@torch.no_grad()
def evaluate_perplexity(model: nn.Module, loader: DataLoader) -> float:
    model.eval()
    total_loss = 0.0
    total_tokens = 0

    for x, y in loader:
        x = x.to(device)
        y = y.to(device)

        # TODO:
        # 1) logits = model(x) -> (B, T, V)
        # 2) reshape logits to (B*T, V)
        # 3) reshape y to (B*T,)
        # 4) loss = criterion(...)
        # 5) accumulate total_loss weighted by number of tokens

        raise NotImplementedError

    avg_loss = total_loss / max(total_tokens, 1)
    ppl = math.exp(avg_loss)
    return ppl


In [None]:
# ===== 8.3) Training loop (TODO) =====
def train_one_epoch(model: nn.Module, loader: DataLoader) -> float:
    model.train()
    running_loss = 0.0
    running_tokens = 0

    for x, y in loader:
        x = x.to(device)
        y = y.to(device)

        # TODO:
        # 1) optimizer.zero_grad()
        # 2) logits = model(x)
        # 3) reshape logits/y for CrossEntropyLoss
        # 4) loss.backward()
        # 5) optional: clip gradients
        # 6) optimizer.step()
        # 7) accumulate running_loss weighted by number of tokens

        raise NotImplementedError

    return running_loss / max(running_tokens, 1)

train_losses = []
val_ppls = []

for epoch in range(1, cfg.epochs + 1):
    train_loss = train_one_epoch(model, train_loader)
    val_ppl = evaluate_perplexity(model, val_loader)

    train_losses.append(train_loss)
    val_ppls.append(val_ppl)

    print(f"Epoch {epoch:02d} | train loss: {train_loss:.4f} | val ppl: {val_ppl:.2f}")


In [None]:
# ===== 8.4) Plot training loss (optional) =====
import matplotlib.pyplot as plt

plt.figure()
plt.plot(range(1, len(train_losses)+1), train_losses)
plt.xlabel("Epoch")
plt.ylabel("Train Loss")
plt.title("Training Loss per Epoch")
plt.show()


# 9) B4 – Test perplexity + text generation

You will:
- Report test perplexity
- Generate text by sampling from the model

Complete the TODO blocks.


In [None]:
# ===== 9.1) Test perplexity =====
test_ppl = evaluate_perplexity(model, test_loader)  # will work after TODOs done
print("Test perplexity:", test_ppl)


In [None]:
# ===== 9.2) Text generation (TODO) =====
# We will generate tokens one-by-one:
# - Start with <bos>
# - Predict next token distribution
# - Sample (or argmax)
# - Append token and continue

def sample_next_token(logits_1v: torch.Tensor, temperature: float = 1.0) -> int:
    # logits_1v: (V,) logits for next token
    # TODO:
    # 1) divide logits by temperature
    # 2) convert to probabilities (softmax)
    # 3) sample an index using torch.multinomial
    raise NotImplementedError

@torch.no_grad()
def generate_text(model: nn.Module, stoi: Dict[str, int], itos: List[str],
                  max_new_tokens: int = 50, temperature: float = 1.0) -> str:
    model.eval()

    bos_id = stoi[SPECIAL["BOS"]]
    eos_id = stoi[SPECIAL["EOS"]]

    # Start sequence with <bos>
    generated = [bos_id]

    for _ in range(max_new_tokens):
        # TODO:
        # 1) create input tensor x of shape (1, current_len)
        # 2) logits = model(x) -> (1, current_len, V)
        # 3) take last timestep logits: logits[0, -1, :]
        # 4) sample next token id
        # 5) append; break if eos
        raise NotImplementedError

    # Convert ids to tokens and return as a string
    tokens = [itos[i] for i in generated]
    return " ".join(tokens)

print(generate_text(model, stoi, itos, max_new_tokens=60, temperature=1.0))


# What you submit for Part B

In your final submission, include:
- Your completed code for all TODO blocks
- Your training loss plot
- Validation perplexity per epoch (printed)
- Final test perplexity
- 3 generated samples (30+ tokens each) with brief comments
