# A2 - Language Model

In this assignment, I have build a neural language model that learns the structure, context, and style of a given text corpus and generates coherent text continuations.
I have implemented a Long Short-Term Memory (LSTM) based language model and evaluate it using perplexity.

### IMPORTS

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import math

import torchtext
from torchtext.vocab import Vocab
from datasets import load_dataset
from collections import Counter
from tqdm import tqdm

device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # Checking for GPU
print("Device:", device)

SEED = 1234 # Setting seed for reproducibility
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True


Device: cpu


# Task 1 

#### Dataset Aquisition

In [18]:
dataset = load_dataset("the-rizz/the-rizz-corpus")
print(dataset)


DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 55422
    })
})


#### Dataset Description

Dataset Chosen: The Rizz Corpus
Source: HuggingFace Datasets
Link: https://huggingface.co/datasets/the-rizz/the-rizz-corpus

The Rizz Corpus is a text-rich conversational dataset containing short dialogue-style sentences designed to represent social and persona-based language.It is a collection of informal conversations taken from various social media apps.
It is suitable for language modeling because :

* It contains natural, informal language

* The text is diverse and context-dependent

In [19]:

# Saving dataset as .txt

output_path = "rizz_corpus.txt"

with open(output_path, "w", encoding="utf-8") as f:
    for example in dataset["train"]:
        text = example["text"].strip()
        if text:                     # skip empty lines
            f.write(text + "\n")

print(f"Dataset saved to {output_path}")


Dataset saved to rizz_corpus.txt


# Task 2

#### Dataset Splitting

Since the dataset only provides a single split, I manually divide it into:

* Training set (80%)

* Validation set (10%)

* Test set (10%)

In [20]:
dataset = dataset["train"].train_test_split(test_size=0.2, seed=SEED)
temp = dataset["test"].train_test_split(test_size=0.5, seed=SEED)

train_dataset = dataset["train"]
valid_dataset = temp["train"]
test_dataset  = temp["test"]

print(len(train_dataset), len(valid_dataset), len(test_dataset))


44337 5542 5543


# Text Preprocessing

Before training the language model, the text data was cleaned and prepared using the following steps:

* Removing empty lines and lowercasing
* Some entries in the dataset were empty, so those were removed.
* All text was converted to lowercase to keep the vocabulary smaller and more consistent.

Tokenization - 
Each sentence was split into individual words using a simple space-based tokenizer.This converts raw text into tokens that the model can learn from.

Building the vocabulary - 
A vocabulary was created from the training data only.Words that appeared very rarely were ignored to reduce noise.

Two special tokens were added :

* UNK for unknown words

* EOS to indicate the end of a sentence

Converting text to numbers (Numericalization) - 
Every word token was mapped to a unique number using the vocabulary.The <eos> token was added at the end of each sentence so the model learns when a sentence should stop.

Batching the data -
All word indices were grouped into fixed-size batches so the LSTM could be trained efficiently using sequences of equal length.

#### Lowercasing

In [21]:
train_texts = [ex["text"].lower() for ex in train_dataset if ex["text"].strip() != ""]
valid_texts = [ex["text"].lower() for ex in valid_dataset if ex["text"].strip() != ""]
test_texts  = [ex["text"].lower() for ex in test_dataset  if ex["text"].strip() != ""]


#### Tokenization

In [22]:
def tokenize(text):
    return text.split()

train_tok = [tokenize(t) for t in train_texts]
valid_tok = [tokenize(t) for t in valid_texts]
test_tok  = [tokenize(t) for t in test_texts]

print(train_tok[0])


['<s>[inst]', 'i', 'am', 'basically', 'a', 'gym', 'rat.', 'but', 'i', 'try', 'to', 'read', 'often', 'too.[/inst]well', 'when', 'i', 'am', 'not', 'flying', 'kites', 'with', 'my', 'niece', 'i', 'am', 'at', 'the', 'gym', 'just', 'not', 'so', 'much</s>']


#### Vocabulary Construction

In [23]:
counter = Counter()
for tokens in train_tok:
    counter.update(tokens)

# applying min_freq = 3
counter = Counter({k: v for k, v in counter.items() if v >= 3})

vocab = Vocab(counter)

# End-of-Sentence Token
vocab.itos.insert(0, "<unk>")
vocab.itos.insert(1, "<eos>")
vocab.stoi = {tok: i for i, tok in enumerate(vocab.itos)}

UNK_IDX = vocab.stoi["<unk>"]
EOS_IDX = vocab.stoi["<eos>"]

# Checking vocab size and some tokens
print("Vocab size:", len(vocab))
print(vocab.itos[:10])


Vocab size: 20178
['<unk>', '<eos>', '<unk>', '<pad>', 'i', 'to', 'a', 'you', 'the', 'are']


In [24]:

# Detokenizer

PUNCT_NO_SPACE_BEFORE = {".", ",", "!", "?", ";", ":", "%", ")", "]", "}"}
PUNCT_NO_SPACE_AFTER  = {"(", "[", "{"}

def decode(ids):
    tokens = [vocab.itos[i] for i in ids]
    out = []

    for tok in tokens:
        if not out:
            out.append(tok)
            continue

        prev = out[-1]

        if tok in PUNCT_NO_SPACE_BEFORE:
            out[-1] = prev + tok
        elif prev in PUNCT_NO_SPACE_AFTER:
            out[-1] = prev + tok
        elif tok in {"'", "’"}:
            out[-1] = prev + tok
        elif prev.endswith(("'", "’")):
            out[-1] = prev + tok
        else:
            out.append(" " + tok)

    return "".join(out)


In [None]:

# Sampling Utilities (Top-k / Top-p)

import torch.nn.functional as F

def sample_next_token(probs, top_k=50, top_p=0.9):
    probs = probs.squeeze(0)

    # Top-k filtering
    if top_k is not None and top_k > 0:
        k = min(top_k, probs.numel())
        values, indices = torch.topk(probs, k)
        mask = torch.zeros_like(probs)
        mask[indices] = values
        probs = mask / mask.sum()

    # Top-p (nucleus) filtering
    if top_p is not None and 0 < top_p < 1:
        sorted_probs, sorted_idx = torch.sort(probs, descending=True)
        cumulative = torch.cumsum(sorted_probs, dim=0)
        cutoff = cumulative > top_p
        cutoff[0] = False
        sorted_probs[cutoff] = 0
        probs = torch.zeros_like(probs)
        probs[sorted_idx] = sorted_probs
        probs = probs / probs.sum()

    return torch.multinomial(probs, 1).item() # Sampling from the filtered distribution


#### Numericalization

In [None]:

def numericalize(tokens):
    return [vocab.stoi.get(tok, UNK_IDX) for tok in tokens]

def build_data(token_lists):
    ids = []
    for tokens in token_lists:
        tokens = tokens + ["<eos>"]
        ids.extend(numericalize(tokens))
    return torch.LongTensor(ids)

train_ids = build_data(train_tok)
valid_ids = build_data(valid_tok)
test_ids  = build_data(test_tok)


#### Batch Preparation

In [None]:
# Preparing data for batching
def get_data(data, batch_size):
    num_batches = data.shape[0] // batch_size
    data = data[:num_batches * batch_size]
    return data.view(batch_size, num_batches)

batch_size = 128 # Adjusts as whatever we requiree

train_data = get_data(train_ids, batch_size) 
valid_data = get_data(valid_ids, batch_size) 
test_data  = get_data(test_ids,  batch_size)


# MODELING

#### Model Architecture - 

The language model is built using an LSTM-based neural network with the following structure:

* Embedding Layer
Converts word indices into dense vectors so the model can understand relationships between words.

* LSTM Layers
The LSTM processes sequences of words and learns the context of a sentence by remembering past information.
Using multiple layers helps the model learn more complex patterns.

* Dropout Layers
Dropout is applied to reduce overfitting and improve generalization on unseen text.

* Output Linear Layer
This layer maps the LSTM outputs to the size of the vocabulary, producing predictions for the next word.

#### Training Process

* The model was trained using cross-entropy loss, which is suitable for next-word prediction tasks.

* Adam optimizer was used to update the model parameters.

* Training was done on fixed-length sequences using truncated backpropagation through time.

* Gradient clipping was applied to avoid unstable training.

* After each epoch, the model was evaluated on the validation set.

The model with the best validation performance was saved and later used for testing and text generation.

#### LSTM

In [None]:
class LSTMLanguageModel(nn.Module):
    def __init__(self, vocab_size, emb_dim, hid_dim, num_layers, dropout):
        super().__init__()
        self.hid_dim = hid_dim
        self.num_layers = num_layers

        self.embedding = nn.Embedding(vocab_size, emb_dim)
        self.lstm = nn.LSTM(
            emb_dim, hid_dim,
            num_layers=num_layers,
            dropout=dropout,
            batch_first=True
        )
        self.dropout = nn.Dropout(dropout)
        self.fc = nn.Linear(hid_dim, vocab_size)
    
    def init_hidden(self, batch_size, device): 
        h = torch.zeros(self.num_layers, batch_size, self.hid_dim).to(device)
        c = torch.zeros(self.num_layers, batch_size, self.hid_dim).to(device)
        return h, c

    def detach_hidden(self, hidden):
        return hidden[0].detach(), hidden[1].detach()
    
    def forward(self, src, hidden):
        emb = self.dropout(self.embedding(src))
        out, hidden = self.lstm(emb, hidden)
        out = self.dropout(out)
        pred = self.fc(out)
        return pred, hidden


In [29]:
# Creates input–target pairs for language modeling by shifting the sequence by one token

def get_batch(data, seq_len, idx):
    src = data[:, idx:idx+seq_len]
    tgt = data[:, idx+1:idx+seq_len+1]
    return src, tgt


#### Model Training

In [None]:
def train(model, data, optimizer, criterion, batch_size, seq_len, clip, device):
    model.train()
    epoch_loss = 0

    # Adjusting data size to be multiple of seq_len

    num_batches = data.shape[1]
    data = data[:, :num_batches - (num_batches - 1) % seq_len]
    num_batches = data.shape[1]

    # Initializing hidden state
    
    hidden = model.init_hidden(batch_size, device)

    for idx in range(0, num_batches - 1, seq_len):
        optimizer.zero_grad()
        hidden = model.detach_hidden(hidden)

        src, tgt = get_batch(data, seq_len, idx)
        src, tgt = src.to(device), tgt.to(device)

        pred, hidden = model(src, hidden)
        pred = pred.reshape(batch_size * seq_len, -1)
        tgt = tgt.reshape(-1)

        loss = criterion(pred, tgt)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        optimizer.step()

        epoch_loss += loss.item() * seq_len

    return epoch_loss / num_batches


In [None]:
# Evaluation function

def evaluate(model, data, criterion, batch_size, seq_len, device):
    model.eval()
    epoch_loss = 0
    
    num_batches = data.shape[1]
    data = data[:, :num_batches - (num_batches - 1) % seq_len]
    num_batches = data.shape[1]

    hidden = model.init_hidden(batch_size, device)

    # Disabling gradient calculation for evaluation

    with torch.no_grad():
        for idx in range(0, num_batches - 1, seq_len):
            hidden = model.detach_hidden(hidden)
            src, tgt = get_batch(data, seq_len, idx)
            src, tgt = src.to(device), tgt.to(device)

            pred, hidden = model(src, hidden)
            pred = pred.reshape(batch_size * seq_len, -1)
            tgt = tgt.reshape(-1)

            loss = criterion(pred, tgt)
            epoch_loss += loss.item() * seq_len

    return epoch_loss / num_batches


In [32]:

# Hyperparameters :

vocab_size = len(vocab)

emb_dim = 1024          # 1024
hid_dim = 1024          # 1024
num_layers = 2          # 2
dropout = 0.65          # 0.65
lr = 1e-3               # 1e-3

# Model Initialization

model = LSTMLanguageModel(
    vocab_size, emb_dim, hid_dim, num_layers, dropout
).to(device)

# Setting up optimizer and loss function

optimizer = optim.Adam(model.parameters(), lr=lr)
criterion = nn.CrossEntropyLoss()


# Training Configuration

n_epochs = 20           # 20 epochs
seq_len = 50            # decoding length = 50
clip = 0.25             # gradient clipping

lr_scheduler = optim.lr_scheduler.ReduceLROnPlateau(
    optimizer, factor=0.5, patience=0
)

best_valid_loss = float("inf")

# Training Loop

for epoch in range(n_epochs):
    train_loss = train(
        model, train_data, optimizer,
        criterion, batch_size,
        seq_len, clip, device
    )

    valid_loss = evaluate(
        model, valid_data, criterion,
        batch_size, seq_len, device
    )

    lr_scheduler.step(valid_loss)

    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), "best-val-lstm_lm.pt") # saving the best model

    print(f"Epoch {epoch+1}")
    print(f"  Train Perplexity: {math.exp(train_loss):.3f}")
    print(f"  Valid Perplexity: {math.exp(valid_loss):.3f}")


Epoch 1
  Train Perplexity: 166.564
  Valid Perplexity: 61.106
Epoch 2
  Train Perplexity: 67.926
  Valid Perplexity: 47.294
Epoch 3
  Train Perplexity: 55.711
  Valid Perplexity: 41.691
Epoch 4
  Train Perplexity: 49.132
  Valid Perplexity: 38.438
Epoch 5
  Train Perplexity: 44.700
  Valid Perplexity: 36.344
Epoch 6
  Train Perplexity: 41.559
  Valid Perplexity: 34.996
Epoch 7
  Train Perplexity: 39.075
  Valid Perplexity: 33.997
Epoch 8
  Train Perplexity: 37.121
  Valid Perplexity: 33.183
Epoch 9
  Train Perplexity: 35.501
  Valid Perplexity: 32.594
Epoch 10
  Train Perplexity: 34.106
  Valid Perplexity: 32.158
Epoch 11
  Train Perplexity: 32.890
  Valid Perplexity: 31.868
Epoch 12
  Train Perplexity: 31.816
  Valid Perplexity: 31.637
Epoch 13
  Train Perplexity: 30.843
  Valid Perplexity: 31.478
Epoch 14
  Train Perplexity: 29.948
  Valid Perplexity: 31.326
Epoch 15
  Train Perplexity: 29.195
  Valid Perplexity: 31.235
Epoch 16
  Train Perplexity: 28.474
  Valid Perplexity: 31.179


In [33]:

# Saving Vocabulary for Inference / Web App

torch.save(
    {
        "stoi": vocab.stoi,
        "itos": vocab.itos
    },
    "vocab.pt"
)

print("Saved vocab.pt")


Saved vocab.pt


# Testing

In [34]:
# Loading the best model saved during training
model.load_state_dict(torch.load("best-val-lstm_lm.pt", map_location=device))

# Evaluating the trained model on the test dataset
test_loss = evaluate(
    model,          # trained LSTM language model
    test_data,      # test dataset
    criterion,      # loss function (CrossEntropy)
    batch_size,     # batch size used during training
    seq_len,        # sequence length for evaluation
    device          # CPU is being used 
)

# Converting test loss to perplexity for easier interpretation
print(f"Test Perplexity: {math.exp(test_loss):.3f}")



Test Perplexity: 31.911


# Inference

#### Text Generation

In [68]:
def generate(
    prompt,
    max_seq_len,
    temperature,
    model,
    tokenizer_fn,
    vocab,
    device,
    seed=None
):
    if seed is not None:
        torch.manual_seed(seed)

    model.eval()

    tokens = tokenizer_fn(prompt)
    indices = [vocab.stoi.get(t, UNK_IDX) for t in tokens]

    batch_size = 1
    hidden = model.init_hidden(batch_size, device)

    with torch.no_grad():
        for _ in range(max_seq_len):
            src = torch.LongTensor([indices]).to(device)
            prediction, hidden = model(src, hidden)

            probs = torch.softmax(
                prediction[:, -1] / temperature,
                dim=-1
            )

            next_token = torch.multinomial(probs, 1).item()

            while next_token == UNK_IDX:
                next_token = torch.multinomial(probs, 1).item()

            if next_token == EOS_IDX:
                break

            indices.append(next_token)

    tokens = [vocab.itos[i] for i in indices]
    tokens = [t for t in tokens if t not in ("<unk>", "<eos>")]

    return tokens


#### Sample Prompt

In [69]:
prompt = "hello there,"
max_seq_len = 30
seed = 0

temperatures = [0.5, 0.7, 0.75, 0.8, 1.0]

for t in temperatures:
    print(f"\nTemperature = {t}")
    output = generate(
        prompt=prompt,
        max_seq_len=max_seq_len,
        temperature=t,
        model=model,
        tokenizer_fn=tokenize,
        vocab=vocab,
        device=device,
        seed=seed
    )
    print(" ".join(output))



Temperature = 0.5
hello there, how are you?</s>

Temperature = 0.7
hello there, how are you?</s>

Temperature = 0.75
hello there, how are you?</s>

Temperature = 0.8
hello there, how are you?</s>

Temperature = 1.0
hello there, where are you from?</s>


#### Description

During inference, the trained LSTM language model is used to generate text autoregressively.  
Given an input prompt, the text is first tokenized and converted into numerical indices using the
trained vocabulary. These indices are then passed through the LSTM model to predict the probability
distribution of the next word.

At each step, the next word is sampled from this distribution (optionally using temperature and
sampling strategies) and appended back to the input sequence. This process is repeated until the
maximum generation length is reached or an end-of-sentence token is produced.


#### Training Results Summary



The model shows rapid improvement during the early epochs, with validation perplexity dropping
significantly from **61.11 (Epoch 1)** to around **32 by Epoch 10**, indicating that the LSTM quickly
learns the basic structure and patterns of the language. This phase reflects effective learning and
good alignment between training and validation performance.

After approximately **Epoch 12**, validation perplexity stabilizes around **31**, while training
perplexity continues to decrease gradually. This suggests diminishing returns from additional
training and the onset of mild overfitting. Overall, the results demonstrate successful convergence.


#### Web Application – Model Interface (Summary)

The web application provides a simple interface for interacting with the trained LSTM-based language
model. It is built using Flask and allows users to enter a text prompt through a web form. The backend
loads the trained PyTorch model and the saved vocabulary to ensure that preprocessing during inference
matches the training setup.

When a user submits a prompt, it is tokenized, converted into numerical indices, and passed to the
LSTM model for autoregressive text generation. The model predicts the next token step by step, applies
temperature-based sampling, and generates a coherent continuation of the input text. The generated
output is converted back to readable text and displayed on the webpage in real time.
