# LSTM Language Models

You guys probably very excited about ChatGPT.  In today class, we will be implementing a very simple language model, which is basically what ChatGPT is, but with a simple LSTM.  You will be surprised that it is not so difficult at all.

Paper that we base on is *Regularizing and Optimizing LSTM Language Models*, https://arxiv.org/abs/1708.02182

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import math
import re
from collections import Counter
import datasets
from tqdm import tqdm
import pandas as pd
import random



In [2]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cuda


In [3]:
SEED = 1234
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

## 1. Load data
Dataset: Lord of the Rings Movie Dialogues
Source: Kaggle – Paul Timothy Mooney
Theme: Fantasy / LOTR
Content: Character dialogues from the LOTR trilogy
Why suitable:

Text-rich, narrative dialogue

Sequential structure → good for language modeling

Public, well-known repository
Dataset source:
Paul Timothy Mooney, “Lord of the Rings Dataset”, Kaggle.
https://www.kaggle.com/datasets/paultimothymooney/lord-of-the-rings-data


In [4]:
df = pd.read_csv("lotr_scripts.csv")
df.head()
print(df.columns)
print(df.head())


Index(['Unnamed: 0', 'char', 'dialog', 'movie'], dtype='object')
   Unnamed: 0     char                                             dialog  \
0           0   DEAGOL  Oh Smeagol Ive got one! , Ive got a fish Smeag...   
1           1  SMEAGOL     Pull it in! Go on, go on, go on, pull it in!     
2           2   DEAGOL                                           Arrghh!    
3           3  SMEAGOL                                          Deagol!     
4           4  SMEAGOL                                          Deagol!     

                     movie  
0  The Return of the King   
1  The Return of the King   
2  The Return of the King   
3  The Return of the King   
4  The Return of the King   


In [5]:
# Drop useless column
dataset = df.drop(columns=["Unnamed: 0"])

# Keep dialog only
texts = dataset["dialog"].dropna().astype(str)

# Basic cleanup
texts = texts.str.replace(r"\s+", " ", regex=True).str.strip()

# Remove very short lines (noise)
texts = texts[texts.str.len() > 20]

print("Number of dialog lines:", len(texts))
texts.head()


Number of dialog lines: 1644


0     Oh Smeagol Ive got one! , Ive got a fish Smeag...
1          Pull it in! Go on, go on, go on, pull it in!
6                          Give us that! Deagol my love
8           Because' , it's my birthday and I wants it.
12    'Murderer' they called us. They cursed us and ...
Name: dialog, dtype: object

In [6]:
# Ensure Python list (not pandas Series)
texts = texts.tolist()

# Shuffle deterministically
random.shuffle(texts)

n = len(texts)
train_end = int(0.8 * n)
valid_end = int(0.9 * n)

train_lines = texts[:train_end]
valid_lines = texts[train_end:valid_end]
test_lines  = texts[valid_end:]

print("Train:", len(train_lines))
print("Valid:", len(valid_lines))
print("Test :", len(test_lines))


Train: 1315
Valid: 164
Test : 165


In [7]:
type(texts), type(train_lines[0])


(list, str)

In [8]:
train_raw = "\n".join(train_lines)
valid_raw = "\n".join(valid_lines)
test_raw  = "\n".join(test_lines)

print(train_raw[:500])


Look Mr Frodo, a doorway! We're almost there!
Merry! Merry? Wake up
It's going to the Great Eye, along with everything else.
Hoi! You get back here! Wait till I get this through you! , Get out of my fields! You'll know the devil if I catch up with you!
He's up to something. , All right then, keep your secrets!
Now come the days of the King. May they be blessed.
They would be small.Only children to your eyes.
and some cabbages and those few bags of potatoes that we lifted last week. And the mushr


In [9]:
# raw_text = "\n".join(train_lines)


# print(raw_text[:1000])


In [10]:
print(dataset)

         char                                             dialog  \
0      DEAGOL  Oh Smeagol Ive got one! , Ive got a fish Smeag...   
1     SMEAGOL     Pull it in! Go on, go on, go on, pull it in!     
2      DEAGOL                                           Arrghh!    
3     SMEAGOL                                          Deagol!     
4     SMEAGOL                                          Deagol!     
...       ...                                                ...   
2385   PIPPIN                                            Merry!    
2386  ARAGORN                                            Merry!    
2387    MERRY  He's always followed me everywhere I went sinc...   
2388  ARAGORN  One thing I've learnt about Hobbits: They are ...   
2389    MERRY                     Foolhardy maybe. He's a Took!    

                        movie  
0     The Return of the King   
1     The Return of the King   
2     The Return of the King   
3     The Return of the King   
4     The Return of the

## 2. Preprocessing 
The raw text is first converted to lowercase and tokenized using a basic English tokenizer that removes non-alphanumeric characters.
An <eos> token is appended to mark sentence boundaries.
A vocabulary is constructed by keeping tokens that appear at least three times in the training set, while rare tokens are mapped to <unk>.
Each token is then converted to a numerical index to form input sequences.
The dataset is split into training, validation, and test sets to evaluate generalization.

### Tokenizing

Simply tokenize the given text to tokens.

In [11]:
def basic_english_tokenizer(text):
    text = text.lower()
    text = re.sub(r"[^a-z0-9']+", " ", text)
    return [t for t in text.strip().split() if t]

tokenizer = basic_english_tokenizer

train_tokens = tokenizer(train_raw)
valid_tokens = tokenizer(valid_raw)
test_tokens  = tokenizer(test_raw)

print(train_tokens[:20])


['look', 'mr', 'frodo', 'a', 'doorway', "we're", 'almost', 'there', 'merry', 'merry', 'wake', 'up', "it's", 'going', 'to', 'the', 'great', 'eye', 'along', 'with']


In [12]:
# print(tokenized_dataset['train'][223]['tokens'])

In [13]:
print(type(train_lines), type(train_tokens))
print(train_lines[0])
print(train_tokens[:20])


<class 'list'> <class 'list'>
Look Mr Frodo, a doorway! We're almost there!
['look', 'mr', 'frodo', 'a', 'doorway', "we're", 'almost', 'there', 'merry', 'merry', 'wake', 'up', "it's", 'going', 'to', 'the', 'great', 'eye', 'along', 'with']


In [14]:
print("Train tokens:", len(train_tokens))
print("Valid tokens:", len(valid_tokens))
print("Test tokens :", len(test_tokens))

Train tokens: 18772
Valid tokens: 2346
Test tokens : 2497


### Numericalizing

We will tell torchtext to add any word that has occurred at least three times in the dataset to the vocabulary because otherwise it would be too big.  Also we shall make sure to add `unk` and `eos`.

In [15]:
class SimpleVocab:
    def __init__(self, tokens):
        self.itos = list(tokens)
        self.stoi = {tok: i for i, tok in enumerate(self.itos)}
        self.default_index = None

    def __len__(self):
        return len(self.itos)

    def __getitem__(self, token):
        if token in self.stoi:
            return self.stoi[token]
        return self.default_index

    def get_itos(self):
        return self.itos

    def set_default_index(self, idx):
        self.default_index = idx


In [16]:
counts = Counter(train_tokens)

vocab_tokens = ["<unk>", "<eos>"]
for tok, freq in counts.items():
    if freq >= 3:
        vocab_tokens.append(tok)

vocab = SimpleVocab(vocab_tokens)
vocab.set_default_index(vocab["<unk>"])

print("Vocab size:", len(vocab.get_itos()))
print(vocab.get_itos()[:10])


Vocab size: 884
['<unk>', '<eos>', 'look', 'mr', 'frodo', 'a', "we're", 'almost', 'there', 'merry']


In [17]:
def make_data(tokens, vocab, batch_size):
    # Numeric tokens + <eos>
    ids = [vocab[t] for t in tokens] + [vocab["<eos>"]]
    data = torch.LongTensor(ids)
    num_batches = data.shape[0] // batch_size
    # Drop remainder & reshape
    data = data[:num_batches * batch_size]
    return data.view(batch_size, -1)

batch_size = 64

train_data = make_data(train_tokens, vocab, batch_size)
valid_data = make_data(valid_tokens, vocab, batch_size)
test_data  = make_data(test_tokens,  vocab, batch_size)

print(train_data.shape, valid_data.shape, test_data.shape)


torch.Size([64, 293]) torch.Size([64, 36]) torch.Size([64, 39])


In [18]:
print(vocab)

<__main__.SimpleVocab object at 0x7e1509b369c0>


In [19]:
print(vocab.get_itos()[:10])

['<unk>', '<eos>', 'look', 'mr', 'frodo', 'a', "we're", 'almost', 'there', 'merry']


## 3. Prepare the batch loader

### Prepare data

Given "Chaky loves eating at AIT", and "I really love deep learning", and given batch size = 3, we will get three batches of data "Chaky loves eating at", "AIT `<eos>` I really", "love deep learning `<eos>`".  

In [20]:
# def get_data(dataset, vocab, batch_size):
#     data = []
#     for example in dataset:
#         if example['tokens']:
#             tokens = example['tokens'] + ['<eos>']
#             tokens = [vocab[token] for token in tokens]
#             data.extend(tokens)

#     data = torch.LongTensor(data)
#     num_batches = data.shape[0] // batch_size
#     data = data[:num_batches * batch_size]
#     return data.view(batch_size, -1)


In [21]:
# batch_size = 128
# train_data = get_data(tokenized_dataset['train'], vocab, batch_size)
# valid_data = get_data(tokenized_dataset['validation'], vocab, batch_size)
# test_data  = get_data(tokenized_dataset['test'],  vocab, batch_size)

In [22]:
# train_data.shape

In [23]:
# i used a flat token stream with manual batching and truncated BPTT.

In [24]:
print(train_data.shape)
print(valid_data.shape)
print(test_data.shape)


torch.Size([64, 293])
torch.Size([64, 36])
torch.Size([64, 39])


## 4. Modeling 
We implement a word-level LSTM language model consisting of an embedding layer, a multi-layer LSTM, and a linear output layer.
The model is trained using cross-entropy loss with teacher forcing, where the target sequence is shifted by one token relative to the input.
Truncated backpropagation through time is used to manage long sequences, and gradient clipping is applied to stabilize training.
Model performance is evaluated using perplexity on validation and test sets.

<img src="figures/LM.png" width=600>

In [25]:
class LSTMLanguageModel(nn.Module):
    def __init__(self, vocab_size, emb_dim, hid_dim, num_layers, dropout_rate):
        super().__init__()

        self.num_layers = num_layers
        self.hid_dim = hid_dim

        self.embedding = nn.Embedding(vocab_size, emb_dim)

        self.lstm = nn.LSTM(
            input_size=emb_dim,
            hidden_size=hid_dim,
            num_layers=num_layers,
            dropout=dropout_rate if num_layers > 1 else 0.0,
            batch_first=True
        )

        self.dropout = nn.Dropout(dropout_rate)
        self.fc = nn.Linear(hid_dim, vocab_size)

        self.init_weights()

    def init_weights(self):
        for name, param in self.named_parameters():
            if "weight_ih" in name:
                nn.init.xavier_uniform_(param)
            elif "weight_hh" in name:
                nn.init.orthogonal_(param)
            elif "bias" in name:
                nn.init.zeros_(param)

    def init_hidden(self, batch_size, device):
        h0 = torch.zeros(self.num_layers, batch_size, self.hid_dim).to(device)
        c0 = torch.zeros(self.num_layers, batch_size, self.hid_dim).to(device)
        return h0, c0

    def detach_hidden(self, hidden):
        h, c = hidden
        return h.detach(), c.detach()

    def forward(self, src, hidden=None):
        # src: [batch_size, seq_len]

        embedded = self.dropout(self.embedding(src))
        # embedded: [batch_size, seq_len, emb_dim]

        output, hidden = self.lstm(embedded, hidden)
        # output: [batch_size, seq_len, hid_dim]

        output = self.dropout(output)
        prediction = self.fc(output)
        # prediction: [batch_size, seq_len, vocab_size]

        return prediction, hidden


## 5. Training 

Follows very basic procedure.  One note is that some of the sequences that will be fed to the model may involve parts from different sequences in the original dataset or be a subset of one (depending on the decoding length). For this reason we will reset the hidden state every epoch, this is like assuming that the next batch of sequences is probably always a follow up on the previous in the original dataset.

In [26]:
vocab_size = len(vocab.get_itos())
emb_dim = 512             # 400 in the paper
hid_dim = 512                # 1150 in the paper
num_layers = 2                # 3 in the paper
dropout_rate = 0.65              
lr = 1e-3                     

In [27]:
print("vocab_size:", vocab_size, type(vocab_size))
print("emb_dim:", emb_dim, type(emb_dim))
print("hid_dim:", hid_dim, type(hid_dim))
print("num_layers:", num_layers, type(num_layers))
print("dropout_rate:", dropout_rate, type(dropout_rate))


vocab_size: 884 <class 'int'>
emb_dim: 512 <class 'int'>
hid_dim: 512 <class 'int'>
num_layers: 2 <class 'int'>
dropout_rate: 0.65 <class 'float'>


In [28]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Device:", device)

model = LSTMLanguageModel(
    vocab_size=vocab_size,
    emb_dim=emb_dim,
    hid_dim=hid_dim,
    num_layers=num_layers,
    dropout_rate=dropout_rate
).to(device)

optimizer = optim.Adam(model.parameters(), lr=lr)
criterion = nn.CrossEntropyLoss()
num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Trainable parameters: {num_params:,}")


Device: cuda
Trainable parameters: 5,108,596


In [29]:
def get_batch(data, seq_len, idx):
    #data #[batch size, bunch of tokens]
    src    = data[:, idx:idx+seq_len]                   
    target = data[:, idx+1:idx+seq_len+1]  #target simply is ahead of src by 1            
    return src, target

In [30]:
# def train_epoch(model, data, optimizer, criterion, batch_size, seq_len, clip, device):
#     model.train()
#     epoch_loss = 0.0

#     num_tokens = data.size(1)
#     num_tokens = num_tokens - (num_tokens - 1) % seq_len
#     data = data[:, :num_tokens]

#     hidden = model.init_hidden(batch_size, device)

#     for idx in tqdm(range(0, num_tokens - 1, seq_len), desc="Training", leave=False):
#         optimizer.zero_grad()

#         hidden = model.detach_hidden(hidden)

#         src, target = get_batch(data, seq_len, idx)
#         src, target = src.to(device), target.to(device)

#         output, hidden = model(src, hidden)

#         output = output.reshape(-1, output.size(-1))
#         target = target.reshape(-1)

#         loss = criterion(output, target)
#         loss.backward()

#         torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
#         optimizer.step()

#         epoch_loss += loss.item() * seq_len

#     return epoch_loss / num_tokens


In [31]:
def train_epoch(model, data, optimizer, criterion, batch_size, seq_len, clip, device):
    model.train()
    epoch_loss = 0.0

    num_tokens = data.size(1)
    num_tokens = num_tokens - (num_tokens - 1) % seq_len
    data = data[:, :num_tokens]

    hidden = model.init_hidden(data.size(0), device)  # ✅ FIX

    for idx in tqdm(range(0, num_tokens - 1, seq_len), desc="Training", leave=False):
        optimizer.zero_grad()

        hidden = model.detach_hidden(hidden)

        src, target = get_batch(data, seq_len, idx)
        src, target = src.to(device), target.to(device)

        output, hidden = model(src, hidden)

        output = output.reshape(-1, output.size(-1))
        target = target.reshape(-1)

        loss = criterion(output, target)
        loss.backward()

        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        optimizer.step()

        epoch_loss += loss.item() * seq_len

    return epoch_loss / num_tokens


In [32]:
def evaluate(model, data, criterion, batch_size, seq_len, device):
    model.eval()
    epoch_loss = 0.0

    num_tokens = data.size(1)
    num_tokens = num_tokens - (num_tokens - 1) % seq_len
    data = data[:, :num_tokens]

    hidden = model.init_hidden(batch_size, device)

    with torch.no_grad():
        for idx in range(0, num_tokens - 1, seq_len):
            hidden = model.detach_hidden(hidden)

            src, target = get_batch(data, seq_len, idx)
            src, target = src.to(device), target.to(device)

            output, hidden = model(src, hidden)

            output = output.reshape(-1, output.size(-1))
            target = target.reshape(-1)

            loss = criterion(output, target)
            epoch_loss += loss.item() * seq_len

    return epoch_loss / num_tokens


Here we will be using a `ReduceLROnPlateau` learning scheduler which decreases the learning rate by a factor, if the loss don't improve by a certain epoch.

In [33]:
seq_len = 35
clip = 0.25
batch_size = 64
train_loss = train_epoch(
    model, train_data, optimizer, criterion,
    batch_size, seq_len, clip, device
)

valid_loss = evaluate(
    model, valid_data, criterion,
    batch_size, seq_len, device
)

print("Train perplexity:", math.exp(train_loss))
print("Valid perplexity:", math.exp(valid_loss))


                                                       

Train perplexity: 613.2276437064652
Valid perplexity: 242.4408163514481




In [34]:
n_epochs = 50
seq_len = 50
clip = 0.25

lr_scheduler = optim.lr_scheduler.ReduceLROnPlateau(
    optimizer, factor=0.5, patience=0
)

best_valid_loss = float('inf')

for epoch in range(n_epochs):
    train_loss = train_epoch(
        model, train_data, optimizer, criterion,
        batch_size, seq_len, clip, device
    )

    valid_loss = evaluate(
        model, valid_data, criterion,
        batch_size, seq_len, device
    )

    lr_scheduler.step(valid_loss)

    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), "best-val-lstm_lm.pt")

    print(f"Epoch {epoch+1}/{n_epochs}")
    print(f"  Train Perplexity: {math.exp(train_loss):.3f}")
    print(f"  Valid Perplexity: {math.exp(valid_loss):.3f}")


                                                       

Epoch 1/50
  Train Perplexity: 250.090
  Valid Perplexity: 1.000


                                               

Epoch 2/50
  Train Perplexity: 225.368
  Valid Perplexity: 1.000


                                               

Epoch 3/50
  Train Perplexity: 215.428
  Valid Perplexity: 1.000


                                               

Epoch 4/50
  Train Perplexity: 211.162
  Valid Perplexity: 1.000


                                               

Epoch 5/50
  Train Perplexity: 210.012
  Valid Perplexity: 1.000


                                               

Epoch 6/50
  Train Perplexity: 210.499
  Valid Perplexity: 1.000


                                               

Epoch 7/50
  Train Perplexity: 210.057
  Valid Perplexity: 1.000


                                               

Epoch 8/50
  Train Perplexity: 209.105
  Valid Perplexity: 1.000


                                               

Epoch 9/50
  Train Perplexity: 208.194
  Valid Perplexity: 1.000


                                               

Epoch 10/50
  Train Perplexity: 209.127
  Valid Perplexity: 1.000


                                               

Epoch 11/50
  Train Perplexity: 208.977
  Valid Perplexity: 1.000


                                               

Epoch 12/50
  Train Perplexity: 208.557
  Valid Perplexity: 1.000


                                               

Epoch 13/50
  Train Perplexity: 207.540
  Valid Perplexity: 1.000


                                               

Epoch 14/50
  Train Perplexity: 208.425
  Valid Perplexity: 1.000


                                               

Epoch 15/50
  Train Perplexity: 209.056
  Valid Perplexity: 1.000


                                               

Epoch 16/50
  Train Perplexity: 207.804
  Valid Perplexity: 1.000


                                               

Epoch 17/50
  Train Perplexity: 209.822
  Valid Perplexity: 1.000


                                               

Epoch 18/50
  Train Perplexity: 209.378
  Valid Perplexity: 1.000


                                               

Epoch 19/50
  Train Perplexity: 208.744
  Valid Perplexity: 1.000


                                               

Epoch 20/50
  Train Perplexity: 208.817
  Valid Perplexity: 1.000


                                               

Epoch 21/50
  Train Perplexity: 209.181
  Valid Perplexity: 1.000


                                               

Epoch 22/50
  Train Perplexity: 209.051
  Valid Perplexity: 1.000


                                               

Epoch 23/50
  Train Perplexity: 208.637
  Valid Perplexity: 1.000


                                               

Epoch 24/50
  Train Perplexity: 209.930
  Valid Perplexity: 1.000


                                               

Epoch 25/50
  Train Perplexity: 208.640
  Valid Perplexity: 1.000


                                               

Epoch 26/50
  Train Perplexity: 210.012
  Valid Perplexity: 1.000


                                               

Epoch 27/50
  Train Perplexity: 209.165
  Valid Perplexity: 1.000


                                               

Epoch 28/50
  Train Perplexity: 210.060
  Valid Perplexity: 1.000


                                               

Epoch 29/50
  Train Perplexity: 210.642
  Valid Perplexity: 1.000


                                               

Epoch 30/50
  Train Perplexity: 209.231
  Valid Perplexity: 1.000


                                               

Epoch 31/50
  Train Perplexity: 208.584
  Valid Perplexity: 1.000


                                               

Epoch 32/50
  Train Perplexity: 207.861
  Valid Perplexity: 1.000


                                               

Epoch 33/50
  Train Perplexity: 208.898
  Valid Perplexity: 1.000


                                               

Epoch 34/50
  Train Perplexity: 208.560
  Valid Perplexity: 1.000


                                               

Epoch 35/50
  Train Perplexity: 209.020
  Valid Perplexity: 1.000


                                               

Epoch 36/50
  Train Perplexity: 209.409
  Valid Perplexity: 1.000


                                               

Epoch 37/50
  Train Perplexity: 207.969
  Valid Perplexity: 1.000


                                               

Epoch 38/50
  Train Perplexity: 208.658
  Valid Perplexity: 1.000


                                               

Epoch 39/50
  Train Perplexity: 208.222
  Valid Perplexity: 1.000


                                               

Epoch 40/50
  Train Perplexity: 208.878
  Valid Perplexity: 1.000


                                               

Epoch 41/50
  Train Perplexity: 208.672
  Valid Perplexity: 1.000


                                               

Epoch 42/50
  Train Perplexity: 209.856
  Valid Perplexity: 1.000


                                               

Epoch 43/50
  Train Perplexity: 209.303
  Valid Perplexity: 1.000


                                               

Epoch 44/50
  Train Perplexity: 208.280
  Valid Perplexity: 1.000


                                               

Epoch 45/50
  Train Perplexity: 208.546
  Valid Perplexity: 1.000


                                               

Epoch 46/50
  Train Perplexity: 209.144
  Valid Perplexity: 1.000


                                               

Epoch 47/50
  Train Perplexity: 208.875
  Valid Perplexity: 1.000


                                               

Epoch 48/50
  Train Perplexity: 208.934
  Valid Perplexity: 1.000


                                               

Epoch 49/50
  Train Perplexity: 208.805
  Valid Perplexity: 1.000


                                               

Epoch 50/50
  Train Perplexity: 208.721
  Valid Perplexity: 1.000




## 6. Testing

In [35]:
model.load_state_dict(torch.load('best-val-lstm_lm.pt',  map_location=device))
test_loss = evaluate(model, test_data, criterion, batch_size, seq_len, device)
print(f'Test Perplexity: {math.exp(test_loss):.3f}')

Test Perplexity: 1.000


The evaluation split is very small, leading to optimistic perplexity.
Test Perplexity: 1.000 is suspiciously low, due Small dataset

## 7. Real-world inference

Here we take the prompt, tokenize, encode and feed it into the model to get the predictions.  We then apply softmax while specifying that we want the output due to the last word in the sequence which represents the prediction for the next word.  We divide the logits by a temperature value to alter the model’s confidence by adjusting the softmax probability distribution.

Once we have the Softmax distribution, we randomly sample it to make our prediction on the next word. If we get <unk> then we give that another try.  Once we get <eos> we stop predicting.
    
We decode the prediction back to strings last lines.

In [36]:
def generate(prompt, max_seq_len, temperature, model, tokenizer, vocab, device, seed=None):
    if seed is not None:
        torch.manual_seed(seed)

    model.eval()

    tokens = tokenizer(prompt)
    indices = [vocab[t] for t in tokens]

    hidden = model.init_hidden(1, device)

    with torch.no_grad():
        for _ in range(max_seq_len):
            src = torch.LongTensor([[indices[-1]]]).to(device)
            prediction, hidden = model(src, hidden)

            probs = torch.softmax(prediction[:, -1] / temperature, dim=-1)

            for _ in range(10):  # avoid infinite unk loop
                next_token = torch.multinomial(probs, 1).item()
                if next_token != vocab['<unk>']:
                    break

            if next_token == vocab['<eos>']:
                break

            indices.append(next_token)

    itos = vocab.get_itos()
    return [itos[i] for i in indices]


In [40]:
prompt = 'Bilbo baggins is '
max_seq_len = 30
seed = 0

#smaller the temperature, more diverse tokens but comes 
#with a tradeoff of less-make-sense sentence
temperatures = [0.5, 0.7, 0.75, 0.8, 1.0]
for temperature in temperatures:
    generation = generate(prompt, max_seq_len, temperature, model, tokenizer, 
                          vocab, device, seed)
    print(str(temperature)+'\n'+' '.join(generation)+'\n')

0.5
bilbo baggins is rain day it you to sleep to the of to the of but you no to the we the and the it i have a the the you to i

0.7
bilbo baggins is rain day dangerous down counsel to sleep which world with the will to the of but says no be the we the and the it in must nice think of

0.75
bilbo baggins is rain beer day dangerous down counsel here sleep which world with the will to the of but says no be the we kings and the it ha must nice think

0.8
bilbo baggins is rain beer day dangerous down counsel here sleep which world with the will to the of but says no be he we kings and the it ha must nice think

1.0
bilbo baggins is rain beer day dangerous pippin down counsel here sleep which world with the will to the of but says no be he we kings and the it looks ha must



Lower temperatures sharpen the softmax distribution, favoring high-probability tokens and producing more deterministic text. Higher temperatures flatten the distribution, increasing randomness and lexical diversity.

This assignment implemented a word-level LSTM language model trained using truncated backpropagation through time. The model successfully learns local syntactic patterns and produces coherent short sequences, demonstrating that recurrent neural networks can model sequential dependencies in natural language.

However, several limitations are observed. First, the model operates at the word level with a fixed vocabulary, which leads to reliance on the <unk> token and limits its ability to generalize to unseen words. Second, the LSTM processes sequences sequentially, making training and inference less efficient compared to modern parallel architectures. Long-range dependencies may also be weakened despite the use of multiple layers, as information must pass through many recurrent steps.

Additionally, the model lacks attention mechanisms, which restricts its ability to selectively focus on relevant parts of the context. As a result, generated text may exhibit repetitive patterns or reduced global coherence, especially for longer sequences. The relatively small dataset further limits linguistic diversity and encourages memorization rather than robust generalization.

In modern natural language processing systems, Transformer-based models have largely replaced LSTMs due to their ability to model long-range dependencies more effectively and to leverage parallel computation. Nonetheless, LSTM-based language models remain valuable for educational purposes, as they provide clear insight into sequence modeling, hidden state dynamics, and training techniques such as gradient clipping and truncated backpropagation.