<a href="https://colab.research.google.com/github/fatemehabedin2/AIG/blob/main/Project3_LSTM_BibiAbidin.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project 3: Implementing a Simple Recurrent Neural Network (RNN)

## Introduction

In this project, you will design, implement, and evaluate a simple Recurrent Neural Network (RNN) from scratch. This will involve building the entire pipeline, from data preprocessing to model training and evaluation.

## Objectives

1. Set up TensorFlow or PyTorch environments. You are free to choose your preferred DL platform.
2. Use GPU for training.
3. Create a data loader and implement data preprocessing where needed.
4. Design a Convolutional Neural Network.
5. Train and evaluate your model. Make sure to clearly show loss and accuracy values. Include visualizations too.
6. Answer assessment questions.

I am using text8 dataset fro kaggle to create a text generator.

In [1]:
import kagglehub
import os

path = kagglehub.dataset_download("gupta24789/text8-word-embedding")

# Check what files are downloaded
print("Dataset downloaded to:", path)
print("Files in dataset:")
print(os.listdir(path))


Dataset downloaded to: /kaggle/input/text8-word-embedding
Files in dataset:
['text8']


In [2]:
import torch

print("PyTorch version:", torch.__version__)
if torch.cuda.is_available():
    print("GPU is available for PyTorch!")
else:
    print("No GPU found for PyTorch.")

PyTorch version: 2.6.0+cu124
GPU is available for PyTorch!


Let’s:

Read the file

Display a sample of words

Confirm the vocabulary size

In [3]:
with open(os.path.join(path, "text8"), "r") as file:
    text = file.read()

print("Length of raw text:", len(text))
print("First 500 characters:", text[:500])

Length of raw text: 713069767
First 500 characters:  anarchism originated as a term of abuse first used against early working class radicals including the diggers of the english revolution and the sans culottes of the french revolution whilst the term is still used in a pejorative way to describe any act that used violent means to destroy the organization of society it has also been taken up as a positive label by self defined anarchists the word anarchism is derived from the greek without archons ruler chief king anarchism as a political philoso


In [None]:
words = text.split()
print("Total number of words:", len(words))

In [None]:
unique_words = sorted(set(words))
print("Vocabulary size:", len(unique_words))

Even though text8 is already lowercased, alphabetic-only, space-separated, I'll still use the standard practice for pre-processing step.

In [4]:
import re

def preprocess_text(text):
    text = text.lower()

    # Removes anything that’s not a-z or whitespace. replace anything other than a-z or whitespace with "". 6 inside [] means anything other than
    text = re.sub(r"[^a-z\s]", "", text)

    # Remove extra whitespace
    text = re.sub(r"\s+", " ", text).strip()

    return text


clean_text = preprocess_text(text)
len(clean_text)

713069766

In [5]:
words = clean_text.split()

print("First 20 words:", words[:20])
print("Total number of words:", len(words))

First 20 words: ['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including', 'the', 'diggers', 'of', 'the', 'english']
Total number of words: 124301826


# Word-Level Modeling
I am going to use Word-Level Modeling, because it Learns semantic meaning better, so it would be more natural for real language modeling compared to Character-Level Modeling.

# Vocabulary and Encoding
Build the word2idx and idx2word dictionaries.

word2idx: Word ➝ Integer is used during: Preprocessing / training and Model input creation.

idx2word: Integer ➝ Word is used during: Text generation (predictions)
and Evaluation / debugging



In [6]:
from collections import Counter

word_counts = Counter(words)

# Creating sorted vocabulary for indexing
vocab = sorted(word_counts.keys())

word2idx = {word: idx for idx, word in enumerate(vocab)}
idx2word = {idx: word for word, idx in word2idx.items()}

# Encode whole text : full dataset as list of token IDs

encoded_text = [word2idx[word] for word in words]

print("First 20 encoded tokens:", encoded_text[:20])


First 20 encoded tokens: [26983, 534054, 43559, 0, 728186, 524750, 3359, 248070, 771709, 11156, 208483, 811779, 139584, 601589, 339492, 731197, 189025, 524750, 731197, 221580]


In [7]:
len(encoded_text)

124301826

In [7]:
# Use only 1M tokens
encoded_text = encoded_text[:25000000]

In [None]:
len(encoded_text)

In [None]:
def create_sequences(data, seq_length):
    inputs = []
    targets = []
    for i in range(len(encoded_text) - seq_length):
        inputs.append(encoded_text[i:i + seq_length])
        targets.append(encoded_text[i + seq_length])
    return inputs, targets

seq_length = 20  # 20 words to predict the 21st
inputs, targets = create_sequences(encoded_text, seq_length)

# Convert to tensors
inputs = torch.tensor(inputs, dtype=torch.long)   # torch.tensor() Takes any iterable (list, NumPy array, etc.) and copies the data into a new PyTorch tensor # Sets the data type explicitly to torch.long = 64-bit integer (A 1D tensor of int64 integers)
targets = torch.tensor(targets, dtype=torch.long)

print("Input shape:", inputs.shape)
print("Target shape:", targets.shape)
print("Example input:", inputs[0])
print("Target word index:", targets[0])
print("Target word:", idx2word[targets[0].item()])

After running above code my colab session crashed with the message: "Your session crashed after using all available RAM", because we're generating millions of sequences, and storing each as a list inside inputs.

Now instead of preloading all sequences, I try to create them on-the-fly in the Dataset class. This way we don't store all sequences in memory.

# Streaming Dataset + DataLoader

In [8]:
from torch.utils.data import Dataset, DataLoader

class Text8StreamingDataset(Dataset):
    def __init__(self, encoded_text, sequence_length):
        self.data = encoded_text
        self.seq_len = sequence_length

    def __len__(self):
        return len(self.data) - self.seq_len

    def __getitem__(self, idx):
        x = torch.tensor(self.data[idx:idx + self.seq_len], dtype=torch.long)
        y = torch.tensor(self.data[idx + self.seq_len], dtype=torch.long)
        return x, y


sequence_length = 20
batch_size = 32     # DataLoader will Select 32 random indices. For each of those 32 indices, It calls dataset’s __getitem__(idx): Now we have 32 sequences of 20 words (x_batch: [32, 20], y_batch:[32])

# Create train and validation DataLoaders
train_dataset = Text8StreamingDataset(train_data, sequence_length)
val_dataset   = Text8StreamingDataset(val_data, sequence_length)

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, num_workers=0)
val_loader   = DataLoader(val_dataset, batch_size=batch_size, shuffle=False, num_workers=0)     # Colab often crashes when num_workers > 0 in DataLoader.


Input batch shape: torch.Size([32, 20])
Target batch shape: torch.Size([32])
First input sequence (word indices): tensor([ 19646, 733490, 801564, 437113, 253600, 771447, 461099, 529311,  27554,
        253600, 538048, 640409, 307325, 110242, 200118, 259710, 307909, 105481,
        731197, 580487])
First target word: of


# LSTM Language Model

We'll build a basic LSTM that:

Takes a sequence of word indices [32, 20]

Embeds them to vectors (like word2vec)

Feeds them through LSTM layers

Outputs a logit for each word in the vocabulary


In [9]:
import torch.nn as nn

class LSTMLanguageModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, num_layers=2):      # input_size=embedding_dim, hidden_size=hidden_dim, output_size=vocab_size
        super(LSTMLanguageModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)       # Each word index maps to a dense vectors (embeddings). Every word ID gets a trainable vector of fixed size, like weights, trained along with our model, change every epoch, optimized via backpropagation,and learn to capture semantic similarity.
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers, batch_first=True)    # batch_first=True >> input shape = [batch_size, seq_len, embedding_dim] >> more intuitive and Compatible with DataLoader batches
        self.fc = nn.Linear(hidden_dim, vocab_size)         # hidden_dim: The size of the hidden state vector for each LSTM unit at every time step

    def forward(self, x):
        # x: [batch_size, seq_len]
        embedded = self.embedding(x)       # [batch_size, seq_len, embedding_dim]
        lstm_out, _ = self.lstm(embedded)  # lstm_out: [batch_size, seq_len, hidden_dim]
        final_hidden = lstm_out[:, -1, :]  # Take output from last timestep → [batch_size, hidden_dim]
        logits = self.fc(final_hidden)     # → [batch_size, vocab_size]
        return logits


SyntaxError: unterminated string literal (detected at line 2) (ipython-input-9-1404640251.py, line 2)

The output size is vocab_size because our model is doing classification over the entire vocabulary. it predicts the next word by selecting from all possible words.

# Train the LSTM Language Model
we’ll create the full training loop that:

Loads batches of sequences

Feeds them to the model

Computes loss using CrossEntropyLoss

Backpropagates

Updates weights using Adam

In [None]:
embedding_dim = 128
hidden_dim = 256
vocab_size = len(word2idx)

model = LSTMLanguageModel(vocab_size, embedding_dim, hidden_dim)

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)


In [None]:

# Use GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

num_epochs = 3

for epoch in range(num_epochs):
    total_loss = 0
    model.train()  # Set model to training mode

    for batch_idx, (x_batch, y_batch) in enumerate(dataloader):
        x_batch = x_batch.to(device)
        y_batch = y_batch.to(device)

        # Forward pass
        logits = model(x_batch)  # Output: [batch_size, vocab_size]

        # Compute loss
        loss = criterion(logits, y_batch)

        # Backpropagation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

        # Print progress every N batches
        if batch_idx % 100 == 0:
            print(f"Epoch [{epoch+1}/{num_epochs}], Batch [{batch_idx}], Loss: {loss.item():.4f}")

    avg_loss = total_loss / len(dataloader)
    print(f"Epoch [{epoch+1}/{num_epochs}] complete. Avg loss: {avg_loss:.4f}")


In [None]:

num_epochs = 5
train_losses = []
val_losses = []

model.to(device)

for epoch in range(num_epochs):
    model.train()
    train_loss = 0

    for x_batch, y_batch in train_loader:
        x_batch, y_batch = x_batch.to(device), y_batch.to(device)

        logits = model(x_batch)
        loss = criterion(logits, y_batch)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        train_loss += loss.item()

    avg_train_loss = train_loss / len(train_loader)
    train_losses.append(avg_train_loss)

    # Validation loop
    model.eval()
    val_loss = 0
    with torch.no_grad():
        for x_val, y_val in val_loader:
            x_val, y_val = x_val.to(device), y_val.to(device)
            logits = model(x_val)
            loss = criterion(logits, y_val)
            val_loss += loss.item()

    avg_val_loss = val_loss / len(val_loader)
    val_losses.append(avg_val_loss)

    print(f"Epoch {epoch+1}/{num_epochs} | Train Loss: {avg_train_loss:.4f} | Val Loss: {avg_val_loss:.4f}")


In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 5))
plt.plot(train_losses, label='Train Loss')
plt.plot(val_losses, label='Validation Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Training vs Validation Loss')
plt.legend()
plt.grid(True)
plt.show()
