# Recurrent Neural Network (RNN) on IMDB Dataset for Sentiment Classification

This project will cover the implementation of a Recurrent Neural Network on the IMDB dataset for Sentiment Classification. Two parallel focuses will be developed here:
1. Understanding the NLP preprocessing pipeline.
2. Sequential data modelling with RNNs and its algorithms

To do:
- <u>Load the data
- Exploratory Data Analysis</u>
- <u>NLP Preprocessing Pipeline</u>
  - <u>Tokenization
  - Lowercasing
  - Stop word removal
  - Remove digits/punctuation</u>
- <u>Create vocab and map tokens to vocab indices</u>
- <u>Split the data into training, validation, and test data</u>
  - <u>Put in DataLoader</u>
- <u>Create the model in PyTorch</u>
- <u>Construct the training and validation loops</u>
- <u>Evaluation and metrics</u>
- Graphs and Analysis
- Implement the model from scratch in NumPy
  - Learned Embedding Layer
  - Forward Propagation
  - Backpropagation Through Time
- Compare implemented model vs. torch version.

## Manual Implementation

In [None]:
## Classes for Implementation
# Fit, Forward propagation, backpropagation through time, loss, embedding layer, pooling over all time steps, gradient descent

## Outside of class
# Weights Initialization, Adam, Norm Gradient Clipping, Minibatching, One_Hot, Derivative of Activation function, Dropout, Inference

In [None]:
class Adam_optimizer():
  def __init__(self):
    pass

In [None]:
class RNN():
  def __init__(self):
    pass

## Loading the data

In [1]:
import torch
import torch.nn as nn
import itertools
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from torch.utils.data import DataLoader, Dataset
from torch import optim
import os
import re
import nltk
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from collections import Counter

In [2]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("lakshmi25npathi/imdb-dataset-of-50k-movie-reviews")

Using Colab cache for faster access to the 'imdb-dataset-of-50k-movie-reviews' dataset.


In [3]:
imdb_reviews = pd.read_csv(path + '/IMDB Dataset.csv')

## Exploratory Data Analysis

In [4]:
imdb_reviews.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [5]:
len(imdb_reviews)

50000

The dataset consists of 50,000 movie reviews, categorized as either positive or negative.

## NLP Preprocessing Pipeline
- <u>Tokenization: segmenting text into a list of tokens (representations of words)
- Lowercasing</u>
- <u>Stop word removal</u>
- <u>Remove digits/punctuation</u>

Then create vocabulary and map tokens to vocab indices.

In [6]:
reviews = imdb_reviews['review']
reviews[0]

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fa

In [7]:
# Word Tokenization, lowercasing, and removal of digits, punctuation, and HTML tags.
reviews = [re.sub("<?br\s*/?>|[0-9.,;:~<>@?]+", '', review.lower()).split(' ') for review in reviews]
print(reviews[0])

  reviews = [re.sub("<?br\s*/?>|[0-9.,;:~<>@?]+", '', review.lower()).split(' ') for review in reviews]


['one', 'of', 'the', 'other', 'reviewers', 'has', 'mentioned', 'that', 'after', 'watching', 'just', '', 'oz', 'episode', "you'll", 'be', 'hooked', 'they', 'are', 'right', 'as', 'this', 'is', 'exactly', 'what', 'happened', 'with', 'methe', 'first', 'thing', 'that', 'struck', 'me', 'about', 'oz', 'was', 'its', 'brutality', 'and', 'unflinching', 'scenes', 'of', 'violence', 'which', 'set', 'in', 'right', 'from', 'the', 'word', 'go', 'trust', 'me', 'this', 'is', 'not', 'a', 'show', 'for', 'the', 'faint', 'hearted', 'or', 'timid', 'this', 'show', 'pulls', 'no', 'punches', 'with', 'regards', 'to', 'drugs', 'sex', 'or', 'violence', 'its', 'is', 'hardcore', 'in', 'the', 'classic', 'use', 'of', 'the', 'wordit', 'is', 'called', 'oz', 'as', 'that', 'is', 'the', 'nickname', 'given', 'to', 'the', 'oswald', 'maximum', 'security', 'state', 'penitentary', 'it', 'focuses', 'mainly', 'on', 'emerald', 'city', 'an', 'experimental', 'section', 'of', 'the', 'prison', 'where', 'all', 'the', 'cells', 'have', '

In [8]:
# Stop word removal
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

# The ten most commong stopwords
list(stop_words)[:10]

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


['than', 'then', 'until', 'she', 'was', "it'd", 'own', 'or', 'its', 'hadn']

In [9]:
# Remove stopwords and empty spaces
cleaned_reviews = [[word for word in review if word not in stop_words and word != ''] for review in reviews]
cleaned_reviews[0]

['one',
 'reviewers',
 'mentioned',
 'watching',
 'oz',
 'episode',
 'hooked',
 'right',
 'exactly',
 'happened',
 'methe',
 'first',
 'thing',
 'struck',
 'oz',
 'brutality',
 'unflinching',
 'scenes',
 'violence',
 'set',
 'right',
 'word',
 'go',
 'trust',
 'show',
 'faint',
 'hearted',
 'timid',
 'show',
 'pulls',
 'punches',
 'regards',
 'drugs',
 'sex',
 'violence',
 'hardcore',
 'classic',
 'use',
 'wordit',
 'called',
 'oz',
 'nickname',
 'given',
 'oswald',
 'maximum',
 'security',
 'state',
 'penitentary',
 'focuses',
 'mainly',
 'emerald',
 'city',
 'experimental',
 'section',
 'prison',
 'cells',
 'glass',
 'fronts',
 'face',
 'inwards',
 'privacy',
 'high',
 'agenda',
 'em',
 'city',
 'home',
 'manyaryans',
 'muslims',
 'gangstas',
 'latinos',
 'christians',
 'italians',
 'irish',
 'moreso',
 'scuffles',
 'death',
 'stares',
 'dodgy',
 'dealings',
 'shady',
 'agreements',
 'never',
 'far',
 'awayi',
 'would',
 'say',
 'main',
 'appeal',
 'show',
 'due',
 'fact',
 'goes',
 

In [10]:
print(cleaned_reviews[0])

['one', 'reviewers', 'mentioned', 'watching', 'oz', 'episode', 'hooked', 'right', 'exactly', 'happened', 'methe', 'first', 'thing', 'struck', 'oz', 'brutality', 'unflinching', 'scenes', 'violence', 'set', 'right', 'word', 'go', 'trust', 'show', 'faint', 'hearted', 'timid', 'show', 'pulls', 'punches', 'regards', 'drugs', 'sex', 'violence', 'hardcore', 'classic', 'use', 'wordit', 'called', 'oz', 'nickname', 'given', 'oswald', 'maximum', 'security', 'state', 'penitentary', 'focuses', 'mainly', 'emerald', 'city', 'experimental', 'section', 'prison', 'cells', 'glass', 'fronts', 'face', 'inwards', 'privacy', 'high', 'agenda', 'em', 'city', 'home', 'manyaryans', 'muslims', 'gangstas', 'latinos', 'christians', 'italians', 'irish', 'moreso', 'scuffles', 'death', 'stares', 'dodgy', 'dealings', 'shady', 'agreements', 'never', 'far', 'awayi', 'would', 'say', 'main', 'appeal', 'show', 'due', 'fact', 'goes', 'shows', 'dare', 'forget', 'pretty', 'pictures', 'painted', 'mainstream', 'audiences', 'forg

In [11]:
def one_hot(Y, num_classes):
  Y = np.array([1 if y == 'positive' else 0 for y in Y ])
  y = np.zeros(shape=(Y.shape[0], num_classes))
  instances_for_indexing = np.arange(0, Y.shape[0])
  y[instances_for_indexing, Y] = 1
  return y

In [12]:
# One-hot labels
num_classes = 2
imdb_sentiments = imdb_reviews['sentiment']
imdb_sentiments = one_hot(imdb_sentiments.values, num_classes)
imdb_sentiments[:10]

array([[0., 1.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [0., 1.]])

In [13]:
imdb_reviews['sentiment'] = imdb_sentiments

In [14]:
print(cleaned_reviews[:2])

[['one', 'reviewers', 'mentioned', 'watching', 'oz', 'episode', 'hooked', 'right', 'exactly', 'happened', 'methe', 'first', 'thing', 'struck', 'oz', 'brutality', 'unflinching', 'scenes', 'violence', 'set', 'right', 'word', 'go', 'trust', 'show', 'faint', 'hearted', 'timid', 'show', 'pulls', 'punches', 'regards', 'drugs', 'sex', 'violence', 'hardcore', 'classic', 'use', 'wordit', 'called', 'oz', 'nickname', 'given', 'oswald', 'maximum', 'security', 'state', 'penitentary', 'focuses', 'mainly', 'emerald', 'city', 'experimental', 'section', 'prison', 'cells', 'glass', 'fronts', 'face', 'inwards', 'privacy', 'high', 'agenda', 'em', 'city', 'home', 'manyaryans', 'muslims', 'gangstas', 'latinos', 'christians', 'italians', 'irish', 'moreso', 'scuffles', 'death', 'stares', 'dodgy', 'dealings', 'shady', 'agreements', 'never', 'far', 'awayi', 'would', 'say', 'main', 'appeal', 'show', 'due', 'fact', 'goes', 'shows', 'dare', 'forget', 'pretty', 'pictures', 'painted', 'mainstream', 'audiences', 'for

In [15]:
# Map tokens to indices
def build_vocab(reviews, vocab_size=20000):
  vocab = {}

  # Special tokens
  unknown = '<UNK>'
  padding = '<PAD>'

  # Map Tokens to Vocabulary
  flattened_list = [word for review in reviews for word in review]

  # Count words
  counted_words = Counter(flattened_list) # 303150 in total

  # Get vocab size
  most_common_words = counted_words.most_common(n=vocab_size-2) # -2 accounts for <UNK> and <PAD> tokens


  vocab[padding] = 0
  vocab[unknown] = 1

  for i, word in enumerate(most_common_words):
    vocab[word[0]] = i+2

  return vocab

vocab = build_vocab(cleaned_reviews, vocab_size=20000)
dict(itertools.islice(vocab.items(), 50))


{'<PAD>': 0,
 '<UNK>': 1,
 'movie': 2,
 'film': 3,
 'one': 4,
 'like': 5,
 'good': 6,
 'would': 7,
 'even': 8,
 'really': 9,
 'time': 10,
 'see': 11,
 'story': 12,
 '-': 13,
 'much': 14,
 'get': 15,
 'well': 16,
 'great': 17,
 'also': 18,
 'people': 19,
 'bad': 20,
 'first': 21,
 'make': 22,
 'made': 23,
 'could': 24,
 'way': 25,
 'movies': 26,
 'think': 27,
 'characters': 28,
 'watch': 29,
 'many': 30,
 'films': 31,
 'seen': 32,
 'two': 33,
 'never': 34,
 'character': 35,
 'acting': 36,
 'little': 37,
 'know': 38,
 'love': 39,
 'plot': 40,
 'best': 41,
 'show': 42,
 'ever': 43,
 'life': 44,
 'better': 45,
 'still': 46,
 'say': 47,
 'scene': 48,
 'end': 49}

In [16]:
# Convert reviews into vocab indices
for review in cleaned_reviews:
  for i in range(len(review)):
    if review[i] in vocab:
      review[i] = vocab[review[i]]
    else:
      review[i] = vocab['<UNK>']

print(cleaned_reviews[0])

[4, 1831, 919, 55, 3806, 282, 2983, 112, 456, 453, 8123, 21, 57, 2966, 3806, 5234, 15266, 50, 454, 173, 112, 549, 53, 1580, 42, 7912, 5483, 11295, 42, 2255, 5697, 5394, 1328, 273, 454, 3573, 248, 220, 1, 338, 3806, 11672, 222, 17202, 6742, 2395, 948, 1, 2384, 1244, 1, 433, 4517, 2348, 1069, 7075, 2873, 12730, 303, 1, 17678, 210, 4832, 9778, 433, 241, 1, 8531, 1, 15267, 5074, 8302, 2243, 1, 1, 229, 8735, 7128, 13202, 8303, 1, 34, 124, 1, 7, 47, 158, 1135, 42, 527, 87, 150, 156, 2883, 683, 79, 1158, 4095, 2380, 1047, 683, 1243, 683, 1, 841, 85, 21, 282, 43, 100, 2966, 1426, 2020, 47, 1381, 163, 1280, 1114, 3806, 86, 9779, 210, 1873, 1906, 454, 454, 7487, 1, 4763, 13696, 2713, 1, 6743, 13696, 382, 481, 15, 141, 16, 9394, 605, 674, 6743, 504, 1069, 1, 527, 419, 871, 1812, 1069, 1, 55, 3806, 94, 292, 3439, 2991, 1, 15, 1075, 3723, 377]


In [17]:
# Pad sequences
sequence_length = 200

for i in range(len(cleaned_reviews)):
  review = cleaned_reviews[i]
  if len(review) > 200:
    cleaned_reviews[i] = review[:200]
  # Add padding
  else:
    amount_of_padding = sequence_length - len(review)
    padding = [vocab['<PAD>']] * amount_of_padding
    review.extend(padding)
    cleaned_reviews[i] = review[:200]


In [18]:
imdb_reviews['review'] = cleaned_reviews

## Train/Validation/Test Split

In [19]:
# Convert elements to tensors
cleaned_reviews = [torch.tensor(review, dtype=torch.long) for review in cleaned_reviews]

In [20]:
# Create one big tensor
cleaned_reviews_tensors = torch.stack((cleaned_reviews))

In [21]:
# Convert labels to tensors
imdb_sentiments_labels = torch.tensor(imdb_reviews['sentiment'], dtype=torch.long)

In [22]:
# Train/Validation/Test Split
split = 0.8
X_train = cleaned_reviews_tensors[:int(len(cleaned_reviews_tensors)*0.64)]
y_train = imdb_sentiments_labels[:int(len(imdb_sentiments_labels)*0.64)]
X_val = cleaned_reviews_tensors[int(len(cleaned_reviews_tensors)*0.64):int(len(cleaned_reviews_tensors)*0.8)]
y_val = imdb_sentiments_labels[int(len(imdb_sentiments_labels)*0.64):int(len(imdb_sentiments_labels)*0.8)]
X_test = cleaned_reviews_tensors[int(len(cleaned_reviews_tensors)*0.8):]
y_test = imdb_sentiments_labels[int(len(imdb_sentiments_labels)*0.8):]

In [23]:
class IMDBMovieDataset(Dataset):
  def __init__(self, X, y):
    self.X = X
    self.y = y

  def __len__(self):
    return len(self.X)

  def __getitem__(self, idx):
    return self.X[idx], self.y[idx]

training_dataset = IMDBMovieDataset(X_train, y_train)
validation_dataset = IMDBMovieDataset(X_val, y_val)
test_dataset = IMDBMovieDataset(X_test, y_test)

## DataLoader

In [24]:
train_dataloader = DataLoader(training_dataset, batch_size=32, shuffle=True)
validation_dataloader = DataLoader(validation_dataset, batch_size=32, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=32, shuffle=True)

## Model

In [25]:
class RNN(nn.Module):
  def __init__(self, seq_len=200, embd_dim=64, hidden_size=32, num_classes=2, vocab_size=20000):
    super().__init__()
    self.embd = nn.Embedding(num_embeddings=20000, embedding_dim=embd_dim)
    self.rnn = nn.RNN(embd_dim, hidden_size, batch_first=True, nonlinearity='tanh', num_layers=1, dropout=0.3)
    self.output = nn.Linear(hidden_size, num_classes)
    self.dropout = nn.Dropout(0.3)

  def forward(self, X):
    X = self.embd(X)
    out, hidden = self.rnn(X)
    out = self.dropout(out)
    pooled = out.mean(dim=1)
    out = self.output(pooled)

    return out

## Training Loop

In [27]:
# Number of epochs
num_epochs = 4

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

# Model
model = RNN(seq_len=200, hidden_size=64, num_classes=2)
model.to(device)

# Optimizer
optimizer = optim.Adam(model.parameters(), lr=5e-4, betas=(0.9, 0.999), weight_decay=1e-4)

# Clip grads
nn.utils.clip_grad_norm_(model.parameters(), max_norm=5, norm_type=2)

# Loss Function
loss_fn = nn.CrossEntropyLoss()

for epoch in range(num_epochs):
  batch_loss = 0
  batch_accuracy = 0

  model.train()
  # Training loop
  for i, data in enumerate(train_dataloader):
    X_batch, y_batch = data
    X_batch = X_batch.to(device)
    y_batch = y_batch.to(device)

    # Zero gradients for each batch
    optimizer.zero_grad()

    # Predictions
    outputs = model(X_batch)

    y = torch.argmax(outputs, dim=1)

    # Calculate loss
    loss = loss_fn(outputs, y_batch)

    # Backpropagate loss
    loss.backward()

    # Adjust learning weights
    optimizer.step()

    batch_loss += loss.item()
    batch_accuracy += torch.sum(y == y_batch).item()

  val_batch_loss = 0
  val_batch_accuracy = 0

  model.eval()
  with torch.no_grad():
    for i, data in enumerate(validation_dataloader):
      X_val_batch, y_val_batch = data
      X_val_batch = X_val_batch.to(device)
      y_val_batch = y_val_batch.to(device)

      # Predictions
      outputs = model(X_val_batch)
      preds = torch.argmax(outputs, dim=-1)

      # Calculate loss
      loss = loss_fn(outputs, y_val_batch)

      val_batch_loss += loss.item()
      val_batch_accuracy += torch.sum(preds == y_val_batch).item()


  train_epoch_loss = batch_loss / len(train_dataloader)
  train_epoch_accuracy = batch_accuracy / len(train_dataloader.dataset)
  val_epoch_loss = val_batch_loss / len(validation_dataloader)
  val_epoch_accuracy = val_batch_accuracy / len(validation_dataloader.dataset)

  print("The epoch {} train loss: {:.2f}".format(epoch, train_epoch_loss))
  print("The epoch {} train accuracy: {:.2f}".format(epoch, train_epoch_accuracy*100))
  print("The epoch {} val loss: {:.2f}".format(epoch, val_epoch_loss))
  print("The epoch {} val accuracy: {:.2f}".format(epoch, val_epoch_accuracy*100))

cuda




The epoch 0 train loss: 0.65
The epoch 0 train accuracy: 60.67
The epoch 0 val loss: 0.58
The epoch 0 val accuracy: 69.56
The epoch 1 train loss: 0.51
The epoch 1 train accuracy: 76.71
The epoch 1 val loss: 0.45
The epoch 1 val accuracy: 79.39
The epoch 2 train loss: 0.39
The epoch 2 train accuracy: 84.08
The epoch 2 val loss: 0.39
The epoch 2 val accuracy: 83.51
The epoch 3 train loss: 0.32
The epoch 3 train accuracy: 87.33
The epoch 3 val loss: 0.34
The epoch 3 val accuracy: 85.70


## Evaluation

In [33]:
# Evaluate model on test set
X_test = X_test.to(device)
y_test = y_test.to(device)

model.eval()
outputs = model(X_test)

# Get predictions
preds = torch.argmax(outputs, dim=-1)

# Calculate loss
test_loss = loss_fn(outputs, y_test)

# Get accuracy
test_accuracy = torch.sum(preds == y_test).item() / len(X_test)

print("The test accuracy is {:.2f}.".format(test_accuracy*100))
print("The test loss is {:.2f}.".format(test_loss))

The test accuracy is 86.13.
The test loss is 0.34.


In [28]:
p