# Recurrent Neural Network (RNN) on IMDB Dataset for Sentiment Classification

This project will cover the implementation of a Recurrent Neural Network on the IMDB dataset for Sentiment Classification. Two parallel focuses will be developed here:
1. Understanding the NLP preprocessing pipeline.
2. Sequential data modelling with RNNs and its algorithms

To do:
- <u>Load the data
- Exploratory Data Analysis</u>
- <u>NLP Preprocessing Pipeline</u>
  - <u>Tokenization
  - Lowercasing
  - Stop word removal
  - Remove digits/punctuation</u>
- Create vocab and map tokens to vocab indices
- Split the data into training, validation, and test data
- Create the model in PyTorch
- Construct the training and validation loops
- Evaluation and metrics
- Implement the model from scratch in NumPy
  - Learned Embedding Layer
  - Forward Propagation
  - Backpropagation Through Time
- Compare implemented model vs. torch version.

## Manual Implementation

## Loading the data

In [97]:
import torch
import torch.nn as nn
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from torch.utils.data import DataLoader
from torch import optim
import os
import re
import nltk
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from collections import Counter

In [98]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("lakshmi25npathi/imdb-dataset-of-50k-movie-reviews")

Using Colab cache for faster access to the 'imdb-dataset-of-50k-movie-reviews' dataset.


In [99]:
imdb_reviews = pd.read_csv(path + '/IMDB Dataset.csv')

## Exploratory Data Analysis

In [100]:
imdb_reviews.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [101]:
len(imdb_reviews)

50000

The dataset consists of 50,000 movie reviews, categorized as either positive or negative.

## NLP Preprocessing Pipeline
- <u>Tokenization: segmenting text into a list of tokens (representations of words)
- Lowercasing</u>
- <u>Stop word removal</u>
- <u>Remove digits/punctuation</u>

Then create vocabulary and map tokens to vocab indices.

In [102]:
reviews = imdb_reviews['review']
reviews[0]

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fa

In [103]:
# Word Tokenization, lowercasing, and removal of digits, punctuation, and HTML tags.
reviews = [re.sub("<?br\s*/?>|[0-9.,;:~<>@?]+", '', review.lower()).split(' ') for review in reviews]
print(reviews[0])

  reviews = [re.sub("<?br\s*/?>|[0-9.,;:~<>@?]+", '', review.lower()).split(' ') for review in reviews]


['one', 'of', 'the', 'other', 'reviewers', 'has', 'mentioned', 'that', 'after', 'watching', 'just', '', 'oz', 'episode', "you'll", 'be', 'hooked', 'they', 'are', 'right', 'as', 'this', 'is', 'exactly', 'what', 'happened', 'with', 'methe', 'first', 'thing', 'that', 'struck', 'me', 'about', 'oz', 'was', 'its', 'brutality', 'and', 'unflinching', 'scenes', 'of', 'violence', 'which', 'set', 'in', 'right', 'from', 'the', 'word', 'go', 'trust', 'me', 'this', 'is', 'not', 'a', 'show', 'for', 'the', 'faint', 'hearted', 'or', 'timid', 'this', 'show', 'pulls', 'no', 'punches', 'with', 'regards', 'to', 'drugs', 'sex', 'or', 'violence', 'its', 'is', 'hardcore', 'in', 'the', 'classic', 'use', 'of', 'the', 'wordit', 'is', 'called', 'oz', 'as', 'that', 'is', 'the', 'nickname', 'given', 'to', 'the', 'oswald', 'maximum', 'security', 'state', 'penitentary', 'it', 'focuses', 'mainly', 'on', 'emerald', 'city', 'an', 'experimental', 'section', 'of', 'the', 'prison', 'where', 'all', 'the', 'cells', 'have', '

In [104]:
# Stop word removal
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

# The ten most commong stopwords
list(stop_words)[:10]

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['what',
 'these',
 'if',
 'during',
 "he'll",
 'doing',
 'some',
 'too',
 "weren't",
 "i'll"]

In [105]:
# Remove stopwords and empty spaces
cleaned_reviews = [[word for word in review if word not in stop_words and word != ''] for review in reviews]
cleaned_reviews[0]

['one',
 'reviewers',
 'mentioned',
 'watching',
 'oz',
 'episode',
 'hooked',
 'right',
 'exactly',
 'happened',
 'methe',
 'first',
 'thing',
 'struck',
 'oz',
 'brutality',
 'unflinching',
 'scenes',
 'violence',
 'set',
 'right',
 'word',
 'go',
 'trust',
 'show',
 'faint',
 'hearted',
 'timid',
 'show',
 'pulls',
 'punches',
 'regards',
 'drugs',
 'sex',
 'violence',
 'hardcore',
 'classic',
 'use',
 'wordit',
 'called',
 'oz',
 'nickname',
 'given',
 'oswald',
 'maximum',
 'security',
 'state',
 'penitentary',
 'focuses',
 'mainly',
 'emerald',
 'city',
 'experimental',
 'section',
 'prison',
 'cells',
 'glass',
 'fronts',
 'face',
 'inwards',
 'privacy',
 'high',
 'agenda',
 'em',
 'city',
 'home',
 'manyaryans',
 'muslims',
 'gangstas',
 'latinos',
 'christians',
 'italians',
 'irish',
 'moreso',
 'scuffles',
 'death',
 'stares',
 'dodgy',
 'dealings',
 'shady',
 'agreements',
 'never',
 'far',
 'awayi',
 'would',
 'say',
 'main',
 'appeal',
 'show',
 'due',
 'fact',
 'goes',
 

In [106]:
print(cleaned_reviews[0])

['one', 'reviewers', 'mentioned', 'watching', 'oz', 'episode', 'hooked', 'right', 'exactly', 'happened', 'methe', 'first', 'thing', 'struck', 'oz', 'brutality', 'unflinching', 'scenes', 'violence', 'set', 'right', 'word', 'go', 'trust', 'show', 'faint', 'hearted', 'timid', 'show', 'pulls', 'punches', 'regards', 'drugs', 'sex', 'violence', 'hardcore', 'classic', 'use', 'wordit', 'called', 'oz', 'nickname', 'given', 'oswald', 'maximum', 'security', 'state', 'penitentary', 'focuses', 'mainly', 'emerald', 'city', 'experimental', 'section', 'prison', 'cells', 'glass', 'fronts', 'face', 'inwards', 'privacy', 'high', 'agenda', 'em', 'city', 'home', 'manyaryans', 'muslims', 'gangstas', 'latinos', 'christians', 'italians', 'irish', 'moreso', 'scuffles', 'death', 'stares', 'dodgy', 'dealings', 'shady', 'agreements', 'never', 'far', 'awayi', 'would', 'say', 'main', 'appeal', 'show', 'due', 'fact', 'goes', 'shows', 'dare', 'forget', 'pretty', 'pictures', 'painted', 'mainstream', 'audiences', 'forg

In [107]:
def one_hot(Y, num_classes):
  Y = np.array([1 if y == 'positive' else 0 for y in Y ])
  y = np.zeros(shape=(Y.shape[0], num_classes))
  instances_for_indexing = np.arange(0, Y.shape[0])
  y[instances_for_indexing, Y] = 1
  return y

In [108]:
# One-hot labels
num_classes = 2
imdb_sentiments = imdb_reviews['sentiment']
imdb_sentiments = one_hot(imdb_sentiments.values, num_classes)
imdb_sentiments[:10]

array([[0., 1.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [0., 1.]])

In [109]:
imdb_reviews['review'] = cleaned_reviews

In [110]:
print(cleaned_reviews[:2])

[['one', 'reviewers', 'mentioned', 'watching', 'oz', 'episode', 'hooked', 'right', 'exactly', 'happened', 'methe', 'first', 'thing', 'struck', 'oz', 'brutality', 'unflinching', 'scenes', 'violence', 'set', 'right', 'word', 'go', 'trust', 'show', 'faint', 'hearted', 'timid', 'show', 'pulls', 'punches', 'regards', 'drugs', 'sex', 'violence', 'hardcore', 'classic', 'use', 'wordit', 'called', 'oz', 'nickname', 'given', 'oswald', 'maximum', 'security', 'state', 'penitentary', 'focuses', 'mainly', 'emerald', 'city', 'experimental', 'section', 'prison', 'cells', 'glass', 'fronts', 'face', 'inwards', 'privacy', 'high', 'agenda', 'em', 'city', 'home', 'manyaryans', 'muslims', 'gangstas', 'latinos', 'christians', 'italians', 'irish', 'moreso', 'scuffles', 'death', 'stares', 'dodgy', 'dealings', 'shady', 'agreements', 'never', 'far', 'awayi', 'would', 'say', 'main', 'appeal', 'show', 'due', 'fact', 'goes', 'shows', 'dare', 'forget', 'pretty', 'pictures', 'painted', 'mainstream', 'audiences', 'for

In [111]:
# Map tokens to indices
def build_vocab(reviews, vocab_size=20000):
  vocab = {}

  # Special tokens
  unknown = '<UNK>'
  padding = '<PAD>'

  # Map Tokens to Vocabulary
  flattened_list = [word for review in reviews for word in review]

  # Count words
  counted_words = Counter(flattened_list) # 303150 in total

  # Get vocab size
  most_common_words = counted_words.most_common(n=vocab_size)


  vocab[padding] = 0
  vocab[unknown] = 1

  for i, word in enumerate(most_common_words):
    vocab[word[0]] = i+2

  return vocab

vocab = build_vocab(cleaned_reviews, vocab_size=20000)
vocab


{'<PAD>': 0,
 '<UNK>': 1,
 'movie': 2,
 'film': 3,
 'one': 4,
 'like': 5,
 'good': 6,
 'would': 7,
 'even': 8,
 'really': 9,
 'time': 10,
 'see': 11,
 'story': 12,
 '-': 13,
 'much': 14,
 'get': 15,
 'well': 16,
 'great': 17,
 'also': 18,
 'people': 19,
 'bad': 20,
 'first': 21,
 'make': 22,
 'made': 23,
 'could': 24,
 'way': 25,
 'movies': 26,
 'think': 27,
 'characters': 28,
 'watch': 29,
 'many': 30,
 'films': 31,
 'seen': 32,
 'two': 33,
 'never': 34,
 'character': 35,
 'acting': 36,
 'little': 37,
 'know': 38,
 'love': 39,
 'plot': 40,
 'best': 41,
 'show': 42,
 'ever': 43,
 'life': 44,
 'better': 45,
 'still': 46,
 'say': 47,
 'scene': 48,
 'end': 49,
 'scenes': 50,
 'something': 51,
 'man': 52,
 'go': 53,
 'back': 54,
 'watching': 55,
 'real': 56,
 'thing': 57,
 'actors': 58,
 'years': 59,
 'makes': 60,
 'actually': 61,
 'find': 62,
 'another': 63,
 'nothing': 64,
 'funny': 65,
 'going': 66,
 'lot': 67,
 'look': 68,
 'work': 69,
 'though': 70,
 'every': 71,
 'new': 72,
 'old': 7

In [112]:
# Convert reviews into vocab indices
for review in cleaned_reviews:
  for i in range(len(review)):
    if review[i] in vocab:
      review[i] = vocab[review[i]]
    else:
      review[i] = vocab['<UNK>']

print(cleaned_reviews[0])

[4, 1831, 919, 55, 3806, 282, 2983, 112, 456, 453, 8123, 21, 57, 2966, 3806, 5234, 15266, 50, 454, 173, 112, 549, 53, 1580, 42, 7912, 5483, 11295, 42, 2255, 5697, 5394, 1328, 273, 454, 3573, 248, 220, 1, 338, 3806, 11672, 222, 17202, 6742, 2395, 948, 1, 2384, 1244, 1, 433, 4517, 2348, 1069, 7075, 2873, 12730, 303, 1, 17678, 210, 4832, 9778, 433, 241, 1, 8531, 1, 15267, 5074, 8302, 2243, 1, 1, 229, 8735, 7128, 13202, 8303, 1, 34, 124, 1, 7, 47, 158, 1135, 42, 527, 87, 150, 156, 2883, 683, 79, 1158, 4095, 2380, 1047, 683, 1243, 683, 1, 841, 85, 21, 282, 43, 100, 2966, 1426, 2020, 47, 1381, 163, 1280, 1114, 3806, 86, 9779, 210, 1873, 1906, 454, 454, 7487, 1, 4763, 13696, 2713, 1, 6743, 13696, 382, 481, 15, 141, 16, 9394, 605, 674, 6743, 504, 1069, 1, 527, 419, 871, 1812, 1069, 1, 55, 3806, 94, 292, 3439, 2991, 1, 15, 1075, 3723, 377]


In [117]:
# Pad sequences
sequence_length = 200

for review in cleaned_reviews:
  if len(review) > 200:
    review = review[:200]
  # Add padding
  else:
    amount_of_padding = sequence_length - len(review)
    padding = [vocab['<PAD>']] * amount_of_padding
    review.extend(padding)

print(cleaned_reviews[0])


[4, 1831, 919, 55, 3806, 282, 2983, 112, 456, 453, 8123, 21, 57, 2966, 3806, 5234, 15266, 50, 454, 173, 112, 549, 53, 1580, 42, 7912, 5483, 11295, 42, 2255, 5697, 5394, 1328, 273, 454, 3573, 248, 220, 1, 338, 3806, 11672, 222, 17202, 6742, 2395, 948, 1, 2384, 1244, 1, 433, 4517, 2348, 1069, 7075, 2873, 12730, 303, 1, 17678, 210, 4832, 9778, 433, 241, 1, 8531, 1, 15267, 5074, 8302, 2243, 1, 1, 229, 8735, 7128, 13202, 8303, 1, 34, 124, 1, 7, 47, 158, 1135, 42, 527, 87, 150, 156, 2883, 683, 79, 1158, 4095, 2380, 1047, 683, 1243, 683, 1, 841, 85, 21, 282, 43, 100, 2966, 1426, 2020, 47, 1381, 163, 1280, 1114, 3806, 86, 9779, 210, 1873, 1906, 454, 454, 7487, 1, 4763, 13696, 2713, 1, 6743, 13696, 382, 481, 15, 141, 16, 9394, 605, 674, 6743, 504, 1069, 1, 527, 419, 871, 1812, 1069, 1, 55, 3806, 94, 292, 3439, 2991, 1, 15, 1075, 3723, 377, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


## Train/Validation/Test Split

In [121]:
split = 0.64
val_split = 0.16

In [123]:
total_reviews = len(cleaned_reviews)
training_data = cleaned_reviews[:int(total_reviews*split)]
validation_data = cleaned_reviews[int(total_reviews*split):int(total_reviews*(val_split+split))]
test_data = cleaned_reviews[int(total_reviews*(split+val_split)):]

# CHECK for stratification

len(training_data), len(validation_data), len(test_data)

(32000, 8000, 10000)

In [None]:

#X_train, X_test, y_train, y_test = train_test_split(imdb_reviews['review'], imdb_reviews['sentiment'], train_size=0.8, shuffle=True)
#X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, train_size=0.8, shuffle=True)

## DataLoader

In [None]:
#train_dataloader = DataLoader(training_data, batch_size=64, shuffle=True)
#validation_dataloader = DataLoader(validation_data, batch_sizez=64, shuffle=True)
#test_dataloader = DataLoader(test_data, batch_size=64, shuffle=True)

## Model