# Sentiment Classification Using RNNs

* Given the IMDB Movie Review Dataset, create an RNN model that predicts whether the given review is negative or positive.
* You need to create your Dataset, Dataloader and Model. Keep your code modular and avoid hardcoding any parameter. This will allow you to experiment more easily.
* Plot graphs for loss and accuracy for each epoch of a training loop. Try using wandb for logging training and validation losses, accuracies.
* Use tqdm to keep track of the status of the training loop for an epoch

### 1. RNN Model
#### 1.1 Build a Dataset from the IMDB Movie Review Dataset by taking reviews with word count between 100 and 500. Perform text processing on the movie reviews and create a word to index mapping for representing any review as a list of numbers.
#### 1.2 Create Dataloaders for the train, test and validation datasets with appropriate batch sizes.
#### 1.3 Create the Model class for the RNN Model. Create functions for running model training and testing.

In [None]:
!pip install datasets torchmetrics

In [None]:
import numpy as np
import matplotlib.pyplot as plt

from datasets import load_dataset
import re
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer

import torch
from torch import nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence
from torchmetrics import Accuracy

from tqdm import tqdm

In [None]:
SEED = 1234

# set seed for all possible random functions to ensure reproducibility
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED)
torch.backends.cudnn.deterministic=True

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using {device} device")

In [None]:
# load the IMDB review dataset. You can take the dataset from Huggingface
imdb_dataset = load_dataset("imdb")

In [None]:
# Split the train set into train and validation in 80-20 split. Use the labels
# to ensure that the ratio of the samples from each label is maintained

In [None]:
def clean(text, tokenizer):
  # Perform text preprocessing:
  # 1. Removing numbers OR replace them with "num" token
  # 2. Convert all characters to lowercase.
  # 3. Tokenize the sentence into words
  # You can use RegexpTokenizer from NLTK.

  # You will experiment with stemming/lemmatization down the line
  # so you can skip that for now

  return text

In [None]:
clean("This IS 1 example sentence", RegexpTokenizer(r'\w+'))

In [None]:
# create a word to index dictionary so that each word in the training set
# has a number associated with it. This allows to represent each sentence
# as a series of numbers. Start the index with 1 instead of 0. The number
# 0 will be used to denote padding, so that each sentence can have the
# same length.
# Keep track of the index since it will be used for representing new words
# that were not part of the training vocabulary.
# Also, make sure to not create dictionary on sentences with word count
# not within the range

def get_word2idx(corpus):
  idx = 1
  for sentence in tqdm(corpus, total=len(corpus), desc="Creating word2idx"):
    # process sentence
    sentence = clean(sentence, tokenizer)

    # drop sentences greater than maxlen or less than minlen

    # for each word in sentence, check for entry in word2idx

  return idx, word2idx

In [None]:
# Build a Dataset object to store each sentence as a tensor of numbers
# along with the label. Make sure to add padding so that the tensor
# for each sentence is of the same length. This will allow us to train
# the model in batches.

class IMDBDataset(Dataset):
  def __init__(self, dataset, split : str, minlen : int = 100, maxlen : int = 500):
    self.count = 0 # total sentences you finally pick

    # count total number of lines
    len = len(dataset[split])

    input_data = []
    target_data = []

    for idx, sentence in tqdm(enumerate(corpus), total=len, desc=f"Transforming input text [{split}]"):
      # process sentence

      # drop sentences greater than maxlen or less than minlen

      # replace words with their index


      self.count += 1

    # pad the sentences upto maxlen
    self.inputs = pad_sequence(input_data, batch_first = True)
    self.targets = torch.tensor(target_data)

  def __len__(self) -> int:
    return self.count

  def __getitem__(self, index : int):
    return self.inputs[index], self.targets[index]

In [None]:
# create the train dataset using the word2idx dictionary built using the train set
train_ds = IMDBDataset(imdb_dataset, "train",minlen = 100, maxlen = 500)
# create the validation and test dataset using the word2idx dictionary built using the train set



In [None]:
len(train_ds), len(val_ds), len(test_ds)

In [None]:
# create dataloaders using the dataset
params = {
    'batch_size':32,
    'shuffle': True,
    'num_workers': 2
}

train_dataloader = DataLoader(train_ds, **params)
test_dataloader = DataLoader(val_ds, **params)
test_dataloader = DataLoader(test_ds, **params)

In [None]:
# create a model
class RNNModel(nn.Module):
  def __init__(self, vocab_size, hidden_size, embedding_dim, num_classes):
    # call the init method of the parent

    # define the layers


  def forward(self, X):

    # run foward pass through the model

    return logits

In [None]:
# Hyperparameters
hidden_size = 256
embedding_dim = 128
learning_rate = 1e-3
epochs = 5

# create the model
model = RNNModel(vocab_size, hidden_size, embedding_dim, num_classes).to(device)

# create optimizer

print(model)

In [None]:
# Create a model training loop
def train_model():

  for epoch in range(epochs):
    ## TRAINING STEP
    model.train()
    # train
    for input_batch, output_batch in tqdm(trainloader, total = len(trainloader), desc = "Training"):

    # Log metrics

    ## VALIDATION STEP
    model.eval()
    # run validation
    for input_batch, output_batch in tqdm(valloader, total = len(valloader), desc = "Validation"):

    # Log metrics

    # store best model

  return train_losses, val_losses, val_accuracy

In [None]:
# Create a model testing loop


In [None]:
# train the model
train_losses, val_losses, val_accuracy = train_model()

In [None]:
# plot training and validation losses

In [None]:
# plot validation accuracy

In [None]:
# find the classification accuracy on test set


#### 1.2 Incorporate stemming/lemmatization when doing text preprocessing using the NLTK library. What changes do you observe in accuracy ?

#### 1.3 In the Model class, experiment with only picking the last output and mean of all outputs in the RNN layer. What changes do you observe ?

### 2. Hyperparameter Tuning
#### 2.1 Starting with the best configurations based on the above experiments, experiment with 5 different hyperparameter configurations. You can change the size of embedding layer, hidden state, batch in the dataloader.


### 3. After RNNs
#### 3.1 Keeping all the parameters same, replace the RNN layer with the LSTM layer using nn.LSTM. What changes do you observe ? Explain why LSTM layer would affect performance.