# IMDB Sentiment Analysis in PyTorch

The IMDB dataset contains movie reviews with a corresponding sentiment toward the movie (can be either positive or negative). Our task is create and train a model that can derive the sentiment from the review text. 

This dataset is often regarded as the `MNIST of sequence modelling`, as this is considered to be a relatively simple task, that is well suited for beginners.

Below we import functions from the `torchtext` library. Similar to `torchvision`, `torchtext` is a specialized library and as you can probably guess, it has a lot of uitility functions and classes for working with texts.

In [1]:
import pandas as pd
import kaggle

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset

# utilities to work with text
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import vocab

Similar to `torchvision`, `torchtext` has a lot of builtin datasets, including the IMDB dataset. The problem is, that `torchtext` uses [TorchData](https://github.com/pytorch/data), a new PyTorch library for dealing with data loading. This library is still in beta and will undergo many changes. We will therefore download the data from Kaggle and create the classical `Dataset` and `DataLoader` objects. Once `TorchData` is stable, we will update our notebooks.

In [2]:
# donwload the data
!kaggle datasets download -p ../datasets -d lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

imdb-dataset-of-50k-movie-reviews.zip: Skipping, found more recently modified local copy (use --force to force download)


In [3]:
# unzip the data
!rm -rf ../datasets/imdb
!unzip -d ../datasets/imdb ../datasets/imdb-dataset-of-50k-movie-reviews.zip

Archive:  ../datasets/imdb-dataset-of-50k-movie-reviews.zip
  inflating: ../datasets/imdb/IMDB Dataset.csv  


We use pandas to read in the csv file.

In [4]:
df = pd.read_csv('../datasets/imdb/IMDB Dataset.csv')

There are just two columns. The `review` column contains the review of the movie, this is our features, while the sentiment contains the binary class: positive or negative.

In [5]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


When we look at the reviews, we realize that the text contains html tags, as the text was probably parsed from the internet. While the data is not completely clean, this is still good enought for our purposes.

In [6]:
# example for positive review
df.review[0]

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fa

In [7]:
# example for negative review
df.review[3]

"Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />This movie is slower than a soap opera... and suddenly, Jake decides to become Rambo and kill the zombie.<br /><br />OK, first of all when you're going to make a film you must Decide if its a thriller or a drama! As a drama the movie is watchable. Parents are divorcing & arguing like in real life. And then we have Jake with his closet which totally ruins all the film! I expected to see a BOOGEYMAN similar movie, and instead i watched a drama with some meaningless thriller spots.<br /><br />3 out of 10 just for the well playing parents & descent dialogs. As for the shots with Jake: just ignore them."

We separate the data into the training set with 30,000 samples and 10,000 samples for validation and testing respectively.

In [8]:
train_df = df[:30000]
val_df = df[30000:40000]
test_df = df[40000:]

The sentiment field is distributed roughly equal in the three splits, we don't need to apply any additional stratified splits.

In [9]:
train_df["sentiment"].value_counts()

positive    15015
negative    14985
Name: sentiment, dtype: int64

In [10]:
val_df["sentiment"].value_counts()

negative    5022
positive    4978
Name: sentiment, dtype: int64

In [11]:
test_df["sentiment"].value_counts()

positive    5007
negative    4993
Name: sentiment, dtype: int64

The `Dataset` object is relatively simple. We extract the numpy array from the reviews and the sentiments and return the value with the corresponding index.

In [12]:
class IMDBDataset(Dataset):
    def __init__(self, df):
        self.length = len(df)
        self.reviews = df["review"].to_numpy()
        self.sentiments = df["sentiment"].to_numpy()
    
    def __len__(self):
        return self.length
    
    def __getitem__(self, idx):
        return self.reviews[idx], self.sentiments[idx]

In [13]:
train_dataset = IMDBDataset(train_df)
val_dataset = IMDBDataset(val_df)
test_dataset = IMDBDataset(test_df)

Now it is time to utilize the `torchtext` functions. The `get_tokenizer(tokenizer)` function returns a tokenizer object. When we provide `basic_english` as a parameter, the generated tokenizer basically splits the sentences by white space and lowercases the words. Theoretically we can provide tokenizers from specialized libraries like [spaCy](https://spacy.io/), but we will keep it simple and stick to the simple `basic_english` tokenizer.

In [14]:
# create a tokenizer
tokenizer = get_tokenizer("basic_english")

In [15]:
# test tokenizer
tokenizer("How are you doing today?")

['how', 'are', 'you', 'doing', 'today', '?']

The `torchtext` library provides `Vocab` the vocab class, that turns a token into a corresponding index. The `vocab` function can construct such a vocabulary for us if we provide an OrderedDict with tokens as keys and the number of occurences of those tokens in our dataset as values.

We will use the `Counter` class to count the tokens.

In [16]:
# create a vocabulary
# vocab expects with word
from collections import Counter, OrderedDict

In [17]:
counter = Counter()
for review, _ in train_dataset:
    counter.update(tokenizer(review))

Alltogether we are faced with 112,119 distinct tokens. We will reduce that number in a separate step.

In [18]:
print(len(counter))

112119


Finally we sort the counts and transform those into an `OrderedDict`.

In [19]:
sorted_by_freq_tuples = sorted(counter.items(), key=lambda x: x[1], reverse=True)
ordered_dict = OrderedDict(sorted_by_freq_tuples)

For example the word `the` has the correspoinding number of counts.

In [20]:
ordered_dict.get("the")

399138

We provide a couple of arguments to the `vocab` function. We add two special tokens for padding and unknown tokens and insert those at the beginnig of the vocabulary, so that the have index 0 and 1. We set the default index to 1, which transforms all unknown words into "<unk>". 
    
We also want to reduce our vocabulary and only keep those words that apper at least 5 times in the training dataset.

In [21]:
imdb_vocab = vocab(ordered_dict, min_freq=5, specials=["<pad>", "<unk>"], special_first=True)
imdb_vocab.set_default_index(1)

This reduces the number of tokens by more than 3.

In [22]:
# reduced number of tokens
len(imdb_vocab)

33031

The `Vocab` object takes a list of tokens and returns the corresponding indices.

In [23]:
imdb_vocab(["what", "are", "you", "doing"])

[54, 30, 25, 401]

Now it is time to talk about the collate function, that can be provided to a `DataLoader`. This function is responsible for taking the list of samples from the dataset as input and generating a tensor batch. Usually this is done automatically, but that is only possible because all the samples are of equal length. The sentences that we deal with are of different length, so we have to provide a custom collate function. We transform sentences into tokens and turn them into indices. In order to make those sequences of equal length, we pad sequences that are of smaller length, which means we add values of 0 at the end of the sequences. We also return the length of each sequence in a tensor, because those will become important during training.

In [24]:
# similar to the implementation in Sebastian Raschkas book: Machine Learning with PyTorch and Scikit-Learn.
def collate_fn(batch):
    token_ls, sentiment_ls, len_ls = [], [], []
    for review, sentiment in batch:
        tokens = [imdb_vocab[token] for token in tokenizer(review)]
        sentiment_idx = 1 if sentiment == 'positive' else 0
        token_ls.append(torch.tensor(tokens, dtype=torch.int64))
        len_ls.append(len(tokens))
        sentiment_ls.append(sentiment_idx)
    return nn.utils.rnn.pad_sequence(token_ls, batch_first=True), torch.tensor(sentiment_ls), torch.tensor(len_ls)

Let's see what we end up with, using a small batch.

In [25]:
dl = DataLoader(train_dataset, batch_size=4, shuffle=False, collate_fn=collate_fn)
next(iter(dl))

(tensor([[  35,    7,    2,  ..., 4164,  516,    3],
         [   6,  386,  126,  ...,    0,    0,    0],
         [  13,  204,   14,  ...,    0,    0,    0],
         [ 673,   46,    9,  ...,    0,    0,    0]]),
 tensor([1, 1, 1, 0]),
 tensor([374, 177, 187, 158]))

In [26]:
BATCH_SIZE=32
NUM_EMBEDDINGS=len(imdb_vocab)
EMBEDDING_DIM=10
LSTM_HIDDEN_SIZE=128

In [27]:
train_dataloader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_fn)
val_dataloader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_fn)
test_dataloader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_fn)

Below we implement our model. For the most part there are not many surprises. We turn the word indices into embeddings, use those embeddings as inputs into the 2-layer LSTM neural network and use a fully connected layer for the classification.

The `pack_padded_sequence` method is probably something new to you. As we padded our sequences, many sequences get as large as the largest sequence in a batch. That can lead to the vanishing gradients problem. The method makes sure that the LSTM layers only traverse the sequence until the padded values. For that we provide the sizes that we calculated in our collate function. In fact if you remove the `pack_padded_sequence` from the code below, you will notice that your model won't be able to improve. An alternative strategy would be to cut the sequence at a fixed length. Let's say 300 words.

In [28]:
class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.embedding = nn.Embedding(num_embeddings=NUM_EMBEDDINGS, embedding_dim=EMBEDDING_DIM, padding_idx=0)
        self.lstm = nn.LSTM(input_size=EMBEDDING_DIM, hidden_size=LSTM_HIDDEN_SIZE, num_layers=2, batch_first=True)
        self.fc = nn.Linear(in_features=LSTM_HIDDEN_SIZE, out_features=1)
    
    def forward(self, x, sizes):
        x = self.embedding(x)
        x = nn.utils.rnn.pack_padded_sequence(
            x, sizes, enforce_sorted=False, batch_first=True
        )
        _, (h_n, _) = self.lstm(x)
        # we take the hidden values from the last (top) layer
        x = h_n[-1, ...]
        x = self.fc(x)
        
        return x

The rest of the notebook uses those functions, that we used many times previously.

In [29]:
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [30]:
def track_performance(dataloader, model, criterion):
    # switch to evaluation mode
    model.eval()
    num_samples = 0
    num_correct = 0
    loss_sum = 0
    
    # no need to calculate gradients
    with torch.inference_mode():
        for batch_idx, (features, labels, sizes) in enumerate(dataloader):
            features = features.to(DEVICE)
            labels = labels.to(DEVICE).view(-1, 1).float()
            logits = model(features, sizes)
            probs = torch.sigmoid(logits)
                        
            predictions = (probs > 0.5).float()
            num_correct += (predictions == labels).sum().item()
            
            loss = criterion(logits, labels)
            loss_sum += loss.cpu().item()
            num_samples += len(features)
    
    # we return the average loss and the accuracy
    return loss_sum/num_samples, num_correct/num_samples

In [31]:
def train(num_epochs, train_dataloader, val_dataloader, model, criterion, optimizer, scheduler=None):
    history = {"train_loss": [], "val_loss": [], "train_acc": [], "val_acc": []}
    
    model.to(DEVICE)
    
    for epoch in range(num_epochs):
        for batch_idx, (features, labels, sizes) in enumerate(train_dataloader):
            model.train()
            features = features.to(DEVICE)
            labels = labels.to(DEVICE).view(-1, 1).float()
            
            # Empty the gradients
            optimizer.zero_grad()
            
            # Forward Pass
            logits = model(features, sizes)
            
            # Calculate Loss
            loss = criterion(logits, labels)
            
            # Backward Pass
            loss.backward()
            
            # Gradient Descent
            optimizer.step()
            
        train_loss, train_acc = track_performance(train_dataloader, model, criterion)
        val_loss, val_acc = track_performance(val_dataloader, model, criterion)

        if scheduler:
          scheduler.step(val_acc)

        history["train_loss"].append(train_loss)
        history["val_loss"].append(val_loss)
        history["train_acc"].append(train_acc)
        history["val_acc"].append(val_acc)

        print(f'Epoch: {epoch+1:>2}/{num_epochs} | Train Loss: {train_loss:.5f} | Val Loss: {val_loss:.5f} | Train Acc: {train_acc:.3f} | Val Acc: {val_acc:.3f}')
    return history            


In [32]:
model = Model()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer,
                                                       factor=0.1,
                                                       mode='max',
                                                       patience=2,
                                                       verbose=True)

criterion = nn.BCEWithLogitsLoss()

In [33]:
history = train(10, train_dataloader, val_dataloader, model, criterion, optimizer, scheduler)

Epoch:  1/10 | Train Loss: 0.02334 | Val Loss: 0.02347 | Train Acc: 0.535 | Val Acc: 0.532
Epoch:  2/10 | Train Loss: 0.01936 | Val Loss: 0.01952 | Train Acc: 0.656 | Val Acc: 0.652
Epoch:  3/10 | Train Loss: 0.01368 | Val Loss: 0.01430 | Train Acc: 0.807 | Val Acc: 0.797
Epoch:  4/10 | Train Loss: 0.01046 | Val Loss: 0.01169 | Train Acc: 0.857 | Val Acc: 0.837
Epoch:  5/10 | Train Loss: 0.00931 | Val Loss: 0.01103 | Train Acc: 0.885 | Val Acc: 0.858
Epoch:  6/10 | Train Loss: 0.00786 | Val Loss: 0.00996 | Train Acc: 0.901 | Val Acc: 0.861
Epoch:  7/10 | Train Loss: 0.00613 | Val Loss: 0.00931 | Train Acc: 0.925 | Val Acc: 0.880
Epoch:  8/10 | Train Loss: 0.00511 | Val Loss: 0.00894 | Train Acc: 0.940 | Val Acc: 0.886
Epoch:  9/10 | Train Loss: 0.00426 | Val Loss: 0.00901 | Train Acc: 0.953 | Val Acc: 0.889
Epoch: 10/10 | Train Loss: 0.00363 | Val Loss: 0.00930 | Train Acc: 0.961 | Val Acc: 0.886


In [34]:
test_loss, test_acc = track_performance(test_dataloader, model, criterion)

Our test accuracy is close to 89%. While there are better implementations, this is not a bad result.

In [35]:
print(f'Test Loss: {test_loss:.5f} | Test Acc: {test_acc:.3f}')

Test Loss: 0.00945 | Test Acc: 0.885
