# Training a RNN to recognize movie review sentiment

This was built following a tutorial in the wonderful book Machine Learning with PyTorch and Scikit-Learn by Raschka, Liu and Mirjalili (2022, Packt Publishing). In this notebook, we do the following:

1. Use the PyTorch library (https://pytorch.org/) to construct and train a recurrent neural network (RNN) on the IMDB dataset build in to torchtext.datasets : https://pytorch.org/text/stable/datasets.html#imdb
2. Deploy the trained model as an interactive web app using the Gradio library (https://gradio.app/).

Note that the Gradio app can be found by visiting my huggingface page: https://huggingface.co/spaces/etweedy/movie_review

First we import out libraries and set the gpu device if available.

In [2]:
%%capture
! pip install torchtext==0.13.0
! pip install torchdata==0.4.0

import torch
import torchdata
from torchtext.datasets import IMDB
from torch import nn

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

Import the data into training and validation Dataset items.  Each of these is a list of tuples of the form (sentiment,review), where the sentiment is either 'pos' or 'neg' and review is a string containing a review for a movie.  The review strings can contain HTML tags, as demonstrated below.

In [3]:
ds_train = list(IMDB(split='train'))
ds_val = list(IMDB(split='test'))

In [11]:
ds_train[0]

('neg',
 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

In [14]:
import re
from collections import Counter, OrderedDict

This tokenizer function does a few things:
1. Removes HTML tags, which are common in this dataset.
2. Finds any emoticons with eyes like : or ; or = , with or without a nose like -, and with mouth like ) or ( or P or D.
3. Removes any non-character symbols and joins on the standardized emoticons.
4. Splits into a list of token strings.

In [12]:
def tokenizer(text):
    text = re.sub('<[^>]*>','',text)
    emoticons = re.findall(
        '(?::|;|=)(?:-)?(?:\)|\(|D|P)',text.lower()
    )
    text = (re.sub('[\W]+',' ',text.lower())+' '.join(emoticons).replace('-',''))
    tokenized = text.split()
    return tokenized

We build a collections.Counter() container which keeps track of the unique words coming from the tokenized reviews in the training set, and the number of times each word appears among all reviews in the training set.  We see the size of the vocabulary is 75977 words.  We take a look at the token list for the final review from ds_train.

In [15]:
token_counts = Counter()
for label, line in ds_train:
    tokens = tokenizer(line)
    token_counts.update(tokens)

print('Vocab-size:',len(token_counts))

Vocab-size: 75977


In [18]:
ds_train[-1]

('pos',
 'The story centers around Barry McKenzie who must go to England if he wishes to claim his inheritance. Being about the grossest Aussie shearer ever to set foot outside this great Nation of ours there is something of a culture clash and much fun and games ensue. The songs of Barry McKenzie(Barry Crocker) are highlights.')

In [19]:
print(tokens)

['the', 'story', 'centers', 'around', 'barry', 'mckenzie', 'who', 'must', 'go', 'to', 'england', 'if', 'he', 'wishes', 'to', 'claim', 'his', 'inheritance', 'being', 'about', 'the', 'grossest', 'aussie', 'shearer', 'ever', 'to', 'set', 'foot', 'outside', 'this', 'great', 'nation', 'of', 'ours', 'there', 'is', 'something', 'of', 'a', 'culture', 'clash', 'and', 'much', 'fun', 'and', 'games', 'ensue', 'the', 'songs', 'of', 'barry', 'mckenzie', 'barry', 'crocker', 'are', 'highlights']


We then sort the list of items in token_counts in descending order of incidence and convert it into a torchtext.vocab object.  Finally, we insert the padding token as well as the placeholder token for unknown words.

In [20]:
from torchtext.vocab import vocab

In [21]:
sorted_by_freq_tuples = sorted(token_counts.items(),key=lambda x: x[1],reverse=True)

In [22]:
ordered_dict = OrderedDict(sorted_by_freq_tuples)

In [11]:
vocab = vocab(ordered_dict)

In [12]:
vocab.insert_token('<pad>',0)
vocab.insert_token('<unk>',1)
vocab.set_default_index(1)

The pipeline for labels converts 'pos' to 1 and 'neg' to 0.
The pipeline converts a string to the list of vocab codes for tokens in tokenizer(string).

The collate_batch function will serve as our collate_fn for the DataLoader we'll define.  Given a batch of samples, collate_batch does a few things:
1. Creates label_list, a list of target labels 1 or 0 for the batch samples.
2. Creates lengths, a list of word counts of each sample in the batch.
3. Creates text_list, a list of lists of vocab codes of tokens in tokenized samples from the batch.
4. Uses pad_sequence() to pad all lists in text_list with 0 (the pad token code) so that all sequences in the batch are the same length, namely max(lengths).

In [13]:
text_pipeline=lambda x: [vocab[token] for token in tokenizer(x)]
label_pipeline=lambda x: 1. if x=='pos' else 0.

In [14]:
def collate_batch(batch):
    label_list,text_list,lengths=[],[],[]
    for _label,_text in batch:
        label_list.append(label_pipeline(_label))
        processed_text=torch.tensor(text_pipeline(_text),dtype=torch.int64)
        text_list.append(processed_text)
        lengths.append(processed_text.size(0))
    label_list = torch.tensor(label_list)
    lengths = torch.tensor(lengths)
    padded_text_list=nn.utils.rnn.pad_sequence(text_list,batch_first=True)
    return padded_text_list , label_list, lengths

Finally we create training and validation DataLoaders with batch size of 32.

In [15]:
from torch.utils.data import DataLoader

In [16]:
batch_size=32
dl_train = DataLoader(ds_train,batch_size=batch_size,shuffle=True,collate_fn=collate_batch)
dl_val = DataLoader(ds_val,batch_size=batch_size,shuffle=True,collate_fn=collate_batch)

Training and evaluation functions (for a single epoch).

In [17]:
def train(dataloader):
    model.train()
    total_acc,total_loss=0,0
    for text_batch, label_batch, lengths in dataloader:
        text_batch = text_batch.to(device)
        label_batch = label_batch.to(device)
        lengths = lengths.to(device)
        opt.zero_grad()
        pred=model(text_batch,lengths)[:,0]
        loss = loss_fn(pred,label_batch)
        loss.backward()
        opt.step()
        total_acc += ((pred>0.5).float() == label_batch).float().sum().item()
        total_loss += loss.item()*label_batch.size(0)
    return total_acc/len(dataloader.dataset),total_loss/len(dataloader.dataset)

In [18]:
def evaluate(dataloader):
    model.eval()
    total_acc,total_loss=0,0
    count=1
    with torch.no_grad():
        for text_batch, label_batch, lengths in dataloader:
#            print(f'Batch {count}')
            count+=1
            text_batch = text_batch.to(device)
            label_batch = label_batch.to(device)
            lengths = lengths.to(device)
            pred=model(text_batch,lengths)[:,0]
            loss = loss_fn(pred,label_batch)
            total_acc += ((pred>0.5).float() == label_batch).float().sum().item()
            total_loss += loss.item()*label_batch.size(0)
    return total_acc/len(dataloader.dataset),total_loss/len(dataloader.dataset)

The RNN class that we'll use.  Note that there are:
1. An input embedding layer
2. A Long Short-Term Memory (LSTM) layer, which is a drop-in replacement for the ordinary RNN layer that helps to mitigate the vanishing/exploding gradient issues that can plague ordinary RNN's while still allowing the network to capture some long-term dependencies.
3. Several FC layers with Dropout layers in-between to reduce potential of overfitting.
4. Sigmoid output activation - we'll classify the output as 0 or 1 based on a threshold of 0.5.

In [19]:
class RNN(nn.Module):
    def __init__(self,vocab_size,embed_dim,rnn_hidden_size,dc_hidden_size):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size,embed_dim,padding_idx=0)
        self.rnn = nn.LSTM(embed_dim,rnn_hidden_size,batch_first=True)
        self.fc1 = nn.Linear(rnn_hidden_size,fc_hidden_size)
        self.relu1 = nn.ReLU()
        self.dropout1 = nn.Dropout(p=0.5)
        self.fc2 = nn.Linear(rnn_hidden_size,fc_hidden_size)
        self.relu2 = nn.ReLU()
        self.dropout2 = nn.Dropout(p=0.5)
        self.fc3 = nn.Linear(fc_hidden_size,1)
        self.sigmoid = nn.Sigmoid()
        
    def forward(self,text,lengths):
        out = self.embedding(text)
        out = nn.utils.rnn.pack_padded_sequence(out,lengths.cpu().numpy(),enforce_sorted=False,batch_first=True)
        out,(hidden,cell) = self.rnn(out)
        out = hidden[-1,:,:]
        out=self.fc1(out)
        out=self.relu1(out)
        out=self.dropout1(out)
        out=self.fc2(out)
        out=self.relu2(out)
        out=self.dropout2(out)
        out=self.fc3(out)
        out = self.sigmoid(out)
        return out

We initialize our RNN instance with embedding dimension 20 and all hidden layers of size 64.

In [20]:
vocab_size=len(vocab)
embed_dim=20
rnn_hidden_size=64
fc_hidden_size=64

torch.manual_seed(1)
model = RNN(vocab_size,embed_dim,rnn_hidden_size,fc_hidden_size)
model.to(device)

RNN(
  (embedding): Embedding(75979, 20, padding_idx=0)
  (rnn): LSTM(20, 64, batch_first=True)
  (fc1): Linear(in_features=64, out_features=64, bias=True)
  (relu1): ReLU()
  (dropout1): Dropout(p=0.5, inplace=False)
  (fc2): Linear(in_features=64, out_features=64, bias=True)
  (relu2): ReLU()
  (dropout2): Dropout(p=0.5, inplace=False)
  (fc3): Linear(in_features=64, out_features=1, bias=True)
  (sigmoid): Sigmoid()
)

We use Binary Cross Entropy loss (ordinary BCELoss, since we included the output sigmoid).  We optimize with Adam; note that 0.001 is a typical baseline learning rate for Adam, but we scale up to 0.1 because we included dropout layers in our RNN - they typically allow for a faster learning rate.

In [21]:
loss_fn = nn.BCELoss()
opt = torch.optim.Adam(model.parameters(),lr=0.01)

In [None]:
num_epochs=10
torch.manual_seed(1)
for epoch in range(num_epochs):
    acc_train,loss_train = train(dl_train)
    acc_val,loss_val = evaluate(dl_val)
    print(f'Epoch {epoch} --- training accuracy: {acc_train:.4f} --- validation accuracy: {acc_val:.4f}')

Epoch 0 --- training accuracy: 0.6946 --- validation accuracy: 0.8254
Epoch 1 --- training accuracy: 0.8875 --- validation accuracy: 0.8689
Epoch 2 --- training accuracy: 0.9345 --- validation accuracy: 0.8683
Epoch 3 --- training accuracy: 0.9570 --- validation accuracy: 0.8531
Epoch 4 --- training accuracy: 0.9716 --- validation accuracy: 0.8580
Epoch 5 --- training accuracy: 0.9806 --- validation accuracy: 0.8526
Epoch 6 --- training accuracy: 0.9842 --- validation accuracy: 0.8541
Epoch 7 --- training accuracy: 0.9862 --- validation accuracy: 0.8530
Epoch 8 --- training accuracy: 0.9864 --- validation accuracy: 0.8545


Save our model weights and vocabulary.

In [None]:
torch.save(model.state_dict(),'imdb_rnn_drop_model_lr_1e-2_weights.pth')

In [None]:
torch.save(vocab,'imdb_rnn_drop_model_lr_1e-2_vocab.pth')

In [27]:
def predict(text):
    text_list, lengths = [],[]
    processed_text=torch.tensor(text_pipeline(text),dtype=torch.int64)
    text_list.append(processed_text)
    lengths.append(processed_text.size(0))
    lengths=torch.tensor(lengths)
    padded_text_list=nn.utils.rnn.pad_sequence(text_list,batch_first=True)
    padded_text_list = padded_text_list.to(device)
    length = lengths.to(device)
    pred = model(padded_text_list,lengths)
    
    return 'Positive' if pred > 0.5 else 'Negative'

Finally, we implement a little Gradio web app that we can use to interact with our model.  The app will ask the user to input a movie review into a text entry box, and will return a prediction of 'Positive' or 'Negative' sentiment.  The below code will generate a locally hosted app, but see the following blog post for a nice tutorial on deploying your web app on Hugging Face:
https://huggingface.co/blog/gradio-spaces

In [161]:
%%capture
! pip install gradio
import gradio as gr

In [162]:
title = "Write a movie review"
description = "Enter a review for a movie you've seen.  This tool will try to guess whether your review is positive or negative."
gr.Interface(fn=predict, 
             inputs="text",
             outputs="label",
             title = title,
             description = description,
              ).launch()

Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.


