# Deep Learning - RNN Sentiment Analysis
---

## Author : Amir Atapour-Abarghouei, amir.atapour-abarghouei@durham.ac.uk

This notebook will provide an example that shows the implementation of a simple RNN for sentiment analysis in PyTorch.

Copyright (c) 2022 Amir Atapour-Abarghouei, UK.

License : LGPL - http://www.gnu.org/licenses/lgpl.html

Since [TorchData](https://github.com/pytorch/data) depends on nightly builds of PyTorch, to avoid any versioning issues, we will be using an earlier version of [TorchText](https://pytorch.org/text/stable/index.html), but outside the Google Colab environment, you can use the newer version of TorchText, which is clear and much better and has removed a lot of unnecessary abstraction.
 

In [None]:
# !pip install -U torch==1.8.0 torchtext==0.9.0
!pip install -U torchtext==0.11.0

# Reload environment
exit()

print('done!')

We are going take advantage of the power of RNNs to build a model that is capable of detecting sentiment (i.e., whether a sentence is positive or negative) using PyTorch and TorchText. This will be done on movie reviews, using the [IMDb dataset](http://ai.stanford.edu/~amaas/data/sentiment/).

Let's start by importing what we need:

In [None]:
import torch
import torchtext
import torch.nn as nn

print(f'torch version: {torch.__version__}')
print(f'torchtext version: {torchtext.__version__}')

from torchtext.legacy import data
from torchtext.legacy import datasets

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
print(f"Device is {device}!")

In the earlier version of the TorchText we are using, `Field` is one of the main concepts, which defines how the data should be processed. In the sentiment classification task we are addressing here, the data consists of string of the review (the main body of the text) and the sentiment, either "pos" or "neg" (the label).

The parameters of a `Field` specify how the data should be processed. Here, we use the `TEXT` field to define how the text is processed, and the `LABEL` field handles the label.

*N.B.* None of this faffing is needed in the new version of TorchText.

Our `TEXT` has the argument `tokenize='spacy'` as an argument. This defines that the tokenisation method, which separates our text string into "tokens".

In [None]:
TEXT = data.Field(tokenize = 'spacy', tokenizer_language = 'en_core_web_sm')
LABEL = data.LabelField(dtype = torch.float)
print('done!')

Now, we can get the dataset, which is already built into TorchText. The following command will download the IMDB dataset:

In [None]:
train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)

print(f'\nThere are {len(train_data)} data points in the training set!')
print(f'There are {len(test_data)} data points in the testing set!')

Let's check one of the data samples:

In [None]:
print(vars(train_data.examples[0]))

Now is the time to build a vocabulary with a one-hot vector for each token. To keep things small and efficient, we are going to keep the vocabulary at 10,000 words

In [None]:
MAX_VOCAB_SIZE = 10_000

TEXT.build_vocab(train_data, max_size = MAX_VOCAB_SIZE)
LABEL.build_vocab(train_data)

print(f"TEXT vocabulary: {len(TEXT.vocab)}")
print(f"LABEL vocabulary: {len(LABEL.vocab)}")

Note that the extra two tokens we see there are `<unk>` and `<pad>` tokens.

The next step is to create iterators for our data. Let's use a `BucketIterator` which returns a batch of examples where each example is of a similar length, minimizing the amount of padding per example.

Once again, note that with the newer versions of torchtext, life is a lot easier.

In [None]:
batch_size = 64

train_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, test_data), 
    batch_size = batch_size,
    device = device)

print('Data iterators have been created!')

Now, we will create our optimiser and our network, a very simple architecture made up of an embedding layer, an RNN layer and a linear layer:

In [None]:
# build architecture:
class RNN(nn.Module):
  def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim):
    super().__init__()
      
    self.embedding = nn.Embedding(input_dim, embedding_dim)  
    self.rnn = nn.RNN(embedding_dim, hidden_dim)
    self.fc = nn.Linear(hidden_dim, output_dim)
      
  def forward(self, text):

    embedded = self.embedding(text)
    output, hidden = self.rnn(embedded)
    
    assert torch.equal(output[-1,:,:], hidden.squeeze(0))
    
    return self.fc(hidden.squeeze(0))

# a few parameters to set for our model:
input_dim = len(TEXT.vocab)
embed_dim = 100
hidden_dim = 256
output_dim = 1

model = RNN(input_dim, embed_dim, hidden_dim, output_dim)
model = model.to(device)

print(model)
print('Model has been created!')
print(f'Model has {sum(p.numel() for p in model.parameters() if p.requires_grad)} parameters!')

# create the optimiser:
optimiser = torch.optim.SGD(model.parameters(), lr=1e-3)

# loss function:
criterion = nn.BCEWithLogitsLoss()

In order to keep track of how our model performs we need to calculate the accuracy of our predictions. This helper function will help us computer accuracy:

In [None]:
# Function to calculate accuracy:
def binary_accuracy(preds, y):

    #round predictions to the closest integer
    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float()
    acc = correct.sum() / len(correct)
    return acc

And now the main training loop:

In [None]:
epochs = 5

# main loop:
for epoch in range(epochs):

  epoch_loss = 0
  epoch_acc = 0

  for batch in train_iterator:
      
    # zero_grading parameters
    optimiser.zero_grad()

    # model output:  
    predictions = model(batch.text).squeeze(1)

    # calculate loss:
    loss = criterion(predictions, batch.label)
    # calculate accuracy:
    acc = binary_accuracy(predictions, batch.label)

    # backward:
    loss.backward()
    optimiser.step()

    epoch_loss += loss.item()/len(train_iterator)
    epoch_acc += acc.item()/len(train_iterator)
    
  print(f'Epoch: {epoch+1:02} || Train Loss: {epoch_loss:.3f} - Train Acc: {epoch_acc*100:.2f}%')

Now that the training is complete, we should evaluate the model:

In [None]:
eval_loss = 0
eval_acc = 0

with torch.no_grad():

  for batch in test_iterator:

    # model output:
    predictions = model(batch.text).squeeze(1)

    # calculate loss and accuracy:
    loss = criterion(predictions, batch.label)
    acc = binary_accuracy(predictions, batch.label)
    eval_loss += loss.item() / len(test_iterator)
    eval_acc += acc.item() / len(test_iterator)

print(f'Test Loss: {eval_loss:.3f} - Test Acc: {eval_acc*100:.2f}%')

Don't be surprised, the results are awful. There are lots of ways to improve what we have done here. Try to see if you can improve things.