<img src='data/images/section-notebook-header.png' />

# Recurrent Neural Networks (RNNs): Sentiment Analysis

Sentiment analysis, also known as opinion mining, is an NLP task that involves determining the sentiment, emotion, or subjective tone expressed in a piece of text. It aims to automatically analyze and classify the sentiment conveyed by the text as, for example *positive*, *negative*, *neutral*, or sometimes more fine-grained sentiments. The goal of sentiment analysis is to extract and understand the subjective information present in text data, enabling automated systems to comprehend people's opinions, attitudes, and emotions towards various topics, products, services, or events.

Sentiment analysis can be considered a text classification task because its goal is to classify or categorize a piece of text into predefined sentiment classes or labels. In sentiment analysis, the text can be a sentence, a document, a review, a tweet, or any other form of textual input. Text classification refers to the process of assigning predefined categories or labels to a given text based on its content. In the case of sentiment analysis, the predefined categories are sentiment classes (e.g. *positive*, *negative*, *neutral*)

By treating sentiment analysis as a text classification problem, various classification algorithms and techniques can be leveraged to build models that can effectively analyze and categorize the sentiment in textual data. Text classification techniques, including machine learning algorithms like Naive Bayes, Support Vector Machines (SVM), Random Forests, and deep learning models such as Convolutional Neural Networks (CNNs) or **Recurrent Neural Networks (RNNs)**, can be employed for sentiment analysis. These models take the text as input, extract relevant features, and make predictions about the sentiment label.

In this notebook, we train a sentiment classifier based on an RNN model using movie reviews as the training data. As usual, the goal is not to achieve state-of-the-art results but to systematically go through the main stops to solve a classification task using RNNs.

## Setting up the Notebook

### Required Packages

In [None]:
%reload_ext autoreload
%autoreload 2

In [None]:
import numpy as np
import random

from tqdm import tqdm

from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

We utilize PyTorch as our deep learning framework of choice by importing the `torch` package.

In [None]:
import torch
import torch.nn as nn
import torchtext
from torch.utils.data import Dataset, DataLoader

We also need to import some custom implementations of classes and methods. This makes a re-use of these classes and methods easier and keeps the notebook clean.

In [None]:
# Custom BatchSampler
from src.sampler import BaseDataset, EqualLengthsBatchSampler
# Core implementation of RNN classifier and Attention
from src.rnn import RnnTextClassifier, DotAttention
# Some utility classes and methods
from src.utils import Dict2Class, plot_training_results

### Checking/Setting the Computation Device

PyTorch allows to train neural networks on supported GPUs to significantly speed up the training process. If you have a support GPU, feel free to utilize it. 

In [None]:
use_cuda = torch.cuda.is_available()

# Use this line below to enforce the use of the CPU (in case you don't have a supported GPU)
# With this small dataset and simple model you won't see a difference anyway
#use_cuda = False

device = torch.device("cuda:0" if use_cuda else "cpu")

print("Available device: {}".format(device))

---

## Preparing the Dataset

While RNNs allow for arbitrary lengths -- as long as the sequence in the same batch is of the same length -- it is often practical to limit the maximum length of sequences. This is not only from a computing point of view but also it gets more and more difficult to propagate meaningful gradients back during Backpropagation Through Time (BPTT).

For a sentence dataset, this is hardly an issue, since individual sentences are usually not overly long. However, the movie reviews  generally consist of several sentences. Note that by limiting ourselves to the first `MAX_LENGTH` words we assume that the main sentiment is expressed at the beginning of the review. If we assume that we should focus on the end of a review, we should consider the last `MAX_LENGTH` words.

In the code cell below, we set `MAX_LENGTH` to 100, but feel free to play with this value. When loading the data from the files, we directly cut all sequences longer than `MAX_LENGTH` down to the specified values. This also means that we won't have to check the sequence lengths anymore when training or evaluating a model (compared to CNN). In practice, `MAX_LENGTH` would be an interesting hyperparameter to optimize, but here out of scope.

In [None]:
MAX_LENGTH = 100

### Load Data from File

We already preprocessed and vectorized the [Large Movie Review Dataset](https://ai.stanford.edu/~amaas/data/sentiment/) in the previous notebook. We essentially only need to load the generated files. Let's start with the vocabulary. Recall, the `vocabulary` is a `vocab` object from the `torchtext` package, allowing us to map words/tokens to their unique integer identifiers and back.

In [None]:
vocabulary = torch.load('data/corpora/imdb-reviews/vectorized-rnn-sa/imdb-rnn-sa-20000.vocab')

vocab_size = len(vocabulary)

print('Size of vocabulary:\t{}'.format(vocab_size))

Now we can load the vectorized reviews, which are split across 2 files: one for the training, the other for the test set. The only additional steps we perform below is to cut each review to a length of `MAX_LENGTH`, if needed, and to shuffle both the training and test set. This is in general a good practice, and here strongly recommended since we know that both files first list all positive and then all negative reviews. This would later result in most training batches containing only positive or negative reviews, which typically results in a poor training performance.

In [None]:
samples_train, samples_test = [], []

with open('data/corpora/imdb-reviews/vectorized-rnn-sa/imdb-rnn-sa-reviews-20000-train.txt') as file:
    for line in file:
        name, label = line.split(',')
        # Convert name to a sequence of integers
        sequence = [ int(index) for index in name.split() ]
        # Add (sequence,label) pair to list of samples
        samples_train.append((sequence[:MAX_LENGTH], int(label.strip())))
        
with open('data/corpora/imdb-reviews/vectorized-rnn-sa/imdb-rnn-sa-reviews-20000-test.txt') as file:    
    for line in file:
        name, label = line.split(',')
        # Convert name to a sequence of integers
        sequence = [ int(index) for index in name.split() ]
        # Add (sequence,label) pair to list of samples
        samples_test.append((sequence[:MAX_LENGTH], int(label.strip())))
        
random.shuffle(samples_train)
random.shuffle(samples_test)
        
print('Number of training samples: {}'.format(len(samples_train)))
print('Number of test samples: {}'.format(len(samples_test)))

### Create Training & Test Set

Since the dataset comes in 2 files reflecting the training and test data, we can directly convert the dataset into the respective lists.

In [None]:
X_train = [ torch.LongTensor(seq) for (seq, _) in samples_train ]
X_test  = [ torch.LongTensor(seq) for (seq, _) in samples_test ]

y_train = [ label for (_, label) in samples_train ]
y_test  = [ label for (_, label) in samples_test ]

# We can directly convert the vector of labels to a tensor
y_train = torch.LongTensor(y_train)
y_test  = torch.LongTensor(y_test)

Note that `X_train` and `X_test` are themselves not tensors but a list of tensors, as this would require that all sequences have the same length. While we ensured that no sequence is longer than 100 words/tokens, there still can be reviews shorter than that. As such, `X_train` and `X_test` are not yet ready to feed into a neural network. This we will address next.

### Create Data Loaders

We first create a simple class called `BaseDataset` extending [`Dataset`](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset). This class only stores out `inputs` and `targets` and needs to implement the `__len__()` and `__getitem__()` methods. Since our class extends the abstract class [`Dataset`](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset), we can use an instance later to create a [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader). Without going into too much detail, this approach does not only allow for cleaner code but also supports parallel processing on many CPUs, or on the GPU as well as to optimize data transfer between the CPU and GPU, which is critical when processing very large amounts of data. It is therefore the recommended best practice.

The [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) class takes a `DataSet` object as input to handle to split the dataset into batches. The class `EqualLengthsBatchSampler` analyzes the input sequences to organize all sequences into groups of sequences of the same length. Then, each batch is sampled for a single group, ensuring that all sequences in the batch have the same length. In the following, we use a batch size of 256, although you can easily go higher since we are dealing with only sentences.

In [None]:
batch_size = 256

dataset_train = BaseDataset(X_train, y_train)
sampler_train = EqualLengthsBatchSampler(batch_size, X_train, y_train)
loader_train = DataLoader(dataset_train, batch_sampler=sampler_train, shuffle=False, drop_last=False)

dataset_test = BaseDataset(X_test, y_test)
sampler_test = EqualLengthsBatchSampler(batch_size, X_test, y_test)
loader_test = DataLoader(dataset_test, batch_sampler=sampler_test, shuffle=False, drop_last=False)

We can now iterate over all batches using a simple loop as follows:

In [None]:
for X, y in loader_train:
    print('X.shape:', X.shape)
    print('y.shape:', y.shape)
    break # We don't need to see all batches here

The shape of `X` reflects the number of samples (i.e., reviews) in the batch, and the length of all sequences in the batch. Note that the number of samples is most of the time much smaller than our specified batch size of 256 (see above). This is because we enforce that all sequences need to have the same length, and if there are less than 256 sequences of the same length in our dataset, the corresponding batch won't be full.

**Side note:** There are standard techniques to have batches with sequences of initially different lengths (keyword: *padding*). Here, however, for convenience, we use the approach of packing sequences of the same length in the same batches. Also, it is easy to see that the chance of an underfull batch reduces for larger datasets.

---

## Training & Evaluating the RNN Model

With the training and test set prepared, we are now ready to build our RNN-based sentiment classification model.

### Auxiliary Methods

#### Evaluation

The code cell below implements the method `evaluate()` to, well, evaluate our model. Apart from the model itself, the method also receives the data loader as input parameter. This allows us later to use both `loader_train` and `loader_test` to evaluate the training and test loss using the same method.

The method is very generic and is not specific to the dataset. It simply loops over all batches of the data loader, computes the log probabilities, uses these log probabilities to derive the predicted class labels, and compares the predictions with the ground truth to return the f1 score. This means, this method could be used "as is" or easily be adopted for all kinds of classifications tasks (incl. task with more than 2 classes).

In [None]:
def evaluate(model, loader):
    
    y_true, y_pred = [], []
    
    with tqdm(total=len(loader)) as progress_bar:

        for X_batch, y_batch in loader:
            batch_size, seq_len = X_batch.shape[0], X_batch.shape[1]
            
            # Move the batch to the correct device
            X_batch, y_batch = X_batch.to(device), y_batch.to(device)
            
            # Initialize the first hidden state h0 (and move to device)
            hidden = model.init_hidden(batch_size)

            if type(hidden) is tuple:
                hidden = (hidden[0].to(device), hidden[1].to(device))  # LSTM
            else:
                hidden = hidden.to(device)  # RNN, GRU
                    
            # Use model to compute log prbabilities for each class
            log_probs = model(X_batch, hidden)

            # Pick class with the highest log probability
            y_batch_pred = torch.argmax(log_probs, dim=1)

            y_true += list(y_batch.cpu().numpy())
            y_pred += list(y_batch_pred.cpu().numpy())
            
            # Update progress bar
            progress_bar.update(batch_size)

    # Return final f1 score
    return f1_score(y_true, y_pred)

#### Training (single epoch)

Similar to the method `evaluate()` we also implement a method `train_epoch()` to wrap all the required steps training. This has the advantage that we can simply call `train_epochs()` multiple times to proceed with the training. Apart from the model, this method has the following input parameters:

* `optimizer`: the optimizer specifier how the computed gradients are used to updates the weights; in the lecture, we only covered the basic Stochastic Gradient Descent, but there are much more efficient alternatives available

* `criterion`: this is the loss function; "criterion" is just very common terminology in the PyTorch documentation and tutorials

The heart of the method is the snippet described as PyTorch Magic. It consists of the following 3 lines of code

* `optimizer.zero_grad()`: After each training step for a batch if have to set the gradients back to zero for the next batch

* `loss.backward()`: Calculating all gradients using backpropagation

* `optimizer.step()`: Update all weights using the gradients and the method of the specific optimizer

In [None]:
def train_epoch(model, loader, optimizer, criterion):
    
    # Initialize epoch loss (cummulative loss fo all batchs)
    epoch_loss = 0.0

    with tqdm(total=len(loader)) as progress_bar:

        for X_batch, y_batch in loader:
            batch_size, seq_len = X_batch.shape[0], X_batch.shape[1]

            # Move the batch to the correct device
            X_batch, y_batch = X_batch.to(device), y_batch.to(device)

            # Initialize the first hidden state h0 (and move to device)
            hidden = model.init_hidden(batch_size)

            if type(hidden) is tuple:
                hidden = (hidden[0].to(device), hidden[1].to(device))  # LSTM
            else:
                hidden = hidden.to(device)  # RNN, GRU            
            
            log_probs = model(X_batch, hidden)

            # Calculate loss
            loss = criterion(log_probs, y_batch)
            
            ### Pytorch magic! ###
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            # Keep track of overall epoch loss
            epoch_loss += loss.item()

            progress_bar.update(batch_size)
            
    return epoch_loss

#### Training (multiple epochs)

The `train()` method combines the training and evaluation of a model epoch by epoch. The method keeps track of the loss, the training score, and the tests score for each epoch. This allows as later to plot the results; see below. Notice the calls of `model.train()` and `model.eval()` to set the models into the correct "mode". This is needed since our model contains a Dropout layer. For more details, check out this [Stackoverflow post](https://stackoverflow.com/questions/51433378/what-does-model-train-do-in-pytorch).

In [None]:
def train(model, loader_train, loader_test, optimizer, criterion, num_epochs, verbose=False):
    
    results = []
    
    print("Total Training Time (total number of epochs: {})".format(num_epochs))
    for epoch in range(1, num_epochs+1):
        model.train()
        epoch_loss = train_epoch(model, loader_train, optimizer, criterion)
        model.eval()
        f1_train = evaluate(model, loader_train)
        f1_test = evaluate(model, loader_test)

        results.append((epoch_loss, f1_train, f1_test))
        
        if verbose is True:
            print("[Epoch {}] loss:\t{:.3f}, f1 train: {:.3f}, f1 test: {:.3f} ".format(epoch, epoch_loss, f1_train, f1_test))
            
    return results

### Building the Model

#### Create Model Instance

The class `RnnTextClassifier` implements an RNN-based classifier in a flexible manner, using different parameters settings one can set:

* Which recurrent cell to use: `nn.RNN`, `nn.GRU`, or `nn.LSTM`

* The number of stacked recurrent layers

* Whether the recurrence is performed bi-directional or not

* The number and size of the subsequence linear layers

* ... and other various parameters.

You can and should check out the implementation of the class `RnnTextClassifier` in the file `src/rnn.py`. While the code might look a bit complex at a first glance, most of the complexity is purely because of the flexibility the class provides. For example the hidden state for `nn.LSTM` is different compared to `nn.GRU` and `nn.LSTM` and the code has to accommodate for both cases. If we would fix the overall architecture of the model, the class `RnnTextClassifier` would probably comprise only half the amount of code.

In [None]:
params = {
    "vocab_size": vocab_size,                  # the size of the vocabulary determines the input size of the embedding
    "embed_size": 300,                         # needs to be 300 if we want to use the pretrained word embeddings
    "rnn_cell": "GRU",                         # in practice GRU or LSTM will always outperform RNN
    "rnn_num_layers": 2,                       # 1 or 2 layers are most common; more rarely sees any benefit
    "rnn_bidirectional": True,                 # if TRUE, we go over each sequence from both directions
    "rnn_hidden_size": 512,                    # size of the hidden state
    "rnn_dropout": 0.5,                        # only relevant if rnn_num_layers > 1
    "dot_attention": False,                    # if TRUE, use attention
    "linear_hidden_sizes": [128, 64],          # list of sizes of subsequent hidden layers; can be [] (empty)!
    "linear_dropout": 0.5,                     # if hidden linear layers are used, we can also include Dropout
    "output_size": 2                           # we only have to sentiment classes
}

# Define model paramaters
params = Dict2Class(params)
# Create model   
rnn = RnnTextClassifier(params).to(device)
# Define optimizer
optimizer = torch.optim.Adam(rnn.parameters(), lr=0.0001)
# Define loss function
criterion = nn.NLLLoss()
# Print the model
print(rnn)

#### Set Pretrained Word Embeddings (optional)

If we want to use pre-trained word embeddings, e.g., Word2Vec, this is the moment to do. A source for pre-trained word embeddings is [this site](http://vectors.nlpl.eu/repository/). When downloading the a file containing pre-trained word embeddings, there are some things to consider:

* Most obviously, the pre-trained embeddings should match the language (here: English).

* The pretrained embeddings should match the preprocessing steps. For example, we lemmatized our dataset for this notebook (at least by default, maybe you have changed it). So we need embeddings trained over a lemmatized dataset as well.

* The pretrained embeddings have to match the size of our embedding layer. So if we create a embedding layer of size 300, we have to use pretrained embeddings of the same size

* The files with the pretrained embeddings are too large to ship with the notebooks, so you have to download them separately :)

First, we need to load the pretrained embeddings from the file; here I used [this file](http://vectors.nlpl.eu/repository/20/5.zip) (lemmatized, size: 300):

In [None]:
pretrained_vectors = torchtext.vocab.Vectors("data/embeddings/model.txt")

Now we have over 270k pretrained word embeddings, but we only have 20k words in our vocabulary. So we need to create an embedding -- which is basically just a $20k \times 300$ matrix containing the respective 20k pretrained word embeddings for our vocabulary.

In [None]:
pretrained_embedding = pretrained_vectors.get_vecs_by_tokens(vocabulary.get_itos())

Now we can set the weights of the embedding layer of our model to the pretrained weights.

In [None]:
rnn.embedding.weight.data = pretrained_embedding

Lastly, we can decide if we want the pretrained embeddings to remain fixed or whether we want to update them during training. By setting `.requires_grad = False`, we tell the optimizer to "freeze" the layer **not** to update the embedding weights during training. You should observe that if we freeze the embedding layer, the training and test f1 score will remain quite similar; otherwise the training f1 score will go towards 1.0, indicating overfitting. Simply try both settings and compare the results.

In [None]:
#rnn.embedding.weight.requires_grad = False
rnn.embedding.weight.requires_grad = True

Since the embedding weights still reside on the CPU, we can move the model to the respective device so that the model on all data is indeed on the same device.

In [None]:
rnn.to(device)

#### Evaluate Untrained Model

Let's first see how our model performs when untrained, i.e., with the initial random weights.

In [None]:
f1_test = evaluate(rnn, loader_test)

print('F1 score for untrained model: {:.3f}'.format(f1_test))

Since our dataset is perfectly balanced w.r.t. to the 2 class labels (50% positive, 50% negative), and assuming that a random model represents a random guesser, we would expect an f1 score of around 0.5. Of course, depending on the random initialization, our model might perform a bit better or worse than a random guess. In principle, even an f1 score of 0.0 is possible -- for example, this can happen if the weights are initialized in such a way that all predictions are of the same class.

### Full Training (and evaluation after each epoch)

Using the auxiliary methods and all components (i.e., loss function, optimizer) defined above, we can finally train our model by calling the method `train()`. Note that you can run the code cell below multiple times to continue the training for further 10 epochs. Each epoch will print 3 progress bars:

* training over training set

* evaluating over training set

* evaluating over test set

After each epoch, a print statement will show the current loss as well as the latest f1 scores for the training and test set.

In [None]:
num_epochs = 10

#train(basic_rnn_classifier, loader, num_epochs, verbose=True)
results = train(rnn, loader_train, loader_test, optimizer, criterion, num_epochs, verbose=True)

### Plotting the Results

Since the method `train()` returns the losses and f1 scores for each epoch, we can use this data to visualize how the loss and the f1 scores change over time, i.e., after each epoch. In `src.utils` you can find the method `plot_training_results()` to plot the losses and accuracies (training + test) over time.

In [None]:
plot_training_results(results)

The result and the plot will heavily depend on the exact parameter setting and whether the pretrained word embeddings gets updated or not. In general, however, you should always see the loss going down and (at least) the training f1 score going up. Usually the test f1 score will also go up, at least in the beginning. Of course, if you increase the number of epochs, you are likely to see signs of overfitting with the test f1 score starting to go down again.

---

## Summary

Recurrent Neural Networks (RNNs) can be effectively used to train models for sentiment analysis, which involves determining the sentiment or emotion expressed in a piece of text. RNNs are particularly suitable for this task because they can capture sequential information and dependencies within the text. Here's a general approach to training an RNN model for sentiment analysis:

* **Data Preparation:** Prepare your dataset by splitting it into training and testing sets. Each input sample should consist of a text sequence (e.g., a sentence or a paragraph) and its corresponding sentiment label (positive or negative).

* **Text Preprocessing:** Perform necessary preprocessing steps on the text data, such as tokenization (breaking text into individual words), removing stopwords, and converting words to lowercase. We performed this and the previous step in a separate notebook.

* **Word Embeddings:** Transform the preprocessed words into word embeddings. You can use pretrained word embeddings like Word2Vec or GloVe, or train your own embeddings specific to your dataset.

* **Sequence Padding:** Pad or truncate the text sequences to a fixed length, ensuring that all input sequences have the same length. This is necessary to create uniform input for the RNN model. In this notebook, we used a custom approach to organize our batches in such a way that all sequences in the batch are guaranteed to have the same lengths.

* **Model Architecture:** Define the architecture of your RNN model. A common choice is the Long Short-Term Memory (LSTM) network, which is a variant of RNN that can effectively capture long-term dependencies.

* **Model Building:** Construct your RNN model using frameworks like TensorFlow or PyTorch. The model typically consists of an embedding layer to convert word indices to word vectors, followed by one or more LSTM layers, and finally, a dense layer for sentiment classification.

* **Training:** Train the RNN model on the training data using techniques like backpropagation and gradient descent. The model learns to capture the patterns and relationships between words in the text sequences and their corresponding sentiments.

* **Evaluation:** Evaluate the trained model on the testing data to assess its performance. Common evaluation metrics for sentiment analysis include accuracy, precision, recall, and F1 score.

* **Inference:* Once the model is trained and evaluated, you can use it to predict the sentiment of new, unseen text data. The model takes a text sequence as input, processes it through the learned network, and outputs the predicted sentiment label.

It's worth noting that the above steps provide a high-level overview, and there can be variations and additional considerations depending on the specific requirements of your sentiment analysis task. However, this general approach using RNNs serves as a foundation for building sentiment analysis models that can effectively analyze and classify sentiment in textual data.