<img src="data/images/div/lecture-notebook-header.png" />

# Sentiment Analysis -- Recurrent Neural Networks (RNNs)

Recurrent Neural Networks (RNNs) are a class of neural networks designed to effectively handle sequential data by retaining memory or context of previous inputs. Unlike feedforward neural networks that process data in fixed-size input vectors, RNNs have loops within their architecture, allowing them to maintain and utilize information about previous inputs while processing the current input.

RNNs are composed of units (often called cells) that maintain a hidden state. This hidden state acts as a memory that retains information about previous inputs in the sequence. Each unit performs computations based on the current input and its previous hidden state, allowing them to capture temporal dependencies in sequential data.

For text classification tasks, RNNs can be used in various ways:

* **Word-level RNNs:** Each word in a text sequence is fed into the RNN step by step. The hidden state of the RNN unit at each step incorporates information about the previous words in the sequence. This way, the RNN learns to capture the context and dependencies between words in the text.

* **Sequence-to-Sequence RNNs:** These models take an entire sequence as input and produce another sequence as output. In text classification, this could involve using an RNN to read an entire sentence or document and outputting a sentiment label or category.

* **Sentiment Analysis:** In sentiment analysis, RNNs can be employed to classify the sentiment of text documents (positive, negative, neutral). The RNN processes the words or sequences of words in the document, learning patterns and relationships to determine the sentiment expressed.

However, vanilla RNNs suffer from issues like vanishing or exploding gradients, which can hinder their ability to capture long-term dependencies in text. To address these limitations, variants of RNNs, such as Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU), have been developed. These architectures have gating mechanisms that control the flow of information, allowing them to better capture long-range dependencies in text sequences.

In summary, RNNs, including LSTM and GRU variants, are powerful for text classification tasks because they can capture sequential dependencies, understand context, and make predictions based on the order and structure of text data. Their ability to retain memory and handle sequential information makes them well-suited for tasks where understanding the context of words or phrases is essential, such as sentiment analysis, named entity recognition, machine translation, and more.

## Setting up the Notebook

### Required Packages

In [None]:
%reload_ext autoreload
%autoreload 2

In [None]:
import numpy as np
from tqdm import tqdm
import random
from collections import Counter, OrderedDict

from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

import torch
import torch.nn as nn
import torchtext
from torch.utils.data import Dataset, DataLoader
from torchtext.vocab import vocab

# Custom BatchSampler
from src.sampler import EqualLengthsBatchSampler
from src.utils import Dict2Class, plot_training_results
from src.rnn import RnnType, RnnTextClassifier, DotAttentionClassification

import spacy
spacy.prefer_gpu()
# We use spaCy for preprocessing, but we only need the tokenizer and lemmatizer
# (for a large real-world dataset that would help with the performance)
nlp = spacy.load("en_core_web_sm", disable=['ner', 'parser'])

### Checking/Setting the Device

PyTorch allows to train neural networks on supported GPU to significantly speed up the training process. If you have a support GPU, feel free to utilize it. However, for this notebook it's certainly not needed as our dataset is small and our network model is very simple. In fact, the training is fast on the CPU here since initializing memory on the GPU and moving the data to the GPU involves some overhead.

In [None]:
use_cuda = torch.cuda.is_available()

# Use this line below to enforce the use of the CPU (in case you don't have a supported GPU)
# With this small dataset and simple model you won't see a difference anyway
#use_cuda = False

device = torch.device("cuda:0" if use_cuda else "cpu")

print("Available device: {}".format(device))

---

## Generate Dataset

### Sentence Polarity Datset

The [sentence polarity dataset](https://www.kaggle.com/datasets/nltkdata/sentence-polarity) is a well-known dataset commonly used for sentiment analysis and text classification tasks in NLP. It consists of sentences or short texts labeled with their corresponding sentiment polarity (positive or negative). This dataset is often used to train and evaluate models that aim to classify text into positive or negative sentiment categories. It serves as a benchmark for sentiment analysis tasks and provides a standardized dataset for researchers and practitioners to compare and evaluate the performance of different algorithms and techniques.

There are several versions and variations of the sentence polarity dataset available, created for different purposes and domains. One of the popular versions is the Movie Review Dataset, also known as the Pang and Lee dataset, created by Bo Pang and Lillian Lee. This dataset contains movie reviews from the website IMDb, with each review labeled as positive or negative. The sentence polarity dataset enables researchers and developers to build and test sentiment analysis models that can automatically determine the sentiment expressed in text, allowing applications such as sentiment monitoring, opinion mining, and customer feedback analysis.

For this notebook, we already prepared the dataset by combining the 2 files containing the positive and negative sentences into a single file. The polarity of each sentence is denoted by a polarity label: `1` for positive and `-1` for negative. This makes handling the data a bit simpler and keeps the notebook a bot cleaner.

### Auxiliary Method

The method `preprocess()`, well, tokenizes a given text. In this case, we not only tokenize but also lemmatize and lowercase all tokens. The exact list of preprocessing steps will in practice depend on the exact task, but this is what we do here. Notice that we do not, for example, remove stopwords. This is mainly to reduce the vocabulary size not too much here.

In [None]:
def preprocess(text):
    return [token.lemma_.lower() for token in nlp(text)]

preprocess("This is a test to see if the TOKENIZER does its job.")

### Read Files & Compute Word Frequencies

The first to go through the whole corpus and count the number of occurrences for each token. 10k sentences is basically nothing these days, but the purpose of this notebook is not to focus on large scale data as the steps would be exactly the same.

In [None]:
token_counter = Counter()

targets_polarity = []

with tqdm(total=10662) as pbar:
    
    # Loop over each sentence (1 sentence per line)
    with open('data/datasets/sentence-polarities/sentence-polarities.csv', encoding='utf-8') as file:
        for line in file:
            line = line.strip()
            parts = line.split('\t')
            sentence, label = parts[0], int(parts[1])
            # Update token counts
            for token in preprocess(sentence):
                token_counter[token] += 1            
            # Add label to targets list
            targets_polarity.append(label)
            # Update progress bar
            pbar.update(1)

### Create Vocabulary

To create our `vocab` object, we perform exactly the same steps as above. The only difference is that our "full" vocabulary is not larger (although with less than 20k tokens still rather small). We therefore limit the vocabulary here to the 10,000 most frequent tokens.


In [None]:
# Sort by word frequency
token_counter_sorted = sorted(token_counter.items(), key=lambda x: x[1], reverse=True)

print("Number of tokens: {}".format(len(token_counter_sorted)))

In [None]:
TOP_TOKENS = 10000

token_counter_sorted = token_counter_sorted[:TOP_TOKENS]

print("Number of tokens: {}".format(len(token_counter_sorted)))

In [None]:
token_ordered_dict = OrderedDict(token_counter_sorted)

# Define list of "special" tokens
SPECIALS = ["<PAD>", "<UNK>", "<SOS>", "<EOS>"]

vocabulary = vocab(token_ordered_dict, specials=SPECIALS)

vocabulary.set_default_index(vocabulary["<UNK>"])

print("Number of tokens: {}".format(len(vocabulary)))

### Save Dataset

Lastly, we save all the data for later use, and so we don't have to recompute it every time.

#### Vectorize and Save Dataset

In [None]:
output_file = open("data/datasets/sentence-polarities/polarity-dataset-vectors-{}.txt".format(TOP_TOKENS), "w")

with tqdm(total=10662) as pbar:
    
    # Loop over each sentence (1 sentence per line)
    with open('data/datasets/sentence-polarities/sentence-polarities.csv', encoding='utf-8') as file:
        for line in file:
            line = line.strip()
            parts = line.split('\t')
            sentence, label = parts[0], int(parts[1])
            # Convert labels from -1/1 to 0/1
            label = int((label + 1) / 2)
            # Convert sentence into sequence of word indices
            vector = vocabulary.lookup_indices(preprocess(sentence))
            # Write converted sequence and labelsto file
            output_file.write("{}\t{}\n".format(" ".join([str(idx) for idx in vector]), label))
            # Update progress bar
            pbar.update(1)

output_file.flush()
output_file.close()   

#### Save Metadata

In [None]:
vocabulary_file_name = "data/datasets/sentence-polarities/polarity-corpus-{}.vocab".format(TOP_TOKENS)

torch.save(vocabulary, vocabulary_file_name)

---

## Prepare Dataset for Training

While RNNs allow for arbitrary lengths -- as long as sequence in the same batch is of the same length -- it is often practical to limited the maximum length of sequences. This is not only from a computing point of view but also it gets more and more difficult to propagate meaningful gradients back during Backpropgation Throught Time (BPTT).

For the sentence dataset, this is hardly an isses, since individual sentences are usually not overly long. However, the moview reviews consiste of several sentences. Note that by limiting ourselves to the first `MAX_LENGTH` words we assume that the main sentiment is expressed at the beginning of the review. If we would assume that we should focus on the end of a review, we should consider the last `MAX_LENGTH` words. 

In the code cell below, we set `MAX_LENGTH` to 10, but feel free to play with this value. When loading the data from the files, we directly cut all sequences longer than `MAX_LENGTH` down to the specified values. This also means that we won't have to check the seqquence lengths anymore when training or evaluating a model (compared to CNN).

In [None]:
MAX_LENGTH = 100

### Dataset A: Sentence Polarity

#### Load Data from File

In [None]:
vocabulary = torch.load("data/datasets/sentence-polarities/polarity-corpus-10000.vocab")

vocab_size = len(vocabulary)

print("Size of vocabulary:\t{}".format(vocab_size))

In [None]:
sequences, targets = [], []

with open("data/datasets/sentence-polarities/polarity-dataset-vectors-10000.txt") as file:
    for line in file:
        line = line.strip()
        # The input sequences and class labels are separated by a tab
        sequence, label = line.split("\t")
        # Convert sequence string to a list of integers (reflecting the indicies in the vocabulary)
        sequence = [ int(idx) for idx in sequence.split()]
        # Convert each sequence into a tensor
        sequence = torch.LongTensor(sequence[:MAX_LENGTH])
        # Add sequence and label to the respective lists
        sequences.append(sequence)
        targets.append(int(label))
        
# As targets is just a list of class labels, we can directly convert it into a tensor
targets = torch.LongTensor(targets)

### Create Training & Test Set

To evaluate any classifier, we need to split our dataset into a training and a test set. With the method `train_test_split()` this is very easy to do; this method also shuffles the dataset by default, which is important for this example, since the dataset file is ordered with all positive sentences coming first. In the example below, we set the size of the test set to 20%.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(sequences, targets, test_size=0.5, shuffle=True, random_state=0)

print("Number of training samples:\t{}".format(len(X_train)))
print("Number of test samples:\t\t{}".format(len(X_test)))

### Create Dataset Class

We first create a simple [`Dataset`](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset). This class only stores out `inputs` and `targets` and needs to implement the `__len__()` and `__getitem__()` methods. Since our class extends the abstract class [`Dataset`](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset), we can use an instance later to create a [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader).

Without going into too much detail, this approach does not only allow for cleaner code but also supports parallel processing on many CPUs, or on the GPU as well as to optimize data transfer between the CPU and GPU, which is critical when processing very large amounts of data. It is therefore the recommended best practice.

In [None]:
class BaseDataset(Dataset):

    def __init__(self, inputs, targets):
        self.inputs = inputs
        self.targets = targets

    def __len__(self):
        return len(self.inputs)

    def __getitem__(self, index):
        if self.targets is None:
            return np.asarray(self.inputs[index])
        else:
            return np.asarray(self.inputs[index]), np.asarray(self.targets[index])

### Create Data Loaders

The [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) class takes a `DataSet` object as input to handle to split the dataset into batches. The class `EqualLengthsBatchSampler` analyzes the input sequences to organize all sequences into groups of sequences of the same length. Then, each batch is sampled for a single group, ensuring that all sequences in the batch have the same length. In the following, we use a batch size of 256, although you can easily go higher since we are dealing with only sentences.

In [None]:
batch_size = 512

dataset_train = BaseDataset(X_train, y_train)
sampler_train = EqualLengthsBatchSampler(batch_size, X_train, y_train)
loader_train = DataLoader(dataset_train, batch_sampler=sampler_train, shuffle=False, drop_last=False)

dataset_test = BaseDataset(X_test, y_test)
sampler_test = EqualLengthsBatchSampler(batch_size, X_test, y_test)
loader_test = DataLoader(dataset_test, batch_sampler=sampler_test, shuffle=False, drop_last=False)

## Train & Evaluate Model

### Auxiliary Methods

#### Evaluate

The code cell below implements the method `evaluate()` to, well, evaluate our model. Apart from the model itself, the method also receives the data loader as input parameter. This allows us later to use both `loader_train` and `loader_test` to evaluate the training and test loss using the same method.

The method is very generic and is not specific to the dataset. It simply loops over all batches of the data loader, computes the log probabilities, uses these log probabilities to derive the predicted class labels, and compares the predictions with the ground truth to return the f1 score. This means, this method could be used "as is" or easily be adopted for all kinds of classifications tasks (incl. task with more than 2 classes).

In [None]:
def evaluate(model, loader):
    
    y_true, y_pred = [], []
    
    with tqdm(total=len(loader)) as pbar:

        for X_batch, y_batch in loader:
            batch_size, seq_len = X_batch.shape[0], X_batch.shape[1]
            
            # Move the batch to the correct device
            X_batch, y_batch = X_batch.to(device), y_batch.to(device)
            
            # Initialize the first hidden state h0 (and move to device)
            hidden = model.init_hidden(batch_size)

            if type(hidden) is tuple:
                hidden = (hidden[0].to(device), hidden[1].to(device))  # LSTM
            else:
                hidden = hidden.to(device)  # RNN, GRU
                    
            log_probs = model(X_batch, hidden)

            y_batch_pred = torch.argmax(log_probs, dim=1)

            y_true += list(y_batch.cpu())
            y_pred += list(y_batch_pred.cpu())
            
            pbar.update(batch_size)

    return f1_score(y_true, y_pred)

### Train Model (single epoch)

Similar to the method `evaluate()` we also implement a method `train_epoch()` to wrap all the required steps training. This has the advantage that we can simply call `train_epochs()` multiple times to proceed with the training. Apart from the model, this method has the following input parameters:

* `optimizer`: the optimizer specifier how the computed gradients are used to updates the weights; in the lecture, we only covered the basic Stochastic Gradient Descent, but there are much more efficient alternatives available

* `criterion`: this is the loss function; "criterion" is just very common terminology in the PyTorch documentation and tutorials

The hear of the method is the snippet described as PyTorch Magic. It consists of the following 3 lines of code

* `optimizer.zero_grad()`: After each training step for a batch if have to set the gradients back to zero for the next batch

* `loss.backward()`: Calculating all gradients using backpropagation

* `optimizer.step()`: Update all weights using the gradients and the method of the specific optimizer

In [None]:
def train_epoch(model, loader, optimizer, criterion):
    
    # Initialize epoch loss (cummulative loss fo all batchs)
    epoch_loss = 0.0

    with tqdm(total=len(loader)) as pbar:

        for X_batch, y_batch in loader:
            batch_size, seq_len = X_batch.shape[0], X_batch.shape[1]

            # Move the batch to the correct device
            X_batch, y_batch = X_batch.to(device), y_batch.to(device)

            # Initialize the first hidden state h0 (and move to device)
            hidden = model.init_hidden(batch_size)

            if type(hidden) is tuple:
                hidden = (hidden[0].to(device), hidden[1].to(device))  # LSTM
            else:
                hidden = hidden.to(device)  # RNN, GRU            
            
            log_probs = model(X_batch, hidden)

            # Calculate loss
            loss = criterion(log_probs, y_batch)
            
            ### Pytorch magic! ###
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            # Keep track of overall epoch loss
            epoch_loss += loss.item()

            pbar.update(batch_size)
            
    return epoch_loss

#### Train Model (multiple epochs)

The `train()` method combines the training and evaluation of a model epoch by epoch. The method keeps track of the loss, the training score, and the tests score for each epoch. This allows as later to plot the results; see below. Notice the calls of `model.train()` and `model.eval()` to set the models into the correcte "mode". This is needed sinze our model containsa Dropout layer. For more details, check out this [Stackoverflow post](https://stackoverflow.com/questions/51433378/what-does-model-train-do-in-pytorch).

In [None]:
def train(model, loader_train, loader_test, optimizer, criterion, num_epochs, verbose=False):
    
    results = []
    
    print("Total Training Time (total number of epochs: {})".format(num_epochs))
    #for epoch in tqdm(range(1, num_epochs+1)):
    for epoch in range(1, num_epochs+1):
        model.train()
        epoch_loss = train_epoch(model, loader_train, optimizer, criterion)
        model.eval()
        f1_train = evaluate(model, loader_train)
        f1_test = evaluate(model, loader_test)

        results.append((epoch_loss, f1_train, f1_test))
        
        if verbose is True:
            print("[Epoch {}] loss:\t{:.3f}, f1 train: {:.3f}, f1 test: {:.3f} ".format(epoch, epoch_loss, f1_train, f1_test))
            
    return results

### Basic RNN Model

The class `RnnTextClassifier` implements an RNN-based classifier in a flexible manner, using different parameters setting once cna set:

* Which recurrent cell to use: nn.RNN, nn.GRU, or nn.LSTM

* The number of stacked recurrent layers

* Whether the recurrence is performed bi-directional or not

* The number and size of the subsequence linear layers

* ... and other various parameters,

In [None]:
params = {
    "vocab_size": vocab_size,
    "embed_size": 300,
    "rnn_type": RnnType.GRU,
    "rnn_num_layers": 2,
    "rnn_bidirectional": True,
    "rnn_hidden_size": 512,
    "rnn_dropout": 0.5,      # only relevant if rnn_num_layers > 1
    "dot_attention": False,
    "linear_hidden_sizes": [128, 64],
    "linear_dropout": 0.5,
    "output_size": 2
}

# Define model paramaters
params = Dict2Class(params)
# Create model   
rnn = RnnTextClassifier(params).to(device)
# Define optimizer
optimizer = torch.optim.Adam(rnn.parameters(), lr=0.0001)
# Define loss function
criterion = nn.NLLLoss()

print(rnn)

#### Evaluate Untrained Model

Let's first see how our model performs when untrain, i.e., with the initial random weights.

In [None]:
evaluate(rnn, loader_test)

### Full Training (and evaluation after each epoch)

In [None]:
num_epochs = 20

#train(basic_rnn_classifier, loader, num_epochs, verbose=True)
results = train(rnn, loader_train, loader_test, optimizer, criterion, num_epochs, verbose=True)

In `src.utils` you can find the method `plot_training_results()` to plot the losses and accuracies (training + test) over time.

In [None]:
plot_training_results(results)

---

## Summary

Recurrent Neural Networks (RNNs) have emerged as a formidable tool for text classification tasks like sentiment analysis due to their intrinsic ability to understand sequential data and capture dependencies among words or characters within text sequences. RNNs, unlike traditional feedforward networks, maintain a memory state that allows them to retain information from previous inputs while processing the current input. This unique architecture enables them to capture temporal dependencies and contextual information crucial for understanding the sentiment or meaning conveyed in text.

In sentiment analysis, RNNs excel at grasping the sequential nature of language, discerning nuances in meaning, and identifying sentiment-bearing words or phrases within sentences or documents. By processing text sequentially, RNNs effectively consider the order of words and their relationships, thus grasping the context necessary for accurate classification.

However, traditional RNNs are prone to issues like vanishing or exploding gradients, limiting their ability to capture long-range dependencies effectively. To mitigate these shortcomings, variants like Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) have been developed. These variants incorporate gating mechanisms that regulate the flow of information, allowing them to better retain relevant context over longer sequences.

Overall, RNNs, including their specialized LSTM and GRU variants, stand out in text classification tasks like sentiment analysis by leveraging sequential information to comprehend context, relationships, and dependencies among words or characters within text data. Their strength lies in their capacity to handle sequences, making them a potent choice for tasks where understanding the sequential nature of language is crucial.