<img src="data/images/lecture-notebook-header.png" />

# Sentiment Analysis -- Convolutional Neural Networks (CNNs)

Convolutional Neural Networks (CNNs) are a type of deep learning model primarily used for analyzing visual imagery. However, they've also been adapted for text classification tasks.

CNNs are composed of layers that detect patterns within data by using convolutional operations. These layers consist of filters or kernels that slide across input data, capturing spatial hierarchies of features. In image processing, these features might represent edges, textures, or more complex structures.

When applied to text, CNNs can be used to detect patterns in sequences of words. Text data is converted into numerical representations like word embeddings. These embeddings capture the relationships between words, enabling the CNN to analyze sequences of these representations. Here's a basic overview of how CNNs can be used for text classification:

* **Input Encoding:** Text documents are tokenized into words or characters and converted into numerical representations (word embeddings, character embeddings, etc.). This we already handled in the Data Preparation notebook.

* **Convolutional Layers:** Similar to image processing, convolutional layers in a text-based CNN apply filters over sequences of word embeddings to detect patterns or features. These filters slide across the sequence, capturing local patterns.

* **Pooling Layers:** After convolutions, pooling layers (e.g., max pooling) downsample the output, extracting the most important information and reducing dimensionality.

* **Fully Connected Layers:** These layers take the processed features and perform classification, often using techniques like softmax for multi-class classification, assigning probabilities to different classes.

* **Training:** The network is trained using labeled data (text documents with their corresponding categories or labels) to optimize its parameters (weights and biases) via backpropagation and gradient descent.

CNNs can capture local and hierarchical patterns in text data, learning representations that can discern different classes or categories in documents. They're particularly effective for tasks where local word order matters, such as sentiment analysis, topic classification, or spam detection. While recurrent neural networks (RNNs) and transformers are also popular for text analysis, CNNs offer advantages in computational efficiency and parallel processing, making them suitable for certain text classification tasks, especially when considering local contextual information.

## Setting up the Notebook

### Required Packages

In [None]:
%reload_ext autoreload
%autoreload 2

In [None]:
import numpy as np
from tqdm import tqdm
import random

from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

import torch
import torch.nn as nn
import torchtext
from torch.utils.data import Dataset, DataLoader

# Custom BatchSampler
from src.sampler import EqualLengthsBatchSampler
from src.utils import Dict2Class, plot_training_results
from src.cnn import CnnSentenceClassifier, CnnTextClassifier

### Checking/Setting the Device

PyTorch allows to train neural networks on supported GPU to significantly speed up the training process. If you have a support GPU, feel free to utilize it. However, for this notebook it's certainly not needed as our dataset is small and our network model is very simple. In fact, the training is fast on the CPU here since initializing memory on the GPU and moving the data to the GPU involves some overhead.

In [None]:
use_cuda = torch.cuda.is_available()

# Use this line below to enforce the use of the CPU (in case you don't have a supported GPU)
# With this small dataset and simple model you won't see a difference anyway
#use_cuda = False

device = torch.device("cuda:0" if use_cuda else "cpu")

print("Available device: {}".format(device))

## Generate Dataset

### Dataset A: Sentence Polarity

#### Load Data from File

In [None]:
vocabulary = torch.load("data/datasets/sentence-polarities/polarity-corpus-10000.vocab")

vocab_size = len(vocabulary)

print("Size of vocabulary:\t{}".format(vocab_size))

In [None]:
sequences, targets = [], []

with open("data/datasets/sentence-polarities/polarity-dataset-vectors-10000.txt") as file:
    for line in file:
        line = line.strip()
        # The input sequences and class labels are separated by a tab
        sequence, label = line.split("\t")
        # Convert sequence string to a list of integers (reflecting the indicies in the vocabulary)
        sequence = [ int(idx) for idx in sequence.split()]
        # Convert each sequence into a tensor
        sequence = torch.LongTensor(sequence)
        # Add sequence and label to the respective lists
        sequences.append(sequence)
        targets.append(int(label))
        
# As targets is just a list of class labels, we can directly convert it into a tensor
targets = torch.LongTensor(targets)

#### Create Training & Test Set

To evaluate any classifier, we need to split our dataset into a training and a test set. With the method `train_test_split()` this is very easy to do; this method also shuffles the dataset by default, which is important for this example, since the dataset file is ordered with all positive sentences coming first. In the example below, we set the size of the test set to 20%.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(sequences, targets, test_size=0.5, shuffle=True, random_state=0)

print("Number of training samples:\t{}".format(len(X_train)))
print("Number of test samples:\t\t{}".format(len(X_test)))

### Dataset B: IMDb Movie Reviews

#### Load Data from File

In [None]:
vocabulary = torch.load("data/datasets/imdb-reviews/imdb-corpus-20000.vocab")

vocab_size = len(vocabulary)

print("Size of vocabulary:\t{}".format(vocab_size))

In [None]:
samples_train, samples_test = [], []

with open("data/datasets/imdb-reviews/imdb-dataset-train-vectors-20000.txt") as file:
    for line in file:
        name, label = line.split('\t')
        # Convert name to a sequence of integers
        sequence = [ int(index) for index in name.split() ]
        # Add (sequence,label) pair to list of samples
        samples_train.append((sequence, int(label.strip())))
        
#with open("data/imdb/aclimdb-sentiment-test-vectorized-20000.txt") as file:
with open("data/datasets/imdb-reviews/imdb-dataset-test-vectors-20000.txt") as file:    
    for line in file:
        name, label = line.split('\t')
        # Convert name to a sequence of integers
        sequence = [ int(index) for index in name.split() ]
        # Add (sequence,label) pair to list of samples
        samples_test.append((sequence, int(label.strip())))
        
random.shuffle(samples_train)
random.shuffle(samples_test)
        
print("Number of training samples: {}".format(len(samples_train)))
print("Number of test samples: {}".format(len(samples_test)))

#### Create Training & Test Set

Since the dataset comes in 2 files reflecting the training and test data, we can directly convert the dataset into the respectice lists

In [None]:
X_train = [ torch.LongTensor(seq) for (seq, _) in samples_train ]
X_test  = [ torch.LongTensor(seq) for (seq, _) in samples_test ]

y_train = [ label for (_, label) in samples_train ]
y_test  = [ label for (_, label) in samples_test ]

# We can directly convert the vector of labels to a tensor
y_train = torch.LongTensor(y_train)
y_test  = torch.LongTensor(y_test)

### Create Dataset Class

We first create a simple [`Dataset`](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset). This class only stores out `inputs` and `targets` and needs to implement the `__len__()` and `__getitem__()` methods. Since our class extends the abstract class [`Dataset`](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset), we can use an instance later to create a [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader).

Without going into too much detail, this approach does not only allow for cleaner code but also supports parallel processing on many CPUs, or on the GPU as well as to optimize data transfer between the CPU and GPU, which is critical when processing very large amounts of data. It is therefore the recommended best practice.

In [None]:
class BaseDataset(Dataset):

    def __init__(self, inputs, targets):
        self.inputs = inputs
        self.targets = targets

    def __len__(self):
        return len(self.inputs)

    def __getitem__(self, index):
        if self.targets is None:
            return np.asarray(self.inputs[index])
        else:
            return np.asarray(self.inputs[index]), np.asarray(self.targets[index])

### Create Data Loaders

The [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) class takes a `DataSet` object as input to handle to split the dataset into batches. The class `EqualLengthsBatchSampler` analyzes the input sequences to organize all sequences into groups of sequences of the same length. Then, each batch is sampled for a single group, ensuring that all sequences in the batch have the same length. In the following, we use a batch size of 256, although you can easily go higher since we are dealing with only sentences.

In [None]:
batch_size = 256

dataset_train = BaseDataset(X_train, y_train)
sampler_train = EqualLengthsBatchSampler(batch_size, X_train, y_train)
loader_train = DataLoader(dataset_train, batch_sampler=sampler_train, shuffle=False, drop_last=False)

dataset_test = BaseDataset(X_test, y_test)
sampler_test = EqualLengthsBatchSampler(batch_size, X_test, y_test)
loader_test = DataLoader(dataset_test, batch_sampler=sampler_test, shuffle=False, drop_last=False)

## Train & Evaluate Model

### Auxiliary Methods

#### Evaluate

The code cell below implements the method `evaluate()` to, well, evaluate our model. Apart from the model itself, the method also receives the data loader as input parameter. This allows us later to use both `loader_train` and `loader_test` to evaluate the training and test loss using the same method.

The method is very generic and is not specific to the dataset. It simply loops over all batches of the data loader, computes the log probabilities, uses these log probabilities to derive the predicted class labels, and compares the predictions with the ground truth to return the f1 score. This means, this method could be used "as is" or easily be adopted for all kinds of classifications tasks (incl. task with more than 2 classes).

The method has 2 additional input parameters:

* `fixed_seq_len`: Most CNN-based models assume inputs of a fixed size. We therefore need to specify this size so batches can be padded or cut accordingly. We have to set this parameter when using the class `CnnTextClassifier`.

* `min_seq_len`: specifies the minimum size of a sequence; shorter sequences get padded up to this size. We need to ensure that a sequence is not shorter than the largest kernel, otherwise there will be an error. We only need this if the `fixed_length=None` and we use a CNN-model that uses 1-Max Pooling which ensures equal shapes for all batches. The class `CnnSentenceClassifier` uses 1-Max Pooling.


In [None]:
def evaluate(model, loader, fixed_seq_len=None, min_seq_len=None):
    
    y_true, y_pred = [], []
    
    with tqdm(total=len(loader)) as pbar:

        for X_batch, y_batch in loader:
            batch_size, seq_len = X_batch.shape[0], X_batch.shape[1]
            
            if fixed_seq_len is not None:
                X_batch = create_fixed_length_batch(X_batch, fixed_seq_len)            
                
            if min_seq_len is not None:
                X_batch = torch.nn.functional.pad(X_batch, (0, min_seq_len-seq_len), mode="constant", value=0)         

            # Move the batch to the correct device
            X_batch, y_batch = X_batch.to(device), y_batch.to(device)

            log_probs = model(X_batch)                

            y_batch_pred = torch.argmax(log_probs, dim=1)

            y_true += list(y_batch.cpu())
            y_pred += list(y_batch_pred.cpu())
            
            pbar.update(batch_size)

    return f1_score(y_true, y_pred)

### Train Model (single epoch)

Similar to the method `evaluate()` we also implement a method `train_epoch()` to wrap all the required steps training. This has the advantage that we can simply call `train_epochs()` multiple times to proceed with the training. Apart from the model, this method has the following input parameters:

* `optimizer`: the optimizer specifier how the computed gradients are used to updates the weights; in the lecture, we only covered the basic Stochastic Gradient Descent, but there are much more efficient alternatives available

* `criterion`: this is the loss function; "criterion" is just very common terminology in the PyTorch documentation and tutorials

The hear of the method is the snippet described as PyTorch Magic. It consists of the following 3 lines of code

* `optimizer.zero_grad()`: After each training step for a batch if have to set the gradients back to zero for the next batch

* `loss.backward()`: Calculating all gradients using backpropagation

* `optimizer.step()`: Update all weights using the gradients and the method of the specific optimizer

In [None]:
def train_epoch(model, loader, optimizer, criterion, fixed_seq_len=None, min_seq_len=None):
    
    # Initialize epoch loss (cummulative loss fo all batchs)
    epoch_loss = 0.0

    with tqdm(total=len(loader)) as pbar:

        for X_batch, y_batch in loader:
            batch_size, seq_len = X_batch.shape[0], X_batch.shape[1]

            if fixed_seq_len is not None:
                X_batch = create_fixed_length_batch(X_batch, fixed_seq_len)            
                
            if min_seq_len is not None:
                X_batch = torch.nn.functional.pad(X_batch, (0, min_seq_len-seq_len), mode="constant", value=0)    
            
                
            # Move the batch to the correct device
            X_batch, y_batch = X_batch.to(device), y_batch.to(device)

            log_probs = model(X_batch)                

            # Calculate loss
            loss = criterion(log_probs, y_batch)
            
            ### Pytorch magic! ###
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            # Keep track of overall epoch loss
            epoch_loss += loss.item()

            pbar.update(batch_size)
            
    return epoch_loss

#### Train Model (multiple epochs)

The `train()` method combines the training and evaluation of a model epoch by epoch. The method keeps track of the loss, the training score, and the tests score for each epoch. This allows as later to plot the results; see below. Notice the calls of `model.train()` and `model.eval()` to set the models into the correct "mode". This is needed since our model contains a Dropout layer. For more details, check out this [Stackoverflow post](https://stackoverflow.com/questions/51433378/what-does-model-train-do-in-pytorch).


In [None]:
def train(model, loader_train, loader_test, optimizer, criterion, num_epochs, fixed_seq_len=None, min_seq_len=None, verbose=False):
    
    results = []
    
    print("Total Training Time (total number of epochs: {})".format(num_epochs))
    #for epoch in tqdm(range(1, num_epochs+1)):
    for epoch in range(1, num_epochs+1):        
        model.train()
        epoch_loss = train_epoch(model, loader_train, optimizer, criterion, fixed_seq_len=fixed_seq_len, min_seq_len=min_seq_len)
        model.eval()
        acc_train = evaluate(model, loader_train, fixed_seq_len=fixed_seq_len, min_seq_len=min_seq_len)
        acc_test = evaluate(model, loader_test, fixed_seq_len=fixed_seq_len, min_seq_len=min_seq_len)

        results.append((epoch_loss, acc_train, acc_test))
        
        if verbose is True:
            print("[Epoch {}] loss:\t{:.3f}, f1 train: {:.3f}, f1 test: {:.3f} ".format(epoch, epoch_loss, acc_train, acc_test))
            
    return results

### Basic CNN Model

We consider this model basic since the architecture is hard-coded and since the model uses 1-Max Pooling for convenience. As such, this model is arguably not suitable for long sequences, as 1-Max Pooling would throw away too much information. More specifically, implementing the architecture is presented in the lecture. However, we increase the size of the embeddings to 100 (although you can change that).


<img src="data/images/CNN-modeling-on-text-Zhang-and-Wallace-2015.jpg" />

In [None]:
# Create model   
cnn = CnnSentenceClassifier(vocab_size, 2, 100).to(device)
# Define optimizer
optimizer = torch.optim.Adam(cnn.parameters(), lr=0.001)
# Define loss function
criterion = nn.NLLLoss()

print(cnn)

#### Evaluate Untrained Model

Let's first see how our model performs when untrain, i.e., with the initial random weights.

In [None]:
evaluate(cnn, loader_test, min_seq_len=4)

### Full Training (and evaluation after each epoch)

In [None]:
num_epochs = 20

#train(basic_rnn_classifier, loader, num_epochs, verbose=True)
results = train(cnn, loader_train, loader_test, optimizer, criterion, num_epochs, min_seq_len=4, verbose=True)

In `src.utils` you can find the method `plot_training_results()` to plot the losses and accuracies (training + test) over time.

In [None]:
plot_training_results(results)

---

## Advanced CNN Text Classifier

The more advanced CNN-based Classifier uses normal Max Pooling instead of 1-Max Pooling. As such, we need to ensure that each sequences is indeed of the same length. The method below takes a batch of sequences and pad batches that are too short and cuts batches that are too long

In [None]:
def create_fixed_length_batch(sequences, length):
    
    # Pad sequences w.r.t. longest sequences
    sequences_padded = torch.nn.utils.rnn.pad_sequence(sequences,  batch_first=True, padding_value=0)

    # Get the current sequence length
    max_seq_len = sequences_padded.shape[1]
    
    if max_seq_len > length:
        # Cut sequences if to0 long
        return sequences_padded[:,:length]
    else:
        # Pad sequences if too short
        return torch.nn.functional.pad(sequences_padded, (0, length-max_seq_len), mode="constant", value=0)

The classifier is also more flexible in the sense that it allows to specify a whole range of parameters such as the size of the kernels, the number of output channels, the Max Pooling parameters and so on. It also allows you to customize the number of hidden layers that map the output of the convolutional and pooling layers to the output layer. Feel free to have a look at the implementation of class `CnnTextClassifier`. It looks quite verbose but most of the code is needed to allow for the flexibility compared to hard-coding the number of layers and their sizes (or other parameters).

In [None]:
FIXED_LENGTH = 100

params = {
    "seq_len": FIXED_LENGTH,
    "in_channels": 1,
    "vocab_size": vocab_size,
    "embed_size": 300,
    "conv_kernel_sizes": [2,3,4],
    "out_channels": 10,
    "conv_stride": 1,
    "conv_padding": 1,
    "maxpool_kernel_size": 2,
    "maxpool_padding": 0,
    "linear_sizes": [64],
    "linear_dropout": 0.5,
    "output_size": 2
}

# Define model paramaters
params = Dict2Class(params)
# Create model  
cnn = CnnTextClassifier(params).to(device)
# Define optimizer
optimizer = torch.optim.Adam(cnn.parameters(), lr=0.001)
# Define loss function
criterion = nn.NLLLoss()

print(cnn)

#### Set Pretrained Word Embeddings (optional)

If we want to use pre-trained word embeddings, e.g., Word2Vec, this is the moment to do. A source for pre-trained word embeddings is [this site](http://vectors.nlpl.eu/repository/). When downloading the a file containing pre-trained word embeddings, there are some things to consider:

* Most obviously, the pre-trained embeddings should match the language (here: English).

* The pretrained embeddings should match the preprocessing steps. For example, we lemmatized our dataset for this notebook (at least by default, maybe you have changed it). So we need embeddings trained over a lemmatized dataset as well.

* The pretrained embeddings have to match the size of our embedding layer. So if we create a embedding layer of size 300, we have to use pretrained embeddings of the same size

* The files with the pretrained embeddings are too large to ship with the notebooks, so you have to download them separately :)

First, we need to load the pretrained embeddings from the file; here I used [this file](http://vectors.nlpl.eu/repository/20/5.zip) (lemmatized, size: 300):


In [None]:
pretrained_vectors = torchtext.vocab.Vectors("data/embeddings/model.txt")

Now we have over 270k pretrained word embeddings, but we only have 20k words in our vocabulary. So we need to create an embedding -- which is basically just a $20k \times 300$ matrix containing the respective 20k pretrained word embeddings for our vocabulary.

In [None]:
pretrained_embedding = pretrained_vectors.get_vecs_by_tokens(vocabulary.get_itos())

Now we can set the weights of the embedding layer of our model to the pretrained weights.

In [None]:
cnn.embedding.weight.data = pretrained_embedding

Lastly, we can decide if we want the pretrained embeddings to remain fixed or whether we want to update them during training. By setting `.requires_grad = False`, we tell the optimizer to "freeze" the layer **not** to update the embedding weights during training.

In [None]:
cnn.embedding.weight.requires_grad = False

Since the embedding weights still reside on the CPU, we can move the model to the respective device so that the model on all data is indeed on the same device.

In [None]:
cnn.to(device)

#### Evaluate Untrained Model

Let's first see how our model performs when untrain, i.e., with the initial random weights.

In [None]:
evaluate(cnn, loader_test, fixed_seq_len=FIXED_LENGTH)

### Full Training (and evaluation after each epoch)

In [None]:
num_epochs = 20

#train(basic_rnn_classifier, loader, num_epochs, verbose=True)
results = train(cnn, loader_train, loader_test, optimizer, criterion, num_epochs, fixed_seq_len=FIXED_LENGTH, verbose=True)

In `src.utils` you can find the method `plot_training_results()` to plot the losses and accuracies (training + test) over time.

In [None]:
plot_training_results(results)

---

## Summary

Convolutional Neural Networks (CNNs), primarily known for their excellence in image processing, have found a remarkable application in text classification tasks like sentiment analysis. When adapted to process text data, CNNs leverage their ability to detect local patterns and hierarchies of features, proving effective in analyzing sequences of words.

In text classification, CNNs operate by treating text as a 1-dimensional sequence, where each word or character corresponds to a position in the sequence. Convolutional layers slide filters or kernels across these sequences, detecting local features or patterns in groups of words. These filters capture information such as word combinations or phrases that are indicative of specific sentiments or categories.

CNNs' strength lies in their capability to learn hierarchical representations of text. Lower-level filters might detect simple patterns like word sequences or n-grams, while higher-level filters capture more complex structures. Pooling layers then downsample these features, extracting the most relevant information while reducing dimensionality.

One of the advantages of CNNs for text classification tasks is their ability to learn from local context, identifying essential phrases or combinations of words without being sensitive to the order of the entire sequence. This makes them effective for tasks where local context matters more than the global sequence structure, as in sentiment analysis. Moreover, CNNs exhibit computational efficiency and can process text in parallel, making them suitable for handling large-scale text data efficiently. Overall, their capacity to discern intricate patterns within text data and extract relevant features has made CNNs a powerful and competitive choice for various text classification tasks, including sentiment analysis.