<img src="data/images/div/lecture-notebook-header.png" />

# Neural Networks (MLP)

A Multi-Layer Perceptron (MLP) is one of the most common neural network models used in the field of deep learning. Often referred to as a "vanilla" neural network, an MLP is simpler than the complex models of today's era (e.g., CNN, RNN, Transformer). However, the techniques it introduced have paved the way for further advanced neural networks. An MLP consists of interconnected neurons (i.e., Logistic Regression units) transferring information to each other. 

The MLP is a feedforward neural network, which means that the data is transmitted from the input layer to the output layer in the forward direction. All neurons of one layer are connected to all the neurons in the next layer. The connections between the layers are assigned weights. The weight of a connection specifies its importance. This concept is the backbone of an MLP's learning process.

**Important:** This notebook does not serve as a proper introductory tutorial into PyTorch, it only provides a minimal example for implementing a basic MLP text classifier, glossing over many details. If you are totally new to PyTorch, it's highly recommended to check out the tutorials on the [official PyTorch page](https://pytorch.org/tutorials/).

## Setting up the Notebook

### Required packages

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn import metrics

from tqdm import tqdm

In [None]:
import torch
import torch.nn as nn

from torch.utils.data import TensorDataset, DataLoader

### Checking/Setting the Device

PyTorch allows to train neural networks on supported GPU to significantly speed up the training process. If you have a support GPU, feel free to utilize it. However, for this notebook it's certainly not needed as our dataset is small and our network model is very simple. In fact, the training is fast on the CPU here since initializing memory on the GPU and moving the data to the GPU involves some overhead.


In [None]:
use_cuda = torch.cuda.is_available()

# Use this line below to enforce the use of the CPU (in case you don't have a supported GPU)
# With this small dataset and simple model you won't see a difference anyway
use_cuda = False

device = torch.device("cuda:0" if use_cuda else "cpu")

print("Available device: {}".format(device))

## Preparing the Data

For this notebook, we use a simple dataset for sentiment classification. This dataset consists of 10,662 sentences, where 50% of the sentences are labeled 1 (positive), and 50% of the sentences are labeled -1 (negative).

### Loading Sentence/Label Pairs from File

In [None]:
sentences, labels = [], []

with open("data/datasets/sentence-polarities/sentence-polarities.csv") as file:
    for line in file:
        line = line.strip()
        sentence, label = line.split("\t")
        sentences.append(sentence)
        labels.append(int((int(label)+1)/2))
        
print("Total number of sentences: {}".format(len(sentences)))

### Create Training & Test Set

To evaluate any classifier, we need to split our dataset into a training and a test set. With the method `train_test_split()` this is very easy to do; this method also shuffles the dataset by default, which is important for this example, since the dataset file is ordered with all positive sentences coming first. In the example below, we set the size of the test set to 20%.

In [None]:
# Split sentences and labels into training and test set with a test set size of 20%
sentences_train, sentences_test, labels_train, labels_test = train_test_split(sentences, labels, test_size=0.2, random_state=0)

# We can directly convert the numerical class labels from lists to numpy arrays
y_train = np.asarray(labels_train, dtype=np.int8)
y_test = np.asarray(labels_test, dtype=np.int8)

print("Size of training set: {}".format(len(sentences_train)))
print("Size of test set: {}".format(len(sentences_test)))

### TF-IDF Feature Extraction

To serve as valid input for our network, we need to convert our sentences into document vectors. For this, we use as always a scikit-learn vectorizer. In the code cell below, you can try the TF-IDF vectorizer or the Count vectorizer. The final results will be more or less the same. Still, feel free to play with the choice of the vectorizer as well as with the different input parameters (e.g., `ngram_range` or `max_features`).


In [None]:
# Create Document-Term Matrix for differen n-gram sizes
vectorizer = TfidfVectorizer(ngram_range=(1, 1), max_features=10000)
#vectorizer = CountVectorizer(ngram_range=(1, 1), max_features=10000)

X_train = vectorizer.fit_transform(sentences_train)
X_test = vectorizer.transform(sentences_test)

### Create Tensors

Both `X_train` and `X_test` are now our matrices containing the document vectors of our training and test set. However, right now, `X_train` and `X_test` are sparse matrices, i.e., representations that only store the non-zero values. Further use with PyTorch, we have to perform 2 additional steps:

* Convert the sparse representation to a dense (i.e., full/normal) representation using `.todense()`; the output will be numpy arrays

* Convert numpy arrays to tensors. `Tensor` is the data object used by PyTorch; they look, feel, and handle basically the same as numpy arrays.

In [None]:
# The default Tensor stores float values
X_train = torch.Tensor(X_train.todense())
X_test = torch.Tensor(X_test.todense())

# Our labels are integers, hence we use LongTensor
# (that's required, otherwise we would get an error later)
y_train = torch.LongTensor(y_train)
y_test = torch.LongTensor(y_test)

print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

### Create PyTorch Datasets

Training a neural network is usually not done computing the gradient w.r.t. the whole dataset as most of the time the dataset is way too large to fit into memory. It might also slow down training since the gradients w.r.t. the whole dataset can be very small (although this could be addressed by increasing the learning rate). In practice, the training is basically always done using batches, i.e., much smaller subset of the data. While we can take `X_train` and `X_test` and implement our own loops to create batches, PyTorch comes with a series of convenient utility classes to simplify things.

The first utility class we use is the [`Dataset`](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset) class; more specifically, since `Dataset` is in abstract class, we use the [`TensorDataset`](https://pytorch.org/docs/stable/data.html#torch.utils.data.TensorDataset) class to wrap our tensors and make each sample retrievable by indexing the tensors along the first dimension. This will be used by the data loaders below.

In [None]:
dataset_train = TensorDataset(X_train, y_train)
dataset_test = TensorDataset(X_test, y_test)

### Create Data Loaders

The [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) class takes a `DataSet` object as input to handle to split the dataset into batches. As such, a data loader also has `batch_size` as an input parameter. In the following, we use a batch size of 64, although you can easily go higher since we are dealing with only sentences.


In [None]:
batch_size = 64

loader_train = DataLoader(dataset_train, batch_size=batch_size, shuffle=True)
loader_test = DataLoader(dataset_test, batch_size=batch_size, shuffle=True)

We can use the data loaders to loop over all batches and use them for training and testing our model. The code cell below shows the general idea.

In [None]:
for X_batch, y_batch in loader_train:
    print(X_batch.shape)
    print(y_batch.shape)
    break

Appreciate how much additional code we would have written if we had implemented this batching on our own. Well, it actually wouldn't be that much, but using these utility classes makes our code much cleaner and less prone to error. These utility classes also allow to train models in a distributed setting; but this is not important at the moment.

With respect to preparing our data, we are not ready to build and train a neural network model.

---

## Build Network

In this notebook, we replicate the exact model architecture we used as an example on the lecture slides: We have 3 hidden layers and 2 outputs. Note that we could implement the model with just 1 output since we are trying to solve binary classification tasks. However, using 1 output and the Binary Cross Entropy Loss or using 2 outputs and using the Multiclass Cross Entropy Loss is completely equivalent. Hence, we simple implement to model exactly like visualized below:

<img src="data/images/examples/example-network-mlp.png">

Of course, our input layer doesn't have just 3 features but 10,000 (assuming that's the value of `max_features` of the vectorizer).

### Basic Implementation

The code cell below shows the most straightforward implementation of our network architecture. To make the model a bit flexible, we have the `vocab_size` (vocabulary size) as an input parameter. This means, any time we change the `max_features` parameter of the vectorizer, we define our model using the value. In this implementation, we use [`nn.ReLU`](https://pytorch.org/docs/stable/generated/torch.nn.ReLU.html) (Rectified Linear Unit) as the activation function for each unit, which became often of the activation of choice. Still, fill free to try out other activation functions (e.g., [`nn.Tanh`](https://pytorch.org/docs/stable/generated/torch.nn.Tanh.html#torch.nn.Tanh)).

For the activation function for the output layer we use [`nn.LogSoftmax`](https://pytorch.org/docs/stable/generated/torch.nn.LogSoftmax.html#torch.nn.LogSoftmax) -- instead of [`nn.Softmax`](https://pytorch.org/docs/stable/generated/torch.nn.Softmax.html#torch.nn.Softmax) -- which applies $\log(Softmax(x))$ to its input. This is very common in practice as it makes computation typically numerical stable as Softmax probabilities can be very small, particularly in case of many class labels. While we have only 2 classes here, and Softmax will probably do just fine, using the LogSoftmax is just a good practice. Of course, the outputs are no longer probabilities but log probabilities. This will affect the choice of the loss functions; see below.


In [None]:
class SimpleNet1(nn.Module):
    
    def __init__(self, vocab_size):
        super().__init__()
        self.vocab_size = vocab_size
        # Define 1st fully connected (i.e., linear) hidden layer
        self.fc1 = nn.Linear(self.vocab_size, 4)
        self.relu1 = nn.ReLU()
        # Define 2st fully connected (i.e., linear) hidden layer
        self.fc2 = nn.Linear(4, 3)
        self.relu2 = nn.ReLU()
        # Define 3st fully connected (i.e., linear) hidden layer
        self.fc3 = nn.Linear(3, 3)
        self.relu3 = nn.ReLU()
        # Define output layer (which is also a linear layer)
        self.out = nn.Linear(3, 2)        
        # Define log softmax layer
        self.log_softmax = nn.LogSoftmax(dim=1)
        
    def forward(self, X):
        out = self.fc1(X)
        out = self.relu1(out)
        out = self.fc2(out)
        out = self.relu2(out)
        out = self.fc3(out)
        out = self.relu3(out)
        out = self.out(out)
        log_probs = self.log_softmax(out)
        return log_probs

To "visualize" the network, we can create and print an instance of the class `SimpleNet1`. The output should reflect the model architecture shown in the image above.

The command `.to(device)` "moves" the instance to the selected instance (e.g., the CPU or GPU). In general, both the model and the data need to reside on the same instance. So if the model will be on the GPU, we also need to move the data later to the GPU. Using consistently `.to(device)` on the model and the data (see below) ensures that there will be no mismatch.


In [None]:
# Create the model and move to device
classifier = SimpleNet1(X_train.shape[1]).to(device)

print(classifier)

### Modified Implementation

In the implementation above, we used the most basic approach by having 2 main parts:

* Defining all the layers in the `__init__()` method

* Push the data through all layers in the `forward()` method

While this is perfectly fine and for more complex models essentially required, here we can use some additional utility classes to simplify our code. Note how we push our data step-by-step (i.e., sequentially) through each layer. This means we can create a model component using [`nn.Sequential`](https://pytorch.org/docs/stable/generated/torch.nn.Sequential.html) that contains all our layers -- here, the model component is basically the complete model. Thus, in the `forward()` method, we only need to give the component the data, and the component will automatically push the data through all the layers.

In [None]:
class SimpleNet2(nn.Module):
    
    def __init__(self, vocab_size):
        super().__init__()
        self.vocab_size = vocab_size
        
        self.net = nn.Sequential(
            nn.Linear(self.vocab_size, 4),
            nn.ReLU(),
            nn.Linear(4, 3),
            nn.ReLU(),
            nn.Linear(3, 3),
            nn.ReLU(),
            nn.Linear(3, 2),
            nn.LogSoftmax(dim=1)
        )
        
    def forward(self, X):
        log_probs = self.net(X)
        return log_probs

This is a neat way to save lines of code and make the implementation more readable and maintainable. Of course, if we create and print an instance of this class, the output should match the one frome `SimpleNet1` with respect to the core layers. Since we use `nn.Sequential` here, the output is not exactly the same, though.

In [None]:
# Create the model and move to device
classifier = SimpleNet2(X_train.shape[1]).to(device)

print(classifier)

---

## Train & Evaluate Model

With the data prepared and the model architecture defined, we can now train and evaluate a model to build our sentiment classifier.

### Evaluate

The code cell below implements the method `evaluate()` to, well, evaluate our model. Apart from the model itself, the method also receives the data loader as input parameter. This allows us later to use both `loader_train` and `loader_test` to evaluate the training and test loss using the same method.

The method is very generic and is not specific to the dataset. It simply loops over all batches of the data loader, computes the log probabilities, uses these log probabilities to derive the predicted class labels, and compares the predictions with the ground truth to return the f1 score. This means, this method could be used "as is" or easily be adopted for all kinds of classifications tasks (incl. task with more than 2 classes).

In [None]:
def evaluate(model, loader):

    # Set model to "eval" mode (not needed here, but a good practice)
    model.eval()

    # Collect predictions and ground truth for all samples across all batches
    y_pred, y_test = [], []

    with tqdm(total=len(loader)) as pbar:
        
        # Loop over each batch in the data loader
        for X_batch, y_batch in loader:

            # Move data to device
            X_batch, y_batch = X_batch.to(device), y_batch.to(device)

            # Push batch through network to get log probabilities for each sample in batch
            log_probs = model(X_batch)                
            
            # The predicted labels are the index of the higest log probability (for each sample)
            y_batch_pred = torch.argmax(log_probs, dim=1)

            # Add predictions and ground truth for current batch
            y_test += list(y_batch.detach().cpu())
            y_pred += list(y_batch_pred.detach().cpu())

            pbar.update(1)

    # Set model to "train" mode (not needed here, but a good practice)
    model.train()            
            
    # Return the f1 score as the output result
    return metrics.f1_score(y_test, y_pred)

For a quick test, let's evaluate the newly created model. Of course, we didn't train our model, but it will still make predictions based on the initial weights.

In [None]:
print(evaluate(classifier, loader_test))

### Train Model (and evaluate after each epoch)

Similar to the method `evaluate()` we also implement a method `train()` to wrap all the required steps training. This has the advantage that we can simply call `train()` multiple times to proceed with the training. Apart from the model, this method has the following input parameters:

* `loader_train` and `loader_test`: this allows us to compute the f1 score over the training data an the test data after each epoch; we can later visualize the changes in the f1 scores

* `optimizer`: the optimizer specifier how the computed gradients are used to updates the weights; in the lecture, we only covered the basic Stochastic Gradient Descent, but there are much more efficient alternatives available

* `criterion`: this is the loss function; "criterion" is just very common terminology in the PyTorch documentation and tutorials

* `num_epochs`: the number of epochs -- i.e., the number of times we want train over all samples in our dataset

The heart of the method is the snippet described as PyTorch Magic. It consists of the following 3 lines of code

* `optimizer.zero_grad()`: After each training step for a batch if have to set the gradients back to zero for the next batch

* `loss.backward()`: Calculating all gradients using backpropagation

* `optimizer.step()`: Update all weights using the gradients and the method of the specific optimizer

In [None]:
def train(model, loader_train, loader_test, optimizer, criterion, num_epochs):

    losses, f1_train, f1_test = [], [], []
    
    # Set model to "train" mode (not needed here, but a good practice)
    model.train()

    # Run all epochs
    for epoch in range(1, num_epochs+1):

        # Initialize epoch loss (cummulative loss fo all batchs)
        epoch_loss = 0.0

        with tqdm(total=len(loader_train)) as pbar:

            # Loop over each batch in the data loader
            for X_batch, y_batch in loader_train:

                # Move data to device
                X_batch, y_batch = X_batch.to(device), y_batch.to(device)

                # Push batch through network to get log probabilities for each sample in batch
                log_probs = classifier(X_batch)                

                # Calculate loss
                loss = criterion(log_probs, y_batch)

                ### PyTorch Magic! ###
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

                # Keep track of overall epoch loss
                epoch_loss += loss.item()

                pbar.update(1)
        
        # Keep track of all epoch losses
        losses.append(epoch_loss)
        
        # Compute f1 score for both TRAINING and TEST data
        f1_tr = evaluate(model, loader_train)
        f1_te = evaluate(model, loader_test)
        f1_train.append(f1_tr)
        f1_test.append(f1_te)

        print("Loss:\t{:.3f}, f1 train: {:.3f}, f1 test: {:.3f} (epoch {})".format(epoch_loss, f1_tr, f1_te, epoch))
     
    # Return all losses and f1 scores (all = for each epoch)
    return losses, f1_train, f1_test        

Before we can actually train the model, we need to instantiate the `criterion` (i.e., the loss function) and the `optimizer`. Since out model returns log probabilities, we need to use the [`nn.NLLLoss()`](https://pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html), i.e., the Negative Log Likelihood Loss. This is basically the same as the Multiclass Cross Entropy Loss but using log probabilities instead of probabilities.

For the optimizer, we pick the widely used [`Adam`](https://pytorch.org/docs/stable/generated/torch.optim.Adam.html) optimizer. Adam optimizer is the extended version of stochastic gradient descent which could be implemented in various deep learning applications such as computer vision and natural language processing in the future years. Adam was first introduced in 2014. The name is derived from adaptive moment estimation. The optimizer is called Adam because it uses estimations of the first and second moments of the gradient to adapt the learning rate for each weight of the neural network.

In [None]:
# Create the model and movie to device
classifier = SimpleNet2(X_train.shape[1]).to(device)

# Define loss function
criterion = nn.NLLLoss()

# Define optimizer (you can try, but the basic Stochastic Gradient Descent (SGD) is actually not great)
optimizer = torch.optim.Adam(classifier.parameters(), lr=0.0001)

Now we finally have everything in place to train the model. For this, we now only need to call the `train()` in the code cell below. Note that you can run the code cell below multiple times to continue the training for further 10 epochs. Each epoch will print 3 progress bars:

* training over training set

* evaluating over training set

* evaluating over test set

After each epoch, a print statement will show the current loss as well as the latest f1 scores for the training and test set.

In [None]:
num_epochs = 100

losses, f1_train, f1_test = train(classifier, loader_train, loader_test, optimizer, criterion, num_epochs)

Since the method `train()` returns the losses and f1 scores for each epoch, we can use this data to visualize how the loss and the f1 scores change over time, i.e., after each epoch. The code cell below creates the corresponding plot.

In [None]:
x = list(range(1, len(losses)+1))

# Convert losses to numpy array
losses = np.asarray(losses)
# Normalize losses so they match the scale in the plot (we are only interested in the trend of the losses!)
losses = losses/np.max(losses)

plt.figure()

plt.plot(x, losses, lw=3)
plt.plot(x, f1_train, lw=3)
plt.plot(x, f1_test, lw=3)

font_axes = {'family':'serif','color':'black','size':16}

plt.gca().set_xticks(x)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)

plt.xlabel("Epoch", fontdict=font_axes)
plt.ylabel("F1 Score", fontdict=font_axes)
plt.legend(['Loss', 'F1 (train)', 'F1 (test)'], loc='lower left', fontsize=16)
plt.tight_layout()
plt.show()

From the plot, we can observe several things

* The loss goes down! This essentially means that our model is learning. This can be very useful as an incorrectly implemented model may not throw an error but not train properly (i.e., the loss not going down). So it's a good sanity check when implementing and using a new model.

* The f1 score over the training data reaches 1.0 (at least after more epoch). This means that the model learns to correctly predict the label for all training samples. Of course, this is not what we are interested in but (a) again tells us that the model is generally learning, and (b) this might give us insights that our model is overfitting.

* The f1 score over the test data almost plateaus after Epoch 4 and even tend to decrease a bit later. This indicates that the model starts to overfit after Epoch 6

**Important:** This plot showing the trends for the loss and f1 scores (or other metrics) can look very different depending on the dataset, the network architecture, and the hyperparameters. While in this simple example all trends seem to smoothly converge, this is not the case in general. For example, the f1 score for the test data might see a much steeper drop after peaking, which would even more clearly indicate the model is starting to overfit.

---

## Summary

In this notebook, we built our first text classifier using PyTorch. In more detail, we trained and evaluated a Multi-Layer Perceptron to build a classifier for sentiment analysis. Building and training more complex architecture might require more code, but most of the essential steps have been covered here. For example, we already made use of important utility classes such as `Dataset` and `DataLoader` that help us prepare our datasets for the training using batches.

Appreciate how PyTorch hides all the nitty-gritty details of the training process, most importantly the computation of all gradients and the updates of the weights. While implementing backpropagation from scratch is not too difficult for simple MLP architectures, for more advanced architectures covered later this would become quite demanding.
