<img src='data/images/section-notebook-header.png' />

# Vanilla RNN Implementation

The Recurrent Neural Network (RNN) architecture is a type of neural network specifically designed to process sequential data. Unlike feedforward neural networks, which process input data independently, RNNs have connections that allow information to be passed from previous steps to the current step, enabling them to capture temporal dependencies and patterns in sequences. Recall the slide from the lecture:

<img src='data/images/lecture-slide-01.png' width="90%" />

The key feature of the RNN architecture is the presence of recurrent connections that create loops in the network, allowing information to persist and flow through time. This looping mechanism allows RNNs to maintain an internal state or memory, which can capture context and information from previous steps in the sequence. This makes RNNs particularly useful for Natural Language Processing (NLP) tasks for several reasons:

* **Sequence Modeling:** RNNs excel at modeling sequential data, making them well-suited for tasks that involve understanding and generating sequences of text. They can effectively process and capture dependencies in sequences, such as word order, sentence structure, and context, which are crucial in NLP tasks like language modeling, machine translation, and sentiment analysis.

* **Variable-Length Input:** NLP often deals with variable-length input, such as sentences or documents of different lengths. RNNs can handle variable-length sequences by iteratively processing each element in the sequence, regardless of its length. This flexibility makes RNNs highly adaptable to NLP tasks where input length varies, allowing them to process text at the word, character, or sentence level.

* **Contextual Information:** RNNs have the ability to maintain a hidden state that captures context and information from previous steps. This contextual information is valuable in NLP tasks that rely on understanding and generating language, as it allows the model to consider the entire sequence and make informed predictions. For example, in machine translation, the hidden state can encode the context of the source sentence, helping the model generate accurate translations.

* **Long-Term Dependencies:** RNNs can theoretically capture long-term dependencies in sequences by propagating information through the recurrent connections. This capability is important for tasks that involve understanding relationships between distant words or phrases. However, traditional RNNs can struggle with capturing long-term dependencies due to the vanishing or exploding gradient problem. To address this, variants such as Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) were introduced, which have improved memory and gating mechanisms to alleviate these issues.

Overall, the recurrent neural network architecture, including its variants like LSTM and GRU, has proven to be highly effective in NLP tasks. Their ability to model sequential data, handle variable-length input, capture context, and capture long-term dependencies makes them a powerful choice for tasks involving language understanding, generation, and sequence-to-sequence mapping.

In this notebook, we built a simple RNN-based classifier. The input of the model is a last name and the prediction is the nationality of the person of that name. What makes this model so easy is that we do not consider sequences for words but sequences of characters -- and compared to words, there's only a small limited number of characters. And most conveniently, there's no need for word embeddings; we represent each character as a one-hot vector. This model direct adopts a [PyTorch tutorial](https://pytorch.org/tutorials/intermediate/char_rnn_generation_tutorial.html); while the dataset is part of the notebook, it is also available [here](https://download.pytorch.org/tutorial/data.zip).

Compared to the PyTorch tutorial, we do not use an existing RNN layer provided by Pytorch such as [`nn.RNN`](https://pytorch.org/docs/stable/generated/torch.nn.RNN.html), [`nn.GRU`](https://pytorch.org/docs/stable/generated/torch.nn.GRU.html), or [`LSTM`](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html). Instead, we implement our own RNN network using the most basic architecture like we saw in the lecture; see below:

<img src='data/images/lecture-slide-02.png' width="50%" />

Of course, we still use components such as linear layers and activation functions provided by PyTorch. The goal is to implement the recurrent nature *"by hand"* to get a better understanding about the intuition behind RNNs. Hence the name Vanilla RNN: the most basic recurrent architecture, including a very simple dataset.

## Setting up the Notebook

### Import Required Packages

In [None]:
import os
import numpy as np
import pandas as pd
from tqdm import tqdm

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn import metrics

We utilize PyTorch as our deep learning framework of choice by importing the `torch` package.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim

The implementation of our model can be found in the file `src/rnn.py`.

In [None]:
from src.rnn import VanillaRNN

### Checking/Setting the Computation Device

PyTorch allows to train neural networks on supported GPUs to significantly speed up the training process. If you have a support GPU, feel free to utilize it. 

In [None]:
use_cuda = torch.cuda.is_available()

# Use this line below to enforce the use of the CPU (in case you don't have a supported GPU)
# With this small dataset and simple model you won't see a difference anyway
#use_cuda = False

device = torch.device("cuda:0" if use_cuda else "cpu")

print("Available device: {}".format(device))

---

## Load & Prepare Data

The corpus of last names can be found in `data/names` or can be downloaded [here](https://download.pytorch.org/tutorial/data.zip). This folder contains a set of files where the file name indicates the nationality of the list of names contained in the respective files. For example, the file `German.txt` contains all the names labeled as German.

### Read Datafiles

First, let's read all the files, extract the names and store them in a list `names`. The list `nationalities` keeps track of the nationality of that name derived from the file name.

In [None]:
file_names = set()
        
names, nationalities = [], []
    
with os.scandir('data/corpora/names/') as it:
    for file_name in it:
        nationality = file_name.name.split('.')[0].lower()
        count = 0
        with open(file_name) as file:
            for line in file:
                names.append(line.strip().lower())
                nationalities.append(nationality)
                
print("Number of data samples: {}".format(len(names)))

### Generate Pandas Dataframe

Based on these two lists, we create a Pandas Dataframe which makes its further use as a training dataset for our classifier a bit easier.

**Side note:** Compared to the notebooks about the MLP architecture, we do not use an utility classes such as [`Dataset`](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset) and [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader). In this notebook, we do not train the model using mini batches but use one data sample (i.e., a name + label) at a time. The reason is that all the sequences in a batch need to have the same length. Of course, names -- like text text data in general -- can be of different length. This is a general problem and there are various approaches to mitigate them. After all, we want to use mini batches of larger sizes for performance reasons. However, the dataset here is small so performance is not critical. In later notebooks, we see and discuss how to handle sequences of different lengths.

In [None]:
# Create dataframe with two columns
df = pd.DataFrame(columns=['name', 'nationality'])
df['name'] = names
df['nationality'] = nationalities

# Convert string value of nationality to a integer value of range 0..(#nationalities-1)
df.nationality, nationalities_mapping = df.nationality.factorize()

print('Number of different nationalities/labels: {}'.format(len(nationalities_mapping)))

To see if the whole notebook is working just fine, you can first consider only a random sample of the overall dataset. If everything is working as expected, feel free to come back to this cell and edit it to use the full dataset. It is not important here, since the focus is not on building a state-of-the art model but to better understand the RNN architecture.

As usual, we also split our dataset into a training and test set.

In [None]:
# It's generally always good to shuffle the dataset
# For testing, you can also first use smaller samples
#df = df.sample(frac=1.0).reset_index(drop=True)
df = df.sample(frac=0.1).reset_index(drop=True)

# Split the dataset into training and test data
df_train, df_test = train_test_split(df, test_size=0.2)

# Let's see how the training data looks like.
df_train.head(10)

We can also quickly check the sizes of the training and test set.

In [None]:
print('Shape of training data: {}'.format(df_train.shape))
print('Shape of test data: {}'.format(df_test.shape))

Lastly, from an evaluation perspective, it is also useful to know if our dataset is reasonably balanced not. For this we can use the information we have to create a bar plot that shows the number of names for each nationality. The code cell below accomplishes this. Without going into the details, Pandas makes this calculation of the numbers per nationality very easy.

In [None]:
y = df.groupby('nationality')['name'].count().tolist()
x = list(range(len(y)))

plt.figure()
plt.bar(x, y, width=0.9)
plt.xticks(x, nationalities_mapping, rotation=45)
plt.ylabel("number of names")
plt.show()

We can see that our dataset is highly unbalanced with some nationalities over-represented, particularly Russian and English names. At the very least, this tells us that accuracy would not be a suitable metric to use in our evaluation. We therefore use the f1 score; see below.

### Auxiliary Preparation Steps

We first need to identify the vocabulary including its size. Here, the vocabulary is simply the number of unique characters across all names. The size of the vocabulary, of course, determines the size of the input of our model.

In [None]:
vocab = list(set(''.join(names)))
vocab_size = len(vocab)

print('Vocabulary:')
print(vocab)
print()
print('Size of vocabulary: {}'.format(len(vocab)))

From the output above you can see that we have some odd characters in our vocabulary, e.g., `/` or other special symbols. In practice, we would need to check if indeed all characters are valid and potentially conduct data cleaning steps to fix that. However, here we simply assume that all characters are indeed valid and part of our dataset -- which might indeed be true!

We already know that Neural Networks don't work on characters or strings. We therefore need to vectorize them, i.e, to convert each name into a 2d tensor (or matrix). Each name is a sequence of characters, and we represent each character as a one-hot encoded vector. Again, this is perfectly valid approach since

* The vocabulary is small, so the one-hot vectors are also small

* It is reasonable to assume to treat characters as nominal data without any notion of similarity between them (in contrast to words, where we would like to represent similar words using similar vectors.)

The code cell below defines the method `name_to_tensor()` to convert a names in to the corresponding 2d tensor where the first dimensions reflects the length of the name in terms of the number of characters, and the second dimensions reflects the size of the one-hot vector determined by the size of the vocabulary (cf., line `tensor = torch.zeros(len(name), vocab_size)`). By iterating over each character in the given name, we set the corresponding position in the one-hot vector to 1.

In [None]:
# Turn a line into a tensor of shape (sequence_lenght, vocab_size)
def name_to_tensor(name):
    # Create tensor of the right shape with all elements being 0
    tensor = torch.zeros(len(name), vocab_size)
    # Create one-hot vector for each character by setting the corresponding element to 1
    for li, letter in enumerate(name):
        tensor[li][vocab.index(letter)] = 1
    return tensor

print(name_to_tensor('jones').size())

As you can see, the tensor for "jones" has a shape of `(5, 58)` as "jones" has 5 letters where each letter is represented by a one-hot encoded vector of the size of the vocabulary (here: 58).

---

## Building the Model

Now that the data is ready, we can build and train our model. For our classifier, we use the most Vanilla RNN as covered in the lecture. You can check the file `src/rnn.py` for the implementation of class `VanillaRNN`. We have only 1 additional parameter to specify, which is the dimension of the hidden state, i.e., `hidden_size`. Since this is a rather simple task, we would probably get away with a smaller hidden state. Feel free to lower `hidden_size` to see when it affects the quality of the resulting classifier.

The output size is the number of different nationalities in the dataset, which we can directly infer from the size of the dictionary `nationalities_mapping` (which is 18 here).

In [None]:
hidden_size = 256
output_size = len(nationalities_mapping)

model = VanillaRNN(vocab_size, hidden_size, output_size).to(device)

# We can also print the model
print(model)

The following 2 methods predict the nationality for an individual name (method `predict`) and evaluate the model over the test data (method `evaluate`) by calculating its accuracy -- f1 score might be more suitably here, but the dataset is fairly balanced and this is just a toy example.

---

## Training & Evaluating the Model

### Evaluation

Similar to the MLP notebook, we first implement a series of auxiliary methods. The first methods handle the evaluation of the model. The code cell below implements the method `evaluate()` to, well, evaluate our model. Apart from the model itself, the method also receives the pandas `DataFrame` as input parameter. This allows us later to use both `dfr_train` and `df_test` to evaluate the training and test loss using the same method.

Note that we use a separate method `predict()` to actually input the vectorized name into the model and get the prediction. This makes the code a bit cleaner since -- compared to a simple MLP -- "running" the model actually requires looping over each character (more specifically, its one-hot vector). Of course, we could have moved the code implemented by the method `predict()` also directly in the `forward()` method of class `VanillaRNN`. But actually seeing here the loop that reflects the recurrent nature of the RNN is more instructive.

In [None]:
def predict(model, name_tensor):    
    with torch.no_grad():
        # Initialize the first hidden state h0
        hidden = model.init_hidden(1).to(device)
        # Iterate over all characters in the name and given the current character and the last hidden state to the model
        for i in range(name_tensor.size()[0]):
            output, hidden = model(name_tensor[i], hidden)
        # The predicted class is the index in the output vector with the largest value (log-prob)
        _, prediction = torch.max(output, dim=1)
    # Return the class as a simple integer value
    return prediction.cpu().detach().numpy()[0]


def evaluate(model, df):
    # Set model to "eval" mode (not needed here, but a good practice)
    model.eval()
    
    y_pred = []
    y_true = df['nationality'].tolist()
    num_samples = df.shape[0]
    
    # Iterate over all samples in the test data
    with tqdm(total=num_samples) as progress_bar:
        for _, row in df.iterrows():
            # Prepare the data sample
            name, nationality = row['name'], row['nationality']
            # Vectorize name and move tensor to device
            name_tensor = name_to_tensor(name).to(device)
            # Use model to prediction class
            prediction = predict(model, name_tensor)
            # Add prediction to list
            y_pred.append(prediction)
            # Update progress bar
            progress_bar.update(1)
            
    # Set model to "train" mode (not needed here, but a good practice)
    model.train()              
            
    # Return f1 score (here: micro)
    return metrics.f1_score(y_true, y_pred, average='micro')

Let's evaluate the model without any training.

In [None]:
evaluate(model, df_test)

As expected, the accuracy is very low. As a sanity check, recall that we have 18 different nationalities (i.e., class labels). This means that a model that's just randomly guessing -- and assuming the class labels are well balanced -- would be correct with a probability $1/18 = 0.0556$. Of course, we already know that our dataset is highly imbalanced, so the f1 score for the untrained model can deviate very much from $1/18$.

### Training (and evaluation after each epoch)

The code cell below implements the method to train our model using the basic loop structure we have already seen in the MLP notebook, and which represents the most basic structure doing the following steps for each batch. Note that our batch size is 1, i.e., we use each individual sample (i.e., name + label) to update the trainable parameters. The main reason is that our inputs (i.e., the names) are generally of different length, and all samples within a batch need to be the same length. There are 2 ways around this:

* Pad all sequences in a batch to the length of the longest sequence in the batch

* Generate batches in such a way that you insert only sequences of equal length

Both are standard techniques but just out of the scope of this notebook.

In [None]:
def train(model, df_train, df_test, optimizer, criterion, num_epochs):

    losses, f1_train, f1_test = [], [], []    
    
    # Run all epochs
    for epoch in range(1, num_epochs+1):
        # Shuffle training data to yield new batches (good practice)
        df = df_train.sample(frac=1).reset_index(drop=True)
        # Initilize loss for whole epoch
        epoch_loss = 0.0

        with tqdm(total=len(df_train)) as progress_bar:
            for _, row in df_train.iterrows():
                # Prepare the data sample
                name, nationality = row['name'], row['nationality']        

                # Convert name and nationality to a tensor and move it the the GPU (if available)
                name = name_to_tensor(name).to(device)
                nationality = torch.Tensor([nationality]).long().to(device)

                # Initialize the first hidden state h0
                hidden = model.init_hidden(1).to(device)

                # Iterate over all characters in the name and given the current character and the last hidden state to the model
                for i in range(name.size()[0]):
                    output, hidden = model(name[i], hidden)

                # Calculate loss
                loss = criterion(output, nationality)

                # Let PyTorch do its magic to calculate the gradients and update all trainable parameters
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

                # Keep track of overall epoch loss
                epoch_loss += loss.item()

                # Update progress bar
                progress_bar.update(1)

        # Keep track of all epoch losses
        losses.append(epoch_loss)

        # Compute f1 score for both TRAINING and TEST data
        f1_tr = evaluate(model, df_train)
        f1_te = evaluate(model, df_test)
        f1_train.append(f1_tr)
        f1_test.append(f1_te)
        
        print("Loss:\t{:.3f}, f1 train: {:.3f}, f1 test: {:.3f} (epoch {})".format(epoch_loss, f1_tr, f1_te, epoch))
        
    # Return all losses and f1 scores (all = for each epoch)
    return losses, f1_train, f1_test

The last steps before training our model are as usual to first define the loss function (i.e. criterion) as well as the optimizer. Since our model returns log probabilities, we have to use the Negative Log Likelihood Loss ([`nn.NLLLoss()`](https://pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html)). We also use the [`Adam`](https://pytorch.org/docs/stable/generated/torch.optim.Adam.html) optimizer.

In [None]:
# Create the model and movie to device
model = VanillaRNN(vocab_size, hidden_size, output_size).to(device)

# Define loss function
criterion = nn.NLLLoss()

# Define optimizer (you can try, but the basic (Stochastic) Gradient Descent is actually not great)
optimizer = torch.optim.Adam(model.parameters(), lr=0.0001)

Now we have all the parameters to call the `train()` in the code cell below. Note that you can run the code cell below multiple times to continue the training for further 10 epochs. Each epoch will print 3 progress bars:

* training over training set

* evaluating over training set

* evaluating over test set

After each epoch, a print statement will show the current loss as well as the latest f1 scores for the training and test set.

In [None]:
num_epochs = 10

losses, f1_train, f1_test = train(model, df_train, df_test, optimizer, criterion, num_epochs)

### Plotting the Results

Since the method `train()` returns the losses and f1 scores for each epoch, we can use this data to visualize how the loss and the f1 scores change over time, i.e., after each epoch. The code cell below creates the corresponding plot.

In [None]:
x = list(range(1, len(losses)+1))

# Convert losses to numpy array
losses = np.asarray(losses)
# Normalize losses so they match the scale in the plot (we are only interested in the trend of the losses!)
losses = losses/np.max(losses)

plt.figure()

plt.plot(x, losses, lw=3)
plt.plot(x, f1_train, lw=3)
plt.plot(x, f1_test, lw=3)

font_axes = {'family':'serif','color':'black','size':16}

plt.gca().set_xticks(x)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)

plt.xlabel("Epoch", fontdict=font_axes)
plt.ylabel("F1 Score", fontdict=font_axes)
plt.legend(['Loss', 'F1 (train)', 'F1 (test)'], loc='lower left', fontsize=16)
plt.tight_layout()
plt.show()

### Making Predictions

In practice, we also want to use our trained model to predict the nationality for a given name. In the code cell below, pick your name of choice and let the model predict the nationality.

In [None]:
name = 'chris'

# Convert name to a tensor and move it to the GPU (if available)
name_tensor = name_to_tensor(name).to(device)
        
# Use model to prediction class
prediction = predict(model, name_tensor)

# Convert class label to nationality
print(nationalities_mapping[prediction])

---

## Summary

Basic recurrent neural networks (RNNs) are a type of neural network architecture designed to process sequential data by capturing dependencies and patterns over time. They have a looping mechanism that allows information to persist and flow through the network, making them well-suited for tasks involving sequences, such as Natural Language Processing (NLP) and time series analysis.

In a basic RNN, each step of the sequence is processed one at a time. At each step, the RNN takes an input, which could be a word, a character, or a feature vector, and combines it with the hidden state from the previous step. This combination is passed through an activation function, typically a hyperbolic tangent or sigmoid function, to produce an output and update the hidden state. The updated hidden state is then used in the next step, allowing the network to capture information and dependencies from earlier steps.

RNNs have been successfully applied to various NLP tasks. They can perform tasks such as language modeling, where the network learns to predict the next word in a sequence given the previous words. RNNs are also used in machine translation, sentiment analysis, named entity recognition, and many other NLP applications. By leveraging their sequential processing capabilities and capturing context and dependencies over time, basic RNNs and their variants have become essential tools in NLP, enabling more sophisticated language understanding, generation, and sequence-to-sequence mapping.

The topic of RNNs is of course much wider than covered in this notebook. For example, one limitation of basic RNNs is the vanishing or exploding gradient problem. During backpropagation, the gradients can either become too small, causing the network to have difficulty learning long-term dependencies, or become too large, leading to instability during training. To address this, more advanced variants like Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) were introduced. These variants incorporate gating mechanisms to control the flow of information and better capture long-term dependencies, making them widely used in practice.