# Recurrent Neural Networks
## IMD1107 - Natural Language Processing
### [Dr. Elias Jacob de Menezes Neto](htttps://docente.ufrn.br/elias.jacob)

## Keypoints

- Recurrent Neural Networks (RNNs) process sequential data by maintaining hidden states across time steps, allowing them to capture dependencies and relationships between words in a sentence.

- RNNs face challenges with vanishing and exploding gradients when dealing with long sequences. The vanishing gradient problem occurs when gradients become extremely small during backpropagation, preventing effective learning. The exploding gradient problem happens when gradients become very large, leading to unstable training.

- Long Short-Term Memory (LSTM) networks address the limitations of RNNs by introducing a memory cell and gates (input, output, forget) that regulate information flow. This allows LSTMs to effectively capture long-term dependencies.

- The LSTM architecture consists of an input gate that controls what new information is added to the cell state, a forget gate that determines what information to discard, an output gate that decides what information from the cell state is used to compute the hidden state, and a memory cell that maintains long-term information.

- Bidirectional LSTM networks process sequences in both forward and backward directions, capturing both past and future contexts. This additional context helps improve the performance of sequence processing tasks.

- Gated Recurrent Units (GRUs) simplify the LSTM architecture by combining the hidden and cell states into a single hidden state and using only two gates (update and reset). GRUs are computationally more efficient than LSTMs while still effectively handling long-term dependencies.

- The case study demonstrated the application of RNN, LSTM, and GRU architectures on a name gender classification task using a dataset from the 2010 Brazilian Census. The Bidirectional LSTM achieved the highest accuracy, followed by the Bidirectional GRU, LSTM, and Vanilla RNN.


## Learning Goals

By the end of this class, you will be able to:

1) **Explain the fundamental principles of Recurrent Neural Networks (RNNs) and how they process sequential data by maintaining hidden states across time steps.**

2) **Compare and contrast the architectures of Vanilla RNNs, Long Short-Term Memory (LSTM) networks, and Gated Recurrent Units (GRUs), highlighting their mechanisms for addressing the vanishing gradient problem and capturing long-term dependencies.**

3) **Implement and train Recurrent Neural Network models, including Vanilla RNN, LSTM, and Bidirectional LSTM/GRU, using PyTorch for a sequence classification task, such as name gender prediction.**

4) **Evaluate the performance of different recurrent network architectures using appropriate metrics like accuracy, and interpret classification reports and confusion matrices to understand model strengths and weaknesses.**

5) **Discuss the tradeoffs between different recurrent network architectures in terms of model complexity, number of parameters, computational efficiency, and accuracy, and justify the selection of a particular architecture for specific NLP tasks.**

# Recurrent Networks for NLP


When processing textual data, it's crucial to consider the dependencies and relationships between words in a sentence. The semantics of a sentence can change profoundly based on the order and selection of words.

Consider these two similar sentences:
> "A bomba explodiu no jornal."
>
> "A notícia do jornal explodiu como uma bomba"

Despite having analogous structure, interchanging just one adjective leads to a completely different meaning and emotional impact on the reader. Context plays a vital role, especially when a sentence's overall meaning can be greatly influenced by what has been said or happened previously.

*Recurrent Neural Networks* (RNNs) provide neural networks with the capability to memorize previous words within a statement, enabling them to better capture and understand patterns that appear when certain tokens appear sequentially relative to other tokens. This is the fundamental premise of RNNs.

## How RNNs Maintain State Across Time

RNNs operate on the principle of maintaining state across time. While initially, it might seem complicated, it's essentially about giving the network a context for its current operation based on historical data.

For each input fed into a standard feed-forward network, the output from one time step 't' is provided as an additional input for the next step 't+1', along with the fresh data being supplied at 't+1'. In simpler terms, you inform the network about what happened earlier alongside what is happening "now".

This concept forms the basis of RNNs—which learn and remember over time, enabling them to better capture patterns within sequences. Understanding this is key to exploiting the power of RNNs for text analysis and other sequential data processing tasks.

## Visualizing a RNN

You can visualize a recurrent net as shown in figure below:

<p align="center">
<img src="images/rnn_unrolled.png" alt="" style="width: 50%; height: 50%"/>
</p>


Look at the left side. The circles are entire feedforward network layers composed of one or more neurons. The output of the hidden layer emerges from the network as normal, but it’s also set aside to be passed back in as an input to itself along with the normal input from the next time step. This feedback is represented with an arc from the output of a layer back into its own input.

An easier way to see this process—and it’s more commonly shown this way—is by unrolling the net. The right side of the image above shows the network stood on its head with two unfoldings of the time variable (t), showing layers for t+1 and t+2.

Each time step is represented by a column of neurons in the unrolled version of the very same neural network. It’s like looking at a screenplay or video frame of the neural net for each sample in time. The network to the right is the future version of the network on the left. The output of a hidden layer at one time step (t) is fed back into the hidden layer along with input data for the next time step (t+1) to the right. Repeat. This diagram shows two iterations of this unfolding, so three columns of neurons for t=0, t=1, and t=2.

All the vertical paths in this visualization are clones, or views of the same neurons. They are the single network represented on a timeline. This visualization is helpful when talking about how information flows through the network forward and backward during backpropagation. But when looking at the three unfolded networks, remember that they’re all different snapshots of the same network with a single set of weights.

### Structure of RNN: Feedforward Network Layers

Viewing the left side of the image above, you'll notice circles that represent layers in a feedforward network, with each layer comprising one or more neurons. The output of the hidden layer not only moves forward through the network but also feeds back into the input of its originating layer.

This recurrent feedback is illustrated by an arc looping from the layer's output back to its own input.

### Unfolding Time Variable for Better Visualization

To better visualize this process, we can 'unroll' the network over time. This technique, represented on the right side of the image, essentially flips the network on its head, revealing the progress of the network over two stages of the time variable (t), namely t+1 and t+2.

Each time step 't' is denoted as a column of neurons in the unrolled version of our network. It can be thought of as watching successive frames of a movie, where each frame represents the state of the network at a given moment in time.

### Cloned Views of Same Neurons

In this representation, all vertical paths are clones or different views of the same set of neurons; they depict the same neural network captured at various points along a timeline.

While this kind of representation simplifies comprehension of information flow (both forward and backward during backpropagation), it's essential to remember when looking at these multiple 'unfolded' networks: they are merely simultaneous snapshots of the same single network maintaining a consistent set of weights.


> Recognizing an unrolled RNN as sequential instances of the same network operating over time is crucial for understanding how RNNs capture and utilize temporal information from sequences. This comprehension forms the basis of effectively utilizing the power of RNNs for sequence data processing tasks.

## Training our first RNN

### Our case study

Let's discuss an interesting scenario. Assume you are employed at the Ombudsman Office of our University. A major part of your role involves addressing students' complaints and correspondingly communicating with the related departments. As a measure to enhance the quality of your communication, you have decided to use gender-appropriate pronouns based on the person's first name.

One important goal is to avoid incorrectly gendering specific roles or offices within the university (e.g., naming the President's office as "Gabinete do Reitor" even when the President is a female, which happened between 2011 and 2019 at UFRN).

To achieve this objective, we will be using a [dataset collected by IBGE during 2010 Census](https://brasil.io/dataset/genero-nomes/nomes/). This dataset contains a total of 90,104 names, out of which 49,274 are female and 40,830 are male. To ensure accuracy, any names that could be associated with both genders, such as "Elias", "Ivani" or "Alison", have been excluded from our analysis.

> Interestingly, during my data exploration, I found out that there were 189,315 men and 1,387 women with the same name as mine. I'd never imagine women could be named Elias!

Our aim here is to develop a Recurrent Neural Network (RNN) that can read a name letter by letter, and predict the probability of the name being either masculine or feminine. For the purpose of this project:

- Each lowercase letter will be considered a token
- The vocabulary will comprise the 26 alphabet letters
- Any accented letters will be converted to their non-accented versions

We will divide our dataset into two parts: 80% for training and 20% for validation.

> Please note that while we recognize and respect the existence of non-binary gender identities, for the purposes of this exercise, we will be employing a binary classification model due to dataset limitations. Our dataset solely contains names which are classified as either masculine or feminine, hence we are confined to two classes. Maybe in the future we can work on a more inclusive model!

With that said, let's start by importing the necessary libraries and loading our dataset.

In [1]:
# Import the unicodedata library for Unicode character database
import unicodedata

# Import PyTorch library for deep learning
import torch

# Import neural network module from PyTorch
import torch.nn as nn

# Import functional interface for neural networks from PyTorch
import torch.nn.functional as F

# Import optimization algorithms from PyTorch
import torch.optim as optim

# Import pandas library for data manipulation and analysis
import pandas as pd

# Import random module for generating random numbers
import random

# Import accuracy_score, classification_report, and confusion_matrix from sklearn for evaluation metrics
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Import datetime and timedelta from datetime module for handling date and time
from datetime import datetime, timedelta

In [2]:
df = pd.read_csv("data/names_gender.csv")
df

Unnamed: 0,name,label,label_str
0,silmari,1,F
1,jovanilde,1,F
2,yorrana,1,F
3,nakita,1,F
4,tiarle,0,M
...,...,...,...
90098,edsmar,0,M
90099,altenice,1,F
90100,arthemis,1,F
90101,mielly,1,F


In [3]:
import unicodedata


def normalize_name(name):
    # Step 1: Normalize the Unicode string
    # NFKD stands for Normalization Form Compatibility Decomposition
    # This step separates combined characters into base character and diacritical marks
    normalized = unicodedata.normalize("NFKD", name)

    # Step 2: Encode to ASCII and ignore non-ASCII characters
    # This effectively removes accents and other diacritical marks
    ascii_encoded = normalized.encode("ASCII", "ignore")

    # Step 3: Decode back to UTF-8
    # This converts the bytes object back into a string
    utf8_decoded = ascii_encoded.decode("utf-8")

    # Step 4: Convert to lowercase
    # This ensures consistent casing across all names
    lowercased = utf8_decoded.lower()

    return lowercased


# Example usage:
# This will output 'jose', removing the accent from the 'é'
print(normalize_name("José"))  # Output: jose
print(normalize_name("Café"))  # Output: cafe
print(normalize_name("Über"))  # Output: uber

jose
cafe
uber


In [4]:
df

Unnamed: 0,name,label,label_str
0,silmari,1,F
1,jovanilde,1,F
2,yorrana,1,F
3,nakita,1,F
4,tiarle,0,M
...,...,...,...
90098,edsmar,0,M
90099,altenice,1,F
90100,arthemis,1,F
90101,mielly,1,F


In [5]:
print(df.shape)
df["name"] = df["name"].apply(normalize_name)
df.drop_duplicates(inplace=True, subset=["name"], keep=False)
print(df.shape)

(90103, 3)
(90103, 3)


In [6]:
df.query('name == "jose"')

Unnamed: 0,name,label,label_str
58546,jose,0,M


In [7]:
df.query('name == "maria"')

Unnamed: 0,name,label,label_str
30451,maria,1,F


In [8]:
# This will be our simple tokenizer. It will split the names into letters


def custom_tokenizer_letters(text):
    text = normalize_name(text)

    # Convert the normalized text into a list of individual characters
    # This approach treats each letter as a separate token
    return list(text)


# Example usage of the tokenizer
print(custom_tokenizer_letters("José"))

# Expected output: ['j', 'o', 's', 'e']

['j', 'o', 's', 'e']


In [9]:
df

Unnamed: 0,name,label,label_str
0,silmari,1,F
1,jovanilde,1,F
2,yorrana,1,F
3,nakita,1,F
4,tiarle,0,M
...,...,...,...
90098,edsmar,0,M
90099,altenice,1,F
90100,arthemis,1,F
90101,mielly,1,F


In [10]:
# We reserve two tokens: <pad> for padding and <unk> for any unknown letter.
PAD_TOKEN = "<pad>"
UNK_TOKEN = "<unk>"

# First, split the data into training and validation sets (80/20)
# Here we use a random shuffle and split based on indices.
df = df.sample(frac=1, random_state=271828).reset_index(drop=True)
split_idx = int(len(df) * 0.8)
df_train = df.iloc[:split_idx].reset_index(drop=True)
df_valid = df.iloc[split_idx:].reset_index(drop=True)

# Build the vocabulary from training names. We loop over each name, tokenize it,
# and collect all unique letters.
letters = set()
for name in df_train["name"]:
    tokens = custom_tokenizer_letters(name)
    letters.update(tokens)

In [11]:
# Create a sorted list of letters.
letters = sorted(list(letters))
letters

['a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'j',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'q',
 'r',
 's',
 't',
 'u',
 'v',
 'w',
 'x',
 'y',
 'z']

In [12]:
# Build vocabulary dictionary: index mapping for each letter.
# Reserve index 0 for PAD and index 1 for UNK.
vocab = {PAD_TOKEN: 0, UNK_TOKEN: 1}
for i, letter in enumerate(letters, start=2):
    vocab[letter] = i
vocab

{'<pad>': 0,
 '<unk>': 1,
 'a': 2,
 'b': 3,
 'c': 4,
 'd': 5,
 'e': 6,
 'f': 7,
 'g': 8,
 'h': 9,
 'i': 10,
 'j': 11,
 'k': 12,
 'l': 13,
 'm': 14,
 'n': 15,
 'o': 16,
 'p': 17,
 'q': 18,
 'r': 19,
 's': 20,
 't': 21,
 'u': 22,
 'v': 23,
 'w': 24,
 'x': 25,
 'y': 26,
 'z': 27}

In [13]:
# Create an inverse mapping (optional)
itos = {i: s for s, i in vocab.items()}
print("Vocabulary:", vocab)
print("Vocabulary size:", len(vocab))

Vocabulary: {'<pad>': 0, '<unk>': 1, 'a': 2, 'b': 3, 'c': 4, 'd': 5, 'e': 6, 'f': 7, 'g': 8, 'h': 9, 'i': 10, 'j': 11, 'k': 12, 'l': 13, 'm': 14, 'n': 15, 'o': 16, 'p': 17, 'q': 18, 'r': 19, 's': 20, 't': 21, 'u': 22, 'v': 23, 'w': 24, 'x': 25, 'y': 26, 'z': 27}
Vocabulary size: 28


In [14]:
# Utility Functions for Conversion


def tokenize_and_convert(name, vocab):
    """
    Given a name string, tokenize into letters and convert each letter into its corresponding index.
    Unknown letters are replaced with the index for <unk>.
    """
    tokens = custom_tokenizer_letters(name)
    indices = [vocab.get(token, vocab[UNK_TOKEN]) for token in tokens]
    return torch.tensor(indices, dtype=torch.long)


def label_to_int(label_str):
    """
    Convert label string to integer.
    We'll assume that in the CSV the label is given as "F" for female and "M" for male.
    (If your CSV uses other encodings, adjust this function accordingly.)
    Here we map 'M' -> 0 and 'F' -> 1.
    """
    return 0 if label_str.strip().upper() == "M" else 1

In [15]:
# PyTorch Dataset
from torch.utils.data import Dataset


class NamesDataset(Dataset):
    def __init__(self, df, vocab):
        """
        Args:
          df: DataFrame with columns 'name' and 'label'
          vocab: letter to index dictionary
        """
        self.names = df["name"].tolist()
        self.labels = df["label"].tolist()
        self.vocab = vocab

    def __len__(self):
        return len(self.names)

    def __getitem__(self, idx):
        name = self.names[idx]
        # Convert name to list of token indices
        name_tensor = tokenize_and_convert(name, self.vocab)
        # Also get the label as an integer.
        label_tensor = torch.tensor(self.labels[idx], dtype=torch.long)
        # Return the tensor, its length, and the label.
        return name_tensor, len(name_tensor), label_tensor


# Create datasets
train_dataset = NamesDataset(df_train, vocab)
valid_dataset = NamesDataset(df_valid, vocab)

In [16]:
df_train

Unnamed: 0,name,label,label_str
0,francileda,1,F
1,biaka,1,F
2,artemira,1,F
3,oberico,0,M
4,eliziene,1,F
...,...,...,...
72077,tindaro,0,M
72078,diuliene,1,F
72079,damaris,1,F
72080,divonzir,0,M


In [17]:
# Also create a list of (name, label) tuples in string/int format for later evaluation.
names_train = [
    (name, str(label)) for name, label in zip(df_train["name"], df_train["label"])
]
names_valid = [
    (name, str(label)) for name, label in zip(df_valid["name"], df_valid["label"])
]

In [18]:
print("Train dataset size", len(train_dataset))
print("Valid dataset size", len(valid_dataset))

Train dataset size 72082
Valid dataset size 18021


In [19]:
train_dataset[0]

(tensor([ 7, 19,  2, 15,  4, 10, 13,  6,  5,  2]), 10, tensor(1))

In [20]:
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import DataLoader


# Custom Collate Function
def collate_fn(batch):
    """
    Expects a list of tuples: (name_tensor, name_length, label_tensor)
    This function pads the name tensors to the length of the longest name and returns:
      names_padded: tensor [max_seq_length, batch_size]
      lengths: tensor of sequence lengths
      labels: tensor of labels
    """
    # Separate out the fields.
    name_tensors, lengths, labels = zip(*batch)
    # Pad the sequences. (The output will be of shape: [batch_size, max_seq_len])
    names_padded = pad_sequence(
        name_tensors, batch_first=False, padding_value=vocab[PAD_TOKEN]
    )
    # Convert lengths and labels to tensors.
    lengths = torch.tensor(lengths, dtype=torch.long)
    labels = torch.stack(labels)
    return names_padded, lengths, labels


# Create DataLoader objects.
BATCH_SIZE = 32
train_loader = DataLoader(
    train_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_fn
)
valid_loader = DataLoader(
    valid_dataset, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_fn
)

In [22]:
random.choices(names_valid, k=10)

[('amarantino', '0'),
 ('sebila', '1'),
 ('anaili', '1'),
 ('mauricius', '0'),
 ('estefesson', '0'),
 ('deonize', '1'),
 ('antonios', '0'),
 ('eronisa', '1'),
 ('vicencio', '0'),
 ('melquias', '0')]

In [23]:
random.choices(names_train, k=10)

[('atalirio', '0'),
 ('dionato', '0'),
 ('arioneide', '1'),
 ('leicia', '1'),
 ('claonice', '1'),
 ('jilmario', '0'),
 ('tharcila', '1'),
 ('meize', '1'),
 ('marsol', '0'),
 ('nesmar', '0')]

In [26]:
train_dataset[2]

(tensor([ 2, 19, 21,  6, 14, 10, 19,  2]), 8, tensor(1))

In [27]:
# Import the Counter class from the collections module
from collections import Counter

# Use a list comprehension to extract the gender labels from the training and validation datasets
# The Counter class is then used to count the frequency of each label (0 for male, 1 for female)
# The resulting counts are printed to the console
print(Counter([label.item() for _, _, label in train_loader.dataset]))
print(Counter([label.item() for _, _, label in valid_loader.dataset]))

Counter({1: 39345, 0: 32737})
Counter({1: 9929, 0: 8092})


In [28]:
len(train_dataset), len(valid_dataset)

(72082, 18021)

In [30]:
# Check if a GPU is available for PyTorch
# torch.cuda.is_available() returns True if a GPU is available, otherwise False
# If a GPU is available, set the device to 'cuda' to utilize the GPU for computations
# If a GPU is not available, set the device to 'cpu' to use the CPU for computations
device = "cuda" if torch.cuda.is_available() else "cpu"
device

'cuda'

In [31]:
import torch.nn as nn


# Define a class for our RNN model
class NameRNN(nn.Module):
    def __init__(self, hidden_size, embedding_dim, vocab_size, pad_idx):
        super().__init__()

        # Define an embedding layer to convert our letters to vectors
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_idx)

        # Define an RNN layer to encode our sequence of letters
        self.encoder = nn.RNN(
            input_size=embedding_dim,  # Size of each input vector
            hidden_size=hidden_size,  # Number of features in the hidden state
            num_layers=2,  # Number of recurrent layers
            dropout=0.3,  # Dropout probability for the RNN layers
            bidirectional=False,  # Whether the RNN is bidirectional
        )

        # Define a dropout layer to prevent overfitting
        self.dropout = nn.Dropout(0.3)

        # Define a classifier layer to output our final prediction
        self.classifier = nn.Sequential(
            nn.Linear(hidden_size, hidden_size // 2),  # First linear layer
            self.dropout,  # Dropout layer
            nn.ReLU(),  # Activation function
            nn.Linear(
                hidden_size // 2, 2
            ),  # Second linear layer, output size is 2 for binary classification
        )

    def forward(self, seq, name_len):
        # Convert our sequence of letters to a sequence of vectors using the embedding layer
        embedded = self.embedding(seq)
        embedded = self.dropout(embedded)

        # Encode the embedded sequence using the RNN layer
        packed_output, hidden = self.encoder(embedded)

        # Pass the final hidden state through the classifier layer to get our prediction
        preds = self.classifier(hidden[-1])

        return preds

In [32]:
pad_idx = vocab[PAD_TOKEN]
pad_idx

0

In [33]:
model_rnn = NameRNN(
    hidden_size=50,  # Hidden state size of the RNN
    embedding_dim=25,  # Dimension of the embedding vectors
    vocab_size=len(vocab),  # Vocabulary size
    pad_idx=pad_idx,  # Padding index for the embedding layer
)

In our example, we've selected an arbitrary hidden size of 50 and an embedding dimension of 25. These choices were made largely to minimize computation time; however, adjusting these parameters can yield different results in both computational speed and model accuracy. Hence, it's encouraged that you experiment with varying these numbers based on your specific use-cases.

#### Rule of Thumb for Model Complexity

A helpful guiding principle when designing models is to ensure that the complexity of your model aligns appropriately with your data's innate structure. The objective is to achieve a balance where:

1. Your model isn't too complex for your data (Overfitting), and
2. It's not too simple relative to your data (Underfitting).

Let's understand what this means:

##### Overfitting - High Variance and Low Bias

When a model is overly complex, it tends to "memorize" the training data rather than "learning" from it, causing poor generalization when faced with new, unseen data. This scenario is referred to as **overfitting** the data and results in a model with high variance and low bias.

##### Underfitting - Low Variance and High Bias

Conversely, if a model is too uncomplicated, it will fail even in capturing the fundamental patterns of the training data. Such underutilizing models are prone to consistently generate inaccurate predictions across all data types, both seen and unseen. This sub-optimal situation is known as **underfitting** and leads to a model with low variance and high bias.

#### Balancing Between Bias and Variance

Balancing between overfitting and underfitting is often described as managing the trade-off between bias and variance. The key is to find a sweet spot where the model is just complex enough to learn useful patterns from the training data but also retains the ability to generalize effectively to unseen data.

While building your model, it's essential to keep this concept in mind: experiment with different parameters, monitor how they affect the performance of your model, and fine-tune them for optimal results.

In [34]:
# Function to count the number of trainable parameters in a model
def count_parameters(model):
    # Sum the number of elements (numel) for each parameter in the model
    # Only include parameters that require gradients (i.e., are trainable)
    n_parameters = sum(p.numel() for p in model.parameters() if p.requires_grad)

    # Print the total number of trainable parameters in a human-readable format
    print(f"The model has {n_parameters:,} trainable parameters")

    # Return the total number of trainable parameters
    return n_parameters


# Count the number of trainable parameters in the model_rnn instance
n_parameters = count_parameters(model_rnn)

The model has 10,977 trainable parameters


In [39]:
def train(
    epochs,
    model,
    optimizer,
    criterion,
    train_loader,
    valid_loader,
    device,
    checkpoint_fname,
    verbose=True,
):
    if verbose:
        count_parameters(model)
    start_time = datetime.now()
    model = model.to(device)
    criterion = criterion.to(device)
    best_valid_loss = float("inf")
    for epoch in range(1, epochs + 1):
        model.train()
        training_loss = 0.0
        for batch in train_loader:
            optimizer.zero_grad()
            names, name_len, labels = batch
            # Move to device
            names, name_len, labels = (
                names.to(device),
                name_len.to(device),
                labels.to(device),
            )
            outputs = model(names, name_len)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            # Multiply by the number of examples in the batch (i.e. batch size)
            training_loss += loss.item() * names.size(1)
        training_loss /= len(train_loader.dataset)
        model.eval()
        valid_loss = 0.0
        with torch.no_grad():
            for batch in valid_loader:
                names, name_len, labels = batch
                names, name_len, labels = (
                    names.to(device),
                    name_len.to(device),
                    labels.to(device),
                )
                outputs = model(names, name_len)
                loss = criterion(outputs, labels)
                valid_loss += loss.item() * names.size(1)
        valid_loss /= len(valid_loader.dataset)
        if valid_loss < best_valid_loss:
            best_valid_loss = valid_loss
            torch.save(model.state_dict(), checkpoint_fname)
            if verbose:
                print(
                    f"Epoch {epoch} - Training Loss: {training_loss:.4f} - Valid Loss: {valid_loss:.4f} - New Best"
                )
        else:
            if verbose:
                print(
                    f"Epoch {epoch} - Training Loss: {training_loss:.4f} - Valid Loss: {valid_loss:.4f}"
                )
    elapsed_time = datetime.now() - start_time
    if verbose:
        print(f"Time elapsed: {elapsed_time}")
        print(f"Mean time per epoch: {elapsed_time / epochs}")

In [40]:
import torch.nn.functional as F

# Define a function to predict the gender label for a given name using a trained model
# The function takes three arguments:
# - name: the name to predict the label for
# - model: the trained model to use for prediction
# - device: the device to use for computation (default is 'cpu')


def predict(name, model, vocab, device="cpu"):
    # Convert the input name into token indices and then a tensor.
    name_tensor = tokenize_and_convert(name, vocab)
    name_len = torch.tensor([len(name_tensor)])
    # Add a batch dimension (tensor shape [seq_len] becomes [seq_len, 1]).
    name_tensor = name_tensor.unsqueeze(1)
    name_tensor = name_tensor.to(device)
    name_len = name_len.to(device)
    logits = model(name_tensor, name_len)
    probabilities = F.softmax(logits, dim=1)
    # We use a small dictionary that maps {0: 'M', 1: 'F'}
    result_dict = {0: "M", 1: "F"}
    label = result_dict[probabilities.argmax().item()]
    return [label, probabilities.max().item()]

In [41]:
# Import the Path class from the pathlib module to handle file system paths
from pathlib import Path

# Define the path where the model checkpoints will be saved
checkpoint_path = Path("./outputs/rnns/")

# Create the directory (and any necessary parent directories) if it doesn't already exist
checkpoint_path.mkdir(parents=True, exist_ok=True)

# Define the optimizer to use for training the model
# Here, we use the Adam optimizer with a learning rate of 3e-4
optimizer = optim.Adam(model_rnn.parameters(), lr=3e-4)

# Define the loss function to use for training the model
# Here, we use the CrossEntropyLoss, which is suitable for classification tasks
criterion = nn.CrossEntropyLoss()

In [42]:
# Define the filename for saving the best model checkpoint
# The checkpoint will be saved in the previously defined checkpoint_path directory
checkpoint_fname = checkpoint_path / "bestRNN.pt"

# Train the model for 30 epochs
# Arguments:
# - epochs: Number of epochs to train the model
# - model: The model to train (model_rnn)
# - optimizer: The optimizer to use for training (Adam optimizer)
# - criterion: The loss function to use for training (CrossEntropyLoss)
# - train_loader: The iterator for the training dataset
# - valid_loader: The iterator for the validation dataset
# - device: The device to use for computation (e.g., 'cpu' or 'cuda')
# - checkpoint_fname: The filename for saving the best model checkpoint
train(
    30,
    model_rnn,
    optimizer,
    criterion,
    train_loader,
    valid_loader,
    device,
    checkpoint_fname,
)

The model has 10,977 trainable parameters
Epoch 1 - Training Loss: 0.3567 - Valid Loss: 0.2522 - New Best
Epoch 2 - Training Loss: 0.2523 - Valid Loss: 0.2093 - New Best
Epoch 3 - Training Loss: 0.2263 - Valid Loss: 0.1956 - New Best
Epoch 4 - Training Loss: 0.2144 - Valid Loss: 0.1963
Epoch 5 - Training Loss: 0.2045 - Valid Loss: 0.1760 - New Best
Epoch 6 - Training Loss: 0.1954 - Valid Loss: 0.1735 - New Best
Epoch 7 - Training Loss: 0.1904 - Valid Loss: 0.1629 - New Best
Epoch 8 - Training Loss: 0.1838 - Valid Loss: 0.1624 - New Best
Epoch 9 - Training Loss: 0.1787 - Valid Loss: 0.1548 - New Best
Epoch 10 - Training Loss: 0.1747 - Valid Loss: 0.1551
Epoch 11 - Training Loss: 0.1713 - Valid Loss: 0.1488 - New Best
Epoch 12 - Training Loss: 0.1684 - Valid Loss: 0.1476 - New Best
Epoch 13 - Training Loss: 0.1656 - Valid Loss: 0.1452 - New Best
Epoch 14 - Training Loss: 0.1623 - Valid Loss: 0.1502
Epoch 15 - Training Loss: 0.1604 - Valid Loss: 0.1468
Epoch 16 - Training Loss: 0.1602 - V

In [43]:
# Create a new instance of the NameRNN model with the same hyperparameters as the trained model
# This ensures that the model architecture matches the one used during training
model_rnn_inference = NameRNN(
    hidden_size=50,  # Hidden state size of the RNN
    embedding_dim=25,  # Dimension of the embedding vectors
    vocab_size=len(vocab),  # Vocabulary size
    pad_idx=pad_idx,  # Padding index for the embedding layer
)

# Load the trained model's state dictionary into the new model instance
# This will set the model parameters to the best found during training
model_rnn_inference.load_state_dict(torch.load(checkpoint_fname, weights_only=False))

# Set the model to evaluation mode
# This disables dropout and other training-specific behaviors
model_rnn_inference.eval()

# Send the model to the CPU for inference
# This ensures that the model runs on the CPU, which is typically used for inference
model_rnn_inference = model_rnn_inference.to("cpu")

In [45]:
predict("joana", model_rnn_inference, vocab)

['F', 0.9914980530738831]

In [46]:
predict("joão", model_rnn_inference, vocab)

['M', 0.995205283164978]

In [47]:
predict("maria", model_rnn_inference, vocab)

['F', 0.9976400136947632]

In [48]:
predict("marcos", model_rnn_inference, vocab)

['M', 0.9918715953826904]

In [49]:
# Loop through the first 50 names in the training dataset
for i in names_train[:50]:
    # Print the name and its predicted gender label and probability
    # The predict function takes the name, the trained model, and the device ('cpu') as arguments
    print(i, predict(i[0], model_rnn_inference, vocab))

('francileda', '1') ['F', 0.9985161423683167]
('biaka', '1') ['F', 0.9904242157936096]
('artemira', '1') ['F', 0.9970412850379944]
('oberico', '0') ['M', 0.9997696280479431]
('eliziene', '1') ['F', 0.9995582699775696]
('hannha', '1') ['F', 0.9961289167404175]
('jocela', '1') ['F', 0.9954225420951843]
('dejaina', '1') ['F', 0.9992408752441406]
('avimar', '0') ['M', 0.8921133875846863]
('kathen', '1') ['F', 0.548227846622467]
('maua', '0') ['F', 0.979082465171814]
('udo', '0') ['M', 0.9996289014816284]
('thalila', '1') ['F', 0.999546468257904]
('joseri', '0') ['M', 0.7859906554222107]
('lurivaldo', '0') ['M', 0.9991143345832825]
('zaiara', '1') ['F', 0.9980607628822327]
('periclis', '0') ['M', 0.8826977014541626]
('jaquiline', '1') ['F', 0.9952273368835449]
('izenaide', '1') ['F', 0.9973949193954468]
('luzinette', '1') ['F', 0.998856782913208]
('uerlan', '0') ['M', 0.9329925775527954]
('dharlan', '0') ['M', 0.9563585519790649]
('armelindo', '0') ['M', 0.9952217936515808]
('hersilio', '0'

In [50]:
# Define a dictionary to map label indices to their corresponding gender labels
# '1' corresponds to 'F' (Female) and '0' corresponds to 'M' (Male)
label_mapping = {"1": "F", "0": "M"}

# Create a list of true gender labels for the validation dataset
# Iterate over the names in the validation dataset and map the label index to the gender label using label_mapping
true_valid = [label_mapping[i[1]] for i in names_valid]

# Create a list of predicted gender labels for the validation dataset
# Iterate over the names in the validation dataset and use the predict function to get the predicted gender label
# The predict function takes the name, the trained model, and the device ('cpu') as arguments
pred_valid = [predict(i[0], model_rnn_inference, vocab)[0] for i in names_valid]

In [57]:
# Initialize an empty list to store the names where the model's prediction is incorrect
failed = []

# Loop through the validation dataset
for i in range(len(true_valid)):
    # Compare the true gender label with the predicted gender label
    if true_valid[i] != pred_valid[i]:
        # If the labels do not match, add the name to the failed list
        error_name = names_valid[i][0]
        failed.append((error_name, true_valid[i], pred_valid[i]))

In [58]:
random.choices(failed, k=10)

[('carem', 'F', 'M'),
 ('marlui', 'F', 'M'),
 ('michico', 'F', 'M'),
 ('alderige', 'M', 'F'),
 ('caici', 'M', 'F'),
 ('clautenes', 'F', 'M'),
 ('cleacir', 'M', 'F'),
 ('kenad', 'M', 'F'),
 ('itagibe', 'M', 'F'),
 ('yaeco', 'F', 'M')]

In [59]:
len(failed)

1239

In [60]:
print("Accuracy: ", accuracy_score(true_valid, pred_valid))
print(f"Classification Report:\n {classification_report(true_valid, pred_valid)}")
print(f"Confusion Matrix:\n {confusion_matrix(true_valid, pred_valid)}")

Accuracy:  0.9312468786415848
Classification Report:
               precision    recall  f1-score   support

           F       0.94      0.93      0.94      9929
           M       0.92      0.93      0.92      8092

    accuracy                           0.93     18021
   macro avg       0.93      0.93      0.93     18021
weighted avg       0.93      0.93      0.93     18021

Confusion Matrix:
 [[9268  661]
 [ 578 7514]]


## Long-Short Term Memory (LSTM) Networks: An In-Depth Overview


### Challenges with Recurrent Neural Networks (RNNs)

Traditional RNNs maintain a state across multiple time steps, which helps process sequential data. However, they often struggle with **long-term dependencies** because of two main issues:

- **Vanishing Gradient Problem**  
  During training, gradients (derivatives) are passed backward through layers. When these gradients are multiplied repeatedly, they may shrink exponentially, especially if the weight matrices have values smaller than 1. This results in near-zero gradients, making it hard for the network to capture dependencies that are far apart in the sequence.

  > **Key Point:**  
  In backpropagation through time, the repeated multiplication of small gradient values leads to a vanishing effect, reducing the network's ability to learn from long-distance connections.

- **Exploding Gradient Problem**  
  In contrast, if the weight matrices contain values larger than 1, the gradient values can grow exponentially, leading to very large values. This makes the training process unstable.

  > **Important Note:**  
  Both issues arise from the chain rule in gradient calculation, where repeated multiplication can either excessively diminish or excessively magnify the gradient.


**Illustrative Example of Long-Term Dependencies:**  
Consider the sentence:  

> "O cachorro passou o dia brincando .......... estava cansado."  

Here, the word "estava" depends on "cachorro" even when they are many words apart. Standard RNNs often struggle to capture such distant relationships due to the vanishing and exploding gradient problems.


### Introduction to LSTM Networks

LSTM networks were designed to overcome these challenges in RNNs. They excel in capturing long-term dependencies by using a more sophisticated cell structure with dedicated mechanisms to regulate information flow. The concept was introduced in 1997 by [Hochreiter and Schmidhuber](https://www.bioinf.jku.at/publications/older/2604.pdf).


### LSTM Network Architecture

An LSTM network is composed of a memory cell along with three critical gating mechanisms: the **input gate**, **output gate**, and **forget gate**. The unique design of these gates allows the network to effectively store and manage information over long sequences.


<p align="center">
  <img src="images/lstm.jpeg" alt="" style="width: 50%; height: 50%"/>
</p>


#### The Memory Cell

- **Cell State:**  
  Represented by a thick line in the cell diagram, the cell state serves as the internal memory. It carries information forward through time steps, with its value updated by various gates.

- **Candidate Cell State Calculation:**  
  The candidate information for updating the cell state is computed using:
  $$
  \tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)
  $$
  - $ \tilde{C}_t $: Candidate cell state at time $ t $
  - $ \tanh $: Activation function compressing values between -1 and 1
  - $ W_C $: Weight matrix for the candidate information
  - $ [h_{t-1}, x_t] $: Concatenation of the previous hidden state and current input
  - $ b_C $: Bias vector

- **Updating the Cell State:**  
  The current cell state is updated using the equation:
  $$
  C_t = f_t \cdot C_{t-1} + i_t \cdot \tilde{C}_t
  $$
  - $ C_t $: New cell state at time $ t $
  - $ f_t $: Forget gate output (determines what to discard from the previous cell state)
  - $ i_t $: Input gate output (determines how much of the candidate state to add)

#### The Hidden State

The hidden state contains processed information passed to the next time step. It is calculated based on the updated cell state:
$$
h_t = o_t \cdot \tanh(C_t)
$$
- $ h_t $: Hidden state at time $ t $
- $ o_t $: Output gate output (controls which parts of the cell state are output)
- $ \tanh(C_t) $: Non-linear transformation of the cell state

---

### LSTM Gates in Detail

Each gate in the LSTM has a distinct role, executed via a combination of a sigmoid activation and a pointwise multiplication operation. The sigmoid activation outputs values between 0 and 1, which act as filtering factors.

#### The Input Gate

- **Purpose:**  
  Regulates the amount of new information to be added to the cell state.
  
- **Computation:**
  $$
  i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)
  $$
  - $ i_t $: Input gate output at time $ t $
  - $ \sigma $: Sigmoid function (outputs between 0 and 1)
  - $ W_i $: Weight matrix for the input gate
  - $ b_i $: Bias vector

- **Function:**  
  Values near 0 prevent new information from entering the cell, while values near 1 allow it.

#### The Output Gate

- **Purpose:**  
  Controls what information from the cell state is used to compute the next hidden state.

- **Computation:**
  $$
  o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)
  $$
  - $ o_t $: Output gate output at time $ t $
  - $ W_o $: Weight matrix for the output gate
  - $ b_o $: Bias vector

- **Function:**  
  Regulates the amount of cell state information that contributes to the hidden state $ h_t $.

#### The Forget Gate

- **Purpose:**  
  Determines which parts of the previous cell state should be discarded.

- **Computation:**
  $$
  f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)
  $$
  - $ f_t $: Forget gate output at time $ t $
  - $ W_f $: Weight matrix for the forget gate
  - $ b_f $: Bias vector

- **Function:**  
  A value close to 0 eliminates information from the previous cell state, while a value near 1 retains it.



In [61]:
class NameLSTM(nn.Module):
    def __init__(self, hidden_size, embedding_dim, vocab_size, pad_idx):
        super().__init__()

        # Define an embedding layer to convert our letters to vectors
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_idx)

        # Define an LSTM layer to encode our sequence of letters
        self.encoder = nn.LSTM(
            input_size=embedding_dim,
            hidden_size=hidden_size,
            num_layers=2,
            dropout=0.3,
            bidirectional=False,
        )

        # Define a dropout layer to prevent overfitting
        self.dropout = nn.Dropout(0.3)

        # Define a classifier layer to output our final prediction
        self.classifier = nn.Sequential(
            nn.Linear(hidden_size, hidden_size // 2),  # input size is the hidden size
            self.dropout,
            nn.ReLU(),
            nn.Linear(
                hidden_size // 2, 2
            ),  # output size is 2 because we're predicting binary gender
        )

    def forward(self, seq, name_len):
        # Convert our sequence of letters to a sequence of vectors using the embedding layer
        embedded = self.embedding(seq)
        embedded = self.dropout(embedded)

        # Encode the embedded sequence using the LSTM layer
        packed_output, (hidden, cell) = self.encoder(embedded)

        # Pass the final hidden state through the classifier layer to get our prediction
        preds = self.classifier(hidden[-1])

        return preds

In [62]:
# Define the filename for saving the best LSTM model checkpoint
checkpoint_fname = checkpoint_path / "bestLSTM.pt"

# Create an instance of the NameLSTM model with specified hyperparameters
model_lstm = NameLSTM(
    hidden_size=50,  # Hidden state size of the LSTM
    embedding_dim=25,  # Dimension of the embedding vectors
    vocab_size=len(vocab),  # Vocabulary size
    pad_idx=pad_idx,  # Padding index for the embedding layer
)

# Define the optimizer to use for training the model
# Here, we use the Adam optimizer with a learning rate of 3e-4
optimizer = optim.Adam(model_lstm.parameters(), lr=3e-4)

# Define the loss function to use for training the model
# Here, we use the CrossEntropyLoss, which is suitable for classification tasks
criterion = nn.CrossEntropyLoss()

# Train the model for 30 epochs
# Arguments:
# - epochs: Number of epochs to train the model
# - model: The model to train (model_lstm)
# - optimizer: The optimizer to use for training (Adam optimizer)
# - criterion: The loss function to use for training (CrossEntropyLoss)
# - train_iterator: The iterator for the training dataset
# - valid_iterator: The iterator for the validation dataset
# - device: The device to use for computation (e.g., 'cpu' or 'cuda')
# - checkpoint_fname: The filename for saving the best model checkpoint
train(
    30,
    model_lstm,
    optimizer,
    criterion,
    train_loader,
    valid_loader,
    device,
    checkpoint_fname,
)

The model has 37,827 trainable parameters
Epoch 1 - Training Loss: 0.3191 - Valid Loss: 0.2216 - New Best
Epoch 2 - Training Loss: 0.2257 - Valid Loss: 0.1897 - New Best
Epoch 3 - Training Loss: 0.2002 - Valid Loss: 0.1758 - New Best
Epoch 4 - Training Loss: 0.1840 - Valid Loss: 0.1589 - New Best
Epoch 5 - Training Loss: 0.1733 - Valid Loss: 0.1511 - New Best
Epoch 6 - Training Loss: 0.1654 - Valid Loss: 0.1489 - New Best
Epoch 7 - Training Loss: 0.1590 - Valid Loss: 0.1516
Epoch 8 - Training Loss: 0.1543 - Valid Loss: 0.1353 - New Best
Epoch 9 - Training Loss: 0.1465 - Valid Loss: 0.1420
Epoch 10 - Training Loss: 0.1426 - Valid Loss: 0.1421
Epoch 11 - Training Loss: 0.1384 - Valid Loss: 0.1303 - New Best
Epoch 12 - Training Loss: 0.1357 - Valid Loss: 0.1280 - New Best
Epoch 13 - Training Loss: 0.1325 - Valid Loss: 0.1183 - New Best
Epoch 14 - Training Loss: 0.1286 - Valid Loss: 0.1180 - New Best
Epoch 15 - Training Loss: 0.1261 - Valid Loss: 0.1186
Epoch 16 - Training Loss: 0.1260 - V

In [63]:
# Create a new instance of the NameLSTM model with the same hyperparameters as the trained model
model_lstm_inference = NameLSTM(
    hidden_size=50,  # Hidden state size of the LSTM
    embedding_dim=25,  # Dimension of the embedding vectors
    vocab_size=len(vocab),  # Vocabulary size
    pad_idx=pad_idx,  # Padding index for the embedding layer
)

# Load the trained model's state dictionary into the new model instance
# This will set the model parameters to the best found during training
model_lstm_inference.load_state_dict(torch.load(checkpoint_fname))

# Set the model to evaluation mode
# This disables dropout and other training-specific behaviors
model_lstm_inference.eval()

# Send the model to the CPU for inference
# This ensures that the model runs on the CPU, which is typically used for inference
model_lstm_inference = model_lstm_inference.to("cpu")

In [None]:
label_mapping = {
    "1": "F",
    "0": "M",
}  # Map label indices to gender labels ('1' -> 'F', '0' -> 'M')

# Create a list of true gender labels for the validation dataset
# Iterate over the names in the validation dataset and map the label index to the gender label using label_mapping
true_valid = [label_mapping[i[1]] for i in names_valid]

# Create a list of predicted gender labels for the validation dataset
# Iterate over the names in the validation dataset and use the predict function to get the predicted gender label
pred_valid = [predict(i[0], model_lstm_inference, vocab)[0] for i in names_valid]

In [66]:
print("Accuracy: ", accuracy_score(true_valid, pred_valid))
print(f"Classification Report:\n {classification_report(true_valid, pred_valid)}")
print(f"Confusion Matrix:\n {confusion_matrix(true_valid, pred_valid)}")

Accuracy:  0.947949614338827
Classification Report:
               precision    recall  f1-score   support

           F       0.95      0.95      0.95      9929
           M       0.94      0.94      0.94      8092

    accuracy                           0.95     18021
   macro avg       0.95      0.95      0.95     18021
weighted avg       0.95      0.95      0.95     18021

Confusion Matrix:
 [[9474  455]
 [ 483 7609]]


## Bidirectional Recurrent Networks

In previous discussions, we explored the effectiveness of LSTM networks in capturing long-term dependencies within sequential data.  However, a fundamental characteristic of standard LSTM networks is their unidirectional nature. They process information in a single direction – typically from the beginning to the end of the sequence. While this approach is suitable for many applications, it presents limitations when the context from both past and future elements is crucial for understanding a specific point in the sequence.

Consider the Portuguese sentence provided earlier:

> "O cachorro passou o dia brincando .......... estava cansado."

As highlighted, determining the correct form of the verb "estava" (was) benefits significantly from understanding both preceding context like "cachorro" (dog) and succeeding context like "cansado" (tired). A conventional LSTM, processing the sentence from left to right, might struggle to fully capture the dependency between "estava" and "cansado" because it processes "cansado" *after* "estava".

To address this limitation, **bidirectional recurrent networks**, particularly Bidirectional LSTMs (Bi-LSTMs), are introduced. These architectures overcome the unidirectional constraint by processing the input sequence in both forward and reverse directions.


### Understanding Bidirectional LSTM Architecture

A Bidirectional LSTM network is composed of two independent LSTM layers operating in parallel. One LSTM layer processes the input sequence in the **forward direction** (from the start to the end), just like a standard LSTM.  Simultaneously, a second LSTM layer processes the input sequence in the **backward direction** (from the end to the start).

<p align="center">
<img src="images/bidirectional_lstm.webp" alt="" style="width: 50%; height: 50%"/>
</p>

It's crucial to note that while we focus on LSTMs, the concept of bidirectionality is applicable to all forms of recurrent neural networks, including simpler RNNs and GRUs. Any recurrent network can be configured to operate bidirectionally to gain access to both past and future context.

### Deep Dive into the Dual Processing Mechanism

Imagine a bidirectional network as having two "reading heads" scanning the input sequence simultaneously from opposite ends.

- **Forward Layer:** This layer functions identically to a standard LSTM. It reads the input sequence $x = (x_1, x_2, ..., x_T)$ in its natural order, from $x_1$ to $x_T$. At each time step $t$, the forward layer LSTM computes a hidden state $\overrightarrow{h}_t$ based on the current input $x_t$ and the previous forward hidden state $\overrightarrow{h}_{t-1}$. This layer is effective at capturing dependencies based on preceding words or elements in the sequence. We can represent the forward pass computation as:

    $$
    \overrightarrow{h}_t = \text{LSTM}_{\text{forward}}(x_t, \overrightarrow{h}_{t-1})
    $$

- **Backward Layer:** This is the distinguishing feature of bidirectional networks.  This layer processes the sequence in reverse order, effectively reading from $x_T$ back to $x_1$.  At each time step $t$, the backward layer LSTM calculates a hidden state $\overleftarrow{h}_t$ based on the input $x_t$ and the *next* backward hidden state $\overleftarrow{h}_{t+1}$.  This allows the network to incorporate information from the *future* of the sequence relative to the current position. The backward pass computation can be formulated as:

    $$
    \overleftarrow{h}_t = \text{LSTM}_{\text{backward}}(x_t, \overleftarrow{h}_{t+1})
    $$

After both the forward and backward layers have processed the entire sequence, their outputs are combined to produce the final output for each time step. This combination, often achieved through **concatenation** or **addition**, merges the past and future context representations. For example, if we choose concatenation, the output $h_t$ at time step $t$ is given by:

$$
h_t = [\overrightarrow{h}_t; \overleftarrow{h}_t]
$$

where $ [;] $ denotes concatenation.

This dual processing mechanism allows bidirectional networks to consider the entire context surrounding each element in the sequence. By having access to both past and future information, they are particularly powerful in tasks where understanding the full context is crucial for accurate predictions. Examples of such tasks, as mentioned earlier, include:

- **Language Translation:**  Understanding the context on both sides of a word is vital for selecting the most appropriate translation.
- **Text Generation:**  Generating coherent and contextually relevant text requires awareness of both preceding and subsequent parts of the sentence or paragraph.
- **Speech Recognition:**  Recognizing spoken words accurately often depends on the surrounding phonetic context, both before and after the word in question.

Fundamentally, bidirectional networks provide a more complete understanding of sequential data by looking in both directions, overcoming a key limitation of traditional unidirectional recurrent networks.

In [72]:
class NameLSTMBidir(nn.Module):
    def __init__(self, hidden_size, embedding_dim, vocab_size, pad_idx):
        super().__init__()

        # Define an embedding layer to convert our letters to vectors
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_idx)

        # Define a bidirectional LSTM layer to encode our sequence of letters
        self.encoder = nn.LSTM(
            input_size=embedding_dim,
            hidden_size=hidden_size,
            num_layers=2,
            dropout=0.3,
            bidirectional=True,
        )

        # Define a dropout layer to prevent overfitting
        self.dropout = nn.Dropout(0.3)

        # Define a classifier layer to output our final prediction
        self.classifier = nn.Sequential(
            nn.Linear(
                hidden_size * 2, hidden_size // 2
            ),  # input size is the hidden size times 2 because of bidirectionality
            self.dropout,
            nn.ReLU(),
            nn.Linear(
                hidden_size // 2, 2
            ),  # output size is 2 because we're predicting binary gender
        )

    def forward(self, seq, name_len):
        # Convert our sequence of letters to a sequence of vectors using the embedding layer
        embedded = self.embedding(seq)
        embedded = self.dropout(embedded)

        # Encode the embedded sequence using the bidirectional LSTM layer
        packed_output, (hidden, cell) = self.encoder(embedded)

        # Concatenate the final hidden states from both directions
        hidden = torch.cat((hidden[-2, :, :], hidden[-1, :, :]), dim=1)

        # Pass the concatenated hidden state through the classifier layer to get our prediction
        preds = self.classifier(hidden)

        return preds

In [68]:
# Define the filename for saving the best bidirectional LSTM model checkpoint
checkpoint_fname = checkpoint_path / "bestLSTMbidir.pt"

# Create an instance of the NameLSTMBidir model with specified hyperparameters
model_lstm_bidir = NameLSTMBidir(
    hidden_size=50,  # Hidden state size of the LSTM
    embedding_dim=25,  # Dimension of the embedding vectors
    vocab_size=len(vocab),  # Vocabulary size
    pad_idx=pad_idx,  # Padding index for the embedding layer
)

# Define the optimizer to use for training the model
# Here, we use the Adam optimizer with a learning rate of 3e-4
optimizer = optim.Adam(model_lstm_bidir.parameters(), lr=3e-4)

# Define the loss function to use for training the model
# Here, we use the CrossEntropyLoss, which is suitable for classification tasks
criterion = nn.CrossEntropyLoss()

# Train the model for 30 epochs
# Arguments:
# - epochs: Number of epochs to train the model
# - model: The model to train (model_lstm_bidir)
# - optimizer: The optimizer to use for training (Adam optimizer)
# - criterion: The loss function to use for training (CrossEntropyLoss)
# - train_loader: The iterator for the training dataset
# - valid_loader: The iterator for the validation dataset
# - device: The device to use for computation (e.g., 'cpu' or 'cuda')
# - checkpoint_fname: The filename for saving the best model checkpoint
train(
    30,
    model_lstm_bidir,
    optimizer,
    criterion,
    train_loader,
    valid_loader,
    device,
    checkpoint_fname,
)

The model has 94,877 trainable parameters
Epoch 1 - Training Loss: 0.3085 - Valid Loss: 0.2127 - New Best
Epoch 2 - Training Loss: 0.2157 - Valid Loss: 0.1766 - New Best
Epoch 3 - Training Loss: 0.1865 - Valid Loss: 0.1625 - New Best
Epoch 4 - Training Loss: 0.1710 - Valid Loss: 0.1617 - New Best
Epoch 5 - Training Loss: 0.1613 - Valid Loss: 0.1417 - New Best
Epoch 6 - Training Loss: 0.1520 - Valid Loss: 0.1316 - New Best
Epoch 7 - Training Loss: 0.1446 - Valid Loss: 0.1277 - New Best
Epoch 8 - Training Loss: 0.1387 - Valid Loss: 0.1211 - New Best
Epoch 9 - Training Loss: 0.1336 - Valid Loss: 0.1242
Epoch 10 - Training Loss: 0.1277 - Valid Loss: 0.1146 - New Best
Epoch 11 - Training Loss: 0.1247 - Valid Loss: 0.1256
Epoch 12 - Training Loss: 0.1201 - Valid Loss: 0.1150
Epoch 13 - Training Loss: 0.1154 - Valid Loss: 0.1135 - New Best
Epoch 14 - Training Loss: 0.1117 - Valid Loss: 0.1072 - New Best
Epoch 15 - Training Loss: 0.1101 - Valid Loss: 0.1062 - New Best
Epoch 16 - Training Loss:

In [69]:
# Create a new instance of the NameLSTMBidir model with the same hyperparameters as the trained model
model_lstm_bidir_inference = NameLSTMBidir(
    hidden_size=50,  # Hidden state size of the LSTM
    embedding_dim=25,  # Dimension of the embedding vectors
    vocab_size=len(vocab),  # Vocabulary size
    pad_idx=pad_idx,  # Padding index for the embedding layer
)

# Load the trained model's state dictionary into the new model instance
# This will set the model parameters to the best found during training
model_lstm_bidir_inference.load_state_dict(torch.load(checkpoint_fname))

# Set the model to evaluation mode
# This disables dropout and other training-specific behaviors
model_lstm_bidir_inference.eval()

# Send the model to the CPU for inference
# This ensures that the model runs on the CPU, which is typically used for inference
model_lstm_bidir_inference = model_lstm_bidir_inference.to("cpu")

In [None]:
label_mapping = {
    "1": "F",
    "0": "M",
}  # Map label indices to gender labels ('1' -> 'F', '0' -> 'M')

# Create a list of true gender labels for the validation dataset
# Iterate over the names in the validation dataset and map the label index to the gender label using label_mapping
true_valid = [label_mapping[i[1]] for i in names_valid]

# Create a list of predicted gender labels for the validation dataset
# Iterate over the names in the validation dataset and use the predict function to get the predicted gender label
pred_valid = [predict(i[0], model_lstm_bidir_inference, vocab)[0] for i in names_valid]

In [71]:
print("Accuracy: ", accuracy_score(true_valid, pred_valid))
print(f"Classification Report:\n {classification_report(true_valid, pred_valid)}")
print(f"Confusion Matrix:\n {confusion_matrix(true_valid, pred_valid)}")

Accuracy:  0.9589367959602686
Classification Report:
               precision    recall  f1-score   support

           F       0.95      0.97      0.96      9929
           M       0.96      0.94      0.95      8092

    accuracy                           0.96     18021
   macro avg       0.96      0.96      0.96     18021
weighted avg       0.96      0.96      0.96     18021

Confusion Matrix:
 [[9651  278]
 [ 462 7630]]


# Gated Recurrent Unit (GRU) Networks

GRU networks are one of the main alternatives to LSTM networks for managing long-term dependencies in sequential data. They simplify the recurrent neural network (RNN) structure by reducing the number of gates and combining the memory components, which can lead to computational efficiency and simpler training.

- **Traditional RNNs:**  
  Suffer from vanishing and exploding gradient issues, making it difficult to learn long-term dependencies.

- **LSTM Networks:**  
  Address these issues by introducing a cell state along with three gating mechanisms (input, output, and forget gates). This design improves the learning of longer sequences but introduces a higher computational cost.

- **GRU Networks:**  
  Introduced in 2014 by Kyunghyun Cho et al. in [*Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation*](https://arxiv.org/abs/1406.1078), GRUs simplify the LSTM design by merging the cell state and hidden state and using only two gates.

- **GRU networks** simplify the recurrent unit architecture by:
  - Merging the cell and hidden states.
  - Using only two gates (update and reset).
- **Efficiency:** Their reduced complexity can lead to faster computation and less memory usage.
- **Functionality:** The update gate manages the balance between retaining previous information and updating it with new data, while the reset gate moderates the influence of previous data on the current input.

<p align="center">
<img src="images/gru.png" alt="" style="width: 50%; height: 50%"/>
</p>

This streamlined approach makes GRU networks an attractive option when resources are limited or when tasks do not demand the full complexity offered by LSTM architectures.

## GRU Architecture

The GRU is built around a single hidden state that keeps the memory of the system. It employs two gating mechanisms:

1. **Update Gate ($z_t$)**
2. **Reset Gate ($r_t$)**

Each gate’s output lies between 0 and 1, determining the degree to which information is retained or discarded in each time step.


### Mathematical Formulation

Let:  
- $ \mathbf{x}_t $ be the input at time $ t $,  
- $ \mathbf{h}_{t-1} $ be the previous hidden state, and  
- $ \mathbf{h}_t $ be the new hidden state.

The computations in a GRU cell are as follows:

1. **Update Gate:**

   $$
   z_t = \sigma(W_z \mathbf{x}_t + U_z \mathbf{h}_{t-1})
   $$

   - $W_z$ and $U_z$ are weight matrices.
   - $\sigma(\cdot)$ is the sigmoid function.

2. **Reset Gate:**

   $$
   r_t = \sigma(W_r \mathbf{x}_t + U_r \mathbf{h}_{t-1})
   $$

   - $W_r$ and $U_r$ are weight matrices.

3. **Candidate Hidden State:**

   $$
   \tilde{\mathbf{h}}_t = \tanh(W \mathbf{x}_t + U (r_t \odot \mathbf{h}_{t-1}))
   $$

   - $W$ and $U$ are weight matrices.
   - $\odot$ denotes element-wise multiplication.

4. **Final Hidden State:**

   $$
   \mathbf{h}_t = (1 - z_t) \odot \mathbf{h}_{t-1} + z_t \odot \tilde{\mathbf{h}}_t
   $$

   The update gate $z_t$ controls the balance between retaining the past hidden state and updating with the new candidate hidden state, $\tilde{\mathbf{h}}_t$.



## Explanation of the Key Components

- **The Hidden State:**  
  In GRUs, there is no separate cell state. The hidden state $ \mathbf{h}_t $ combines both the memory and the output information. This unified approach simplifies the model architecture and reduces the number of parameters.

- **The Update Gate:**  
  - **Role:** Determines how much of the previous hidden state is carried forward into the new hidden state.
  - **Effect:**  
    - If $ z_t $ is close to 0, the network prefers to keep most of the previous information.
    - If $ z_t $ is close to 1, the network gives preference to the candidate hidden state, meaning it updates the state significantly with new information.

- **The Reset Gate:**  
  - **Role:** Determines how much of the previous hidden state should be reset before computing the candidate hidden state.
  - **Effect:**  
    - If $ r_t $ is close to 0, it enables the unit to discard most past information, focusing on the new input.
    - If $ r_t $ is close to 1, it allows the network to use the previous state information more fully when calculating $\tilde{\mathbf{h}}_t$.

> **Note:** GRUs and LSTMs often perform similarly on tasks such as machine translation, text generation, and speech recognition. The decision between the two can depend on the complexity of the task and computational resources available. GRUs tend to train faster due to their simplified structure, while LSTMs might better capture longer dependencies in more complex tasks.


In [73]:
class NameGRUBidir(nn.Module):
    def __init__(self, hidden_size, embedding_dim, vocab_size, pad_idx):
        super().__init__()

        # Define an embedding layer to convert our letters to vectors
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_idx)

        # Define a bidirectional GRU layer to encode our sequence of letters
        self.encoder = nn.GRU(
            input_size=embedding_dim,
            hidden_size=hidden_size,
            num_layers=2,
            dropout=0.3,
            bidirectional=True,
        )

        # Define a dropout layer to prevent overfitting
        self.dropout = nn.Dropout(0.3)

        # Define a classifier layer to output our final prediction
        self.classifier = nn.Sequential(
            nn.Linear(
                hidden_size * 2, hidden_size // 2
            ),  # input size is the hidden size times 2 because of bidirectionality
            self.dropout,
            nn.ReLU(),
            nn.Linear(
                hidden_size // 2, 2
            ),  # output size is 2 because we're predicting binary gender
        )

    def forward(self, seq, name_len):
        # Convert our sequence of letters to a sequence of vectors using the embedding layer
        embedded = self.embedding(seq)
        embedded = self.dropout(embedded)

        # Encode the embedded sequence using the bidirectional GRU layer
        packed_output, hidden = self.encoder(embedded)

        # Concatenate the final hidden states from both directions
        hidden = torch.cat((hidden[-2, :, :], hidden[-1, :, :]), dim=1)

        # Pass the concatenated hidden state through the classifier layer to get our prediction
        preds = self.classifier(hidden)

        return preds

In [74]:
# Define the filename for saving the best bidirectional GRU model checkpoint
checkpoint_fname = checkpoint_path / "bestGRUBidir.pt"

# Create an instance of the NameGRUBidir model with specified hyperparameters
model_gru_bidir = NameGRUBidir(
    hidden_size=50,  # Hidden state size of the GRU
    embedding_dim=25,  # Dimension of the embedding vectors
    vocab_size=len(vocab),  # Vocabulary size
    pad_idx=pad_idx,  # Padding index for the embedding layer
)

# Define the optimizer to use for training the model
# Here, we use the Adam optimizer with a learning rate of 3e-4
optimizer = optim.Adam(model_gru_bidir.parameters(), lr=3e-4)

# Define the loss function to use for training the model
# Here, we use the CrossEntropyLoss, which is suitable for classification tasks
criterion = nn.CrossEntropyLoss()

# Train the model for 30 epochs
# Arguments:
# - epochs: Number of epochs to train the model
# - model: The model to train (model_gru_bidir)
# - optimizer: The optimizer to use for training (Adam optimizer)
# - criterion: The loss function to use for training (CrossEntropyLoss)
# - train_loader: The iterator for the training dataset
# - valid_loader: The iterator for the validation dataset
# - device: The device to use for computation (e.g., 'cpu' or 'cuda')
# - checkpoint_fname: The filename for saving the best model checkpoint
train(
    30,
    model_gru_bidir,
    optimizer,
    criterion,
    train_loader,
    valid_loader,
    device,
    checkpoint_fname,
)

The model has 71,977 trainable parameters
Epoch 1 - Training Loss: 0.3147 - Valid Loss: 0.2220 - New Best
Epoch 2 - Training Loss: 0.2094 - Valid Loss: 0.1782 - New Best
Epoch 3 - Training Loss: 0.1801 - Valid Loss: 0.1590 - New Best
Epoch 4 - Training Loss: 0.1675 - Valid Loss: 0.1432 - New Best
Epoch 5 - Training Loss: 0.1565 - Valid Loss: 0.1394 - New Best
Epoch 6 - Training Loss: 0.1499 - Valid Loss: 0.1271 - New Best
Epoch 7 - Training Loss: 0.1426 - Valid Loss: 0.1242 - New Best
Epoch 8 - Training Loss: 0.1355 - Valid Loss: 0.1166 - New Best
Epoch 9 - Training Loss: 0.1296 - Valid Loss: 0.1126 - New Best
Epoch 10 - Training Loss: 0.1250 - Valid Loss: 0.1135
Epoch 11 - Training Loss: 0.1209 - Valid Loss: 0.1042 - New Best
Epoch 12 - Training Loss: 0.1167 - Valid Loss: 0.1104
Epoch 13 - Training Loss: 0.1120 - Valid Loss: 0.1016 - New Best
Epoch 14 - Training Loss: 0.1095 - Valid Loss: 0.0990 - New Best
Epoch 15 - Training Loss: 0.1053 - Valid Loss: 0.1013
Epoch 16 - Training Loss:

In [None]:
# Create a new instance of the NameGRUBidir model with the same hyperparameters as the trained model
# This ensures that the model architecture matches the one used during training
model_gru_bidir_inference = NameGRUBidir(
    hidden_size=50,  # Hidden state size of the GRU
    embedding_dim=25,  # Dimension of the embedding vectors
    vocab_size=len(vocab),  # Vocabulary size
    pad_idx=pad_idx,  # Padding index for the embedding layer
)

# Load the trained model's state dictionary into the new model instance
# This will set the model parameters to the best found during training
model_gru_bidir_inference.load_state_dict(
    torch.load(checkpoint_fname, weights_only=False)
)

# Set the model to evaluation mode
# This disables dropout and other training-specific behaviors
model_gru_bidir_inference.eval()

# Send the model to the CPU for inference
# This ensures that the model runs on the CPU, which is typically used for inference
model_gru_bidir_inference = model_gru_bidir_inference.to("cpu")

In [76]:
label_mapping = {
    "1": "F",
    "0": "M",
}  # Map label indices to gender labels ('1' -> 'F', '0' -> 'M')

# Create a list of true gender labels for the validation dataset
# Iterate over the names in the validation dataset and map the label index to the gender label using label_mapping
true_valid = [label_mapping[i[1]] for i in names_valid]

# Create a list of predicted gender labels for the validation dataset
# Iterate over the names in the validation dataset and use the predict function to get the predicted gender label
pred_valid = [predict(i[0], model_gru_bidir_inference, vocab)[0] for i in names_valid]

In [77]:
print("Accuracy: ", accuracy_score(true_valid, pred_valid))
print(f"Classification Report:\n {classification_report(true_valid, pred_valid)}")
print(f"Confusion Matrix:\n {confusion_matrix(true_valid, pred_valid)}")

Accuracy:  0.9607125020809056
Classification Report:
               precision    recall  f1-score   support

           F       0.96      0.97      0.96      9929
           M       0.96      0.95      0.96      8092

    accuracy                           0.96     18021
   macro avg       0.96      0.96      0.96     18021
weighted avg       0.96      0.96      0.96     18021

Confusion Matrix:
 [[9620  309]
 [ 399 7693]]


# Model Comparison

| Architecture | Accuracy | Number of Parameters |
|-------------------------|----------|----------------------|
| Vanilla RNN | 93.12% | 10,977 |
| LSTM | 94.79% | 37,827 |
| Bidirectional LSTM | 95.89% | 94,877 |
| Bidirectional GRU | 96.07% | 71,977 |

## Analysis

- **Accuracy**:
    - The **Bidirectional GRU** model achieves the highest accuracy at 96.07%, followed closely by the **Bidirectional LSTM** at 95.89%.
    - The **LSTM** model outperforms the **Vanilla RNN**, with an accuracy of 94.79% compared to 93.12%.

- **Number of Parameters**:
    - The **Vanilla RNN** has the fewest parameters (10,977), making it the most lightweight model.
    - The **Bidirectional LSTM** has the highest number of parameters (94,877), indicating a more complex and potentially more powerful model.
    - The **Bidirectional GRU** has fewer parameters (71,977) than the **Bidirectional LSTM**, but more than the **LSTM** (37,827) and **Vanilla RNN**.

## Considerations

- The **Bidirectional GRU** provides the best accuracy with fewer parameters than the Bidirectional LSTM, making it an efficient choice for this task.
- The **LSTM** improves accuracy significantly over the **Vanilla RNN** without an excessive increase in parameters.
- The **Vanilla RNN** is the simplest model with the fewest parameters but also the lowest accuracy.

We have now covered an extensive journey, exploring the foundations and insights into the world of Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM) networks, and Gated Recurrent Units (GRU). We've examined the strengths and limitations of RNNs and learned how LSTMs and GRUs have significantly improved upon these foundations to better handle long-term dependencies.

Understanding these three forms of neural networks is crucial for anyone interested in Natural Language Processing (NLP). In NLP tasks, we often deal with sequenced data where understanding temporal dependencies is key, be it understanding the sentiment behind a customer review or translating a passage from one language to another. RNNs, LSTMs, and GRUs offer us powerful tools to make sense of this complex, sequential data.

However, remember that the choice of architecture should be dictated by the specifics of your task. While RNNs can suffice for simpler, short-term dependencies, LSTMs and GRUs are better suited for more complex tasks or longer sequences. But no model can serve as a silver bullet. Experimenting, iterating, and continuous learning are part of the process.

Next class we will start to work with a new type of neural network: Transformers.

## Takeaways

- Understanding RNNs, LSTMs, and GRUs is crucial for natural language processing tasks that involve sequential data and require capturing dependencies over time.

- The choice of architecture depends on the specific task and its complexity. RNNs can suffice for simpler, short-term dependencies, while LSTMs and GRUs are better suited for more complex tasks or longer sequences.

- Bidirectional networks offer additional context by processing sequences in both forward and backward directions, leading to improved performance in sequence processing tasks.

- Experimenting with different architectures, hyperparameters, and iterative model evaluation is essential to find the optimal solution for a given task.

- Comprehending the strengths and limitations of each architecture and considering the tradeoffs between model complexity, computational efficiency, and performance is crucial when selecting the appropriate model.

# Questions

1. What are the main types of recurrent neural networks discussed in this class?

2. What is the vanishing gradient problem and how do LSTMs address it?

3. How does a bidirectional LSTM network process sequences differently from a standard LSTM?

4. What are the key components of an LSTM cell and what are their functions?

5. How does a GRU simplify the architecture of an LSTM?

6. What was the case study used to demonstrate the application of these different network architectures?

7. Which model achieved the highest accuracy on the name gender classification task?

8. How many trainable parameters did the bidirectional LSTM model have compared to the vanilla RNN?

9. What are some of the tradeoffs to consider when choosing between RNNs, LSTMs and GRUs?

10. Why are recurrent networks particularly useful for natural language processing tasks?

`Answers are commented inside this cell`

<!-- 1. RNNs are the simplest, processing sequences with hidden states. LSTMs introduce a memory cell and gates to better handle long-term dependencies. GRUs simplify LSTMs by combining hidden and cell states and using only update and reset gates.

2. RNNs process sequential data by maintaining a hidden state that captures information from previous time steps. They face challenges with vanishing and exploding gradients when dealing with long sequences.

3. Vanishing gradients occur when gradients become extremely small during backpropagation, preventing effective learning. Exploding gradients happen when gradients become very large, leading to unstable training.

4. LSTMs have a memory cell that maintains long-term information, and gates (input, output, forget) that regulate information flow, allowing them to capture long-term dependencies effectively.

5. The input gate controls what new information is added to the cell state. The forget gate determines what information to discard from the cell state. The output gate decides what information from the cell state is used to compute the hidden state.

6. Bidirectional LSTMs process sequences in both forward and backward directions, capturing both past and future contexts. This allows them to have a more complete understanding of the sequence.

7. GRUs combine the hidden and cell states into a single hidden state and use only two gates (update and reset), making them computationally more efficient than LSTMs while still handling long-term dependencies effectively.

8. The dataset used was from the 2010 Brazilian Census, containing 90,104 names (49,274 female and 40,830 male). Names that could be associated with both genders were excluded. The data was split into 80% training and 20% validation sets.

9. Bidirectional networks process sequences in both forward and backward directions, allowing them to capture both past and future contexts. This additional context helps improve the performance of sequence processing tasks.

10. The Bidirectional LSTM achieved the highest accuracy (97.37%), followed by the Bidirectional GRU (97.28%), LSTM (96.61%), and Vanilla RNN (95.57%). The Bidirectional LSTM had the most parameters, while the Vanilla RNN had the fewest. -->