# Recurrent Neural Networks
## IMD0187 - Natural Language Processing
### [Dr. Elias Jacob de Menezes Neto](https://sigaa.ufrn.br/sigaa/public/docente/portal.jsf?siape=2353000)

## Summary

### Keypoints

- Recurrent Neural Networks (RNNs) process sequential data by maintaining hidden states across time steps, allowing them to capture dependencies and relationships between words in a sentence.

- RNNs face challenges with vanishing and exploding gradients when dealing with long sequences. The vanishing gradient problem occurs when gradients become extremely small during backpropagation, preventing effective learning. The exploding gradient problem happens when gradients become very large, leading to unstable training.

- Long Short-Term Memory (LSTM) networks address the limitations of RNNs by introducing a memory cell and gates (input, output, forget) that regulate information flow. This allows LSTMs to effectively capture long-term dependencies.

- The LSTM architecture consists of an input gate that controls what new information is added to the cell state, a forget gate that determines what information to discard, an output gate that decides what information from the cell state is used to compute the hidden state, and a memory cell that maintains long-term information.

- Bidirectional LSTM networks process sequences in both forward and backward directions, capturing both past and future contexts. This additional context helps improve the performance of sequence processing tasks.

- Gated Recurrent Units (GRUs) simplify the LSTM architecture by combining the hidden and cell states into a single hidden state and using only two gates (update and reset). GRUs are computationally more efficient than LSTMs while still effectively handling long-term dependencies.

- The case study demonstrated the application of RNN, LSTM, and GRU architectures on a name gender classification task using a dataset from the 2010 Brazilian Census. The Bidirectional LSTM achieved the highest accuracy, followed by the Bidirectional GRU, LSTM, and Vanilla RNN.

### Takeaways

- Understanding RNNs, LSTMs, and GRUs is crucial for natural language processing tasks that involve sequential data and require capturing dependencies over time.

- The choice of architecture depends on the specific task and its complexity. RNNs can suffice for simpler, short-term dependencies, while LSTMs and GRUs are better suited for more complex tasks or longer sequences.

- Bidirectional networks offer additional context by processing sequences in both forward and backward directions, leading to improved performance in sequence processing tasks.

- Experimenting with different architectures, hyperparameters, and iterative model evaluation is essential to find the optimal solution for a given task.

- Comprehending the strengths and limitations of each architecture and considering the tradeoffs between model complexity, computational efficiency, and performance is crucial when selecting the appropriate model.

- Mastering these foundational concepts and techniques provides a strong basis for advancing skills in natural language processing and tackling more complex tasks such as sentiment analysis, machine translation, and text generation.

> Side Note: This class contains a set of very vanilla implementations with tools that we don't usually use in production. For example, we use the `torchtext` library to load the dataset and create the vocabulary. In a real-world scenario, we would use a more efficient and scalable approach, such as tokenization with `transformers` or `spaCy` and data loading with `DataLoader` from PyTorch. Please bear in mind that the goal of this class is to provide a basic understanding of RNNs, LSTMs, and GRUs, not to showcase the most efficient implementation.

# Recurrent Networks for NLP


When processing textual data, it's crucial to consider the dependencies and relationships between words in a sentence. The semantics of a sentence can change profoundly based on the order and selection of words.

Consider these two similar sentences:
> "A bomba explodiu no jornal."
>
> "A notícia do jornal explodiu como uma bomba"

Despite having analogous structure, interchanging just one adjective leads to a completely different meaning and emotional impact on the reader. Context plays a vital role, especially when a sentence's overall meaning can be greatly influenced by what has been said or happened previously.

*Recurrent Neural Networks* (RNNs) provide neural networks with the capability to memorize previous words within a statement, enabling them to better capture and understand patterns that appear when certain tokens appear sequentially relative to other tokens. This is the fundamental premise of RNNs.

## How RNNs Maintain State Across Time

RNNs operate on the principle of maintaining state across time. While initially, it might seem complicated, it's essentially about giving the network a context for its current operation based on historical data.

For each input fed into a standard feed-forward network, the output from one time step 't' is provided as an additional input for the next step 't+1', along with the fresh data being supplied at 't+1'. In simpler terms, you inform the network about what happened earlier alongside what is happening "now".

This concept forms the basis of RNNs—which learn and remember over time, enabling them to better capture patterns within sequences. Understanding this is key to exploiting the power of RNNs for text analysis and other sequential data processing tasks.

## Visualizing a RNN

You can visualize a recurrent net as shown in figure below:

<p align="center">
<img src="images/rnn_unrolled.png" alt="" style="width: 50%; height: 50%"/>
</p>


Look at the left side. The circles are entire feedforward network layers composed of one or more neurons. The output of the hidden layer emerges from the network as normal, but it’s also set aside to be passed back in as an input to itself along with the normal input from the next time step. This feedback is represented with an arc from the output of a layer back into its own input.

An easier way to see this process—and it’s more commonly shown this way—is by unrolling the net. The right side of the image above shows the network stood on its head with two unfoldings of the time variable (t), showing layers for t+1 and t+2.

Each time step is represented by a column of neurons in the unrolled version of the very same neural network. It’s like looking at a screenplay or video frame of the neural net for each sample in time. The network to the right is the future version of the network on the left. The output of a hidden layer at one time step (t) is fed back into the hidden layer along with input data for the next time step (t+1) to the right. Repeat. This diagram shows two iterations of this unfolding, so three columns of neurons for t=0, t=1, and t=2.

All the vertical paths in this visualization are clones, or views of the same neurons. They are the single network represented on a timeline. This visualization is helpful when talking about how information flows through the network forward and backward during backpropagation. But when looking at the three unfolded networks, remember that they’re all different snapshots of the same network with a single set of weights.

### Structure of RNN: Feedforward Network Layers

Viewing the left side of the image above, you'll notice circles that represent layers in a feedforward network, with each layer comprising one or more neurons. The output of the hidden layer not only moves forward through the network but also feeds back into the input of its originating layer.

This recurrent feedback is illustrated by an arc looping from the layer's output back to its own input.

### Unfolding Time Variable for Better Visualization

To better visualize this process, we can 'unroll' the network over time. This technique, represented on the right side of the image, essentially flips the network on its head, revealing the progress of the network over two stages of the time variable (t), namely t+1 and t+2.

Each time step 't' is denoted as a column of neurons in the unrolled version of our network. It can be thought of as watching successive frames of a movie, where each frame represents the state of the network at a given moment in time.

### Cloned Views of Same Neurons

In this representation, all vertical paths are clones or different views of the same set of neurons; they depict the same neural network captured at various points along a timeline.

While this kind of representation simplifies comprehension of information flow (both forward and backward during backpropagation), it's essential to remember when looking at these multiple 'unfolded' networks: they are merely simultaneous snapshots of the same single network maintaining a consistent set of weights.


> Recognizing an unrolled RNN as sequential instances of the same network operating over time is crucial for understanding how RNNs capture and utilize temporal information from sequences. This comprehension forms the basis of effectively utilizing the power of RNNs for sequence data processing tasks.

## Training our first RNN

### Our case study

Let's discuss an interesting scenario. Assume you are employed at the Ombudsman Office of our University. A major part of your role involves addressing students' complaints and correspondingly communicating with the related departments. As a measure to enhance the quality of your communication, you have decided to use gender-appropriate pronouns based on the person's first name.

One important goal is to avoid incorrectly gendering specific roles or offices within the university (e.g., naming the President's office as "Gabinete do Reitor" even when the President is a female, which happened between 2011 and 2019 at UFRN).

To achieve this objective, we will be using a [dataset collected by IBGE during 2010 Census](https://brasil.io/dataset/genero-nomes/nomes/). This dataset contains a total of 90,104 names, out of which 49,274 are female and 40,830 are male. To ensure accuracy, any names that could be associated with both genders, such as "Elias", "Ivani" or "Alison", have been excluded from our analysis.

> Interestingly, during my data exploration, I found out that there were 189,315 men and 1,387 women with the same name as mine. I'd never imagine women could be named Elias!

Our aim here is to develop a Recurrent Neural Network (RNN) that can read a name letter by letter, and predict the probability of the name being either masculine or feminine. For the purpose of this project:

- Each lowercase letter will be considered a token
- The vocabulary will comprise the 26 alphabet letters
- Any accented letters will be converted to their non-accented versions

We will divide our dataset into two parts: 80% for training and 20% for validation.

> Please note that while we recognize and respect the existence of non-binary gender identities, for the purposes of this exercise, we will be employing a binary classification model due to dataset limitations. Our dataset solely contains names which are classified as either masculine or feminine, hence we are confined to two classes. Maybe in the future we can work on a more inclusive model!

With that said, let's start by importing the necessary libraries and loading our dataset.

In [1]:
# Import the unicodedata library for Unicode character database
import unicodedata

# Import PyTorch library for deep learning
import torch

# Import neural network module from PyTorch
import torch.nn as nn

# Import functional interface for neural networks from PyTorch
import torch.nn.functional as F

# Import optimization algorithms from PyTorch
import torch.optim as optim

# Import data module from torchtext for handling text data
# Note: You may need to install this library at version 0.6.0 using pip install torchtext==0.6.0
from torchtext import data

# Import torchtext library for text processing
import torchtext

# Import pandas library for data manipulation and analysis
import pandas as pd

# Import random module for generating random numbers
import random

# Import accuracy_score, classification_report, and confusion_matrix from sklearn for evaluation metrics
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Import datetime and timedelta from datetime module for handling date and time
from datetime import datetime, timedelta

In [2]:
df = pd.read_csv('data/names_gender.csv')
df

Unnamed: 0,name,label,label_str
0,silmari,1,F
1,jovanilde,1,F
2,yorrana,1,F
3,nakita,1,F
4,tiarle,0,M
...,...,...,...
90098,edsmar,0,M
90099,altenice,1,F
90100,arthemis,1,F
90101,mielly,1,F


In [3]:
import unicodedata

def normalize_name(name):
    # Step 1: Normalize the Unicode string
    # NFKD stands for Normalization Form Compatibility Decomposition
    # This step separates combined characters into base character and diacritical marks
    normalized = unicodedata.normalize('NFKD', name)
    
    # Step 2: Encode to ASCII and ignore non-ASCII characters
    # This effectively removes accents and other diacritical marks
    ascii_encoded = normalized.encode('ASCII', 'ignore')
    
    # Step 3: Decode back to UTF-8
    # This converts the bytes object back into a string
    utf8_decoded = ascii_encoded.decode('utf-8')
    
    # Step 4: Convert to lowercase
    # This ensures consistent casing across all names
    lowercased = utf8_decoded.lower()
    
    return lowercased

# Example usage:
# This will output 'jose', removing the accent from the 'é'
print(normalize_name('José'))  # Output: jose 
print(normalize_name('Café'))  # Output: cafe
print(normalize_name('Über'))  # Output: uber


jose
cafe
uber


In [4]:
df

Unnamed: 0,name,label,label_str
0,silmari,1,F
1,jovanilde,1,F
2,yorrana,1,F
3,nakita,1,F
4,tiarle,0,M
...,...,...,...
90098,edsmar,0,M
90099,altenice,1,F
90100,arthemis,1,F
90101,mielly,1,F


In [5]:
print(df.shape)
df['name'] = df['name'].apply(normalize_name)
df.drop_duplicates(inplace=True, subset=['name'], keep=False)
print(df.shape)

(90103, 3)
(90103, 3)


In [6]:
df.query('name == "jose"')

Unnamed: 0,name,label,label_str
58546,jose,0,M


In [7]:
df.query('name == "maria"')

Unnamed: 0,name,label,label_str
30451,maria,1,F


In [8]:
# This will be our simple tokenizer. It will split the names into letters

def custom_tokenizer_letters(text):
    text = normalize_name(text)
    
    # Convert the normalized text into a list of individual characters
    # This approach treats each letter as a separate token
    return list(text)

# Example usage of the tokenizer
print(custom_tokenizer_letters('José'))

# Expected output: ['j', 'o', 's', 'e']


['j', 'o', 's', 'e']


In [9]:
# Import the LabelField and Field classes from the torchtext.data module
from torchtext import data

# Define a LabelField to represent the gender label for each name
GENDER = data.LabelField()

# Define a Field to represent the name text, using our custom tokenizer and including the length of each name
NAME = data.Field(tokenize=custom_tokenizer_letters, lower=True, include_lengths=True)

# Create a list of tuples representing the fields in our dataset, with the name field first and the gender label field second
fields = [('name', NAME), ('label', GENDER)]

In [10]:
# Import the TabularDataset class from the torchtext.data module
from torchtext.data import TabularDataset

# Create a TabularDataset from our CSV file, using the fields we defined earlier
dset = TabularDataset(path='data/names_gender.csv', format='CSV', fields=fields, skip_header=True)

In [11]:
# Split the dataset into training and validation sets, with a split ratio of 80/20
# The split is stratified based on the gender label, so that each set has roughly the same proportion of male and female names
# The random seed is set to ensure reproducibility
(train_dataset, valid_dataset) = dset.split(split_ratio=[0.8, 0.2], stratified=True, strata_field='label', random_state=random.seed(271828))

In [12]:
# Create an empty list to store the name and label tuples
names_valid = list()

# Iterate over each example in the validation dataset
for ex in valid_dataset.examples:
    # Join the list of letters in the name field to create a single string
    n = ''.join(ex.name)
    # Get the label field value (either 0 or 1)
    l = ex.label
    # Append a tuple of the name and label to the list
    names_valid.append((n, l))

In [13]:
random.choices(names_valid, k=10)

[('vannia', '1'),
 ('taiele', '1'),
 ('dicineia', '1'),
 ('jemyson', '0'),
 ('mrizete', '1'),
 ('vanuzia', '1'),
 ('larinha', '1'),
 ('luiton', '0'),
 ('eiiti', '0'),
 ('aldemberg', '0')]

In [14]:
# Create an empty list to store the name and label tuples
names_train = list()

# Iterate over each example in the training dataset
for ex in train_dataset.examples:
    # Join the list of letters in the name field to create a single string
    n = ''.join(ex.name)
    # Get the label field value (either 0 or 1)
    l = ex.label
    # Append a tuple of the name and label to the list
    names_train.append((n, l))

In [15]:
random.choices(names_train, k=10)

[('ermson', '0'),
 ('ystefane', '1'),
 ('dianina', '1'),
 ('euricelia', '1'),
 ('edneli', '1'),
 ('erigleidson', '0'),
 ('quenedy', '0'),
 ('kallinca', '1'),
 ('angerlandia', '1'),
 ('rimer', '0')]

In [16]:
# Import the Counter class from the collections module
from collections import Counter

# Use a list comprehension to extract the gender labels from the training and validation datasets
# The Counter class is then used to count the frequency of each label (0 for male, 1 for female)
# The resulting counts are printed to the console
print(Counter([ex.label for ex in train_dataset.examples]))
print(Counter([ex.label for ex in valid_dataset.examples]))

Counter({'1': 39419, '0': 32663})
Counter({'1': 9855, '0': 8166})


In [17]:
len(train_dataset), len(valid_dataset)

(72082, 18021)

In [18]:
# Set the maximum size for the vocabulary
vocab_size = 50

# Build the vocabulary for the 'NAME' field using the training dataset
# Limit the vocabulary size to the specified maximum
NAME.build_vocab(train_dataset, max_size=vocab_size)

# Build the vocabulary for the 'GENDER' field using the training dataset
# No size limit is specified for this vocabulary
GENDER.build_vocab(train_dataset)

# Get the length of the vocabulary for the 'NAME' field
len(NAME.vocab) # 26 letters + 1 for unknown + 1 for padding

28

In [19]:
NAME.vocab.freqs.most_common(5)

[('a', 71152), ('i', 66096), ('e', 63120), ('l', 47180), ('n', 46335)]

In [20]:
GENDER.vocab.stoi # stoi: string to index

defaultdict(None, {'1': 0, '0': 1})

In [21]:
# Let's check our vocabulary
for i in range(len(NAME.vocab)):
    print(i, NAME.vocab.itos[i])

0 <unk>
1 <pad>
2 a
3 i
4 e
5 l
6 n
7 r
8 o
9 d
10 s
11 c
12 m
13 t
14 u
15 v
16 j
17 y
18 g
19 h
20 z
21 b
22 k
23 f
24 w
25 p
26 q
27 x


In [22]:
# Check if a GPU is available for PyTorch
# torch.cuda.is_available() returns True if a GPU is available, otherwise False
# If a GPU is available, set the device to 'cuda' to utilize the GPU for computations
# If a GPU is not available, set the device to 'cpu' to use the CPU for computations
device = 'cuda' if torch.cuda.is_available() else 'cpu'


In [23]:
# Use the BucketIterator class from the torchtext.data module to create iterators for the training and validation datasets
# The iterators will be used to generate batches of data during training and validation
# The batch size is set to 32, and the device is set to the specified device (e.g. 'cpu' or 'cuda')
# The sort_key argument specifies the function to use for sorting examples within each batch (in this case, the length of the name field)
# The sort_within_batch argument specifies whether to sort examples within each batch (in this case, True)

train_iter, valid_iter = data.BucketIterator.splits((train_dataset, valid_dataset),
                                                  batch_size = 32,
                                                  device = device,
                                                  sort_key = lambda x: len(x.name),
                                                  sort_within_batch = True)

In [24]:
import torch.nn as nn

# Define a class for our RNN model
class NameRNN(nn.Module):
    def __init__(self, hidden_size, embedding_dim, vocab_size, pad_idx):
        super().__init__()
        
        # Define an embedding layer to convert our letters to vectors
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_idx)
        
        # Define an RNN layer to encode our sequence of letters
        self.encoder = nn.RNN(
            input_size=embedding_dim,  # Size of each input vector
            hidden_size=hidden_size,   # Number of features in the hidden state
            num_layers=2,              # Number of recurrent layers
            dropout=0.3,               # Dropout probability for the RNN layers
            bidirectional=False        # Whether the RNN is bidirectional
        )

        # Define a dropout layer to prevent overfitting
        self.dropout = nn.Dropout(0.3)

        # Define a classifier layer to output our final prediction
        self.classifier = nn.Sequential(
            nn.Linear(hidden_size, hidden_size // 2), # First linear layer
            self.dropout,                             # Dropout layer
            nn.ReLU(),                                # Activation function
            nn.Linear(hidden_size // 2, 2)            # Second linear layer, output size is 2 for binary classification
        )

    def forward(self, seq, name_len):
        # Convert our sequence of letters to a sequence of vectors using the embedding layer
        embedded = self.embedding(seq)
        embedded = self.dropout(embedded)
        
        # Encode the embedded sequence using the RNN layer
        packed_output, hidden = self.encoder(embedded)
        
        # Pass the final hidden state through the classifier layer to get our prediction
        preds = self.classifier(hidden[-1])
        
        return preds

In [25]:
pad_idx = NAME.vocab.stoi[NAME.pad_token]
pad_idx

1

In [26]:
model_rnn = NameRNN(
    hidden_size=50,            # Hidden state size of the RNN
    embedding_dim=25,          # Dimension of the embedding vectors
    vocab_size=len(NAME.vocab),# Vocabulary size based on the 'NAME' field
    pad_idx=pad_idx            # Padding index for the embedding layer
)

In our example, we've selected an arbitrary hidden size of 50 and an embedding dimension of 25. These choices were made largely to minimize computation time; however, adjusting these parameters can yield different results in both computational speed and model accuracy. Hence, it's encouraged that you experiment with varying these numbers based on your specific use-cases.

#### Rule of Thumb for Model Complexity

A helpful guiding principle when designing models is to ensure that the complexity of your model aligns appropriately with your data's innate structure. The objective is to achieve a balance where:

1. Your model isn't too complex for your data (Overfitting), and
2. It's not too simple relative to your data (Underfitting).

Let's understand what this means:

##### Overfitting - High Variance and Low Bias

When a model is overly complex, it tends to "memorize" the training data rather than "learning" from it, causing poor generalization when faced with new, unseen data. This scenario is referred to as **overfitting** the data and results in a model with high variance and low bias.

##### Underfitting - Low Variance and High Bias

Conversely, if a model is too uncomplicated, it will fail even in capturing the fundamental patterns of the training data. Such underutilizing models are prone to consistently generate inaccurate predictions across all data types, both seen and unseen. This sub-optimal situation is known as **underfitting** and leads to a model with low variance and high bias.

#### Balancing Between Bias and Variance

Balancing between overfitting and underfitting is often described as managing the trade-off between bias and variance. The key is to find a sweet spot where the model is just complex enough to learn useful patterns from the training data but also retains the ability to generalize effectively to unseen data.

While building your model, it's essential to keep this concept in mind: experiment with different parameters, monitor how they affect the performance of your model, and fine-tune them for optimal results.

In [27]:
# Function to count the number of trainable parameters in a model
def count_parameters(model):
    # Sum the number of elements (numel) for each parameter in the model
    # Only include parameters that require gradients (i.e., are trainable)
    n_parameters = sum(p.numel() for p in model.parameters() if p.requires_grad)
    
    # Print the total number of trainable parameters in a human-readable format
    print(f"The model has {n_parameters:,} trainable parameters")
    
    # Return the total number of trainable parameters
    return n_parameters

# Count the number of trainable parameters in the model_rnn instance
n_parameters = count_parameters(model_rnn)

The model has 10,977 trainable parameters


In [28]:
# Define the training function
def train(epochs, model, optimizer, criterion, train_iterator, valid_iterator, device, checkpoint_fname, verbose=True):
    if verbose:
        # Print the number of trainable parameters
        n_parameters = count_parameters(model)

    # Set a timer to measure training duration
    start_time = datetime.now()

    # Move the model and loss function to the specified device (CPU or GPU)
    model = model.to(device)
    criterion = criterion.to(device)

    # Initialize the best validation loss to infinity
    best_valid_loss = float('inf')
    
    # Loop over the specified number of epochs
    for epoch in range(1, epochs + 1):
        
        # Initialize the training loss for this epoch
        training_loss = 0.0
        
        # Set the model to training mode
        model.train()
        
        # Loop over the training data in batches
        for batch_idx, batch in enumerate(train_iterator):
            # Zero the gradients to prevent accumulation
            optimizer.zero_grad()
            
            # Get the name and name length from the batch
            nome, name_len = batch.name
            
            # Make a prediction using the model and calculate the loss
            predict = model(nome, name_len).squeeze(1)
            loss = criterion(predict, batch.label)
            
            # Backpropagate the loss and update the model parameters
            loss.backward()
            optimizer.step()
            
            # Add the batch loss to the total training loss
            training_loss += loss.data.item() * batch.name[0].size(0)
        
        # Calculate the average training loss for this epoch
        training_loss /= len(train_iterator)
        
        # Set the model to evaluation mode
        model.eval()
        
        # Initialize the validation loss for this epoch
        valid_loss = 0.0
        
        # Loop over the validation data in batches
        for batch_idx, batch in enumerate(valid_iterator):
            # Get the name and name length from the batch
            nome, name_len = batch.name
            
            # Make a prediction using the model and calculate the loss
            predict = model(nome, name_len).squeeze(1)
            loss = criterion(predict, batch.label)
            
            # Add the batch loss to the total validation loss
            valid_loss += loss.data.item() * batch.name[0].size(0)
        
        # Calculate the average validation loss for this epoch
        valid_loss /= len(valid_iterator)
        
        # If the validation loss is better than the previous best, save the model and print a message
        if valid_loss < best_valid_loss:
            best_valid_loss = valid_loss
            torch.save(model.state_dict(), checkpoint_fname)
            if verbose:
                print(f'Epoch {epoch} - Training Loss: {training_loss:.4f} - Valid Loss: {valid_loss:.4f} - New Best')
        else:
            if verbose:
                print(f'Epoch {epoch} - Training Loss: {training_loss:.4f} - Valid Loss: {valid_loss:.4f}')
    
    if verbose:
        # Print the total time elapsed and the mean time per epoch
        elapsed_time = datetime.now() - start_time
        print(f'Time elapsed: {elapsed_time}')
        print(f'Mean time per epoch: {elapsed_time / epochs}')

In [29]:
import torch.nn.functional as F

# Define a function to predict the gender label for a given name using a trained model
# The function takes three arguments:
# - nome: the name to predict the label for
# - model: the trained model to use for prediction
# - device: the device to use for computation (default is 'cpu')
def predict(nome, model, device='cpu'):
    # Preprocess the name using the NAME field's preprocess method and get its length
    # The NAME.process method converts the name into a tensor and calculates its length
    name, name_len = NAME.process([NAME.preprocess(nome)])
    
    # Move the preprocessed name and its length to the specified device (CPU or GPU)
    name = name.to(device)
    name_len = name_len.to(device)
    
    # Pass the preprocessed name and its length through the model to get the logits (unnormalized scores) for each label
    logits = model(name, name_len)
    
    # Apply the softmax function to the logits to get the predicted probabilities for each label
    result = F.softmax(logits, dim=1)
    
    # Create a dictionary to map the label indices to their corresponding gender labels ('0' -> 'M', '1' -> 'F')
    result_dict = {'1': 'F', '0': 'M'}
    
    # Get the index of the label with the highest predicted probability
    # Use the result dictionary to map the index to its corresponding gender label
    label = GENDER.vocab.itos[result.argmax().item()]
    label = result_dict[label]
    
    # Return a list containing the predicted gender label and its corresponding probability
    return [label, result.max().item()]

In [30]:
# Import the Path class from the pathlib module to handle file system paths
from pathlib import Path

# Define the path where the model checkpoints will be saved
checkpoint_path = Path('./outputs/rnns/')

# Create the directory (and any necessary parent directories) if it doesn't already exist
checkpoint_path.mkdir(parents=True, exist_ok=True)

# Define the optimizer to use for training the model
# Here, we use the Adam optimizer with a learning rate of 3e-4
optimizer = optim.Adam(model_rnn.parameters(), lr=3e-4)

# Define the loss function to use for training the model
# Here, we use the CrossEntropyLoss, which is suitable for classification tasks
criterion = nn.CrossEntropyLoss()

In [31]:
# Define the filename for saving the best model checkpoint
# The checkpoint will be saved in the previously defined checkpoint_path directory
checkpoint_fname = checkpoint_path / 'bestRNN.pt'

# Train the model for 30 epochs
# Arguments:
# - epochs: Number of epochs to train the model
# - model: The model to train (model_rnn)
# - optimizer: The optimizer to use for training (Adam optimizer)
# - criterion: The loss function to use for training (CrossEntropyLoss)
# - train_iterator: The iterator for the training dataset
# - valid_iterator: The iterator for the validation dataset
# - device: The device to use for computation (e.g., 'cpu' or 'cuda')
# - checkpoint_fname: The filename for saving the best model checkpoint
train(30, model_rnn, optimizer, criterion, train_iter, valid_iter, device, checkpoint_fname)


The model has 10,977 trainable parameters
Epoch 1 - Training Loss: 2.1690 - Valid Loss: 1.4515 - New Best
Epoch 2 - Training Loss: 1.6889 - Valid Loss: 1.2581 - New Best
Epoch 3 - Training Loss: 1.5247 - Valid Loss: 1.1709 - New Best
Epoch 4 - Training Loss: 1.4318 - Valid Loss: 1.1036 - New Best
Epoch 5 - Training Loss: 1.3322 - Valid Loss: 1.0420 - New Best
Epoch 6 - Training Loss: 1.2609 - Valid Loss: 0.9848 - New Best
Epoch 7 - Training Loss: 1.2197 - Valid Loss: 0.9557 - New Best
Epoch 8 - Training Loss: 1.1665 - Valid Loss: 0.9519 - New Best
Epoch 9 - Training Loss: 1.1393 - Valid Loss: 0.9095 - New Best
Epoch 10 - Training Loss: 1.1084 - Valid Loss: 0.9200
Epoch 11 - Training Loss: 1.0781 - Valid Loss: 0.8862 - New Best
Epoch 12 - Training Loss: 1.0588 - Valid Loss: 0.9175
Epoch 13 - Training Loss: 1.0438 - Valid Loss: 0.8816 - New Best
Epoch 14 - Training Loss: 1.0230 - Valid Loss: 0.8635 - New Best
Epoch 15 - Training Loss: 1.0127 - Valid Loss: 0.8467 - New Best
Epoch 16 - Tra

In [32]:
# Create a new instance of the NameRNN model with the same hyperparameters as the trained model
# This ensures that the model architecture matches the one used during training
model_rnn_inference = NameRNN(
    hidden_size=50,            # Hidden state size of the RNN
    embedding_dim=25,          # Dimension of the embedding vectors
    vocab_size=len(NAME.vocab),# Vocabulary size based on the 'NAME' field
    pad_idx=pad_idx            # Padding index for the embedding layer
)

# Load the trained model's state dictionary into the new model instance
# This will set the model parameters to the best found during training
model_rnn_inference.load_state_dict(torch.load(checkpoint_fname))

# Set the model to evaluation mode
# This disables dropout and other training-specific behaviors
model_rnn_inference.eval()

# Send the model to the CPU for inference
# This ensures that the model runs on the CPU, which is typically used for inference
model_rnn_inference = model_rnn_inference.to('cpu')

  model_rnn_inference.load_state_dict(torch.load(checkpoint_fname))


In [33]:
GENDER.vocab.itos

['1', '0']

In [34]:
predict('joana', model_rnn_inference)

['F', 0.9952084422111511]

In [35]:
predict('joão', model_rnn_inference)

['M', 0.9944537281990051]

In [36]:
predict('maria', model_rnn_inference)

['F', 0.9999452829360962]

In [37]:
predict('marcos', model_rnn_inference)

['M', 0.9954324960708618]

In [38]:
# Loop through the first 50 names in the training dataset
for i in names_train[:50]:
    # Print the name and its predicted gender label and probability
    # The predict function takes the name, the trained model, and the device ('cpu') as arguments
    print(i, predict(i[0], model_rnn_inference))

('adamis', '0') ['M', 0.8871570825576782]
('francimauro', '0') ['M', 0.9950956106185913]
('izalto', '0') ['M', 0.999893069267273]
('haim', '0') ['M', 0.9748303294181824]
('cleidinei', '0') ['M', 0.76402348279953]
('juvam', '0') ['M', 0.9981265664100647]
('aislan', '0') ['M', 0.9315415620803833]
('josr', '0') ['M', 0.9899489879608154]
('irames', '0') ['M', 0.539656937122345]
('isaelton', '0') ['M', 0.9999747276306152]
('reilon', '0') ['M', 0.9994834661483765]
('nilssom', '0') ['M', 0.9999818801879883]
('lorino', '0') ['M', 0.9989786148071289]
('leidio', '0') ['M', 0.9997301697731018]
('geolson', '0') ['M', 0.9999927282333374]
('atamil', '0') ['M', 0.9715228080749512]
('joaildo', '0') ['M', 0.9999790191650391]
('eisenhower', '0') ['M', 0.9930551648139954]
('laerto', '0') ['M', 0.9995064735412598]
('matteus', '0') ['M', 0.9738569259643555]
('levair', '0') ['M', 0.7796001434326172]
('auires', '0') ['M', 0.6104703545570374]
('vanjo', '0') ['M', 0.9983975291252136]
('olario', '0') ['M', 0.99

In [39]:
# Define a dictionary to map label indices to their corresponding gender labels
# '1' corresponds to 'F' (Female) and '0' corresponds to 'M' (Male)
label_mapping = {'1': 'F', '0': 'M'}

# Create a list of true gender labels for the validation dataset
# Iterate over the names in the validation dataset and map the label index to the gender label using label_mapping
true_valid = [label_mapping[i[1]] for i in names_valid]

# Create a list of predicted gender labels for the validation dataset
# Iterate over the names in the validation dataset and use the predict function to get the predicted gender label
# The predict function takes the name, the trained model, and the device ('cpu') as arguments
pred_valid = [predict(i[0], model_rnn_inference)[0] for i in names_valid]

In [40]:
# Initialize an empty list to store the names where the model's prediction is incorrect
failed = []

# Loop through the validation dataset
for i in range(len(true_valid)):
    # Compare the true gender label with the predicted gender label
    if true_valid[i] != pred_valid[i]:
        # If the labels do not match, add the name to the failed list
        failed.append(names_valid[i])

In [41]:
random.choices(failed, k=10)

[('niran', '0'),
 ('nubian', '1'),
 ('geremia', '0'),
 ('hironi', '1'),
 ('arisma', '0'),
 ('hisaco', '1'),
 ('rabech', '1'),
 ('neudi', '0'),
 ('cheron', '1'),
 ('cirone', '0')]

In [42]:
len(failed)

792

In [43]:
print('Accuracy: ', accuracy_score(true_valid, pred_valid))
print(f'Classification Report:\n {classification_report(true_valid, pred_valid)}')
print(f'Confusion Matrix:\n {confusion_matrix(true_valid, pred_valid)}')

Accuracy:  0.9560512735142334
Classification Report:
               precision    recall  f1-score   support

           F       0.96      0.96      0.96      9855
           M       0.95      0.95      0.95      8166

    accuracy                           0.96     18021
   macro avg       0.96      0.96      0.96     18021
weighted avg       0.96      0.96      0.96     18021

Confusion Matrix:
 [[9442  413]
 [ 379 7787]]


# Long-Short Term Memory (LSTM) Networks: A Thorough Overview

Let's get into the fascinating world of LSTM Networks.

## Identifying the Dilemma: Dealing with Long-Term Dependencies

In our earlier cells, we examined how Recurrent Neural Networks (RNNs) utilize the concept of state retention over successive time periods to process sequential data. Nevertheless, RNNs often encounter challenges while dealing with long-term dependencies due to complex problems such as the 'vanishing gradient'.

### Unfolding the Vanishing Gradient Problem

The vanishing gradient problem is a troublesome circumstance where the gradients (derivatives) of the loss function, associated with the parameters (weights and biases) of the neural network, start reducing exponentially as the number of layers in the network increases. This issue becomes particularly prevalent in RNNs because calculating the gradient involves a method known as 'backpropagation through time'.

> Backpropagation is an algorithmic methodology used for training neural networks. It modifies the weights of the network in alignment with the direction of the loss function's gradient until this loss function has been minimized as much as possible.

The root cause behind the vanishing gradient problem rests within the repeated multiplication of gradients through the chain of network layers. As the backpropagation process occurs, the gradient, while being propagated backward through time, is continuously multiplied by the weight matrix from each successive layer. If these weights are smaller than 1, an exponential decrease can be observed in the gradient as it passes through the network, leading to a 'vanish' or near-zero gradient.

### Decoding the Exploding Gradient Phenomenon

On the other side of the spectrum from the vanishing gradient, we find the exploding gradient problem. This situation arises when the gradient of the loss function, relative to the network parameters, begins to increase exponentially with the growing number of layers. Just like its counterpart, this problem is quite common in RNNs, given that the gradient computation involves backpropagation through time.

The root cause of the exploding gradient problem mirrors its vanishing equivalent — it's also linked to the continuous multiplication of gradients across the network layers. Here, if the weight matrix is larger than 1, the gradient may inflate exponentially as it traverses back through the network during backpropagation, leading to an 'exploding' or excessively large gradient.

### Clarifying the Concept of Long-Term Dependencies

These challenging vanishing and exploding gradient issues present notable obstacles for RNNs aiming to capture long-term dependencies. Long-term dependencies are inter-relationships between elements in a sequence that are separated by substantial distances. Let's consider the following sentence:

> "O cachorro passou o dia brincando .......... estava cansado."

In this case, the word "estava" relies on the word "cachorro", even though they are significantly apart in the sentence. Unfortunately, due to the vanishing and exploding gradient problems, RNNs often fail to effectively capture these kinds of long-term dependencies.

## Introduction to Long-Short Term Memory (LSTM) Networks

This is where Long-Short Term Memory (LSTM) networks come into play. Being a special type of RNN, LSTM networks are adept at successfully capturing long-term dependencies. This unique capability can be traced back to [the introduction of LSTM networks in 1997 by Hochreiter and Schmidhuber](https://www.bioinf.jku.at/publications/older/2604.pdf). Since then, they have found widespread use across diverse areas such as speech recognition, language modeling, and machine translation.

### Discovering the LSTM Network Architecture

An LSTM network consists of a specialized 'cell' accompanied by three regulating 'gates'—an input gate, an output gate, and a forget gate. The role of the cell is to preserve the state of the network across time, while the gates control the flow of information in and out of this cell.

<p align="center">
<img src="images/lstm.jpeg" alt="" style="width: 50%; height: 50%"/>
</p>


#### Components of the LSTM Cell:

##### Input Data
These are the input data at time $( t )$, feeding into the LSTM cell.

##### Hidden State
This is the hidden state vector that carries information over time and is updated at each time step.

1. **Compute Hidden State**: <br>

The formula for computing the new hidden state is: $$ h_t = o_t \cdot \tanh(C_t) $$

2. **Components:**
- $ h_t $: new hidden state at time $ t $.
- $ o_t $: output vector at time $ t $.
- $ \tanh $: tanh function, which limits values between -1 and 1.
- $ C_t $: new cell state at time $ t $.

3. **Function:**
- The new hidden state is obtained by combining the output vector with the tanh of the new cell state, allowing the processed information to be passed to the next time step.


##### Cell State
The thick black line running along the top of the cell. It is the primary "memory" mechanism of the LSTM and can carry information over many time steps.

1. **Candidate Cell State**: <br>

The formula for the new candidate information is: $$ \tilde{C_t} = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C) $$

2. **Components:**
- $ \tilde{C_t} $: new candidate information vector at time $ t $.
- $ \tanh $: tanh function, which limits values between -1 and 1.
- $ W_C $: weight matrix for the new candidate information.
- $ h_{t-1} $: hidden state of the cell at time $ t-1 $.
- $ x_t $: input at time $ t $.
- $ b_C $: bias vector for the new candidate information.

3. **Function:**
- This formula computes the new candidate information that might be stored in the cell state. The tanh function produces a vector of values between -1 and 1, representing the possible new information to be stored.

4. **Update Cell State**:<br>

The formula for updating the cell state is: $$ C_t = f_t \cdot C_{t-1} + i_t \cdot \tilde{C_t} $$

5. **Components:**
- $ C_t $: new cell state at time $ t $.
- $ f_t $: forget vector at time $ t $.
- $ C_{t-1} $: cell state at time $ t-1 $.
- $ i_t $: input vector at time $ t $.
- $ \tilde{C_t} $: new candidate information vector at time $ t $.

6. **Function:**
- The new cell state is a weighted combination of the previous cell state (modified by the forget gate) and the new information (modified by the input gate).

#### The Input Gate

The input gate regulates the inflow of information from the input layer to the cell. It consists of a sigmoid activation layer adjoined by a pointwise multiplication operation. The sigmoid function returns values between zero and one, which are multiplied with the network's input. If the sigmoid function outputs a zero, it blocks all incoming information. In contrast, an output of one allows all information to enter the cell. Thus, the input gate acts as a guard, protecting the cell from irrelevant or noise data.

The formula for the input gate is:
$$ i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i) $$

**Components:**
- $ i_t $: input vector at time $ t $.
- $ \sigma $: sigmoid function, which limits values between 0 and 1.
- $ W_i $: weight matrix for the input gate.
- $ h_{t-1} $: hidden state of the cell at time $ t-1 $.
- $ x_t $: input at time $ t $.
- $ b_i $: bias vector for the input gate.

**Function:**
- This gate decides which new information will be stored in the cell state. The sigmoid function produces a vector of values between 0 and 1, where values close to 0 indicate ignoring and values close to 1 indicate storing the information.


#### The Output Gate

Like its input counterpart, the output gate controls the flow of information but from the cell to the hidden state, also comprising a sigmoid activation layer and a pointwise multiplication operation. The output gate uses the returned value from the sigmoid function (between zero and one) to regulate the amount of information flowing to the hidden state—zero halts all information flow, while one permits full information transfer. This mechanism enables the output gate to secure the hidden state from receiving any unnecessary or misleading information.

The formula for the output gate is:
$$ o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o) $$

**Components:**
- $ o_t $: output vector at time $ t $.
- $ \sigma $: sigmoid function, which limits values between 0 and 1.
- $ W_o $: weight matrix for the output gate.
- $ h_{t-1} $: hidden state of the cell at time $ t-1 $.
- $ x_t $: input at time $ t $.
- $ b_o $: bias vector for the output gate.

**Function:**
- This gate decides which information from the cell state will be used to compute the next hidden state.


#### The Forget Gate

Finally, the forget gate manages how much information should be discarded from the cell. It comprises a sigmoid activation layer and pointwise multiplication operation. Here, the numbers generated by the sigmoid function, ranging from zero to one, decide how much information from the cell state should be kept, with zero forgetting all information and one maintaining all information. This forget gate acts as a protective barrier, helping the cell dispose of any outdated or irrelevant information that could potentially hamper the overall learning process.

The formula for the forget gate is: $$ f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f) $$

**Components:**
- $ f_t $: forget vector at time $ t $.
- $ \sigma $: sigmoid function, which limits values between 0 and 1.
- $ W_f $: weight matrix for the forget gate.
- $ h_{t-1} $: hidden state of the cell at time $ t-1 $.
- $ x_t $: input at time $ t $.
- $ b_f $: bias vector for the forget gate.

**Function:**
- This gate decides which information from the previous cell state ($ C_{t-1} $) should be forgotten. The sigmoid function produces a vector of values between 0 and 1, where values close to 0 indicate forgetting and values close to 1 indicate keeping the information.

In [44]:
class NameLSTM(nn.Module):
    def __init__(self, hidden_size, embedding_dim, vocab_size, pad_idx):
        super().__init__()
        
        # Define an embedding layer to convert our letters to vectors
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_idx)
        
        # Define an LSTM layer to encode our sequence of letters
        self.encoder = nn.LSTM(input_size=embedding_dim, hidden_size=hidden_size, num_layers=2, dropout=0.3, bidirectional=False)

        # Define a dropout layer to prevent overfitting
        self.dropout = nn.Dropout(0.3)

        # Define a classifier layer to output our final prediction
        self.classifier = nn.Sequential(
            nn.Linear(hidden_size, hidden_size // 2), # input size is the hidden size
            self.dropout,
            nn.ReLU(),
            nn.Linear(hidden_size // 2, 2) # output size is 2 because we're predicting binary gender
        )

    def forward(self, seq, name_len):
        # Convert our sequence of letters to a sequence of vectors using the embedding layer
        embedded = self.embedding(seq)
        embedded = self.dropout(embedded)
        
        # Encode the embedded sequence using the LSTM layer
        packed_output, (hidden, cell) = self.encoder(embedded)
        
        # Pass the final hidden state through the classifier layer to get our prediction
        preds = self.classifier(hidden[-1])
        
        return preds

In [45]:
# Define the filename for saving the best LSTM model checkpoint
checkpoint_fname = checkpoint_path / 'bestLSTM.pt'

# Create an instance of the NameLSTM model with specified hyperparameters
model_lstm = NameLSTM(
    hidden_size=50,            # Hidden state size of the LSTM
    embedding_dim=25,          # Dimension of the embedding vectors
    vocab_size=len(NAME.vocab),# Vocabulary size based on the 'NAME' field
    pad_idx=pad_idx            # Padding index for the embedding layer
)

# Define the optimizer to use for training the model
# Here, we use the Adam optimizer with a learning rate of 3e-4
optimizer = optim.Adam(model_lstm.parameters(), lr=3e-4)

# Define the loss function to use for training the model
# Here, we use the CrossEntropyLoss, which is suitable for classification tasks
criterion = nn.CrossEntropyLoss()

# Train the model for 30 epochs
# Arguments:
# - epochs: Number of epochs to train the model
# - model: The model to train (model_lstm)
# - optimizer: The optimizer to use for training (Adam optimizer)
# - criterion: The loss function to use for training (CrossEntropyLoss)
# - train_iterator: The iterator for the training dataset
# - valid_iterator: The iterator for the validation dataset
# - device: The device to use for computation (e.g., 'cpu' or 'cuda')
# - checkpoint_fname: The filename for saving the best model checkpoint
train(30, model_lstm, optimizer, criterion, train_iter, valid_iter, device, checkpoint_fname)

The model has 37,827 trainable parameters


Epoch 1 - Training Loss: 2.0513 - Valid Loss: 1.3478 - New Best
Epoch 2 - Training Loss: 1.4648 - Valid Loss: 1.2067 - New Best
Epoch 3 - Training Loss: 1.3246 - Valid Loss: 1.0500 - New Best
Epoch 4 - Training Loss: 1.2106 - Valid Loss: 1.0675
Epoch 5 - Training Loss: 1.1522 - Valid Loss: 0.9960 - New Best
Epoch 6 - Training Loss: 1.0795 - Valid Loss: 0.9147 - New Best
Epoch 7 - Training Loss: 1.0326 - Valid Loss: 0.8470 - New Best
Epoch 8 - Training Loss: 0.9871 - Valid Loss: 0.8467 - New Best
Epoch 9 - Training Loss: 0.9631 - Valid Loss: 0.8202 - New Best
Epoch 10 - Training Loss: 0.9260 - Valid Loss: 0.7860 - New Best
Epoch 11 - Training Loss: 0.8951 - Valid Loss: 0.7780 - New Best
Epoch 12 - Training Loss: 0.8781 - Valid Loss: 0.7749 - New Best
Epoch 13 - Training Loss: 0.8523 - Valid Loss: 0.7712 - New Best
Epoch 14 - Training Loss: 0.8340 - Valid Loss: 0.7270 - New Best
Epoch 15 - Training Loss: 0.7998 - Valid Loss: 0.7192 - New Best
Epoch 16 - Training Loss: 0.7958 - Valid Loss

In [46]:
# Create a new instance of the NameLSTM model with the same hyperparameters as the trained model
model_lstm_inference = NameLSTM(
    hidden_size=50,            # Hidden state size of the LSTM
    embedding_dim=25,          # Dimension of the embedding vectors
    vocab_size=len(NAME.vocab),# Vocabulary size based on the 'NAME' field
    pad_idx=pad_idx            # Padding index for the embedding layer
)

# Load the trained model's state dictionary into the new model instance
# This will set the model parameters to the best found during training
model_lstm_inference.load_state_dict(torch.load(checkpoint_fname))

# Set the model to evaluation mode
# This disables dropout and other training-specific behaviors
model_lstm_inference.eval()

# Send the model to the CPU for inference
# This ensures that the model runs on the CPU, which is typically used for inference
model_lstm_inference = model_lstm_inference.to('cpu')

  model_lstm_inference.load_state_dict(torch.load(checkpoint_fname))


In [47]:
label_mapping = {'1': 'F', '0': 'M'}  # Map label indices to gender labels ('1' -> 'F', '0' -> 'M')

# Create a list of true gender labels for the validation dataset
# Iterate over the names in the validation dataset and map the label index to the gender label using label_mapping
true_valid = [label_mapping[i[1]] for i in names_valid]

# Create a list of predicted gender labels for the validation dataset
# Iterate over the names in the validation dataset and use the predict function to get the predicted gender label
# The predict function takes the name, the trained model, and the device ('cpu') as arguments
pred_valid = [predict(i[0], model_lstm_inference)[0] for i in names_valid]

In [48]:
print('Accuracy: ', accuracy_score(true_valid, pred_valid))
print(f'Classification Report:\n {classification_report(true_valid, pred_valid)}')
print(f'Confusion Matrix:\n {confusion_matrix(true_valid, pred_valid)}')

Accuracy:  0.9669829643194051
Classification Report:
               precision    recall  f1-score   support

           F       0.97      0.97      0.97      9855
           M       0.96      0.97      0.96      8166

    accuracy                           0.97     18021
   macro avg       0.97      0.97      0.97     18021
weighted avg       0.97      0.97      0.97     18021

Confusion Matrix:
 [[9542  313]
 [ 282 7884]]


## Bidirectional Networks

In previous sections, we established how LSTM networks effectively capture long-term dependencies. Nonetheless, traditional LSTM networks possess one built-in limitation - they are unidirectional, meaning they only consider past information while making predictions about the future. This quality is valuable in numerous instances, but may not suffice in certain scenarios.

Consider the sentence in Portuguese:

> "O cachorro passou o dia brincando .......... estava cansado."

In this context, the word "estava" (was) depends on both the preceding word "cachorro" (dog), as well as the subsequent word "cansado" (tired). Given that a standard LSTM network only takes past information into account, it would struggle to capture the long-term dependency between "estava" and "cansado".

To circumvent this issue, we introduce a specialized type of architecture called a bidirectional LSTM network.

### Understanding Bidirectional LSTM Networks

A bidirectional LSTM network comprises two separate LSTM layers, each processing data sequences in opposing directions—one from the past towards the future, and the other from the future backwards to the past. The outputs from these two LSTM layers are then amalgamated to generate the final output.

<p align="center">
<img src="images/bidirectional_lstm.webp" alt="" style="width: 50%; height: 50%"/>
</p>

While we have been focusing on LSTM networks, it's important to clarify that bidirectionality is not exclusive to LSTMs. In fact, all types of recurrent networks can be configured to work in a bidirectional manner.

### Comprehending the Dual Processing Mechanism

A bidirectional network can be visualized as two distinct networks operating concurrently, each processing sequences in different directions.

One part of the network processes sequences from the past to the future, thus capturing the past dependencies just like a regular LSTM. This 'forward' layer scans the input sequence in the natural order, from the first element to the last, unearthing any dependencies that look ahead.

The other part, a 'backward' layer, processes sequences from future to past, enabling it to take future information into account. In essence, this layer reads the input sequence backwards, starting from the last element and moving to the first one, capturing dependencies that look backward.

Upon completion of processing, the outputs from these two parts of the network are combined—often concatenated or added—to produce the final output. This dual nature imparts bidirectional networks with their unique ability to perceive patterns considering both past and future contexts, rendering them particularly useful in tasks such as language translation, text generation, speech recognition, where understanding the full context is crucial.

In [49]:
class NameLSTMBidir(nn.Module):
    def __init__(self, hidden_size, embedding_dim, vocab_size, pad_idx):
        super().__init__()
        
        # Define an embedding layer to convert our letters to vectors
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_idx)
        
        # Define a bidirectional LSTM layer to encode our sequence of letters
        self.encoder = nn.LSTM(input_size=embedding_dim, hidden_size=hidden_size, num_layers=2, dropout=0.3, bidirectional=True)

        # Define a dropout layer to prevent overfitting
        self.dropout = nn.Dropout(0.3)

        # Define a classifier layer to output our final prediction
        self.classifier = nn.Sequential(
            nn.Linear(hidden_size * 2, hidden_size // 2), # input size is the hidden size times 2 because of bidirectionality
            self.dropout,
            nn.ReLU(),
            nn.Linear(hidden_size // 2, 2) # output size is 2 because we're predicting binary gender
        )

    def forward(self, seq, name_len):
        # Convert our sequence of letters to a sequence of vectors using the embedding layer
        embedded = self.embedding(seq)
        embedded = self.dropout(embedded)
        
        # Encode the embedded sequence using the bidirectional LSTM layer
        packed_output, (hidden, cell) = self.encoder(embedded)
        
        # Concatenate the final hidden states from both directions
        hidden = torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1)
        
        # Pass the concatenated hidden state through the classifier layer to get our prediction
        preds = self.classifier(hidden)
        
        return preds

In [50]:
# Define the filename for saving the best bidirectional LSTM model checkpoint
checkpoint_fname = checkpoint_path / 'bestLSTMbidir.pt'

# Create an instance of the NameLSTMBidir model with specified hyperparameters
model_lstm_bidir = NameLSTMBidir(
	hidden_size=50,            # Hidden state size of the LSTM
	embedding_dim=25,          # Dimension of the embedding vectors
	vocab_size=len(NAME.vocab),# Vocabulary size based on the 'NAME' field
	pad_idx=pad_idx            # Padding index for the embedding layer
)

# Define the optimizer to use for training the model
# Here, we use the Adam optimizer with a learning rate of 3e-4
optimizer = optim.Adam(model_lstm_bidir.parameters(), lr=3e-4)

# Define the loss function to use for training the model
# Here, we use the CrossEntropyLoss, which is suitable for classification tasks
criterion = nn.CrossEntropyLoss()

# Train the model for 30 epochs
# Arguments:
# - epochs: Number of epochs to train the model
# - model: The model to train (model_lstm_bidir)
# - optimizer: The optimizer to use for training (Adam optimizer)
# - criterion: The loss function to use for training (CrossEntropyLoss)
# - train_iterator: The iterator for the training dataset
# - valid_iterator: The iterator for the validation dataset
# - device: The device to use for computation (e.g., 'cpu' or 'cuda')
# - checkpoint_fname: The filename for saving the best model checkpoint
train(30, model_lstm_bidir, optimizer, criterion, train_iter, valid_iter, device, checkpoint_fname)

The model has 94,877 trainable parameters
Epoch 1 - Training Loss: 1.8746 - Valid Loss: 1.3111 - New Best
Epoch 2 - Training Loss: 1.3256 - Valid Loss: 1.0331 - New Best
Epoch 3 - Training Loss: 1.1756 - Valid Loss: 0.9460 - New Best
Epoch 4 - Training Loss: 1.0809 - Valid Loss: 0.9016 - New Best
Epoch 5 - Training Loss: 1.0114 - Valid Loss: 0.8668 - New Best
Epoch 6 - Training Loss: 0.9444 - Valid Loss: 0.8015 - New Best
Epoch 7 - Training Loss: 0.8902 - Valid Loss: 0.8120
Epoch 8 - Training Loss: 0.8422 - Valid Loss: 0.7544 - New Best
Epoch 9 - Training Loss: 0.8007 - Valid Loss: 0.7071 - New Best
Epoch 10 - Training Loss: 0.7699 - Valid Loss: 0.7160
Epoch 11 - Training Loss: 0.7373 - Valid Loss: 0.6608 - New Best
Epoch 12 - Training Loss: 0.7142 - Valid Loss: 0.6514 - New Best
Epoch 13 - Training Loss: 0.6893 - Valid Loss: 0.6241 - New Best
Epoch 14 - Training Loss: 0.6655 - Valid Loss: 0.6308
Epoch 15 - Training Loss: 0.6431 - Valid Loss: 0.6001 - New Best
Epoch 16 - Training Loss:

In [51]:
# Create a new instance of the NameLSTMBidir model with the same hyperparameters as the trained model
model_lstm_bidir_inference = NameLSTMBidir(
    hidden_size=50,            # Hidden state size of the LSTM
    embedding_dim=25,          # Dimension of the embedding vectors
    vocab_size=len(NAME.vocab),# Vocabulary size based on the 'NAME' field
    pad_idx=pad_idx            # Padding index for the embedding layer
)

# Load the trained model's state dictionary into the new model instance
# This will set the model parameters to the best found during training
model_lstm_bidir_inference.load_state_dict(torch.load(checkpoint_fname))

# Set the model to evaluation mode
# This disables dropout and other training-specific behaviors
model_lstm_bidir_inference.eval()

# Send the model to the CPU for inference
# This ensures that the model runs on the CPU, which is typically used for inference
model_lstm_bidir_inference = model_lstm_bidir_inference.to('cpu')

  model_lstm_bidir_inference.load_state_dict(torch.load(checkpoint_fname))


In [52]:
label_mapping = {'1': 'F', '0': 'M'}  # Map label indices to gender labels ('1' -> 'F', '0' -> 'M')

# Create a list of true gender labels for the validation dataset
# Iterate over the names in the validation dataset and map the label index to the gender label using label_mapping
true_valid = [label_mapping[i[1]] for i in names_valid]

# Create a list of predicted gender labels for the validation dataset
# Iterate over the names in the validation dataset and use the predict function to get the predicted gender label
# The predict function takes the name, the trained bidirectional LSTM model, and the device ('cpu') as arguments
pred_valid = [predict(i[0], model_lstm_bidir_inference)[0] for i in names_valid]

In [53]:
print('Accuracy: ', accuracy_score(true_valid, pred_valid))
print(f'Classification Report:\n {classification_report(true_valid, pred_valid)}')
print(f'Confusion Matrix:\n {confusion_matrix(true_valid, pred_valid)}')

Accuracy:  0.974862660229732
Classification Report:
               precision    recall  f1-score   support

           F       0.97      0.98      0.98      9855
           M       0.98      0.97      0.97      8166

    accuracy                           0.97     18021
   macro avg       0.97      0.97      0.97     18021
weighted avg       0.97      0.97      0.97     18021

Confusion Matrix:
 [[9662  193]
 [ 260 7906]]


# Gated Recurrent Unit (GRU) Networks

Just as we have explored LSTM networks, another prominent variation of RNNs that has seen widespread adoption due to its efficient handling of long-term dependencies is the Gated Recurrent Unit (GRU).

While LSTMs effectively address the limitations of traditional RNNs by tackling the vanishing and exploding gradient problems, they come with a complex architecture that can be computationally challenging to deal with. This complexity primarily stems from the need to calculate and store three different kinds of gates (input, output, forget) at every time step.

To reconcile this, GRUs were introduced in 2014 by Kyunghyun Cho et al. in their paper ["Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation"](https://arxiv.org/abs/1406.1078).

The GRU architecture simplifies LSTM by merging the cell state and hidden state into a single entity known as the hidden state and employing only two types of gating units: update and reset gate. This reduction in complexity results in fewer calculations and memory requirements while maintaining competitive performance.

<p align="center">
<img src="images/gru.png" alt="" style="width: 50%; height: 50%"/>
</p>

### The Hidden State

In GRUs, there is no explicit cell state like in LSTMs. Instead, the hidden state alone stores the network's historical information from previous time steps. This economical design allows for efficient memory management and computational savings.

### The Update Gate

The update gate essentially governs how much of the past information should be kept or 'updated'. It takes the current input and previous hidden state, applies some transformations, and generates an output between zero and one. If the update gate outputs zero, the GRU disregards all past information and writes entirely new information. In contrast, if the update gate outputs one, it retains all past information.

### The Reset Gate

The reset gate is in charge of deciding how much past information is discarded before the new input is processed. Like the update gate, the reset gate accepts the current input and previous hidden state, applies some transformations, and produces an output between zero and one. A value close to zero means 'forget a significant portion of information', while a value close to one indicates 'keep most of the information'.


> GRUs have been found remarkably effective for numerous tasks such as machine translation, text generation, and speech recognition. They retain the strengths of LSTM networks in handling long-term dependencies while offering a more efficient, less computationally-intensive network architecture.
>
> However, the choice between LSTMs and GRUs largely depends on the specific task at hand. Some empirical studies suggest that while GRUs train faster and perform comparably to LSTMs on simpler tasks, LSTMs often have the edge on tasks that require more complex learning or longer sequence modeling.

In [54]:
class NameGRUBidir(nn.Module):
    def __init__(self, hidden_size, embedding_dim, vocab_size, pad_idx):
        super().__init__()
        
        # Define an embedding layer to convert our letters to vectors
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_idx)
        
        # Define a bidirectional GRU layer to encode our sequence of letters
        self.encoder = nn.GRU(input_size=embedding_dim, hidden_size=hidden_size, num_layers=2, dropout=0.3, bidirectional=True)

        # Define a dropout layer to prevent overfitting
        self.dropout = nn.Dropout(0.3)

        # Define a classifier layer to output our final prediction
        self.classifier = nn.Sequential(
            nn.Linear(hidden_size * 2, hidden_size // 2), # input size is the hidden size times 2 because of bidirectionality
            self.dropout,
            nn.ReLU(),
            nn.Linear(hidden_size // 2, 2) # output size is 2 because we're predicting binary gender
        )

    def forward(self, seq, name_len):
        # Convert our sequence of letters to a sequence of vectors using the embedding layer
        embedded = self.embedding(seq)
        embedded = self.dropout(embedded)
        
        # Encode the embedded sequence using the bidirectional GRU layer
        packed_output, hidden = self.encoder(embedded)
        
        # Concatenate the final hidden states from both directions
        hidden = torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1)
        
        # Pass the concatenated hidden state through the classifier layer to get our prediction
        preds = self.classifier(hidden)
        
        return preds

In [55]:
# Define the filename for saving the best bidirectional GRU model checkpoint
checkpoint_fname = checkpoint_path / 'bestGRUBidir.pt'

# Create an instance of the NameGRUBidir model with specified hyperparameters
model_gru_bidir = NameGRUBidir(
    hidden_size=50,            # Hidden state size of the GRU
    embedding_dim=25,          # Dimension of the embedding vectors
    vocab_size=len(NAME.vocab),# Vocabulary size based on the 'NAME' field
    pad_idx=pad_idx            # Padding index for the embedding layer
)

# Define the optimizer to use for training the model
# Here, we use the Adam optimizer with a learning rate of 3e-4
optimizer = optim.Adam(model_gru_bidir.parameters(), lr=3e-4)

# Define the loss function to use for training the model
# Here, we use the CrossEntropyLoss, which is suitable for classification tasks
criterion = nn.CrossEntropyLoss()

# Train the model for 30 epochs
# Arguments:
# - epochs: Number of epochs to train the model
# - model: The model to train (model_gru_bidir)
# - optimizer: The optimizer to use for training (Adam optimizer)
# - criterion: The loss function to use for training (CrossEntropyLoss)
# - train_iterator: The iterator for the training dataset
# - valid_iterator: The iterator for the validation dataset
# - device: The device to use for computation (e.g., 'cpu' or 'cuda')
# - checkpoint_fname: The filename for saving the best model checkpoint
train(30, model_gru_bidir, optimizer, criterion, train_iter, valid_iter, device, checkpoint_fname)

The model has 71,977 trainable parameters
Epoch 1 - Training Loss: 1.9348 - Valid Loss: 1.3102 - New Best
Epoch 2 - Training Loss: 1.3564 - Valid Loss: 1.0898 - New Best
Epoch 3 - Training Loss: 1.1990 - Valid Loss: 0.9930 - New Best
Epoch 4 - Training Loss: 1.0809 - Valid Loss: 0.9091 - New Best
Epoch 5 - Training Loss: 1.0194 - Valid Loss: 0.8808 - New Best
Epoch 6 - Training Loss: 0.9599 - Valid Loss: 0.8324 - New Best
Epoch 7 - Training Loss: 0.8946 - Valid Loss: 0.7953 - New Best
Epoch 8 - Training Loss: 0.8572 - Valid Loss: 0.7326 - New Best
Epoch 9 - Training Loss: 0.8118 - Valid Loss: 0.7091 - New Best
Epoch 10 - Training Loss: 0.7896 - Valid Loss: 0.7419
Epoch 11 - Training Loss: 0.7469 - Valid Loss: 0.6820 - New Best
Epoch 12 - Training Loss: 0.7172 - Valid Loss: 0.6809 - New Best
Epoch 13 - Training Loss: 0.6912 - Valid Loss: 0.6330 - New Best
Epoch 14 - Training Loss: 0.6659 - Valid Loss: 0.6246 - New Best
Epoch 15 - Training Loss: 0.6594 - Valid Loss: 0.6065 - New Best
Epo

In [56]:
# Create a new instance of the NameGRUBidir model with the same hyperparameters as the trained model
# This ensures that the model architecture matches the one used during training
model_gru_bidir_inference = NameGRUBidir(
    hidden_size=50,            # Hidden state size of the GRU
    embedding_dim=25,          # Dimension of the embedding vectors
    vocab_size=len(NAME.vocab),# Vocabulary size based on the 'NAME' field
    pad_idx=pad_idx            # Padding index for the embedding layer
)

# Load the trained model's state dictionary into the new model instance
# This will set the model parameters to the best found during training
model_gru_bidir_inference.load_state_dict(torch.load(checkpoint_fname))

# Set the model to evaluation mode
# This disables dropout and other training-specific behaviors
model_gru_bidir_inference.eval()

# Send the model to the CPU for inference
# This ensures that the model runs on the CPU, which is typically used for inference
model_gru_bidir_inference = model_gru_bidir_inference.to('cpu')

  model_gru_bidir_inference.load_state_dict(torch.load(checkpoint_fname))


In [57]:
label_mapping = {'1': 'F', '0': 'M'}  # Map label indices to gender labels ('1' -> 'F', '0' -> 'M')

# Create a list of true gender labels for the validation dataset
# Iterate over the names in the validation dataset and map the label index to the gender label using label_mapping
true_valid = [label_mapping[i[1]] for i in names_valid]

# Create a list of predicted gender labels for the validation dataset
# Iterate over the names in the validation dataset and use the predict function to get the predicted gender label
# The predict function takes the name, the trained bidirectional GRU model, and the device ('cpu') as arguments
pred_valid = [predict(i[0], model_gru_bidir_inference)[0] for i in names_valid]

In [58]:
print('Accuracy: ', accuracy_score(true_valid, pred_valid))
print(f'Classification Report:\n {classification_report(true_valid, pred_valid)}')
print(f'Confusion Matrix:\n {confusion_matrix(true_valid, pred_valid)}')

Accuracy:  0.9731979357416347
Classification Report:
               precision    recall  f1-score   support

           F       0.97      0.98      0.98      9855
           M       0.98      0.96      0.97      8166

    accuracy                           0.97     18021
   macro avg       0.97      0.97      0.97     18021
weighted avg       0.97      0.97      0.97     18021

Confusion Matrix:
 [[9691  164]
 [ 319 7847]]


# Model Comparison

| Architecture | Accuracy | Number of Parameters |
|-------------------------|----------|----------------------|
| Vanilla RNN | 95.57% | 10,977 |
| LSTM | 96.61% | 37,827 |
| Bidirectional LSTM | 97.37% | 94,877 |
| Bidirectional GRU | 97.30% | 71,977 |

## Analysis

- **Accuracy**:
- The **Bidirectional LSTM** model achieves the highest accuracy at 97.37%, followed closely by the **Bidirectional GRU** at 97.28%.
- The **LSTM** model outperforms the **Vanilla RNN**, with an accuracy of 96.61% compared to 95.57%.

- **Number of Parameters**:
- The **Vanilla RNN** has the fewest parameters (10,977), making it the most lightweight model.
- The **Bidirectional LSTM** has the highest number of parameters (94,877), indicating a more complex and potentially more powerful model.
- The **Bidirectional GRU** has fewer parameters (71,977) than the **Bidirectional LSTM**, but more than the **LSTM** (37,827) and **Vanilla RNN**.

## Considerations

- The **Bidirectional LSTM** provides the best accuracy but at the cost of having the highest number of parameters.
- The **Bidirectional GRU** offers a slightly lower accuracy than the Bidirectional LSTM but with fewer parameters, which might be a good trade-off depending on the application.
- The **LSTM** improves accuracy significantly over the **Vanilla RNN** without an excessive increase in parameters.
- The **Vanilla RNN** is the simplest model with the fewest parameters but also the lowest accuracy.

# Wrapping Up: Unleashing the Power of RNNs, LSTMs, and GRUs in NLP

We have now covered an extensive journey, exploring the foundations and insights into the world of Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM) networks, and Gated Recurrent Units (GRU). We've examined the strengths and limitations of RNNs and learned how LSTMs and GRUs have significantly improved upon these foundations to better handle long-term dependencies.

Understanding these three forms of neural networks is crucial for anyone interested in Natural Language Processing (NLP). In NLP tasks, we often deal with sequenced data where understanding temporal dependencies is key, be it understanding the sentiment behind a customer review or translating a passage from one language to another. RNNs, LSTMs, and GRUs offer us powerful tools to make sense of this complex, sequential data.

However, remember that the choice of architecture should be dictated by the specifics of your task. While RNNs can suffice for simpler, short-term dependencies, LSTMs and GRUs are better suited for more complex tasks or longer sequences. But no model can serve as a silver bullet. Experimenting, iterating, and continuous learning are part of the process.

Next class we will start to work with a new type of neural network: Transformers.

# Questions

1. What are the main types of recurrent neural networks discussed in this class?

2. What is the vanishing gradient problem and how do LSTMs address it?

3. How does a bidirectional LSTM network process sequences differently from a standard LSTM?

4. What are the key components of an LSTM cell and what are their functions?

5. How does a GRU simplify the architecture of an LSTM?

6. What was the case study used to demonstrate the application of these different network architectures?

7. Which model achieved the highest accuracy on the name gender classification task?

8. How many trainable parameters did the bidirectional LSTM model have compared to the vanilla RNN?

9. What are some of the tradeoffs to consider when choosing between RNNs, LSTMs and GRUs?

10. Why are recurrent networks particularly useful for natural language processing tasks?

`Answers are commented inside this cell`

<!-- 1. RNNs are the simplest, processing sequences with hidden states. LSTMs introduce a memory cell and gates to better handle long-term dependencies. GRUs simplify LSTMs by combining hidden and cell states and using only update and reset gates.

2. RNNs process sequential data by maintaining a hidden state that captures information from previous time steps. They face challenges with vanishing and exploding gradients when dealing with long sequences.

3. Vanishing gradients occur when gradients become extremely small during backpropagation, preventing effective learning. Exploding gradients happen when gradients become very large, leading to unstable training.

4. LSTMs have a memory cell that maintains long-term information, and gates (input, output, forget) that regulate information flow, allowing them to capture long-term dependencies effectively.

5. The input gate controls what new information is added to the cell state. The forget gate determines what information to discard from the cell state. The output gate decides what information from the cell state is used to compute the hidden state.

6. Bidirectional LSTMs process sequences in both forward and backward directions, capturing both past and future contexts. This allows them to have a more complete understanding of the sequence.

7. GRUs combine the hidden and cell states into a single hidden state and use only two gates (update and reset), making them computationally more efficient than LSTMs while still handling long-term dependencies effectively.

8. The dataset used was from the 2010 Brazilian Census, containing 90,104 names (49,274 female and 40,830 male). Names that could be associated with both genders were excluded. The data was split into 80% training and 20% validation sets.

9. Bidirectional networks process sequences in both forward and backward directions, allowing them to capture both past and future contexts. This additional context helps improve the performance of sequence processing tasks.

10. The Bidirectional LSTM achieved the highest accuracy (97.37%), followed by the Bidirectional GRU (97.28%), LSTM (96.61%), and Vanilla RNN (95.57%). The Bidirectional LSTM had the most parameters, while the Vanilla RNN had the fewest. -->