<a href="https://colab.research.google.com/github/chineidu/NLP-Tutorial/blob/main/notebook/01a_text_processing/05-RNNs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RNNs (Recurrent Neural Networks)

## Basic RNN
## Long Short-Term Memory (LSTM)

In [1]:
!pip -q install torch==1.13.0 torchtext==0.14.0 \
  torchdata==0.5.0 portalocker==2.5.0

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m890.1/890.1 MB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m71.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.5/4.5 MB[0m [31m36.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.1/317.1 MB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.0/21.0 MB[0m [31m23.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m849.3/849.3 kB[0m [31m41.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m557.1/557.1 MB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following de

In [2]:
# %load_ext watermark
# %watermark -v -p numpy,pandas,polars,mlxtend,omegaconf --conda

In [3]:
# Built-in library
from pathlib import Path
import re
import json
from typing import Any, Literal, Optional, TypedDict, Union
import logging
import warnings

# Standard imports
import numpy as np
import numpy.typing as npt
from pprint import pprint
import pandas as pd
import polars as pl

# Visualization
import matplotlib.pyplot as plt

# NumPy settings
np.set_printoptions(precision=4)

# Pandas settings
pd.options.display.max_rows = 1_000
pd.options.display.max_columns = 1_000
pd.options.display.max_colwidth = 600

# Polars settings
pl.Config.set_fmt_str_lengths(1_000)
pl.Config.set_tbl_cols(n=1_000)
pl.Config.set_tbl_rows(n=200)

warnings.filterwarnings("ignore")

# # Black code formatter (Optional)
# %load_ext lab_black

# # auto reload imports
# %load_ext autoreload
# %autoreload 2

## Sequential Data

- `Sequential data` refers to data that is ordered in a particular sequence or time series.
- Each data point in a sequential dataset is `dependent` on the `previous data points`, and the `order` of the data is crucial for understanding the information it contains.
- Time series data is sequential data where the order of examples is determined by time. Examples include stock prices and audio recordings.
- Not all sequential data is time series. e.g. Text and DNA sequences are ordered but not time-based.

<hr>

## Recurrent Neural Networks (RNNs)

- `RNN` is a type of neural network designed to process sequential data.
- Unlike feedforward neural networks that process data in a single pass, RNNs can process data across multiple time steps, making them ideal for tasks involving sequences like text, speech, and time series data.

### Key characteristics of RNNs

- **Sequential Processing**: RNNs process data sequentially, allowing them to consider the context of previous inputs when making predictions.
- **Internal Memory**: RNNs have an internal memory, often represented as a hidden state, that stores information about past inputs. This memory allows the network to maintain context and make predictions based on the entire sequence.
- **Feedback Loop**: RNNs have a feedback loop that connects the output of a layer back to its input, creating a cyclic structure. This allows the network to learn long-term dependencies and capture patterns in sequential data.

[![image.png](https://i.postimg.cc/3wSPP3Qn/image.png)](https://postimg.cc/cK393yHn)

<br>

[![image.png](https://i.postimg.cc/dtPzMWgP/image.png)](https://postimg.cc/Dm6CLc2B)

### Common Types of Sequencing Tasks

- **Many-to-one**: Input is a sequence, output is a fixed-size vector or scalar. Example: Sentiment analysis.
- **One-to-many**: Input is standard, output is a sequence. e.g. in image captioning, a single image is given as input to a model, which then produces a sequence of words to describe the image (image caption).
- **Many-to-many**: Both input and output are sequences. Can be further divided based on alignment.

### Hidden Recurrence vs Output Recurrence

- **Hidden recurrence**: The recurrent connection is from the hidden layer to itself.
- **Output recurrence**: The recurrent connection is from the output layer to either the hidden layer or itself.

#### Two Types of Output Recurrence

- **Output-to-hidden**: Connection from output layer at previous time step to hidden layer at current time step.
- **Output-to-output**: Connection from output layer at previous time step to output layer at current time step.

### Load Data

#### Dependencies

```sh
pip install torchtext
pip install torchdata
pip install portalocker
```

In [4]:
import torch
from torch import nn
from torch.utils.data.dataset import random_split

# from torchtext.datasets import IMDB

from torch.utils.data.dataset import Subset


if not torch.cuda.is_available():
    print("[Warning]: This code may be very slow on CPU")

In [5]:
def tokenizer(text: str) -> list[str]:
    """
    Tokenize the input text by removing HTML tags, extracting emoticons,
    and splitting the text into individual tokens.

    Parameters
    ----------
    text : str
        The input text to be tokenized.

    Returns
    -------
    list[str]
        A list of tokens extracted from the input text.
    """
    text = re.sub("<[^>]*>", "", text)
    emoticons: list[str] = re.findall("(?::|;|=)(?:-)?(?:\)|\(|D|P)", text.lower())
    text = re.sub("[\W]+", " ", text.lower()) + " ".join(emoticons).replace("-", "")
    tokenized: list[str] = text.split()
    return tokenized

In [None]:
# # Load data
# train_dataset: list[tuple[int, str]] = list(IMDB(split="train"))
# test_dataset: list[tuple[int, str]] = list(IMDB(split="test"))


# print(f"Train dataset length: {len(train_dataset):,}")
# print(f"Test dataset length: {len(test_dataset):,}")

### Save Data To Disk

- I'm unable to use `torchtext` on colab so I have to save the data locally.

In [None]:
# # Convert the datasets to a JSON-serializable format
# train_dataset_json: list[dict[str, Any]] = [
#     {"label": label, "text": text} for label, text in train_dataset
# ]
# test_dataset_json: list[dict[str, Any]] = [
#     {"label": label, "text": text} for label, text in test_dataset
# ]

# # Save the datasets to disk
# sp: str = "../../data/IMDB"

# with open(f"{sp}/train_dataset.json", "w") as train_file:
#     json.dump(train_dataset_json, train_file)

# with open(f"{sp}/test_dataset.json", "w") as test_file:
#     json.dump(test_dataset_json, test_file)

In [6]:
from google.colab import drive

drive.mount("/content/drive")

Mounted at /content/drive


In [7]:
# Load the datasets from disk
sp: str = "/content/drive/MyDrive/My doc/Deep Learning/Data/IMDB"

with open(f"{sp}/train_dataset.json", "r") as train_file:
    train_dataset_json = json.load(train_file)

with open(f"{sp}/test_dataset.json", "r") as test_file:
    test_dataset_json = json.load(test_file)

# Convert the JSON-formatted datasets back to the original format
train_dataset: list[tuple[int, str]] = [
    (item["label"], item["text"]) for item in train_dataset_json
]
test_dataset: list[tuple[int, str]] = [
    (item["label"], item["text"]) for item in test_dataset_json
]

# Verify the loaded datasets
print(f"Train dataset length: {len(train_dataset):,}")
print(f"Test dataset length: {len(test_dataset):,}")

Train dataset length: 25,000
Test dataset length: 25,000


In [8]:
train_dataset[:2]

[(1,
  'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far betw

In [9]:
# Split the dataset into train and validation
torch.manual_seed(123)

train_dataset: Subset
valid_dataset: Subset


train_dataset, valid_dataset = random_split(dataset=train_dataset, lengths=[0.85, 0.15])
# train_dataset, valid_dataset = random_split(
#     dataset=train_dataset, lengths=[20000, 5000]
# )

In [10]:
from collections import Counter, OrderedDict


device: torch.device | str = torch.device(
    "cuda" if torch.cuda.is_available() else "cpu"
)


# Find unique tokens (words)
token_counts: Counter = Counter()


for label, line in train_dataset:
    tokens = tokenizer(line)
    token_counts.update(tokens)


print(f"Vocab-size: {len(token_counts):,}")

Vocab-size: 71,011


In [11]:
from itertools import islice


dict(islice(token_counts.items(), 0, 3))

{'seeing': 1801, 'as': 39774, 'i': 74206}

In [12]:
from torchtext.vocab import vocab
from torchtext import __version__ as torchtext_version
from pkg_resources import parse_version


def create_vocabulary(token_counts: dict[str, int]) -> vocab:
    """
    Create a vocabulary from token counts.

    Parameters
    ----------
    token_counts : dict[str, int]
        A dictionary mapping tokens to their frequency counts.

    Returns
    -------
    vocab
        A vocabulary object with special tokens and default index set.

    Notes
    -----
    The vocabulary is created by sorting tokens by frequency in descending order,
    inserting special tokens '<pad>' and '<unk>', and setting a default index for
    out-of-vocabulary tokens.
    """
    # Sort tokens by frequency in descending order
    sorted_tokens: list[tuple[str, int]] = sorted(
        token_counts.items(), key=lambda x: x[1], reverse=True
    )

    # Create vocab object
    vocabulary: vocab = vocab(OrderedDict(sorted_tokens))

    # Insert special tokens
    vocabulary.insert_token("<pad>", 0)
    vocabulary.insert_token("<unk>", 1)

    # Set default index for OOV tokens
    vocabulary.set_default_index(1)

    return vocabulary


def encode_texts(text: str) -> list[int]:
    """
    Encode a text string into a list of integer token indices.

    Parameters
    ----------
    text : str
        The input text to be encoded.

    Returns
    -------
    list[int]
        A list of integer token indices representing the encoded text.

    Notes
    -----
    This function uses a global `vocabulary` object to map tokens to their
    corresponding integer indices. The input text is first tokenized using
    a global `tokenizer` function before being encoded.
    """
    global vocabulary

    enc_text: list[int] = [vocabulary[token] for token in tokenizer(text)]
    return enc_text


def encode_labels(label: int | str) -> float:
    """
    Encode labels into binary values.

    Parameters
    ----------
    label : int or str
        The input label to be encoded. Can be either an integer or a string.

    Returns
    -------
    float
        The encoded label as a float value (0.0 or 1.0).

    Notes
    -----
    For torchtext versions > 0.10:
        - 1 represents a negative review
        - 2 represents a positive review
    For torchtext versions <= 0.10:
        - "neg" represents a negative review
        - "pos" represents a positive review
    """
    # Transform labels into 0 or 1
    if parse_version(torchtext_version) > parse_version("0.10"):
        # 1 ~ negative, 2 ~ positive review
        enc_label: float = 1.0 if label == 2 else 0.0
    else:
        enc_label: float = 1.0 if label == "pos" else 0.0

    return enc_label

In [13]:
# Assuming token_counts is defined elsewhere
vocabulary: vocab = create_vocabulary(token_counts)

# Test the vocabulary
test_tokens: list[str] = ["this", "is", "an", "example", "thisTokenDoesNotExist"]
print([vocabulary[token] for token in test_tokens])


# Test the encode_texts
print(encode_texts("this is an example thisTokenDoesNotExist"))

[11, 7, 35, 458, 1]
[11, 7, 35, 458, 1]


In [14]:
# Define the functions for transformation
def collate_fn(
    batch: list[tuple[int, str]]
) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
    """
    Collate function for processing batches of data.

    Parameters
    ----------
    batch : list[tuple[int, str]]
        A list of tuples containing label (int) and text (str) pairs.

    Returns
    -------
    tuple[torch.Tensor, torch.Tensor, torch.Tensor]
        A tuple containing:
        - padded_texts: torch.Tensor of shape (batch_size, max_seq_length)
        - labels: torch.Tensor of shape (batch_size,)
        - lengths: torch.Tensor of shape (batch_size,)
    """
    labels: list[int] = []
    texts: list[torch.Tensor] = []
    lengths: list[int] = []

    for _label, _text in batch:
        labels.append(encode_labels(_label))
        tok_text: torch.Tensor = torch.tensor(encode_texts(_text), dtype=torch.int64)
        texts.append(tok_text)
        lengths.append(tok_text.size(0))

    # Convert to tensor
    labels_tensor: torch.Tensor = torch.tensor(labels)
    lengths_tensor: torch.Tensor = torch.tensor(lengths)
    padded_texts: torch.Tensor = nn.utils.rnn.pad_sequence(
        texts, batch_first=True, padding_value=0.0
    )

    return padded_texts.to(device), labels_tensor.to(device), lengths_tensor.to(device)

#### Test The Collate Function

- Test with a small batch

In [15]:
from torch.utils.data import DataLoader


dataloader = DataLoader(
    train_dataset, batch_size=4, shuffle=False, collate_fn=collate_fn
)
encoded_batch, label_batch, length_batch = next(iter(dataloader))
print(f"{encoded_batch = }")
print(f"{label_batch = }")
print(f"{length_batch = }")
print(f"{encoded_batch.shape = }")

encoded_batch = tensor([[ 318,   15,   10,  ...,    0,    0,    0],
        [  48,   52,   51,  ...,  598,    2, 1591],
        [4793, 8188,  127,  ...,    0,    0,    0],
        [  15,    4, 9725,  ...,    0,    0,    0]], device='cuda:0')
label_batch = tensor([0., 1., 0., 1.], device='cuda:0')
length_batch = tensor([156, 864, 835, 541], device='cuda:0')
encoded_batch.shape = torch.Size([4, 864])


<br>

#### Batch the Datasets

- Create dataloaders

In [16]:
batch_size: int = 32

train_dl: DataLoader = DataLoader(
    train_dataset, batch_size=batch_size, shuffle=True, collate_fn=collate_fn
)
valid_dl: DataLoader = DataLoader(
    valid_dataset, batch_size=batch_size, shuffle=False, collate_fn=collate_fn
)
test_dl: DataLoader = DataLoader(
    test_dataset, batch_size=batch_size, shuffle=False, collate_fn=collate_fn
)

<br>

### Create Embedding Layers for Sentence Encoding

- input_dim: number of words, i.e. maximum integer index + 1.

- output_dim:

- input_length: the length of (padded) sequence

<br>

```text
for example, 'This is an example' -> [0, 0, 0, 0, 0, 0, 3, 1, 8, 9]
=> input_lenght is 10
When calling the layer, takes integr values as input,
the embedding layer convert each interger into float vector of size [output_dim]

If input shape is [BATCH_SIZE], output shape will be [BATCH_SIZE, output_dim]
If input shape is [BATCH_SIZE, 10], output shape will be [BATCH_SIZE, 10, output_dim]
```

### Convert The Text To Vectors

[![image.png](https://i.postimg.cc/Y066d3ms/image.png)](https://postimg.cc/N202MRD6)

<br>

- There are two ways of converting the texts to vectors.

1. **One-hot encoding**:
  - Converts word indices into sparse vectors of zeros and ones.
  - Vector size equals the vocabulary size (can be very large).
  - Results in high-dimensional, sparse feature space, leading to the curse of dimensionality.

2. **Embedding**:
  - Maps each word to a fixed-size, real-valued vector.
  - Embedding dimension is much smaller than the vocabulary size.
  - Reduces feature space dimensionality, mitigating the curse of dimensionality.
  - Learns salient word features through model optimization, improving efficiency.

In [17]:
embedding = nn.Embedding(
    num_embeddings=10,  # vocab size
    embedding_dim=3,  # num of feats per token
    padding_idx=0,
)

# a batch of 2 samples of 4 indices each
text_encoded_input = torch.LongTensor([[1, 2, 4, 5], [4, 3, 2, 0]])
output: torch.Tensor = embedding(text_encoded_input)
print(f"{output = }")

# batch_size, seq_len, embedding_dim
print(f"{output.shape = }")

output = tensor([[[-1.3074,  1.0545, -0.1409],
         [-0.8768, -0.1416, -0.5148],
         [-1.3336, -0.3185,  0.0877],
         [-0.3443,  1.2392,  1.1727]],

        [[-1.3336, -0.3185,  0.0877],
         [ 0.5286, -0.7448,  0.5654],
         [-0.8768, -0.1416, -0.5148],
         [ 0.0000,  0.0000,  0.0000]]], grad_fn=<EmbeddingBackward0>)
output.shape = torch.Size([2, 4, 3])


<hr><br>

### Building An RNN Model

- **RNN layers**:
  - nn.RNN(input_size, hidden_size, num_layers=1)
  - nn.LSTM(..)
  - nn.GRU(..)
  - nn.RNN(input_size, hidden_size, num_layers=1, bidirectional=True)

<br>

#### Vanilla RNN

- Simplest form of RNN.
- Suffers from the vanishing gradient problem, making it difficult to learn long-term dependencies.

#### LSTM (Long Short-Term Memory)

- Introduces gates (input, forget, output) to control the flow of information, addressing the vanishing gradient problem.
  - Input Gate: Controls how much of the new information should be added to the cell state.
  - Forget Gate: Controls how much of the previous cell state should be retained.
  - Output Gate: Controls how much of the current cell state should be output.
  - Cell State: A memory unit that carries information across time steps.
- Can learn long-term dependencies effectively.
- More complex than vanilla RNN.

#### GRU (Gated Recurrent Unit)

- A simpler variant of LSTM with fewer gates (reset, update).
- Offers similar performance to LSTM but is computationally less expensive.
- Also addresses the vanishing gradient problem.

<br>

### PyTorch Syntax

```py
# Fully connected neural network with one hidden layer
class RNN(nn.Module):
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.rnn = nn.RNN(
            input_size,
            hidden_size,
            num_layers=2,
            batch_first=True,
        )
        # self.gru = nn.GRU(input_size, hidden_size, num_layers, batch_first=True)
        # self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, 1)

    def forward(self, x):
        _, hidden = self.rnn(x)
        # Selects the hidden state from the last layer and the last time step.
        out = hidden[-1, :, :]
        out = self.fc(out)
        return out

# Usage
model = RNN(64, 32)
model(torch.randn(5, 3, 64))
```

In [18]:
class RNN(nn.Module):
    """
    Recurrent Neural Network with one hidden layer.

    Parameters
    ----------
    input_size : int
        The number of expected features in the input x.
    hidden_size : int
        The number of features in the hidden state h.

    Attributes
    ----------
    rnn : nn.RNN
        The RNN layer.
    fc : nn.Linear
        The fully connected output layer.
    """

    def __init__(self, input_size: int, hidden_size: int):
        super().__init__()
        self.rnn = nn.RNN(
            input_size,
            hidden_size,
            num_layers=2,
            batch_first=True,
        )
        self.fc = nn.Linear(hidden_size, 1)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Forward pass of the RNN.

        Parameters
        ----------
        x : torch.Tensor
            Input tensor of shape (batch_size, seq_len, input_size).

        Returns
        -------
        torch.Tensor
            Output tensor of shape (batch_size, 1).
        """
        # hidden state: (num_layers, batch_size, hidden_size)
        _, hidden = self.rnn(x)
        out: torch.Tensor = hidden[-1, :, :]
        out = self.fc(out)
        return out

In [19]:
model = RNN(input_size=64, hidden_size=32)

print(model)
# shape of hidden state: (num_layers, batch_size, hidden_size)
# (2, 5, 32). num_layer=2 (from the model architecture)
# hidden[-1, :, :] => (5, 32) (from the model architecture)
# output shape: (5, 32) @ (32, 1) => (5, 1)
input_tensor: torch.Tensor = torch.randn(5, 3, 64)
print(f"{input_tensor.shape = }\n")
output: torch.Tensor = model(input_tensor)

print(f"{output}\n")
print(f"{output.shape = }")

RNN(
  (rnn): RNN(64, 32, num_layers=2, batch_first=True)
  (fc): Linear(in_features=32, out_features=1, bias=True)
)
input_tensor.shape = torch.Size([5, 3, 64])

tensor([[ 0.5008],
        [ 0.1696],
        [ 0.4209],
        [ 0.1581],
        [-0.0279]], grad_fn=<AddmmBackward0>)

output.shape = torch.Size([5, 1])


### Build An RNN Model For Sentiment Analysis

In [20]:
class RNNModelConfig(TypedDict):
    vocab_size: int
    embed_dim: int
    rnn_hidden_size: int
    fc_hidden_size: int


class RNN(nn.Module):
    def __init__(self, config):
        super().__init__()

        self.embedding = nn.Embedding(
            num_embeddings=config["vocab_size"],
            embedding_dim=config["embed_dim"],
            padding_idx=0,
        )
        self.rnn = nn.LSTM(
            input_size=config["embed_dim"],
            hidden_size=config["rnn_hidden_size"],
            batch_first=True,
        )
        self.fc1 = nn.Linear(config["rnn_hidden_size"], config["fc_hidden_size"])
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(config["fc_hidden_size"], 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, text, lengths):
        out = self.embedding(text)
        out = nn.utils.rnn.pack_padded_sequence(
            out, lengths.cpu().numpy(), enforce_sorted=False, batch_first=True
        )
        out, (hidden, cell) = self.rnn(out)
        out = hidden[-1, :, :]
        out = self.fc1(out)
        out = self.relu(out)
        out = self.fc2(out)
        out = self.sigmoid(out)
        return out


# Updated!
class RNN(nn.Module):
    """
    Recurrent Neural Network (RNN) model for text classification.

    Parameters
    ----------
    config : RNNModelConfig
        Configuration object containing model parameters.

    Attributes
    ----------
    embedding : nn.Embedding
        Embedding layer for text input.
    rnn : nn.LSTM
        LSTM layer for sequence processing.
    out_layer : nn.Sequential
        Output layer for classification.
    """

    def __init__(self, config: RNNModelConfig) -> None:
        super().__init__()

        self.embedding: nn.Embedding = nn.Embedding(
            num_embeddings=config["vocab_size"],
            embedding_dim=config["embed_dim"],
            padding_idx=0,
        )
        self.rnn: nn.LSTM = nn.LSTM(
            input_size=config["embed_dim"],
            hidden_size=config["rnn_hidden_size"],
            batch_first=True,
        )
        self.out_layer: nn.Sequential = nn.Sequential(
            nn.Linear(config["rnn_hidden_size"], config["fc_hidden_size"]),
            nn.ReLU(),
            nn.Linear(config["fc_hidden_size"], 1),
            nn.Sigmoid(),
        )

    def forward(self, text: torch.Tensor, lengths: torch.Tensor) -> torch.Tensor:
        """
        Forward pass of the RNN model.

        Parameters
        ----------
        text : torch.Tensor
            Input tensor of shape (batch_size, seq_length).
        lengths : torch.Tensor
            Tensor of shape (batch_size,) containing the length of each sequence in the batch.

        Returns
        -------
        torch.Tensor
            Output tensor of shape (batch_size, 1) containing classification probabilities.
        """
        out: torch.Tensor = self.embedding(text)
        out: torch.nn.utils.rnn.PackedSequence = nn.utils.rnn.pack_padded_sequence(
            out, lengths.cpu().numpy(), enforce_sorted=False, batch_first=True
        )
        out: tuple[
            torch.nn.utils.rnn.PackedSequence, tuple[torch.Tensor, torch.Tensor]
        ] = self.rnn(out)
        out, (hidden, cell) = out
        out: torch.Tensor = hidden[-1, :, :]
        out: torch.Tensor = self.out_layer(out)
        return out

In [21]:
model_config: RNNModelConfig = {
    "vocab_size": len(vocabulary),
    "embed_dim": 20,
    "rnn_hidden_size": 64,
    "fc_hidden_size": 64,
}

model_config

{'vocab_size': 71013,
 'embed_dim': 20,
 'rnn_hidden_size': 64,
 'fc_hidden_size': 64}

In [22]:
torch.manual_seed(1)

# Initialize the model
model = RNN(model_config)
model = model.to(device)

In [23]:
# Test the model
text: str = "Thank you Jesus for loving me"
input_text: list[int] = encode_texts(text)
input_tensor: torch.Tensor = torch.tensor([input_text], dtype=torch.int64).to(device)
text_length: torch.Tensor = torch.tensor([input_tensor.size(0)]).to(device)
print(f"{input_tensor = } | {text_length = }\n")

model(input_tensor, text_length)

input_tensor = tensor([[1301,   21, 1847,   16, 1723,   71]], device='cuda:0') | text_length = tensor([1], device='cuda:0')



tensor([[0.5349]], device='cuda:0', grad_fn=<SigmoidBackward0>)

In [24]:
def train_model(
    dataloader: torch.utils.data.DataLoader, model: nn.Module, lr: float = 0.001
) -> tuple[float, float]:
    """
    Train the model using the provided dataloader.

    Parameters
    ----------
    dataloader : torch.utils.data.DataLoader
        The dataloader containing the training data.
    model : nn.Module
        The neural network model to be trained.
    lr : float, optional
        The learning rate for the optimizer (default is 0.001).

    Returns
    -------
    tuple[float, float]
        A tuple containing the average accuracy and average loss.
    """
    loss_fn: nn.BCELoss = nn.BCELoss()
    optimizer: torch.optim.Adam = torch.optim.Adam(model.parameters(), lr=lr)

    model.train()
    total_acc: float = 0
    total_loss: float = 0

    for text_batch, label_batch, lengths in dataloader:
        optimizer.zero_grad()

        # Forward pass
        pred: torch.Tensor = model(text_batch, lengths)[:, 0]
        loss: torch.Tensor = loss_fn(pred, label_batch)

        # Backward pass
        loss.backward()
        optimizer.step()

        # Update metrics
        total_acc += ((pred >= 0.5).float() == label_batch).float().sum().item()
        total_loss += loss.item() * label_batch.size(0)

    avg_acc: float = total_acc / len(dataloader.dataset)
    avg_loss: float = total_loss / len(dataloader.dataset)

    return (avg_acc, avg_loss)


def evaluate_model(
    dataloader: torch.utils.data.DataLoader, model: nn.Module
) -> tuple[float, float]:
    """
    Evaluate the model using the provided dataloader.

    Parameters
    ----------
    dataloader : torch.utils.data.DataLoader
        The dataloader containing the evaluation data.
    model : nn.Module
        The neural network model to be evaluated.

    Returns
    -------
    tuple[float, float]
        A tuple containing the average accuracy and average loss.
    """
    loss_fn: nn.BCELoss = nn.BCELoss()

    model.eval()
    total_acc: float = 0
    total_loss: float = 0

    with torch.no_grad():
        for text_batch, label_batch, lengths in dataloader:
            pred: torch.Tensor = model(text_batch, lengths)[:, 0]
            loss: torch.Tensor = loss_fn(pred, label_batch)
            total_acc += ((pred >= 0.5).float() == label_batch).float().sum().item()
            total_loss += loss.item() * label_batch.size(0)

    avg_acc: float = total_acc / len(dataloader.dataset)
    avg_loss: float = total_loss / len(dataloader.dataset)
    return (avg_acc, avg_loss)

In [25]:
num_epochs: int = 10

torch.manual_seed(1)

for epoch in range(num_epochs):
    acc_train, loss_train = train_model(train_dl, model=model)
    acc_valid, loss_valid = evaluate_model(valid_dl, model=model)
    print(
        f"Epoch {epoch+1} | accuracy: {acc_train:.4f} | val_accuracy: {acc_valid:.4f}"
    )

Epoch 1 | accuracy: 0.5990 | val_accuracy: 0.6864
Epoch 2 | accuracy: 0.7187 | val_accuracy: 0.6816
Epoch 3 | accuracy: 0.7805 | val_accuracy: 0.7963
Epoch 4 | accuracy: 0.8403 | val_accuracy: 0.8293
Epoch 5 | accuracy: 0.8710 | val_accuracy: 0.8512
Epoch 6 | accuracy: 0.8960 | val_accuracy: 0.8643
Epoch 7 | accuracy: 0.9094 | val_accuracy: 0.7643
Epoch 8 | accuracy: 0.9272 | val_accuracy: 0.8808
Epoch 9 | accuracy: 0.9229 | val_accuracy: 0.8483
Epoch 10 | accuracy: 0.9448 | val_accuracy: 0.8805


In [26]:
acc_test, _ = evaluate_model(test_dl, model=model)
print(f"test_accuracy: {acc_test:.4f}")

test_accuracy: 0.8584


<br>

#### Try Out Bidirectional Recurrent Layer

- The bidirectional RNN layer makes `two passes` over each input sequence: a `forward` pass and a `reverse` or backward pass (note that this is not to be confused with the forward and backward passes in the context of backpropagation).

- The resulting hidden states of these forward and backward passes are usually `concatenated` into a single hidden state.
- Other merge modes include `summation`, `multiplication` (multiplying the results of the two passes), and `averaging` (taking the average of the two).

- We can also try other types of recurrent layers, such as the regular RNN. However, as it turns out, a model built with regular recurrent layers won’t be able to reach a good predictive performance (even on the training data).
  - For example, if you try replacing the bidirectional LSTM layer in the previous code with a `unidirectional nn.RNN` (instead of nn.LSTM) layer and train the model on full-length sequences, you may observe that the loss will not even decrease during training.
  - The reason is that the sequences in this dataset are too long, so a model with an RNN layer cannot learn the long-term dependencies and may suffer from vanishing or exploding gradient problems.

In [27]:
class RNNModelConfig(TypedDict):
    vocab_size: int
    embed_dim: int
    rnn_hidden_size: int
    fc_hidden_size: int
    merge_strategy: Literal["avg", "concat", "mul", "sum"]


class RNN(nn.Module):
    """
    Recurrent Neural Network (RNN) model for text classification.

    Parameters
    ----------
    config : RNNModelConfig
        Configuration object containing model parameters.

    Attributes
    ----------
    embedding : nn.Embedding
        Embedding layer for text input.
    rnn : nn.LSTM
        LSTM layer for sequence processing.
    out_layer : nn.Sequential
        Output layer for classification.
    """

    def __init__(self, config: RNNModelConfig) -> None:
        super().__init__()

        self.embedding: nn.Embedding = nn.Embedding(
            num_embeddings=config["vocab_size"],
            embedding_dim=config["embed_dim"],
            padding_idx=0,
        )
        self.rnn: nn.LSTM = nn.LSTM(
            input_size=config["embed_dim"],
            hidden_size=config["rnn_hidden_size"],
            batch_first=True,
            bidirectional=True,  # NEW!
        )
        # In a bidirectional LSTM, setting bidirectional=True allows the LSTM to process
        # the input sequence in both forward and backward directions, resulting in two
        # sets of hidden states. For a bidirectional LSTM, the hidden state size is doubled
        #  (rnn_hidden_size * 2) because it concatenates the forward and backward hidden states.
        # Consequently, the input size to the output layer must also be rnn_hidden_size * 2 to
        # accommodate the concatenated hidden states.
        self.out_layer: nn.Sequential = nn.Sequential(
            nn.Linear(config["rnn_hidden_size"] * 2, config["fc_hidden_size"]),
            nn.ReLU(),
            nn.Linear(config["fc_hidden_size"], 1),
            nn.Sigmoid(),
        )
        self.merge_strategy: str = config["merge_strategy"]

    def forward(self, text: torch.Tensor, lengths: torch.Tensor) -> torch.Tensor:
        """
        Forward pass of the RNN model.

        Parameters
        ----------
        text : torch.Tensor
            Input tensor of shape (batch_size, seq_length).
        lengths : torch.Tensor
            Tensor of shape (batch_size,) containing the length of each sequence in the batch.

        Returns
        -------
        torch.Tensor
            Output tensor of shape (batch_size, 1) containing classification probabilities.
        """
        out: torch.Tensor = self.embedding(text)
        out: torch.nn.utils.rnn.PackedSequence = nn.utils.rnn.pack_padded_sequence(
            out, lengths.cpu().numpy(), enforce_sorted=False, batch_first=True
        )
        _: torch.Tensor
        hidden: torch.Tensor
        _, (hidden, _) = self.rnn(out)
        if self.merge_strategy == "avg":
            out = (hidden[-2, :, :] + hidden[-1, :, :]) / 2
        elif self.merge_strategy == "concat":
            out = torch.cat((hidden[-2, :, :], hidden[-1, :, :]), dim=1)
        elif self.merge_strategy == "mul":
            out = torch.mul(hidden[-2, :, :], hidden[-1, :, :])
        elif self.merge_strategy == "sum":
            out = torch.add(hidden[-2, :, :], hidden[-1, :, :])
        else:
            raise ValueError("merge_strategy must be one of 'avg', 'concat', 'mul', 'sum'")

        out = self.out_layer(out)
        return out

In [28]:
model_config: RNNModelConfig = {
    "vocab_size": len(vocabulary),
    "embed_dim": 20,
    "rnn_hidden_size": 64,
    "fc_hidden_size": 64,
    "merge_strategy": "concat",
}

model_config

{'vocab_size': 71013,
 'embed_dim': 20,
 'rnn_hidden_size': 64,
 'fc_hidden_size': 64,
 'merge_strategy': 'concat'}

In [29]:
torch.manual_seed(1)

# Initialize the model
model = RNN(model_config)
model = model.to(device)


num_epochs: int = 10

torch.manual_seed(1)

for epoch in range(num_epochs):
    acc_train, loss_train = train_model(train_dl, model=model)
    acc_valid, loss_valid = evaluate_model(valid_dl, model=model)
    print(
        f"Epoch {epoch+1} | accuracy: {acc_train:.4f} | val_accuracy: {acc_valid:.4f}"
    )

Epoch 1 | accuracy: 0.6166 | val_accuracy: 0.7203
Epoch 2 | accuracy: 0.7620 | val_accuracy: 0.5272
Epoch 3 | accuracy: 0.8078 | val_accuracy: 0.8117
Epoch 4 | accuracy: 0.8624 | val_accuracy: 0.8379
Epoch 5 | accuracy: 0.8937 | val_accuracy: 0.8531
Epoch 6 | accuracy: 0.9106 | val_accuracy: 0.8429
Epoch 7 | accuracy: 0.9240 | val_accuracy: 0.8669
Epoch 8 | accuracy: 0.9414 | val_accuracy: 0.8760
Epoch 9 | accuracy: 0.9534 | val_accuracy: 0.8819
Epoch 10 | accuracy: 0.9656 | val_accuracy: 0.8901


In [30]:
acc_test, _ = evaluate_model(test_dl, model=model)
print(f"test_accuracy: {acc_test:.4f}")

test_accuracy: 0.8614
