# RNNs (Recurrent Neural Networks)

## Basic RNN
## Long Short-Term Memory (LSTM)

In [1]:
%load_ext watermark
%watermark -v -p numpy,pandas,polars,mlxtend,omegaconf --conda

Python implementation: CPython
Python version       : 3.11.8
IPython version      : 8.22.2

numpy    : 1.26.4
pandas   : 2.2.1
polars   : 0.20.18
mlxtend  : 0.23.1
omegaconf: 2.3.0

conda environment: torch_p11



In [2]:
# Built-in library
from pathlib import Path
import re
import json
from typing import Any, Literal, Optional, Union
import logging
import warnings

# Standard imports
import numpy as np
import numpy.typing as npt
from pprint import pprint
import pandas as pd
import polars as pl

# Visualization
import matplotlib.pyplot as plt

# NumPy settings
np.set_printoptions(precision=4)

# Pandas settings
pd.options.display.max_rows = 1_000
pd.options.display.max_columns = 1_000
pd.options.display.max_colwidth = 600

# Polars settings
pl.Config.set_fmt_str_lengths(1_000)
pl.Config.set_tbl_cols(n=1_000)
pl.Config.set_tbl_rows(n=200)

warnings.filterwarnings("ignore")

# Black code formatter (Optional)
%load_ext lab_black

# auto reload imports
%load_ext autoreload
%autoreload 2

## Sequential Data

- `Sequential data` refers to data that is ordered in a particular sequence or time series.
- Each data point in a sequential dataset is `dependent` on the `previous data points`, and the `order` of the data is crucial for understanding the information it contains.
- Time series data is sequential data where the order of examples is determined by time. Examples include stock prices and audio recordings.
- Not all sequential data is time series. e.g. Text and DNA sequences are ordered but not time-based.

<hr>

## Recurrent Neural Networks (RNNs)

- `RNN` is a type of neural network designed to process sequential data.
- Unlike feedforward neural networks that process data in a single pass, RNNs can process data across multiple time steps, making them ideal for tasks involving sequences like text, speech, and time series data.   

### Key characteristics of RNNs

- **Sequential Processing**: RNNs process data sequentially, allowing them to consider the context of previous inputs when making predictions.   
- **Internal Memory**: RNNs have an internal memory, often represented as a hidden state, that stores information about past inputs. This memory allows the network to maintain context and make predictions based on the entire sequence.   
- **Feedback Loop**: RNNs have a feedback loop that connects the output of a layer back to its input, creating a cyclic structure. This allows the network to learn long-term dependencies and capture patterns in sequential data.

[![image.png](https://i.postimg.cc/3wSPP3Qn/image.png)](https://postimg.cc/cK393yHn)

<br>

[![image.png](https://i.postimg.cc/dtPzMWgP/image.png)](https://postimg.cc/Dm6CLc2B)

### Common Types of Sequencing Tasks

- **Many-to-one**: Input is a sequence, output is a fixed-size vector or scalar. Example: Sentiment analysis.
- **One-to-many**: Input is standard, output is a sequence. e.g. in image captioning, a single image is given as input to a model, which then produces a sequence of words to describe the image (image caption).
- **Many-to-many**: Both input and output are sequences. Can be further divided based on alignment.

### Hidden Recurrence vs Output Recurrence

- **Hidden recurrence**: The recurrent connection is from the hidden layer to itself.
- **Output recurrence**: The recurrent connection is from the output layer to either the hidden layer or itself.

#### Two Types of Output Recurrence

- **Output-to-hidden**: Connection from output layer at previous time step to hidden layer at current time step.
- **Output-to-output**: Connection from output layer at previous time step to output layer at current time step.

### Load Data

#### Dependencies

```sh
pip install torchtext
pip install torchdata
pip install portalocker
```

In [3]:
import torch
from torch import nn
from torch.utils.data.dataset import random_split
from torchtext.datasets import IMDB

from torch.utils.data.dataset import Subset

In [4]:
def tokenizer(text: str) -> list[str]:
    """
    Tokenize the input text by removing HTML tags, extracting emoticons,
    and splitting the text into individual tokens.

    Parameters
    ----------
    text : str
        The input text to be tokenized.

    Returns
    -------
    list[str]
        A list of tokens extracted from the input text.
    """
    text = re.sub("<[^>]*>", "", text)
    emoticons: list[str] = re.findall("(?::|;|=)(?:-)?(?:\)|\(|D|P)", text.lower())
    text = re.sub("[\W]+", " ", text.lower()) + " ".join(emoticons).replace("-", "")
    tokenized: list[str] = text.split()
    return tokenized

In [5]:
### Load data
train_dataset: list[tuple[int, str]] = list(IMDB(split="train"))
test_dataset: list[tuple[int, str]] = list(IMDB(split="test"))

In [6]:
train_dataset[:2]

[(1,
  'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far betw

In [7]:
# Split the dataset into train and validation
torch.manual_seed(123)

train_dataset: Subset
valid_dataset: Subset

train_dataset, valid_dataset = random_split(dataset=train_dataset, lengths=[0.85, 0.15])

In [8]:
device: torch.device | str = torch.device(
    "cuda" if torch.cuda.is_available() else "cpu"
)


from collections import Counter, OrderedDict

# Find unique tokens (words)
token_counts: Counter = Counter()


for label, line in train_dataset:
    tokens = tokenizer(line)
    token_counts.update(tokens)


print(f"Vocab-size: {len(token_counts):,}")

Vocab-size: 71,011


In [9]:
from itertools import islice


dict(islice(token_counts.items(), 0, 3))

{'seeing': 1801, 'as': 39774, 'i': 74206}

In [10]:
from torchtext.vocab import vocab
from torchtext import __version__ as torchtext_version
from pkg_resources import parse_version
import torchtext


def create_vocabulary(token_counts: dict[str, int]) -> vocab:
    """
    Create a vocabulary from token counts.

    Parameters
    ----------
    token_counts : dict[str, int]
        A dictionary mapping tokens to their frequency counts.

    Returns
    -------
    vocab
        A vocabulary object with special tokens and default index set.

    Notes
    -----
    The vocabulary is created by sorting tokens by frequency in descending order,
    inserting special tokens '<pad>' and '<unk>', and setting a default index for
    out-of-vocabulary tokens.
    """
    # Sort tokens by frequency in descending order
    sorted_tokens: list[tuple[str, int]] = sorted(
        token_counts.items(), key=lambda x: x[1], reverse=True
    )

    # Create vocab object
    vocabulary: vocab = vocab(OrderedDict(sorted_tokens))

    # Insert special tokens
    vocabulary.insert_token("<pad>", 0)
    vocabulary.insert_token("<unk>", 1)

    # Set default index for OOV tokens
    vocabulary.set_default_index(1)

    return vocabulary


def encode_texts(text: str) -> list[int]:
    """
    Encode a text string into a list of integer token indices.

    Parameters
    ----------
    text : str
        The input text to be encoded.

    Returns
    -------
    list[int]
        A list of integer token indices representing the encoded text.

    Notes
    -----
    This function uses a global `vocabulary` object to map tokens to their
    corresponding integer indices. The input text is first tokenized using
    a global `tokenizer` function before being encoded.
    """
    global vocabulary

    enc_text: list[int] = [vocabulary[token] for token in tokenizer(text)]
    return enc_text


def encode_labels(label: int | str) -> float:
    """
    Encode labels into binary values.

    Parameters
    ----------
    label : int or str
        The input label to be encoded. Can be either an integer or a string.

    Returns
    -------
    float
        The encoded label as a float value (0.0 or 1.0).

    Notes
    -----
    For torchtext versions > 0.10:
        - 1 represents a negative review
        - 2 represents a positive review
    For torchtext versions <= 0.10:
        - "neg" represents a negative review
        - "pos" represents a positive review
    """
    # Transform labels into 0 or 1
    if parse_version(torchtext_version) > parse_version("0.10"):
        # 1 ~ negative, 2 ~ positive review
        enc_label: float = 1.0 if label == 2 else 0.0
    else:
        enc_label: float = 1.0 if label == "pos" else 0.0

    return enc_label

In [11]:
# Assuming token_counts is defined elsewhere
vocabulary: vocab = create_vocabulary(token_counts)

# Test the vocabulary
test_tokens: list[str] = ["this", "is", "an", "example", "thisTokenDoesNotExist"]
print([vocabulary[token] for token in test_tokens])


# Test the encode_texts
print(encode_texts("this is an example"))

[11, 7, 35, 458, 1]
[11, 7, 35, 458]


In [12]:
# Define the functions for transformation
def collate_fn(
    batch: list[tuple[int, str]]
) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
    """
    Collate function for processing batches of data.

    Parameters
    ----------
    batch : list[tuple[int, str]]
        A list of tuples containing label (int) and text (str) pairs.

    Returns
    -------
    tuple[torch.Tensor, torch.Tensor, torch.Tensor]
        A tuple containing:
        - padded_texts: torch.Tensor of shape (batch_size, max_seq_length)
        - labels: torch.Tensor of shape (batch_size,)
        - lengths: torch.Tensor of shape (batch_size,)
    """
    labels: list[int] = []
    texts: list[torch.Tensor] = []
    lengths: list[int] = []

    for _label, _text in batch:
        labels.append(encode_labels(_label))
        tok_text: torch.Tensor = torch.tensor(encode_texts(_text), dtype=torch.int64)
        texts.append(tok_text)
        lengths.append(tok_text.size(0))

    # Convert to tensor
    labels_tensor: torch.Tensor = torch.tensor(labels)
    lengths_tensor: torch.Tensor = torch.tensor(lengths)
    padded_texts: torch.Tensor = nn.utils.rnn.pad_sequence(
        texts, batch_first=True, padding_value=0.0
    )

    return padded_texts.to(device), labels_tensor.to(device), lengths_tensor.to(device)

#### Take a Small Batch

In [13]:
from torch.utils.data import DataLoader


dataloader = DataLoader(
    train_dataset, batch_size=4, shuffle=False, collate_fn=collate_fn
)
text_batch, label_batch, length_batch = next(iter(dataloader))
print(f"{text_batch = }")
print(f"{label_batch = }")
print(f"{length_batch = }")
print(f"{text_batch.shape = }")

text_batch = tensor([[ 318,   15,   10,  ...,    0,    0,    0],
        [  48,   52,   51,  ...,  598,    2, 1591],
        [4793, 8188,  127,  ...,    0,    0,    0],
        [  15,    4, 9725,  ...,    0,    0,    0]])
label_batch = tensor([0., 1., 0., 1.])
length_batch = tensor([156, 864, 835, 541])
text_batch.shape = torch.Size([4, 864])


#### Step 4: Batch the Datasets

In [14]:
batch_size: int = 32

train_dlDataLoader = DataLoader(
    train_dataset, batch_size=batch_size, shuffle=True, collate_fn=collate_fn
)
valid_dlDataLoader = DataLoader(
    valid_dataset, batch_size=batch_size, shuffle=False, collate_fn=collate_fn
)
test_dlDataLoader = DataLoader(
    test_dataset, batch_size=batch_size, shuffle=False, collate_fn=collate_fn
)