# Introduction

Long short-term memory (LSTM) is a type of neural network architecture
designed to capture long-term dependencies in sequential data. It is
particularly useful in time-series analysis, language modeling, speech
recognition, and other tasks that require modeling sequential data.
PyTorch is a popular open-source machine learning framework that
provides excellent support for building and training LSTMs.

In this notebook, we will explore the basics of LSTMs in PyTorch. We
will start with the theoretical background, including the architecture
of LSTMs and how they work. Then, we will dive into practical
implementation, including data preprocessing, building the LSTM model,
training the model, and evaluating the model’s performance. Finally, we
will discuss some advanced techniques for LSTMs, such as bidirectional
LSTMs, stacked LSTMs, and attention mechanisms.

# Theoretical Background

## Recurrent Neural Networks (RNNs)

Before we dive into LSTMs, let’s first take a look at recurrent neural
networks (RNNs). RNNs are a type of neural network architecture that is
designed to handle sequential data. They process sequences of vectors
(inputs) by recursively applying the same set of weights to each input
vector. An RNN is composed of a loop that processes each input vector
and updates its internal state based on the input vector and the
previous state. The output of the RNN at each step is a function of the
current internal state.

The standard RNN architecture suffers from a fundamental problem called
the vanishing gradient problem. This problem arises when the gradient
that is propagated through time becomes very small, making it difficult
for the model to learn long-term dependencies. LSTMs were introduced to
address this problem.

## LSTMs

LSTMs were first proposed by Hochreiter and Schmidhuber in 1997 as a
variant of RNNs. They are designed to capture long-term dependencies by
introducing a memory cell and three gating mechanisms: the input gate,
the output gate, and the forget gate.

The memory cell is the core component of LSTMs. It can remember
information for an extended period and selectively choose which
information to forget or remember using the gating mechanisms. The input
gate controls the amount of new information that enters the memory cell.
The forget gate controls the amount of old information that is discarded
from the memory cell. The output gate controls the amount of information
that is outputted from the memory cell.

The LSTM architecture is illustrated in the figure below.

<figure>
<img
src="https://raw.githubusercontent.com/abulbasar/data-science-notebooks/master/images/lstm.png"
alt="LSTM Architecture" />
<figcaption aria-hidden="true">LSTM Architecture</figcaption>
</figure>

The equations that govern the state updates in an LSTM are given below:

$$
i_t = \sigma(W_{i} x_t + U_{i} h_{t-1} + b_i) \\
f_t = \sigma(W_{f} x_t + U_{f} h_{t-1} + b_f) \\
o_t = \sigma(W_{o} x_t + U_{o} h_{t-1} + b_o) \\
\tilde{c}_t = \text{tanh}(W_{c} x_t + U_{c} h_{t-1} + b_c) \\
c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t \\
h_t = o_t \odot \text{tanh}(c_t)
$$

where $x_t$ is the input at time step $t$, $h_t$ is the hidden state at
time step $t$, $c_t$ is the memory cell at time step $t$, $i_t$, $f_t$,
and $o_t$ are the input, forget, and output gates, respectively. $W$ and
$U$ are the learnable weight matrices, and $b$ is the bias vector.
$\sigma$ is the sigmoid activation function, and $\odot$ is the
element-wise multiplication.

# Practical Implementation

Now that we have a good understanding of the theory behind LSTMs let’s
dive into the practical implementation of an LSTM model in PyTorch. We
will use the famous MNIST dataset to classify handwritten digits.

## Data Preprocessing

We will start by loading the MNIST dataset and preprocessing it. This
dataset contains images of handwritten digits from 0 to 9.

``` python
import torch
from torchvision import datasets, transforms

# Define transformations
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))
])

# Load training and testing datasets
trainset = datasets.MNIST('~/.pytorch/MNIST_data/', download=True, train=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True)

testset = datasets.MNIST('~/.pytorch/MNIST_data/', download=True, train=False, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=64, shuffle=True)
```

We use `transforms.ToTensor()` to convert the images to PyTorch tensors
and then normalize the pixel values using `transforms.Normalize()`. We
then load the training and testing datasets and create data loaders to
iterate over the data during training and testing.

## Building the LSTM Model

Next, we will define the LSTM model architecture using PyTorch’s
`nn.LSTM()` module. We will use one LSTM layer with 128 hidden units,
followed by a fully connected layer with 10 outputs for the 10 classes.

``` python
import torch.nn as nn

class MNISTLSTM(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super(MNISTLSTM, self).__init__()

        self.hidden_size = hidden_size
        self.lstm = nn.LSTM(input_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, num_classes)

    def forward(self, x):

        # Initialize hidden state and cell state
        h0 = torch.zeros(1, x.size(0), self.hidden_size).to(x.device)
        c0 = torch.zeros(1, x.size(0), self.hidden_size).to(x.device)

        # Forward pass through LSTM layer
        out, _ = self.lstm(x, (h0, c0))

        # Only use the last output of the sequence
        out = out[:, -1, :]

        # Forward pass through the fully connected layer
        out = self.fc(out)

        return out

# Create the model
input_size = 28 # The size of each image is 28x28 pixels
hidden_size = 128 # Number of hidden units in the LSTM layer
num_classes = 10 # There are 10 classes (digits 0 through 9)
model = MNISTLSTM(input_size, hidden_size, num_classes)

# Print the model's architecture
print(model)
```

We define the `MNISTLSTM` class that inherits from PyTorch’s
`nn.Module`. In the constructor, we define an LSTM layer with 128 hidden
units and a fully connected layer with 10 outputs for the 10 classes.
During forward pass, we first initialize the hidden state and cell state
of the LSTM layer to zeros. We then pass the input sequence through the
LSTM layer and use only the last output of the sequence. Finally, we
pass the last output through the fully connected layer to get the output
logits.

## Training the Model

Now that we have defined the LSTM model, let’s train it on the MNIST
dataset. We will use the cross-entropy loss and the Adam optimizer for
training. We will train the model for 5 epochs, which means that we will
iterate over the training dataset 5 times.

``` python
epochs = 5
lr = 0.001
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr)

# Train the model
for epoch in range(epochs):
    running_loss = 0
    for images, labels in trainloader:

        # Flatten the images and convert them to sequences
        images = images.view(images.shape[0], 28, 28)

        # Zero the parameter gradients
        optimizer.zero_grad()

        # Forward pass, backward pass, and optimization
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        running_loss += loss.item()

    print('Epoch {}/{} Loss: {}'.format(epoch+1, epochs, running_loss/len(trainloader)))
```

In each epoch, we iterate over the training dataset and perform a
forward pass, backward pass, and optimization. We flatten the images and
convert them to sequences with 28 time steps (one for each row of the
image). We then compute the cross-entropy loss between the predicted
logits and the true labels and update the model’s weights using the Adam
optimizer.

## Evaluating the Model

Now that we have trained the model, let’s evaluate its performance on
the testing dataset. We will compute the accuracy, precision, recall,
and F1 score of the model.

``` python
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

y_true = []
y_pred = []

# Disable gradient computation for inference
with torch.no_grad():
    for images, labels in testloader:

        # Flatten the images and convert them to sequences
        images = images.view(images.shape[0], 28, 28)

        # Predict the class labels
        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)

        # Collect the true and predicted labels for evaluation
        y_true.extend(labels.numpy())
        y_pred.extend(predicted.numpy())

# Calculate evaluation metrics
acc = accuracy_score(y_true, y_pred)
prec = precision_score(y_true, y_pred, average='macro')
rec = recall_score(y_true, y_pred, average='macro')
f1 = f1_score(y_true, y_pred, average='macro')

print('Accuracy: {:.2f}%  Precision: {:.2f}  Recall: {:.2f}  F1 Score: {:.2f}'.format(
    acc * 100, prec, rec, f1))
```

We iterate over the testing dataset and perform inference on each image
by passing it through the trained model. We compute the accuracy,
precision, recall, and F1 score of the model using scikit-learn’s
metrics functions.

# Advanced Techniques

In this notebook, we have covered the basics of LSTMs in PyTorch,
including the architecture, theory, implementation, training, and
evaluation. LSTMs are a powerful type of neural network architecture
that can learn long-term dependencies in sequential data. However, there
are many advanced techniques that can be used to improve their
performance, such as bidirectional LSTMs, stacked LSTMs, and attention
mechanisms.

## Bidirectional LSTMs

Bidirectional LSTMs (BiLSTMs) are a variant of LSTMs that process
sequential data in both forward and backward directions. They introduce
another set of hidden states that process the sequence in reverse order.
The final output is a concatenation of the forward and backward hidden
states. BiLSTMs are useful when the model needs to predict the next
timestep in the sequence based on both past and future timesteps.

## Stacked LSTMs

Stacked LSTMs are a variant of LSTMs that stack multiple LSTMs on top of
each other. Each LSTM layer in the stack processes the output of the
previous layer. Stacked LSTMs can capture more complex features and
long-term dependencies in sequential data.

## Attention Mechanisms

Attention mechanisms are a type of neural network architecture that
learns to selectively focus on parts of the input sequence when making
predictions. In the context of LSTMs, attention mechanisms can be used
to assign different weights to the time steps in the input sequence
based on their importance. This can improve the accuracy and
interpretability of the model.

# Conclusion

In this notebook, we have covered the basics of LSTMs in PyTorch. We
started with the theoretical background, including the architecture of
LSTMs and how they work. We then dove into practical implementation,
including data preprocessing, building the LSTM model, training the
model, and evaluating the model’s performance. Finally, we discussed
some advanced techniques for LSTMs, such as bidirectional LSTMs, stacked
LSTMs, and attention mechanisms.