**Plan**

**1. Basics of sequential data**

**2. Introduction to RNNs**

**3. Long Short-Term Memory (LSTM) networks**

**4. Gated Recurrent Unit (GRU)**



# **Basics of sequential data**

Sequential data is a type of data where the order of the elements is significant. In PyTorch, handling sequential data typically involves the use of recurrent neural networks (RNNs), Long Short-Term Memory networks (LSTMs), and Gated Recurrent Units (GRUs). Here are some key concepts and basic steps for working with sequential data in PyTorch:

**Key Concepts**

1. **Recurrent Neural Networks (RNNs)**:
    - RNNs are designed to handle sequential data by maintaining a hidden state that captures information about previous elements in the sequence.
    - Basic structure: `hidden_state_t = f(hidden_state_t-1, input_t)`

2. **Long Short-Term Memory Networks (LSTMs)**:
    - LSTMs address the vanishing gradient problem in RNNs by introducing memory cells and gates (input gate, forget gate, output gate).
    - These gates control the flow of information, allowing the network to retain information over longer sequences.

3. **Gated Recurrent Units (GRUs)**:
    - GRUs are a simplified version of LSTMs with fewer gates (update gate, reset gate).
    - They perform similarly to LSTMs but are computationally more efficient.

4. **Embedding Layers**:
    - Embeddings are used to convert input sequences of tokens (like words) into dense vectors of fixed size.

# **Introduction to RNNs**

**Introduction to Recurrent Neural Networks (RNNs)**

Recurrent Neural Networks (RNNs) are a class of neural networks that are particularly effective for processing sequential data. They are widely used in tasks where the order of the data points matters, such as time series prediction, language modeling, and speech recognition.

**Key Concepts of RNNs**

1. **Sequential Data**:
   - In sequential data, each data point is dependent on the previous ones. Examples include sentences (sequences of words), time series data (sequences of values), and videos (sequences of frames).

2. **Recurrent Connections**:
   - Unlike traditional feedforward neural networks, RNNs have connections that loop back on themselves. This allows information to persist, making them suitable for handling sequences of data.

3. **Hidden State**:
   - The hidden state in an RNN acts as a memory that captures information from previous time steps. It is updated at each time step based on the current input and the previous hidden state.
   - Equation: $ h_t = f(W_{hx} x_t + W_{hh} h_{t-1} + b_h) $, where $ h_t $ is the hidden state at time $ t $, $ x_t $ is the input at time $ t $, $ W_{hx} $ and $ W_{hh} $ are weight matrices, and $ b_h $ is a bias term.

4. **Output**:
   - The output at each time step can be computed using the hidden state.
   - Equation: $ y_t = g(W_{hy} h_t + b_y) $, where $ y_t $ is the output at time $ t $, $ W_{hy} $ is a weight matrix, and $ b_y $ is a bias term.

In [56]:
import torch
import torch.nn as nn

In [57]:
# Hyperparameters
input_size = 10    # Input feature size
hidden_size = 20   # Hidden state size
num_layers = 1     # Number of RNN layers
output_size = 5    # Output size (e.g., number of classes)
sequence_length = 5 # Length of input sequences
batch_size = 3     # Batch size

# Create RNN layer
rnn = nn.RNN(input_size, hidden_size, num_layers, batch_first=True)

In [58]:
# Dummy input sequence (batch_size x sequence_length x input_size)
input_seq = torch.randn(batch_size, sequence_length, input_size)  # Batch size 3, sequence length 5, input size 10

In [59]:
# Initialize hidden state with zeros (num_layers, batch_size, hidden_size)
h0 = torch.zeros(num_layers, batch_size, hidden_size)

# Forward propagate through RNN
out, hn = rnn(input_seq, h0)

# Output shapes
print(f"Input shape: {input_seq.shape}")      # Expected: (3, 5, 10)
print(f"RNN output shape: {out.shape}")       # Expected: (3, 5, 20)
print(f"Hidden state shape: {hn.shape}")      # Expected: (1, 3, 20) -> (num_layers, batch, hidden)
print(f"Output type: {type(out)}")            # Expected: <class 'torch.Tensor'>
print(f"Hidden state type: {type(hn)}")       # Expected: <class 'torch.Tensor'>

Input shape: torch.Size([3, 5, 10])
RNN output shape: torch.Size([3, 5, 20])
Hidden state shape: torch.Size([1, 3, 20])
Output type: <class 'torch.Tensor'>
Hidden state type: <class 'torch.Tensor'>


# **Long Short-Term Memory (LSTM) networks**

Long Short-Term Memory (LSTM) networks are a type of recurrent neural network (RNN) capable of learning long-term dependencies, particularly useful for sequential data. PyTorch, a popular deep learning library, provides robust support for implementing LSTMs.

**LSTM Equations**

An LSTM unit consists of a cell, an input gate, a forget gate, and an output gate. These gates control the flow of information and help maintain long-term dependencies.

1. **Forget Gate**: Decides what information to discard from the cell state.
   $$
   f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)
   $$

2. **Input Gate**: Decides which new information to store in the cell state.
   $$
   i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)
   $$
   $$
   \tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)
   $$

3. **Cell State Update**: Combines the previous cell state and the new candidate values.
   $$
   C_t = f_t * C_{t-1} + i_t * \tilde{C}_t
   $$

4. **Output Gate**: Decides what the next hidden state should be.
   $$
   o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)
   $$
   $$
   h_t = o_t * \tanh(C_t)
   $$

**Legend**
- $ x_t $: Input at time step $ t $
- $ h_t $: Hidden state at time step $ t $
- $ C_t $: Cell state at time step $ t $
- $ \sigma $: Sigmoid function
- $ \tanh $: Hyperbolic tangent function
- $ W $ and $ b $: Weight matrices and biases for respective gates

**Summary**

The LSTM architecture is designed to overcome the limitations of standard RNNs in capturing long-term dependencies by introducing a more complex cell state and three gates (forget, input, output) that regulate the flow of information. This makes LSTMs particularly effective for tasks involving long sequences of data.

In [60]:
import torch
import torch.nn as nn
import torch.optim as optim

In [61]:
class LSTMModel(nn.Module):
    def __init__(self, input_size, hidden_size, output_size, num_layers=1):
        super(LSTMModel, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(x.device)
        c0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(x.device)
        out, _ = self.lstm(x, (h0, c0))
        out = self.fc(out[:, -1, :])
        return out


In [63]:
import numpy as np

# Example data
X_train = np.random.rand(100, 10, 1)  # 100 sequences, each of length 10 with 1 feature
y_train = np.random.rand(100, 1)      # 100 target values

X_train = torch.from_numpy(X_train).float()
y_train = torch.from_numpy(y_train).float()

In [64]:
input_size = 1
hidden_size = 50
output_size = 1
num_layers = 2

model = LSTMModel(input_size, hidden_size, output_size, num_layers)
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)


In [None]:
num_epochs = 100
for epoch in range(num_epochs):
    model.train()
    outputs = model(X_train)
    optimizer.zero_grad()
    loss = criterion(outputs, y_train)
    loss.backward()
    optimizer.step()

    if (epoch+1) % 10 == 0:
        print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')

In [None]:
model.eval()
with torch.no_grad():
    test_output = model(X_train)
    test_loss = criterion(test_output, y_train)
    print(f'Test Loss: {test_loss.item():.4f}')

# **Gated Recurrent Unit (GRU)**

Gated Recurrent Units (GRUs) are a variant of Recurrent Neural Networks (RNNs) designed to capture long-term dependencies more effectively than vanilla RNNs, while being simpler and computationally more efficient than Long Short-Term Memory (LSTM) networks.

**GRU Equations**

A GRU unit consists of a reset gate and an update gate, which control the flow of information and help maintain long-term dependencies.

1. **Reset Gate**: Decides what part of the previous hidden state to forget.
   $$
   r_t = \sigma(W_r \cdot [h_{t-1}, x_t] + b_r)
   $$

2. **Update Gate**: Decides how much of the previous hidden state to keep.
   $$
   z_t = \sigma(W_z \cdot [h_{t-1}, x_t] + b_z)
   $$

3. **Candidate Hidden State**: Combines the reset gate with the previous hidden state and current input to create a candidate hidden state.
   $$
   \tilde{h}_t = \tanh(W_h \cdot [r_t * h_{t-1}, x_t] + b_h)
   $$

4. **Final Hidden State**: Interpolates between the previous hidden state and the candidate hidden state using the update gate.
   $$
   h_t = (1 - z_t) * h_{t-1} + z_t * \tilde{h}_t
   $$

**Legend**

- $ x_t $: Input at time step $ t $
- $ h_t $: Hidden state at time step $ t $
- $ \sigma $: Sigmoid function
- $ \tanh $: Hyperbolic tangent function
- $ W $ and $ b $: Weight matrices and biases for respective gates

**Summary**

The GRU architecture simplifies the LSTM by combining the forget and input gates into a single update gate, and the cell state and hidden state into a single hidden state. This makes GRUs less complex and computationally more efficient while still being effective at capturing long-term dependencies.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim

In [None]:
class GRUModel(nn.Module):
    def __init__(self, input_size, hidden_size, output_size, num_layers=1):
        super(GRUModel, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.gru = nn.GRU(input_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(x.device)
        out, _ = self.gru(x, h0)
        out = self.fc(out[:, -1, :])
        return out

In [None]:
import numpy as np

# Example data
X_train = np.random.rand(100, 10, 1)  # 100 sequences, each of length 10 with 1 feature
y_train = np.random.rand(100, 1)      # 100 target values

X_train = torch.from_numpy(X_train).float()
y_train = torch.from_numpy(y_train).float()

In [None]:
input_size = 1
hidden_size = 50
output_size = 1
num_layers = 2

model = GRUModel(input_size, hidden_size, output_size, num_layers)
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)


In [None]:
num_epochs = 100
for epoch in range(num_epochs):
    model.train()
    outputs = model(X_train)
    optimizer.zero_grad()
    loss = criterion(outputs, y_train)
    loss.backward()
    optimizer.step()

    if (epoch+1) % 10 == 0:
        print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')

In [None]:
model.eval()
with torch.no_grad():
    test_output = model(X_train)
    test_loss = criterion(test_output, y_train)
    print(f'Test Loss: {test_loss.item():.4f}')