# 🔄 Notebook 03: RNNs and LSTMs

**Week 3-4: Deep Learning & NLP Foundations**  
**Gen AI Masters Program**

---

## 📋 Objectives

By the end of this notebook, you will master:
1. ✅ Recurrent Neural Networks (RNN) architecture
2. ✅ Sequence modeling and time series
3. ✅ Vanishing gradient problem
4. ✅ Long Short-Term Memory (LSTM) networks
5. ✅ Gated Recurrent Units (GRU)
6. ✅ Bidirectional RNNs
7. ✅ Real-world sequence prediction

**Estimated Time:** 3-4 hours

---

## 📚 Why RNNs?

CNNs work great for images, but what about **sequential data**?
- 📈 Time series (sensor readings, stock prices)
- 📝 Text (sentences, documents)
- 🎵 Audio (speech, music)
- 🏭 **Manufacturing logs** (our focus!)

**Key Insight**: RNNs have **memory** - they remember previous inputs!

Let's dive in! 🚀

In [None]:
# Import libraries
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

# Set style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (14, 6)

# Set random seeds
torch.manual_seed(42)
np.random.seed(42)

# Check GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"✅ Using device: {device}")
print(f"PyTorch version: {torch.__version__}")

## 1️⃣ Understanding RNN Architecture

### How RNNs Work

Unlike feedforward networks, RNNs process sequences **step by step**:

```
h(t) = tanh(W_hh * h(t-1) + W_xh * x(t) + b)
y(t) = W_hy * h(t) + b
```

Where:
- `h(t)`: Hidden state at time t
- `x(t)`: Input at time t
- `y(t)`: Output at time t

The **hidden state** carries information forward!

In [None]:
# Simple RNN from scratch
class SimpleRNN:
    def __init__(self, input_size, hidden_size, output_size):
        """
        Simple RNN implementation
        
        Args:
            input_size: Dimension of input features
            hidden_size: Dimension of hidden state
            output_size: Dimension of output
        """
        # Initialize weights
        self.hidden_size = hidden_size
        
        # W_xh: input to hidden
        self.W_xh = np.random.randn(input_size, hidden_size) * 0.01
        
        # W_hh: hidden to hidden (recurrent)
        self.W_hh = np.random.randn(hidden_size, hidden_size) * 0.01
        
        # W_hy: hidden to output
        self.W_hy = np.random.randn(hidden_size, output_size) * 0.01
        
        # Biases
        self.b_h = np.zeros((1, hidden_size))
        self.b_y = np.zeros((1, output_size))
    
    def forward(self, inputs):
        """
        Forward pass through RNN
        
        Args:
            inputs: Sequence of inputs (seq_len, input_size)
        
        Returns:
            outputs: Sequence of outputs
            hidden_states: All hidden states
        """
        h = np.zeros((1, self.hidden_size))  # Initial hidden state
        outputs = []
        hidden_states = [h]
        
        for x in inputs:
            x = x.reshape(1, -1)
            
            # Update hidden state
            h = np.tanh(np.dot(x, self.W_xh) + np.dot(h, self.W_hh) + self.b_h)
            
            # Compute output
            y = np.dot(h, self.W_hy) + self.b_y
            
            outputs.append(y)
            hidden_states.append(h)
        
        return np.array(outputs), np.array(hidden_states)

# Test SimpleRNN
rnn = SimpleRNN(input_size=3, hidden_size=5, output_size=2)

# Create a simple sequence
sequence = np.array([
    [1.0, 0.5, 0.3],  # t=0
    [0.8, 0.6, 0.4],  # t=1
    [0.9, 0.7, 0.2],  # t=2
    [0.7, 0.8, 0.5],  # t=3
])

outputs, hidden_states = rnn.forward(sequence)

print("🔄 Simple RNN Test")
print("="*60)
print(f"Input sequence shape: {sequence.shape}")
print(f"Output shape: {outputs.shape}")
print(f"Hidden states shape: {hidden_states.shape}")
print(f"\nFirst output: {outputs[0]}")
print(f"Last output: {outputs[-1]}")
print("\n✅ RNN processes sequences step-by-step!")

## 2️⃣ The Vanishing Gradient Problem

### Why Simple RNNs Fail

RNNs struggle with **long sequences** because:
- Gradients shrink exponentially (vanishing)
- Or explode exponentially (exploding)
- Can't learn long-term dependencies

**Solution**: LSTM & GRU with gating mechanisms!

In [None]:
# Demonstrate vanishing gradient
def compute_gradient_flow(sequence_length):
    """
    Simulate gradient magnitude through time
    """
    W = 0.5  # Weight < 1 causes vanishing
    gradient = 1.0
    gradients = [gradient]
    
    for t in range(sequence_length):
        gradient *= W  # Gradient flows backward
        gradients.append(gradient)
    
    return gradients

# Compute for different sequence lengths
seq_lengths = [10, 20, 30, 40, 50]
plt.figure(figsize=(14, 6))

for seq_len in seq_lengths:
    grads = compute_gradient_flow(seq_len)
    plt.plot(range(len(grads)), grads, marker='o', label=f'Seq Len = {seq_len}', linewidth=2)

plt.axhline(y=0.01, color='r', linestyle='--', label='Vanishing Threshold', linewidth=2)
plt.xlabel('Time Steps Backward', fontweight='bold', fontsize=12)
plt.ylabel('Gradient Magnitude', fontweight='bold', fontsize=12)
plt.title('Vanishing Gradient Problem in RNNs (W=0.5)', fontweight='bold', fontsize=14)
plt.legend()
plt.yscale('log')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("⚠️ Vanishing Gradient Problem:")
print("   • Gradients shrink exponentially")
print("   • Can't learn from distant past")
print("   • Need LSTM/GRU to solve this!")

## 3️⃣ LSTM Networks

### LSTM Architecture

LSTM uses **gates** to control information flow:

1. **Forget Gate**: What to forget from cell state
2. **Input Gate**: What new info to add
3. **Output Gate**: What to output

```
f(t) = σ(W_f * [h(t-1), x(t)] + b_f)  # Forget gate
i(t) = σ(W_i * [h(t-1), x(t)] + b_i)  # Input gate
C̃(t) = tanh(W_C * [h(t-1), x(t)] + b_C)  # Candidate
C(t) = f(t) * C(t-1) + i(t) * C̃(t)  # Cell state
o(t) = σ(W_o * [h(t-1), x(t)] + b_o)  # Output gate
h(t) = o(t) * tanh(C(t))  # Hidden state
```

In [None]:
# LSTM with PyTorch
class LSTMModel(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, output_size, dropout=0.2):
        """
        LSTM model for sequence prediction
        
        Args:
            input_size: Number of input features
            hidden_size: Number of LSTM units
            num_layers: Number of LSTM layers
            output_size: Number of output features
            dropout: Dropout probability
        """
        super(LSTMModel, self).__init__()
        
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        
        # LSTM layer
        self.lstm = nn.LSTM(
            input_size=input_size,
            hidden_size=hidden_size,
            num_layers=num_layers,
            batch_first=True,
            dropout=dropout if num_layers > 1 else 0
        )
        
        # Fully connected layer
        self.fc = nn.Linear(hidden_size, output_size)
    
    def forward(self, x):
        # Initialize hidden and cell states
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(x.device)
        c0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(x.device)
        
        # Forward propagate LSTM
        out, (hn, cn) = self.lstm(x, (h0, c0))
        
        # Take the output from the last time step
        out = self.fc(out[:, -1, :])
        
        return out

# Create LSTM model
lstm_model = LSTMModel(
    input_size=10,
    hidden_size=64,
    num_layers=2,
    output_size=1
)

print("🧠 LSTM Model Architecture")
print("="*60)
print(lstm_model)

# Count parameters
total_params = sum(p.numel() for p in lstm_model.parameters())
print(f"\nTotal parameters: {total_params:,}")

# Test forward pass
test_input = torch.randn(32, 20, 10)  # (batch, seq_len, features)
output = lstm_model(test_input)
print(f"\nInput shape: {test_input.shape}")
print(f"Output shape: {output.shape}")
print("\n✅ LSTM handles variable-length sequences!")

## 4️⃣ Real-World Example: Predictive Maintenance for Manufacturing Copilot

### Use Case
Our **Manufacturing Copilot** needs to predict potential equipment failures. We will build an LSTM model to predict a machine's `temperature` based on a sequence of its recent sensor readings (`temperature`, `vibration`, `pressure`, `rpm`). An accurate temperature prediction can help the copilot issue warnings before a machine overheats.

In [None]:
# Generate synthetic manufacturing sensor data
def generate_sensor_data(n_samples=1000):
    """
    Generate synthetic time series sensor data
    """
    time = np.arange(n_samples)
    
    # Base temperature with daily cycle
    base_temp = 70 + 10 * np.sin(2 * np.pi * time / 100)
    
    # Add trend (equipment degradation)
    trend = 0.01 * time
    
    # Add noise
    noise = np.random.normal(0, 2, n_samples)
    
    # Occasional spikes (anomalies)
    spikes = np.zeros(n_samples)
    spike_indices = np.random.choice(n_samples, size=20, replace=False)
    spikes[spike_indices] = np.random.uniform(5, 15, 20)
    
    # Final temperature
    temperature = base_temp + trend + noise + spikes
    
    # Additional features
    vibration = 0.5 * temperature + np.random.normal(0, 5, n_samples)
    pressure = 100 + 0.3 * temperature + np.random.normal(0, 3, n_samples)
    rpm = 1500 + 2 * temperature + np.random.normal(0, 50, n_samples)
    
    data = pd.DataFrame({
        'time': time,
        'temperature': temperature,
        'vibration': vibration,
        'pressure': pressure,
        'rpm': rpm
    })
    
    return data

# Generate data
sensor_data = generate_sensor_data(n_samples=1000)

print("🏭 Manufacturing Sensor Data")
print("="*60)
print(sensor_data.head(10))
print(f"\nShape: {sensor_data.shape}")
print(f"\nStatistics:")
print(sensor_data.describe())

In [None]:
# Visualize sensor data
fig, axes = plt.subplots(2, 2, figsize=(16, 10))

# Temperature
axes[0, 0].plot(sensor_data['time'], sensor_data['temperature'], linewidth=1.5, color='red')
axes[0, 0].set_title('Temperature Over Time', fontweight='bold', fontsize=12)
axes[0, 0].set_xlabel('Time Steps')
axes[0, 0].set_ylabel('Temperature (°F)')
axes[0, 0].grid(True, alpha=0.3)

# Vibration
axes[0, 1].plot(sensor_data['time'], sensor_data['vibration'], linewidth=1.5, color='blue')
axes[0, 1].set_title('Vibration Over Time', fontweight='bold', fontsize=12)
axes[0, 1].set_xlabel('Time Steps')
axes[0, 1].set_ylabel('Vibration (Hz)')
axes[0, 1].grid(True, alpha=0.3)

# Pressure
axes[1, 0].plot(sensor_data['time'], sensor_data['pressure'], linewidth=1.5, color='green')
axes[1, 0].set_title('Pressure Over Time', fontweight='bold', fontsize=12)
axes[1, 0].set_xlabel('Time Steps')
axes[1, 0].set_ylabel('Pressure (PSI)')
axes[1, 0].grid(True, alpha=0.3)

# RPM
axes[1, 1].plot(sensor_data['time'], sensor_data['rpm'], linewidth=1.5, color='purple')
axes[1, 1].set_title('RPM Over Time', fontweight='bold', fontsize=12)
axes[1, 1].set_xlabel('Time Steps')
axes[1, 1].set_ylabel('RPM')
axes[1, 1].grid(True, alpha=0.3)

plt.suptitle('Manufacturing Equipment Sensor Readings', fontweight='bold', fontsize=14)
plt.tight_layout()
plt.show()

In [None]:
# Prepare data for LSTM
def create_sequences(data, seq_length):
    """
    Create sequences for LSTM training
    
    Args:
        data: Input features (DataFrame or array)
        seq_length: Length of input sequences
    
    Returns:
        X: Input sequences
        y: Target values
    """
    X, y = [], []
    
    for i in range(len(data) - seq_length):
        X.append(data[i:i+seq_length])
        y.append(data[i+seq_length, 0])  # Predict temperature
    
    return np.array(X), np.array(y)

# Normalize data
features = ['temperature', 'vibration', 'pressure', 'rpm']
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(sensor_data[features])

# Create sequences
seq_length = 20  # Use 20 time steps to predict next value
X, y = create_sequences(scaled_data, seq_length)

print("📊 Sequence Data Preparation")
print("="*60)
print(f"Sequence length: {seq_length}")
print(f"Number of features: {len(features)}")
print(f"X shape: {X.shape}")
print(f"y shape: {y.shape}")

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, shuffle=False  # Don't shuffle time series!
)

print(f"\nTrain set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")

# Convert to PyTorch tensors
X_train_tensor = torch.FloatTensor(X_train)
y_train_tensor = torch.FloatTensor(y_train)
X_test_tensor = torch.FloatTensor(X_test)
y_test_tensor = torch.FloatTensor(y_test)

# Create data loaders
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

print(f"\n✅ Data ready for LSTM training!")

In [None]:
# Create and train LSTM model
model = LSTMModel(
    input_size=4,  # 4 features
    hidden_size=64,
    num_layers=2,
    output_size=1,
    dropout=0.2
).to(device)

criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop
epochs = 50
train_losses = []

print("🔄 Training LSTM Model...")
print("="*60)

for epoch in range(epochs):
    model.train()
    epoch_loss = 0
    
    for batch_X, batch_y in train_loader:
        batch_X = batch_X.to(device)
        batch_y = batch_y.to(device)
        
        # Forward pass
        outputs = model(batch_X)
        loss = criterion(outputs.squeeze(), batch_y)
        
        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        epoch_loss += loss.item()
    
    avg_loss = epoch_loss / len(train_loader)
    train_losses.append(avg_loss)
    
    if (epoch + 1) % 10 == 0:
        print(f"Epoch [{epoch+1}/{epochs}], Loss: {avg_loss:.6f}")

print("\n✅ Training Complete!")

In [None]:
# Plot training loss
plt.figure(figsize=(12, 5))
plt.plot(range(1, epochs+1), train_losses, linewidth=2, color='blue', marker='o', markersize=4)
plt.title('LSTM Training Loss', fontweight='bold', fontsize=14)
plt.xlabel('Epoch', fontweight='bold')
plt.ylabel('MSE Loss', fontweight='bold')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Evaluate model
model.eval()
with torch.no_grad():
    X_test_device = X_test_tensor.to(device)
    predictions = model(X_test_device).cpu().numpy().squeeze()
    actuals = y_test

# Calculate metrics
mse = np.mean((predictions - actuals) ** 2)
rmse = np.sqrt(mse)
mae = np.mean(np.abs(predictions - actuals))

print("📊 Model Performance")
print("="*60)
print(f"MSE:  {mse:.6f}")
print(f"RMSE: {rmse:.6f}")
print(f"MAE:  {mae:.6f}")

# Plot predictions vs actuals
plt.figure(figsize=(16, 6))
plt.plot(range(len(actuals)), actuals, label='Actual Temperature', linewidth=2, alpha=0.7)
plt.plot(range(len(predictions)), predictions, label='Predicted Temperature', linewidth=2, alpha=0.7)
plt.title('LSTM Temperature Prediction', fontweight='bold', fontsize=14)
plt.xlabel('Time Steps', fontweight='bold')
plt.ylabel('Normalized Temperature', fontweight='bold')
plt.legend(fontsize=12)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Zoom in on a portion
zoom_start, zoom_end = 0, 100
plt.figure(figsize=(16, 6))
plt.plot(range(zoom_start, zoom_end), actuals[zoom_start:zoom_end], 
         label='Actual', linewidth=2, marker='o', markersize=4, alpha=0.7)
plt.plot(range(zoom_start, zoom_end), predictions[zoom_start:zoom_end], 
         label='Predicted', linewidth=2, marker='s', markersize=4, alpha=0.7)
plt.title('LSTM Predictions (Zoomed View)', fontweight='bold', fontsize=14)
plt.xlabel('Time Steps', fontweight='bold')
plt.ylabel('Normalized Temperature', fontweight='bold')
plt.legend(fontsize=12)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\n✅ LSTM successfully predicts equipment temperature!")

## 5️⃣ GRU: Simplified LSTM

### GRU Architecture

GRU simplifies LSTM with fewer gates:
- **Reset Gate**: How much past to forget
- **Update Gate**: How much past to keep

**Advantages**:
- Fewer parameters (faster training)
- Often similar performance to LSTM
- Easier to tune

In [None]:
# GRU Model
class GRUModel(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, output_size, dropout=0.2):
        super(GRUModel, self).__init__()
        
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        
        # GRU layer
        self.gru = nn.GRU(
            input_size=input_size,
            hidden_size=hidden_size,
            num_layers=num_layers,
            batch_first=True,
            dropout=dropout if num_layers > 1 else 0
        )
        
        # Fully connected layer
        self.fc = nn.Linear(hidden_size, output_size)
    
    def forward(self, x):
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(x.device)
        out, hn = self.gru(x, h0)
        out = self.fc(out[:, -1, :])
        return out

# Compare LSTM vs GRU
lstm_model = LSTMModel(input_size=4, hidden_size=64, num_layers=2, output_size=1)
gru_model = GRUModel(input_size=4, hidden_size=64, num_layers=2, output_size=1)

lstm_params = sum(p.numel() for p in lstm_model.parameters())
gru_params = sum(p.numel() for p in gru_model.parameters())

print("⚖️ LSTM vs GRU Comparison")
print("="*60)
print(f"LSTM parameters: {lstm_params:,}")
print(f"GRU parameters:  {gru_params:,}")
print(f"\nGRU has {lstm_params - gru_params:,} fewer parameters ({100*(1-gru_params/lstm_params):.1f}% reduction)")
print("\n✅ GRU is more efficient!")

## 6️⃣ Bidirectional RNNs

### Why Bidirectional?

Sometimes **future context** helps understand the past!

Example: "The animal didn't cross the street because it was too **tired**"
- Need to see "tired" to understand "it" refers to the animal

Bidirectional RNNs process sequences in **both directions**.

In [None]:
# Bidirectional LSTM
class BiLSTM(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, output_size):
        super(BiLSTM, self).__init__()
        
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        
        # Bidirectional LSTM
        self.lstm = nn.LSTM(
            input_size=input_size,
            hidden_size=hidden_size,
            num_layers=num_layers,
            batch_first=True,
            bidirectional=True  # Key difference!
        )
        
        # FC layer (hidden_size * 2 because bidirectional)
        self.fc = nn.Linear(hidden_size * 2, output_size)
    
    def forward(self, x):
        # num_directions = 2 for bidirectional
        h0 = torch.zeros(self.num_layers * 2, x.size(0), self.hidden_size).to(x.device)
        c0 = torch.zeros(self.num_layers * 2, x.size(0), self.hidden_size).to(x.device)
        
        out, (hn, cn) = self.lstm(x, (h0, c0))
        out = self.fc(out[:, -1, :])
        
        return out

# Create bidirectional model
bi_lstm = BiLSTM(input_size=4, hidden_size=64, num_layers=2, output_size=1)

print("↔️ Bidirectional LSTM")
print("="*60)
print(bi_lstm)
print(f"\nParameters: {sum(p.numel() for p in bi_lstm.parameters()):,}")
print("\n✅ BiLSTM uses both past AND future context!")

## 🎉 Summary

Congratulations! You've mastered RNNs and LSTMs!

### Key Concepts
- ✅ RNN architecture and hidden states
- ✅ Vanishing gradient problem
- ✅ LSTM gates (forget, input, output)
- ✅ GRU (simplified LSTM)
- ✅ Bidirectional RNNs
- ✅ Sequence prediction

### What You Built
1. 🔄 Simple RNN from scratch
2. 🧠 LSTM temperature predictor
3. ⚡ GRU model
4. ↔️ Bidirectional LSTM

### RNN Applications
- 🏭 **Manufacturing**: Equipment monitoring, predictive maintenance
- 📈 **Finance**: Stock prediction, fraud detection
- 🎵 **Audio**: Speech recognition, music generation
- 📝 **Text**: Language modeling, translation

### Comparison Table

| Model | Parameters | Speed | Long-term Memory | Use Case |
|-------|-----------|-------|------------------|----------|
| Simple RNN | Low | Fast | Poor | Short sequences |
| LSTM | High | Slow | Excellent | Long sequences |
| GRU | Medium | Medium | Very Good | Balanced |
| BiLSTM | Highest | Slowest | Best | Full context needed |

### Next Steps
Continue to **Notebook 04: Transformers** to learn the architecture that revolutionized NLP!

<div align="center">
<b>RNNs & LSTMs mastered! Ready for Transformers! 🚀</b>
</div>