# 📝 Notebook 03: Recurrent Neural Networks (RNNs) & LSTMs

**Week 3-4: Deep Learning & NLP Foundations**  
**Gen AI Masters Program**

---

## 📋 Objectives

Welcome to the world of sequential data! So far, we've worked with data where each input is independent, like images or tabular data. But what about data where the order is crucial?
-   **Text & Language**: The order of words defines meaning. "Man bites dog" is very different from "Dog bites man."
-   **Time Series**: Stock prices, weather patterns, and sensor readings are all sequences where past values influence future ones.
-   **Audio & Video**: These are fundamentally sequences of sound waves or image frames.

This is where **Recurrent Neural Networks (RNNs)** shine. Unlike the networks we've seen before, RNNs have a "memory" in the form of a **hidden state** that allows them to persist information across time steps, making them ideal for processing sequences.

By the end of this notebook, you will be able to:
1.  **Understand the RNN Architecture**: Grasp the concept of a recurrent loop and how the hidden state acts as the network's memory.
2.  **Identify the Vanishing/Exploding Gradient Problem**: Learn about the key challenge that makes it difficult for simple RNNs to learn long-range dependencies.
3.  **Master the Long Short-Term Memory (LSTM) Cell**: Dive into the powerful LSTM architecture, a specialized RNN that uses a system of "gates" (Forget, Input, and Output) to effectively regulate information flow and overcome the vanishing gradient problem.
4.  **Build an LSTM for Sentiment Analysis**: Apply your knowledge to a real-world Natural Language Processing (NLP) task. We will build and train an LSTM model from scratch using PyTorch to classify movie reviews from the IMDb dataset as positive or negative.
5.  **Implement a Text Processing Pipeline**: Learn the essential steps to prepare text data for a neural network, including tokenization, building a vocabulary, numericalization, and padding.

**Estimated Time:** 3-4 hours

---

## 📚 What are RNNs and LSTMs?

An RNN processes a sequence by iterating through its elements one by one. At each step, it takes the current input element and its hidden state from the previous step, and uses them to compute the new hidden state. This recurrent formula allows information to propagate through the sequence.

However, simple RNNs struggle to "remember" information from many steps back due to the vanishing gradient problem. **Long Short-Term Memory (LSTM)** networks were created to solve this. LSTMs have a more complex internal structure, including a separate **cell state** and three gates that meticulously control what information is stored, updated, and read from the memory. This makes them exceptionally good at capturing long-range dependencies in data.

Let's dive in and build one for our sentiment analysis task! 🚀

## 🚀 Agenda

Our journey into the world of recurrent networks will be structured as follows:

1.  **Setting the Stage**: We'll import the necessary libraries, including `torch` and `torchtext`, and configure our device for GPU acceleration.
2.  **Data Preparation for NLP**: We'll dive into the essential pipeline for processing text data. This is a critical skill for any NLP task.
    *   **Loading Data**: Use `torchtext` to load the IMDb movie review dataset.
    *   **Tokenization**: Break down sentences into individual words (tokens).
    *   **Building a Vocabulary**: Create a numerical mapping for all unique words.
    *   **Numericalization & Padding**: Convert sentences into integer sequences of a fixed length to be fed into the model.
3.  **Building the LSTM Model**: We'll construct our sentiment analysis model layer by layer in PyTorch.
    *   **Embedding Layer**: Learn dense vector representations for words.
    *   **LSTM Layer**: The core of our network for processing the sequence of word embeddings.
    *   **Fully Connected Layer**: The final classifier that makes the sentiment prediction.
4.  **Training the Model**: We'll define the loss function (`BCEWithLogitsLoss`) and optimizer (`Adam`), and write a complete training and evaluation loop to train the model on our data.
5.  **Using the Model for Inference**: We'll load our trained model and use it to predict the sentiment of new, unseen movie reviews.
6.  **Conceptual Extensions**: We'll briefly discuss powerful variations like **GRUs** and **Bidirectional LSTMs** to provide context for further learning.

In [None]:
# --- 1. Set up the Environment ---

# For this notebook, we'll be using `torchtext` to handle the IMDb dataset.
# `torchtext` provides convenient tools for text processing in PyTorch.
# The `portalocker` library is a dependency for `torchtext` data utilities,
# ensuring that data loading is handled correctly.
!pip install torchtext portalocker --quiet

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
from torch.nn.utils.rnn import pad_sequence

# Import torchtext components
from torchtext.datasets import IMDB
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

import numpy as np
from tqdm.notebook import tqdm
import matplotlib.pyplot as plt
import seaborn as sns

# --- Configuration ---

# Set a seed for reproducibility
SEED = 42
torch.manual_seed(SEED)
np.random.seed(SEED)

# Set the default device
# Training on a GPU is significantly faster for deep learning models.
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"✅ Using device: {device.upper()}")

# --- Plotting Style ---
plt.style.use("seaborn-v0_8-whitegrid")
plt.rcParams["figure.figsize"] = (10, 6)
sns.set_palette("colorblind")

## 💬 Part 2: Preparing the IMDb Dataset

Before we can train a model, we need to convert our raw text data into a structured, numerical format that the network can understand. This preprocessing pipeline is a fundamental part of almost any NLP project.

Here are the key steps we'll take:

1.  **Load the Data**: We'll use `torchtext` to stream the IMDb dataset, which consists of 25,000 training reviews and 25,000 testing reviews. Each review is labeled as either positive (`2`) or negative (`1`). We will map these to `1` and `0`.

2.  **Tokenization**: We'll break down each review (which is a long string) into a list of individual words or "tokens." This process, called tokenization, is the first step in making sense of the text. We'll use a basic English tokenizer that splits text by spaces and punctuation.

3.  **Build a Vocabulary**: We need to create a "vocabulary," which is a dictionary that maps every unique word in our dataset to a unique integer. This allows us to represent our sentences as sequences of numbers. To keep things manageable, we'll limit the vocabulary to the `10,000` most frequent words. Words not in this vocabulary will be mapped to a special "unknown" token (`<unk>`).

4.  **Numericalization & Padding**:
    *   **Numericalization**: We'll use the vocabulary to convert our tokenized sentences into numerical sequences.
    *   **Padding/Truncation**: RNNs require the sequences within a single batch to have the same length. We will enforce a `MAX_LEN` for all sequences. Reviews longer than this will be truncated, and shorter ones will be "padded" with a special padding token (`<pad>`) until they reach the desired length.

Let's execute this pipeline. This is often the most time-consuming part of an NLP project, but `torchtext` makes it surprisingly straightforward.

In [None]:
# --- 2.1. Load Data and Define Tokenizer ---
print("Loading IMDb dataset and defining tokenizer...")
# `IMDB` returns iterators for the training and test datasets.
# An iterator is a memory-efficient way to access data one item at a time.
train_iter, test_iter = IMDB(split=('train', 'test'))
# We'll use a basic English tokenizer from torchtext that splits sentences by spaces and punctuation.
tokenizer = get_tokenizer('basic_english')

# --- 2.2. Build the Vocabulary ---
# The vocabulary maps words to integer indices. We build it from the training data.
VOCAB_SIZE = 10000  # We will only consider the top 10,000 most frequent words.
UNK_TOKEN = "<unk>"   # Special token for unknown words (words not in our vocabulary).
PAD_TOKEN = "<pad>"   # Special token for padding sequences to a fixed length.

def yield_tokens(data_iter):
    """
    A helper generator function to yield tokens from the raw dataset iterator.
    This is passed to the vocabulary builder.
    """
    for _, text in data_iter:
        yield tokenizer(text)

# Reset the training iterator, as we will have consumed it in the function above.
train_iter, _ = IMDB(split=('train', 'test'))

print(f"Building vocabulary with top {VOCAB_SIZE} words...")
# `build_vocab_from_iterator` handles the process of counting word frequencies
# and creating the integer mapping.
vocab = build_vocab_from_iterator(
    yield_tokens(train_iter),
    specials=[UNK_TOKEN, PAD_TOKEN], # Add our special tokens to the vocabulary.
    max_tokens=VOCAB_SIZE,           # Limit the vocabulary size.
    special_first=True               # Ensure special tokens get indices 0 and 1.
)
# Set the default index for unknown words. If the vocabulary encounters a word
# it hasn't seen, it will be mapped to the index of our UNK_TOKEN.
vocab.set_default_index(vocab[UNK_TOKEN])

pad_index = vocab[PAD_TOKEN] # Get the integer index for the padding token.

print(f"✅ Vocabulary created. Size: {len(vocab)}")
print(f"   - Index for '<unk>': {vocab[UNK_TOKEN]}")
print(f"   - Index for '<pad>': {vocab[PAD_TOKEN]}")
print(f"   - Example mapping: 'hello' -> {vocab['hello']}")


# --- 2.3. Define the Data Processing and Batching Pipeline ---
MAX_LEN = 250    # Maximum sequence length. Longer reviews will be truncated, shorter ones padded.
BATCH_SIZE = 64

def create_dataloader(data_iter, is_train=True):
    """
    A function to convert the raw data iterator into a batched and padded DataLoader.
    """
    all_texts, all_labels = [], []
    
    print(f"Processing data... (This may take a moment)")
    for label, text in tqdm(data_iter, desc="Numericalizing and Padding"):
        # Convert text to a list of integer indices using the vocabulary.
        numericalized_text = vocab(tokenizer(text))
        
        # Truncate the sequence if it's longer than MAX_LEN.
        if len(numericalized_text) > MAX_LEN:
            numericalized_text = numericalized_text[:MAX_LEN]
            
        all_texts.append(torch.tensor(numericalized_text, dtype=torch.int64))
        
        # The original labels are 1 (neg) and 2 (pos). We map them to 0 and 1.
        all_labels.append(label - 1)

    # `pad_sequence` takes a list of tensors of different lengths and stacks them
    # into a single tensor, padding them all to the length of the longest sequence
    # in the batch (or to a max length if specified).
    # `batch_first=True` makes the output shape [batch_size, sequence_length].
    padded_texts = pad_sequence(all_texts, batch_first=True, padding_value=pad_index)
    
    # Create a standard PyTorch TensorDataset and DataLoader.
    dataset = TensorDataset(padded_texts, torch.tensor(all_labels, dtype=torch.float32))
    return DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=is_train)

# --- Create the DataLoaders ---
# We need to get fresh iterators since they get consumed during processing.
train_iter, test_iter = IMDB(split=('train', 'test'))

train_dataloader = create_dataloader(train_iter, is_train=True)
test_dataloader = create_dataloader(test_iter, is_train=False)

print("\n✅ Data preparation complete.")
print(f"   - Number of training batches: {len(train_dataloader)}")
print(f"   - Number of testing batches:  {len(test_dataloader)}")

# --- Inspect a Single Batch ---
texts, labels = next(iter(train_dataloader))
print(f"\nShape of a single batch of texts: {texts.shape}")
print(f"Shape of a single batch of labels: {labels.shape}")
print(f"Data type of texts: {texts.dtype}")
print(f"Data type of labels: {labels.dtype}")
print("\nThe first review in the batch (as integers):")
print(texts[0])

## 🧠 Part 3: Building the LSTM Model

Now for the exciting part! We'll construct our sentiment analysis model using an **LSTM**. Our model will be a `nn.Module` subclass, and its architecture will consist of three main layers:

1.  **Embedding Layer (`nn.Embedding`):**
    *   This is the crucial first layer in any NLP model. Its job is to transform the integer indices (our numericalized words) into dense vectors of a fixed size (`embedding_dim`).
    *   These vectors are called **word embeddings**. Unlike sparse one-hot encodings, embeddings are dense and can capture semantic relationships between words (e.g., the vectors for "king" and "queen" might be very close to each other in the embedding space).
    *   Most importantly, these embedding vectors are **learnable parameters**. The network will learn the optimal representation for each word in our vocabulary to best solve the sentiment analysis task. It starts with random vectors and adjusts them during training via backpropagation.

2.  **LSTM Layer (`nn.LSTM`):**
    *   This is the recurrent core of our model. It will process the sequence of embedding vectors one by one.
    *   It maintains a **hidden state** and a **cell state**, which are updated at each time step, allowing it to remember information from earlier in the sequence.
    *   We'll configure it with a few key parameters:
        *   `input_size`: The size of each input vector (our `embedding_dim`).
        *   `hidden_size`: The dimension of the hidden and cell states.
        *   `num_layers`: We can stack multiple LSTM layers on top of each other to create a deeper model, which can learn more complex patterns.
        *   `batch_first=True`: This tells the layer to expect input tensors with the shape `[batch_size, seq_len, feature_dim]`, which matches our `DataLoader`'s output.
        *   `dropout`: We'll add dropout between the LSTM layers (if `num_layers > 1`) to help prevent overfitting.

3.  **Fully Connected Layer (`nn.Linear`):**
    *   The LSTM layer outputs a sequence of hidden states, one for each time step. For our classification task, we are primarily interested in the **final hidden state** of the last layer, as it represents a summary of the entire review.
    *   We'll pass this final hidden state to a linear layer, which will map it from the `hidden_dim` to a single output value (`output_dim=1`). This single value is our raw prediction, or "logit."

The output logit will then be passed to our loss function (`BCEWithLogitsLoss`), which will internally apply a sigmoid function to get a probability and calculate the loss.

In [None]:
# --- 3. Define the LSTM Model ---

class SentimentLSTM(nn.Module):
    """
    An LSTM-based neural network for sentiment classification.
    """
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, dropout):
        """
        Initializes the layers of the model.
        
        Args:
            vocab_size (int): The size of the vocabulary.
            embedding_dim (int): The dimension of the word embeddings.
            hidden_dim (int): The size of the LSTM's hidden state.
            output_dim (int): The dimension of the output (1 for binary classification).
            n_layers (int): The number of stacked LSTM layers.
            dropout (float): The dropout probability for regularization.
        """
        super().__init__()
        
        # 1. Embedding Layer
        # This layer maps integer word indices to dense, learnable vectors.
        # `padding_idx` tells the layer to ignore the padding token during training
        # and not update its embedding.
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_index)
        
        # 2. LSTM Layer
        # This is the recurrent core that processes the sequence of embeddings.
        self.lstm = nn.LSTM(
            input_size=embedding_dim,
            hidden_size=hidden_dim,
            num_layers=n_layers,
            dropout=dropout if n_layers > 1 else 0, # Dropout is only applied between LSTM layers.
            batch_first=True  # Crucial: ensures input format is [batch, seq_len, features].
        )
        
        # 3. Dropout Layer
        # A general dropout layer applied to the LSTM's output for further regularization.
        self.dropout = nn.Dropout(dropout)
        
        # 4. Fully Connected (Linear) Layer
        # This layer maps the final LSTM hidden state to the desired output dimension.
        self.fc = nn.Linear(hidden_dim, output_dim)
        
    def forward(self, text):
        """
        Defines the forward pass of the model.
        
        Args:
            text (Tensor): A batch of text sequences, shape [batch_size, seq_len].
        
        Returns:
            Tensor: The model's raw output (logits), shape [batch_size].
        """
        # text shape: [batch_size, seq_len]
        
        # Pass text through the embedding layer.
        embedded = self.embedding(text)
        # embedded shape: [batch_size, seq_len, embedding_dim]
        
        # Pass the embeddings through the LSTM.
        # `lstm_out` contains the hidden state for every time step.
        # `hidden` is a tuple containing the final hidden state and cell state of the last time step.
        # lstm_out shape: [batch_size, seq_len, hidden_dim]
        # hidden shape: ([n_layers, batch_size, hidden_dim], [n_layers, batch_size, hidden_dim])
        lstm_out, (hidden, cell) = self.lstm(embedded)
        
        # We are interested in the final hidden state of the *last* layer, as it summarizes the whole sequence.
        # `hidden` has shape [n_layers, batch_size, hidden_dim], so we take the last layer's state with `hidden[-1]`.
        final_hidden_state = self.dropout(hidden[-1])
        # final_hidden_state shape: [batch_size, hidden_dim]
        
        # Pass the final hidden state through the fully connected layer to get the final prediction.
        output = self.fc(final_hidden_state)
        # output shape: [batch_size, output_dim]
        
        # We squeeze the output to remove the last dimension, resulting in a shape of [batch_size].
        return output.squeeze(1)

# --- Instantiate the Model ---
# Define the hyperparameters for our model. These values are a good starting point.
EMBEDDING_DIM = 100   # The size of the dense word vectors.
HIDDEN_DIM = 256      # The number of features in the LSTM's hidden state.
OUTPUT_DIM = 1        # The output dimension (1 for binary classification: a single logit).
N_LAYERS = 2          # The number of stacked LSTM layers.
DROPOUT = 0.5         # The dropout rate for regularization.

# Create an instance of our LSTM model.
model = SentimentLSTM(
    vocab_size=len(vocab),
    embedding_dim=EMBEDDING_DIM,
    hidden_dim=HIDDEN_DIM,
    output_dim=OUTPUT_DIM,
    n_layers=N_LAYERS,
    dropout=DROPOUT
).to(device) # Move the model to the configured device (GPU or CPU).

print("✅ LSTM Model created and moved to device.")

# --- Count Model Parameters ---
def count_parameters(model):
    """A helper function to count the number of trainable parameters in a model."""
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'\nThe model has {count_parameters(model):,} trainable parameters.')
print("\nModel Architecture:")
print(model)

## 🚀 Part 4: Training the Model

With our data prepared and our model built, it's time to start the training process. This involves defining our optimization strategy and then creating a loop to feed data to the model and update its weights.

Here's our game plan:

1.  **Define the Optimizer:** We'll use the **Adam** optimizer, a robust and popular choice for training deep learning models. It will be responsible for updating our model's weights based on the calculated gradients, using an adaptive learning rate.

2.  **Choose the Loss Function:** Since our output is a single logit for a binary classification problem, the perfect loss function is **`BCEWithLogitsLoss`**. This function is highly recommended because it combines a `Sigmoid` activation and the Binary Cross-Entropy loss in a single, numerically stable function. It takes the raw, un-squashed outputs (logits) from our model, which is exactly what our `SentimentLSTM` provides.

3.  **Create the Training Loop:** We'll loop through our training data for a set number of **epochs**. In each epoch, we will:
    *   Iterate through the batches of data in our `train_dataloader`.
    *   For each batch, perform the core training steps:
        *   Set the model to training mode (`model.train()`).
        *   Clear any old gradients from the last step (`optimizer.zero_grad()`).
        *   Move the data batch to the correct device (GPU/CPU).
        *   Make a **forward pass**: feed the batch to the model to get predictions (logits).
        *   Calculate the **loss**: compare the model's predictions with the true labels using our criterion.
        *   Perform a **backward pass**: compute the gradients of the loss with respect to all model parameters (`loss.backward()`).
        *   **Update the weights**: instruct the optimizer to take a step, updating the weights based on the computed gradients (`optimizer.step()`).
        *   Calculate the accuracy for the batch to monitor performance as we train.

4.  **Create the Evaluation Loop:** After each epoch of training, we'll evaluate our model's performance on the unseen test data. This is crucial for checking how well the model is generalizing and for spotting overfitting. The steps are similar to the training loop but without any gradient computation or weight updates.
    *   Set the model to evaluation mode (`model.eval()`).
    *   Wrap the entire loop in a `with torch.no_grad():` block to disable gradient calculation, which saves memory and computation.
    *   Calculate the average loss and accuracy over the entire test set.

Let's write the code to bring this to life.

In [None]:
# --- 4.1. Define Optimizer, Loss Function, and Accuracy Metric ---

# We'll use the Adam optimizer, a popular and effective default choice.
optimizer = optim.Adam(model.parameters())

# We use BCEWithLogitsLoss, which combines a Sigmoid layer and BCELoss in one.
# This is numerically more stable than using a plain Sigmoid followed by BCELoss.
# It expects the raw logits from our model as input.
criterion = nn.BCEWithLogitsLoss().to(device)

def binary_accuracy(preds, y):
    """
    Calculates accuracy for binary classification.
    
    Args:
        preds (Tensor): The raw logits from the model.
        y (Tensor): The true labels (0 or 1).
        
    Returns:
        float: The accuracy as a percentage.
    """
    # Apply sigmoid to get probabilities and round to get predictions (0 or 1).
    rounded_preds = torch.round(torch.sigmoid(preds))
    # Check if predictions are equal to the true labels.
    correct = (rounded_preds == y).float()
    # Calculate the mean accuracy.
    acc = correct.sum() / len(correct)
    return acc

# --- 4.2. Define Training and Evaluation Functions ---

def train_one_epoch(model, iterator, optimizer, criterion):
    """Defines a single training epoch."""
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()  # Set the model to training mode.
    
    for batch_texts, batch_labels in tqdm(iterator, desc="Training"):
        # Move data to the correct device.
        batch_texts, batch_labels = batch_texts.to(device), batch_labels.to(device)
        
        optimizer.zero_grad()  # Clear old gradients.
        
        # Forward pass.
        predictions = model(batch_texts)
        
        # Calculate loss and accuracy.
        loss = criterion(predictions, batch_labels)
        acc = binary_accuracy(predictions, batch_labels)
        
        # Backward pass and optimization.
        loss.backward()
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

def evaluate_one_epoch(model, iterator, criterion):
    """Defines the evaluation loop."""
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()  # Set the model to evaluation mode.
    
    with torch.no_grad():  # Disable gradient calculations for efficiency.
        for batch_texts, batch_labels in tqdm(iterator, desc="Evaluating"):
            batch_texts, batch_labels = batch_texts.to(device), batch_labels.to(device)
            
            predictions = model(batch_texts)
            
            loss = criterion(predictions, batch_labels)
            acc = binary_accuracy(predictions, batch_labels)
            
            epoch_loss += loss.item()
            epoch_acc += acc.item()
            
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

# --- 4.3. Run the Training Loop ---
N_EPOCHS = 5
best_valid_loss = float('inf')

history = {'train_loss': [], 'train_acc': [], 'val_loss': [], 'val_acc': []}

print("🚀 Starting training...")

for epoch in range(N_EPOCHS):
    
    # Train the model for one epoch.
    train_loss, train_acc = train_one_epoch(model, train_dataloader, optimizer, criterion)
    
    # Evaluate the model on the validation set.
    valid_loss, valid_acc = evaluate_one_epoch(model, test_dataloader, criterion)
    
    # Store history.
    history['train_loss'].append(train_loss)
    history['train_acc'].append(train_acc)
    history['val_loss'].append(valid_loss)
    history['val_acc'].append(valid_acc)
    
    # Check if this is the best model we've seen so far based on validation loss.
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        # Save the model's state dictionary. This saves the learned weights.
        torch.save(model.state_dict(), 'best_model.pt')
    
    print(f'\nEpoch: {epoch+1:02}')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

print("\n✅ Training complete.")
print(f"Best validation loss: {best_valid_loss:.3f}")
print("The best model has been saved to 'best_model.pt'")

## 🎬 Part 5: Using the Model for Inference

Now that we have a trained model, let's use it for its intended purpose: predicting the sentiment of new, unseen sentences. We'll create a function that encapsulates the entire inference pipeline, from raw text to a sentiment prediction.

The steps for inference must mirror the preprocessing steps used for training:
1.  **Load the Best Model:** We'll load the weights from `best_model.pt`, which we saved during training as it had the lowest validation loss.
2.  **Set to Evaluation Mode:** It's crucial to call `model.eval()` to turn off dropout and other training-specific layers.
3.  **Preprocessing the Input Sentence:** The new sentence must be processed in the exact same way as the training data. This includes:
    *   **Tokenizing** the sentence into a list of words.
    *   **Numericalizing** the tokens using our existing vocabulary to get a list of integer indices.
    *   **Converting** the list of indices to a PyTorch tensor.
    *   **Adding a "batch" dimension**, as the model was trained on batches of sentences and therefore expects a batch dimension (e.g., `[1, seq_len]`).
4.  **Prediction:**
    *   Pass the processed tensor through the model within a `torch.no_grad()` block.
    *   Apply the **sigmoid function** to the output logit to get a probability score between 0 and 1.
    *   Interpret the score to classify the sentiment as "Positive" or "Negative" based on a 0.5 threshold.

In [None]:
# --- 5.1. Load the Best Performing Model ---
print("Loading the best model from 'best_model.pt'...")
# We create a new instance of the model and then load the saved weights into it.
model.load_state_dict(torch.load('best_model.pt'))
model.to(device) # Ensure the model is on the correct device.
print("✅ Model loaded successfully.")

# --- 5.2. Create the Prediction Function ---
def predict_sentiment(sentence):
    """
    Predicts the sentiment of a given sentence using the trained model.
    
    Args:
        sentence (str): The input sentence (e.g., a movie review).
        
    Returns:
        tuple: A tuple containing the predicted sentiment ("Positive" or "Negative")
               and the raw probability score.
    """
    model.eval()  # Set the model to evaluation mode (disables dropout, etc.).
    
    # 1. Tokenize and Numericalize the input sentence.
    tokenized = tokenizer(sentence)
    indexed = vocab(tokenized)
    
    # 2. Convert to a Tensor and Add the Batch Dimension.
    # The model expects a batch of data, so we use `unsqueeze(0)` to add a
    # batch dimension of size 1, changing the shape from [seq_len] to [1, seq_len].
    tensor = torch.LongTensor(indexed).unsqueeze(0).to(device)
    
    # 3. Get the Prediction from the Model.
    # We use `torch.no_grad()` to ensure no gradients are calculated.
    with torch.no_grad():
        prediction = model(tensor)
    
    # 4. Convert the Raw Logit to a Probability.
    # The model outputs a raw logit, which we pass through a sigmoid function
    # to get a probability score between 0 and 1.
    probability = torch.sigmoid(prediction).item()
    
    # 5. Classify and Return the Result.
    sentiment = "Positive" if probability > 0.5 else "Negative"
    return sentiment, probability

# --- 5.3. Test with Example Sentences ---
print("\n--- Testing with a positive review ---")
positive_review = "This movie was fantastic! The acting was superb and the plot was gripping. I would watch it again in a heartbeat."
sentiment, prob = predict_sentiment(positive_review)
print(f"Sentence:  '{positive_review}'")
print(f"Sentiment: {sentiment}")
print(f"Probability (Positive): {prob:.4f}\n")

print("--- Testing with a negative review ---")
negative_review = "I was really disappointed with this film. It was boring, the story made no sense, and I almost fell asleep."
sentiment, prob = predict_sentiment(negative_review)
print(f"Sentence:  '{negative_review}'")
print(f"Sentiment: {sentiment}")
print(f"Probability (Positive): {prob:.4f}\n")

print("--- Testing with a neutral/ambiguous review ---")
ambiguous_review = "The movie was okay, I guess. Some parts were good, others not so much. I'm not sure if I would recommend it."
sentiment, prob = predict_sentiment(ambiguous_review)
print(f"Sentence:  '{ambiguous_review}'")
print(f"Sentiment: {sentiment}")
print(f"Probability (Positive): {prob:.4f}")

## 💡 Part 6: Conceptual Extensions - GRUs and Bidirectional RNNs

While our LSTM model is very powerful, it's helpful to be aware of a few common and powerful variations in the world of recurrent networks.

### 1. Gated Recurrent Unit (GRU)

The **GRU** is a newer and slightly simpler alternative to the LSTM. It was designed to solve the same vanishing gradient problem but with a more streamlined architecture.

-   **Fewer Gates**: A GRU has only two gates (a **Reset Gate** and an **Update Gate**), whereas an LSTM has three (Forget, Input, Output).
-   **Combined Cell and Hidden State**: It doesn't have a separate cell state like the LSTM; it only uses the hidden state to transfer information.

**Advantages**:
-   **Fewer Parameters**: Because it's simpler, a GRU has fewer weights to train. This can make it **faster** and more computationally efficient.
-   **Similar Performance**: For many tasks, GRUs perform just as well as LSTMs, and their simplicity can make them easier to work with. They are a very popular choice in modern NLP.

### 2. Bidirectional RNNs

When processing a sequence, our standard LSTM only looks at the past and present—the hidden state at time `t` only contains information from time steps `1, 2, ..., t`.

But what if future context is also important? Consider the sentence:
> "The man who was previously sitting next to the window **stood** up."

To understand the verb "stood," knowing what comes *after* it is just as important as what came before.

A **Bidirectional RNN** addresses this by processing the sequence in two directions at once:
1.  A **forward RNN** reads the sequence from left to right.
2.  A **backward RNN** reads the sequence from right to left.

The hidden states from both directions are then concatenated at each time step. This gives the model a richer, more complete understanding of the context surrounding every word in the sequence.

**When to use it?** Bidirectional models are the standard for many NLP tasks like sentiment analysis, named entity recognition, and question answering, where the full context of a word is crucial for making a correct prediction. You can easily make a PyTorch LSTM or GRU bidirectional by setting the `bidirectional=True` flag.

In [None]:
# GRU Model
class GRUModel(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, output_size, dropout=0.2):
        super(GRUModel, self).__init__()
        
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        
        # GRU layer
        self.gru = nn.GRU(
            input_size=input_size,
            hidden_size=hidden_size,
            num_layers=num_layers,
            batch_first=True,
            dropout=dropout if num_layers > 1 else 0
        )
        
        # Fully connected layer
        self.fc = nn.Linear(hidden_size, output_size)
    
    def forward(self, x):
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(x.device)
        out, hn = self.gru(x, h0)
        out = self.fc(out[:, -1, :])
        return out

# Compare LSTM vs GRU
lstm_model = LSTMModel(input_size=4, hidden_size=64, num_layers=2, output_size=1)
gru_model = GRUModel(input_size=4, hidden_size=64, num_layers=2, output_size=1)

lstm_params = sum(p.numel() for p in lstm_model.parameters())
gru_params = sum(p.numel() for p in gru_model.parameters())

print("⚖️ LSTM vs GRU Comparison")
print("="*60)
print(f"LSTM parameters: {lstm_params:,}")
print(f"GRU parameters:  {gru_params:,}")
print(f"\nGRU has {lstm_params - gru_params:,} fewer parameters ({100*(1-gru_params/lstm_params):.1f}% reduction)")
print("\n✅ GRU is more efficient!")

## 6️⃣ Bidirectional RNNs

### Why Bidirectional?

Sometimes **future context** helps understand the past!

Example: "The animal didn't cross the street because it was too **tired**"
- Need to see "tired" to understand "it" refers to the animal

Bidirectional RNNs process sequences in **both directions**.

In [None]:
# Bidirectional LSTM
class BiLSTM(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, output_size):
        super(BiLSTM, self).__init__()
        
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        
        # Bidirectional LSTM
        self.lstm = nn.LSTM(
            input_size=input_size,
            hidden_size=hidden_size,
            num_layers=num_layers,
            batch_first=True,
            bidirectional=True  # Key difference!
        )
        
        # FC layer (hidden_size * 2 because bidirectional)
        self.fc = nn.Linear(hidden_size * 2, output_size)
    
    def forward(self, x):
        # num_directions = 2 for bidirectional
        h0 = torch.zeros(self.num_layers * 2, x.size(0), self.hidden_size).to(x.device)
        c0 = torch.zeros(self.num_layers * 2, x.size(0), self.hidden_size).to(x.device)
        
        out, (hn, cn) = self.lstm(x, (h0, c0))
        out = self.fc(out[:, -1, :])
        
        return out

# Create bidirectional model
bi_lstm = BiLSTM(input_size=4, hidden_size=64, num_layers=2, output_size=1)

print("↔️ Bidirectional LSTM")
print("="*60)
print(bi_lstm)
print(f"\nParameters: {sum(p.numel() for p in bi_lstm.parameters()):,}")
print("\n✅ BiLSTM uses both past AND future context!")

## 🎉 Summary

Congratulations! You've mastered RNNs and LSTMs!

### Key Concepts
- ✅ RNN architecture and hidden states
- ✅ Vanishing gradient problem
- ✅ LSTM gates (forget, input, output)
- ✅ GRU (simplified LSTM)
- ✅ Bidirectional RNNs
- ✅ Sequence prediction

### What You Built
1. 🔄 Simple RNN from scratch
2. 🧠 LSTM temperature predictor
3. ⚡ GRU model
4. ↔️ Bidirectional LSTM

### RNN Applications
- 🏭 **Manufacturing**: Equipment monitoring, predictive maintenance
- 📈 **Finance**: Stock prediction, fraud detection
- 🎵 **Audio**: Speech recognition, music generation
- 📝 **Text**: Language modeling, translation

### Comparison Table

| Model | Parameters | Speed | Long-term Memory | Use Case |
|-------|-----------|-------|------------------|----------|
| Simple RNN | Low | Fast | Poor | Short sequences |
| LSTM | High | Slow | Excellent | Long sequences |
| GRU | Medium | Medium | Very Good | Balanced |
| BiLSTM | Highest | Slowest | Best | Full context needed |

### Next Steps
Continue to **Notebook 04: Transformers** to learn the architecture that revolutionized NLP!

<div align="center">
<b>RNNs & LSTMs mastered! Ready for Transformers! 🚀</b>
</div>