# ✨ Notebook 04: The Transformer Architecture

**Week 3-4: Deep Learning & NLP Foundations**  
**Gen AI Masters Program**

---

## 📋 Objectives

Welcome to one of the most significant breakthroughs in the history of deep learning: the **Transformer architecture**. Introduced in the 2017 paper "Attention Is All You Need," Transformers have completely revolutionized Natural Language Processing (NLP) and now form the backbone of modern Large Language Models (LLMs) like GPT, BERT, and T5.

In the previous notebook, we saw how RNNs and LSTMs process sequences one token at a time. While effective, this sequential nature has two major drawbacks:
1.  **It's Slow**: The computation for step `t` depends on the output of step `t-1`, making it **impossible to parallelize** the processing of a sequence. This becomes a major bottleneck for very long sequences.
2.  **Long-Range Dependencies**: While LSTMs were designed to handle long-range dependencies, they still struggle to maintain context over extremely long distances. Information can get diluted as it passes through the recurrent chain.

Transformers solve these problems by getting rid of recurrence entirely and relying on a powerful mechanism called **self-attention**.

By the end of this notebook, you will be able to:
1.  **Understand the Core Concept of Self-Attention**: Grasp the intuition behind how self-attention allows the model to weigh the importance of all other words in a sequence when processing a specific word, looking at the entire sentence at once.
2.  **Implement Positional Encodings**: Understand why positional information is critical in a non-recurrent model and implement the sine/cosine positional encoding scheme from the original paper.
3.  **Build a Transformer Encoder**: Construct a Transformer model for classification from scratch using PyTorch's built-in `TransformerEncoderLayer`.
4.  **Apply Transformers to a Practical Task**: Apply your Transformer model to the same **IMDb sentiment classification task** from the previous notebook, allowing for a direct comparison between the LSTM and Transformer approaches.

**Estimated Time:** 3-4 hours

---

## 📚 What is a Transformer?

A Transformer is a neural network architecture that, like an RNN, is designed to handle sequential data. However, it processes the entire sequence at once rather than one element at a time.

The key innovation is the **self-attention mechanism**. For any given word, self-attention allows the model to directly look at and draw information from all other words in the sequence. It calculates "attention scores" that determine how much focus to place on other words when representing the current word. This allows it to build highly contextualized representations and capture complex relationships, no matter how far apart the words are.

By stacking these self-attention layers, Transformers can build incredibly rich and deep understandings of language, which is why they have become the dominant architecture in modern NLP. Let's build one! 🚀

## 🚀 Agenda

Our exploration of the Transformer architecture will be structured as follows:

1.  **Setting the Stage**: We'll import the necessary libraries and configure our environment.
2.  **Data Preparation**: We will use the **exact same** IMDb dataset and preprocessing pipeline as the previous notebook. This allows for a fair, direct comparison between the performance of an LSTM and a Transformer on the same task.
3.  **Building the Transformer Model**: We'll construct our model piece by piece, understanding each component's role.
    *   **Positional Encoding**: We'll implement the clever sine/cosine positional encoding scheme to give the model a sense of word order.
    *   **The Full `SentimentTransformer`**: We'll assemble the complete model, incorporating an embedding layer, our positional encoding, and a stack of PyTorch's `TransformerEncoder` layers.
4.  **Training the Model**: We'll define the optimizer and loss function and write the training and evaluation loops.
5.  **Inference**: We'll use our trained Transformer to predict the sentiment of new, unseen movie reviews.

In [None]:
# --- 1. Set up the Environment ---

# We'll be using the same libraries as the previous notebook.
!pip install torchtext portalocker --quiet

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset, TensorDataset
from torch.nn.utils.rnn import pad_sequence

# Import torchtext components
from torchtext.datasets import IMDB
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

import numpy as np
import math # For the positional encoding calculation
from tqdm.notebook import tqdm
import matplotlib.pyplot as plt
import seaborn as sns

# --- Configuration ---

# Set a seed for reproducibility
SEED = 42
torch.manual_seed(SEED)
np.random.seed(SEED)

# Set the default device
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"✅ Using device: {device.upper()}")

# --- Plotting Style ---
plt.style.use("seaborn-v0_8-whitegrid")
plt.rcParams["figure.figsize"] = (10, 6)
sns.set_palette("colorblind")

## 💬 Part 2: Preparing the IMDb Dataset

This step is **identical** to the one in our previous LSTM notebook. A robust model architecture is only as good as the data it's fed, so we'll follow the same rigorous preprocessing pipeline. Using the same data setup is crucial for making a fair comparison between the LSTM and Transformer models.

**Reminder of the Pipeline:**
1.  **Load Data**: Fetch the IMDb training and test sets using `torchtext`.
2.  **Tokenization**: Break sentences into words (tokens) using a basic English tokenizer.
3.  **Build Vocabulary**: Create a word-to-index mapping from our training data, capped at the 10,000 most frequent words.
4.  **Numericalize & Pad**: Convert sentences to sequences of integers and pad or truncate them to a fixed length (`MAX_LEN`) so they can be processed in batches.

Let's execute the code. It will look very familiar!

In [None]:
# --- 2.1. Load Data, Define Tokenizer, and Build Vocabulary ---
print("Loading IMDb dataset and defining tokenizer...")
train_iter, test_iter = IMDB(split=('train', 'test'))
tokenizer = get_tokenizer('basic_english')

# Define constants for vocabulary and special tokens
VOCAB_SIZE = 10000
UNK_TOKEN = "<unk>"
PAD_TOKEN = "<pad>"

def yield_tokens(data_iter):
    """Helper generator function to yield tokens from the dataset."""
    for _, text in data_iter:
        yield tokenizer(text)

# We need a fresh iterator to build the vocabulary, as it gets consumed.
train_iter_for_vocab, _ = IMDB(split=('train', 'test'))

print(f"Building vocabulary with top {VOCAB_SIZE} words...")
vocab = build_vocab_from_iterator(
    yield_tokens(train_iter_for_vocab),
    specials=[UNK_TOKEN, PAD_TOKEN],
    max_tokens=VOCAB_SIZE,
    special_first=True
)
vocab.set_default_index(vocab[UNK_TOKEN])

pad_index = vocab[PAD_TOKEN]

print(f"✅ Vocabulary created. Size: {len(vocab)}")
print(f"   - Index for '<pad>': {pad_index}")
print(f"   - Example mapping: 'transformer' -> {vocab['transformer']}")

In [None]:
# --- 2.2. Define Data Processing and Create DataLoaders ---
MAX_LEN = 250    # Maximum sequence length
BATCH_SIZE = 32  # Batch size for the DataLoader

def create_dataloader(data_iter, is_train=True):
    """
    A function to convert the raw data iterator into a batched and padded DataLoader.
    """
    all_texts, all_labels = [], []
    
    print(f"Processing data... (This may take a moment)")
    for label, text in tqdm(data_iter, desc="Numericalizing and Padding"):
        numericalized_text = vocab(tokenizer(text))
        
        if len(numericalized_text) > MAX_LEN:
            numericalized_text = numericalized_text[:MAX_LEN]
            
        all_texts.append(torch.tensor(numericalized_text, dtype=torch.int64))
        all_labels.append(label - 1)

    padded_texts = pad_sequence(all_texts, batch_first=True, padding_value=pad_index)
    
    dataset = TensorDataset(padded_texts, torch.tensor(all_labels, dtype=torch.float32))
    return DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=is_train)

# --- Create the DataLoaders ---
# Get fresh iterators before processing
train_iter, test_iter = IMDB(split=('train', 'test'))

train_dataloader = create_dataloader(train_iter, is_train=True)
test_dataloader = create_dataloader(test_iter, is_train=False)

print("\n✅ Data preparation complete.")
print(f"   - Number of training batches: {len(train_dataloader)}")
print(f"   - Number of testing batches:  {len(test_dataloader)}")

# --- Inspect a Single Batch ---
texts, labels = next(iter(train_dataloader))
print(f"\nShape of a single batch of texts: {texts.shape}")
print(f"Shape of a single batch of labels: {labels.shape}")

## 🧠 Part 3: Building the Transformer Model

Now, let's construct the components of our Transformer-based sentiment classifier. We'll build it up piece by piece, starting with the solution to the "orderless" problem.

### 3.1. Positional Encoding

The first challenge with removing recurrence is that the model no longer has any inherent sense of word order. Without extra information, a Transformer would see the sentences "I love this movie" and "this movie I love" as identical.

**Solution:** We inject **Positional Encodings** into the input embeddings. These are vectors that provide information about the absolute position of a token within the sequence. The original paper used a clever method with sine and cosine functions of different frequencies.

The formula for the positional encoding `PE` at position `pos` and dimension `i` is:
$$ PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right) $$
$$ PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right) $$

Where:
-   `pos` is the position of the token in the sequence (0, 1, 2, ...).
-   `i` is the index of the dimension within the embedding vector.
-   `d_model` is the total dimension of the embedding (our `embed_dim`).

This method has two great properties:
1.  It produces a unique encoding for each time-step.
2.  It allows the model to easily learn to attend to relative positions, since for any fixed offset `k`, `PE_{pos+k}` can be represented as a linear function of `PE_{pos}`.

We will implement this as a `nn.Module`. The positional encodings are calculated once and then added to the input embeddings during the forward pass. They are not learnable parameters.

In [None]:
# --- 3.1. Implement Positional Encoding ---

class PositionalEncoding(nn.Module):
    """
    Injects positional information into the input embeddings.
    This is a fixed, non-learnable component.
    """
    def __init__(self, embed_dim: int, dropout: float = 0.1, max_len: int = 5000):
        """
        Args:
            embed_dim (int): The dimensionality of the embeddings.
            dropout (float): The dropout probability.
            max_len (int): The maximum possible sequence length.
        """
        super().__init__()
        self.dropout = nn.Dropout(p=dropout)

        # Create a positional encoding matrix of shape (max_len, embed_dim).
        pe = torch.zeros(max_len, embed_dim)
        
        # Create a tensor representing the positions (0, 1, 2, ..., max_len-1).
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        
        # Create the division term for the sine and cosine functions.
        # This term creates a geometric progression of frequencies.
        div_term = torch.exp(torch.arange(0, embed_dim, 2).float() * (-math.log(10000.0) / embed_dim))
        
        # Calculate the positional encodings using the sine/cosine formula.
        pe[:, 0::2] = torch.sin(position * div_term) # Apply to even indices.
        pe[:, 1::2] = torch.cos(position * div_term) # Apply to odd indices.
        
        # Add a batch dimension to the positional encoding matrix so it can be
        # easily added to the batch of embeddings. Shape becomes [1, max_len, embed_dim].
        pe = pe.unsqueeze(0)
        
        # Register 'pe' as a buffer. A buffer is part of the model's state but is not
        # considered a parameter to be trained. It will be moved to the correct device
        # along with the model.
        self.register_buffer('pe', pe)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        The forward pass adds the positional encodings to the input embeddings.
        
        Args:
            x: The input embeddings, shape [batch_size, seq_len, embedding_dim].
        
        Returns:
            The embeddings with added positional information.
        """
        # Add the positional encoding to the input tensor.
        # We only need the encodings up to the sequence length of the current batch.
        x = x + self.pe[:, :x.size(1), :]
        return self.dropout(x)

# --- Visualize the Positional Encodings ---
plt.figure(figsize=(12, 6))
pos_encoding_visualization = PositionalEncoding(embed_dim=128, max_len=500)
pe_matrix = pos_encoding_visualization.pe.squeeze().numpy()

plt.imshow(pe_matrix, cmap='viridis', aspect='auto')
plt.title("Visualization of Positional Encodings Matrix")
plt.xlabel("Embedding Dimension")
plt.ylabel("Position in Sequence")
plt.colorbar(label="Encoding Value")
plt.show()

print("💡 Analysis: The unique pattern for each position allows the model to distinguish between words at different locations.")

### 3.2. Assembling the Full Classifier Model

With positional information in place, we can build the main model. A Transformer Encoder is a stack of identical layers. Each layer has two main sub-components:

1.  **A Multi-Head Self-Attention Mechanism:** This is the core of the Transformer. It allows the model to weigh the importance of different words in the input sequence when producing a representation for a specific word. Instead of building this from scratch, we'll use PyTorch's highly optimized `nn.TransformerEncoderLayer`, which contains this mechanism internally.

2.  **A Position-wise Feed-Forward Network:** This is a simple fully connected network that is applied independently to each position (i.e., to each word's representation) in the sequence.

Each of these sub-layers has a residual connection around it, followed by a layer normalization. This is a standard practice that helps stabilize the training of deep networks. The full operation is: `output = LayerNorm(x + Sublayer(x))`.

Our complete sentiment classifier will stack these components:

1.  **Input Embedding Layer (`nn.Embedding`):** Converts input word indices into dense vectors.
2.  **Positional Encoding Layer (`PositionalEncoding`):** Injects information about word order into the embeddings.
3.  **Transformer Encoder (`nn.TransformerEncoder`):** A stack of `nn.TransformerEncoderLayer`s that processes the sequence and builds rich, context-aware representations for every token.
4.  **Classification Head (`nn.Linear`):** A final linear layer that takes the representation of the **first token** in the sequence and maps it to our output class. We only need the output for one token because the self-attention mechanism ensures that this single token's final representation has already gathered information from the entire sequence.

In [None]:
# --- 3.2. Define the Transformer Classifier ---

class SentimentTransformer(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_heads, num_layers, hidden_dim, dropout=0.1, max_len=5000):
        """
        Initializes the Transformer model for sentiment classification.
        
        Args:
            vocab_size (int): The size of the vocabulary.
            embed_dim (int): The dimensionality of the embeddings.
            num_heads (int): The number of attention heads in the multi-head attention mechanism.
            num_layers (int): The number of stacked Transformer encoder layers.
            hidden_dim (int): The dimensionality of the feed-forward network inside the encoder.
            dropout (float): The dropout probability.
            max_len (int): The maximum sequence length for positional encoding.
        """
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=pad_index)
        self.pos_encoder = PositionalEncoding(embed_dim, dropout, max_len)
        
        # A single Transformer Encoder Layer.
        # `d_model` is the same as `embed_dim`.
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=embed_dim, 
            nhead=num_heads, 
            dim_feedforward=hidden_dim, 
            dropout=dropout,
            batch_first=True # Crucial: ensures input format is [batch, seq_len, features].
        )
        
        # A stack of identical encoder layers.
        self.transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)
        
        # The final linear layer for classification.
        self.classifier = nn.Linear(embed_dim, 1)
        
        self.embed_dim = embed_dim

    def forward(self, src):
        """
        Defines the forward pass.
        
        Args:
            src (Tensor): The input tensor of word indices, shape [batch_size, seq_len].
        
        Returns:
            Tensor: The raw output logits, shape [batch_size].
        """
        # src shape: [batch_size, seq_len]
        
        # Create a padding mask.
        # The mask should be `True` for positions that should be ignored (i.e., padding tokens).
        # The attention mechanism will not attend to these positions.
        src_key_padding_mask = (src == pad_index)
        # src_key_padding_mask shape: [batch_size, seq_len]

        # 1. Get embeddings and add positional encoding.
        # We scale the embeddings by sqrt(embed_dim) as recommended in the original paper.
        embedded = self.embedding(src) * math.sqrt(self.embed_dim)
        pos_encoded = self.pos_encoder(embedded)
        
        # 2. Pass through the Transformer Encoder.
        # The mask ensures that the attention mechanism ignores padding tokens.
        transformer_output = self.transformer_encoder(pos_encoded, src_key_padding_mask=src_key_padding_mask)
        # transformer_output shape: [batch_size, seq_len, embed_dim]
        
        # 3. Perform classification.
        # We use the output corresponding to the first token (at index 0) for classification.
        # Due to self-attention, this token's representation contains information from the whole sequence.
        cls_output = transformer_output[:, 0, :]
        # cls_output shape: [batch_size, embed_dim]
        
        output = self.classifier(cls_output)
        # output shape: [batch_size, 1]
        
        return output.squeeze(1)

# --- Instantiate the Model ---
EMBED_DIM = 128       # Dimensionality of the embeddings.
NUM_HEADS = 4         # Number of attention heads. Must be a divisor of EMBED_DIM.
NUM_LAYERS = 2        # Number of Transformer Encoder layers to stack.
HIDDEN_DIM = 512      # Hidden dimension of the feed-forward network.
DROPOUT = 0.2

model = SentimentTransformer(
    vocab_size=len(vocab),
    embed_dim=EMBED_DIM,
    num_heads=NUM_HEADS,
    num_layers=NUM_LAYERS,
    hidden_dim=HIDDEN_DIM,
    dropout=DROPOUT,
    max_len=MAX_LEN
).to(device)

# --- Count Model Parameters ---
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print("✅ Transformer Model created and moved to device.")
print(f'The model has {count_parameters(model):,} trainable parameters.')
print("\nModel Architecture:")
print(model)

## 🚀 Part 4: Training the Transformer Model

The training and evaluation process will be very similar to the one we used for the LSTM model, demonstrating the modularity of PyTorch workflows.

1.  **Optimizer:** We'll use the **Adam** optimizer. The learning rate for Transformers is often set to be smaller than for LSTMs, so we'll start with `1e-4`.
2.  **Loss Function:** We'll again use `BCEWithLogitsLoss`, which is perfect for our binary classification task and is numerically stable.
3.  **Training & Evaluation Loops:** We will reuse our `train` and `evaluate` functions to handle the epoch loops, loss calculation, and accuracy monitoring. We'll also save the best-performing model based on validation loss.

Let's set up the training components and start the process.

In [None]:
# --- 4.1. Define Optimizer, Loss, and Accuracy Functions ---
optimizer = optim.Adam(model.parameters(), lr=1e-4)
criterion = nn.BCEWithLogitsLoss().to(device)

def binary_accuracy(preds, y):
    """Calculates accuracy for binary classification."""
    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float()
    acc = correct.sum() / len(correct)
    return acc

# --- 4.2. Define Training and Evaluation Functions ---
# These functions are identical to the ones from the LSTM notebook.
def train_one_epoch(model, iterator, optimizer, criterion):
    epoch_loss = 0
    epoch_acc = 0
    model.train()
    
    for batch_texts, batch_labels in tqdm(iterator, desc="Training"):
        batch_texts, batch_labels = batch_texts.to(device), batch_labels.to(device)
        
        optimizer.zero_grad()
        predictions = model(batch_texts)
        loss = criterion(predictions, batch_labels)
        acc = binary_accuracy(predictions, batch_labels)
        
        loss.backward()
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

def evaluate_one_epoch(model, iterator, criterion):
    epoch_loss = 0
    epoch_acc = 0
    model.eval()
    
    with torch.no_grad():
        for batch_texts, batch_labels in tqdm(iterator, desc="Evaluating"):
            batch_texts, batch_labels = batch_texts.to(device), batch_labels.to(device)
            
            predictions = model(batch_texts)
            loss = criterion(predictions, batch_labels)
            acc = binary_accuracy(predictions, batch_labels)
            
            epoch_loss += loss.item()
            epoch_acc += acc.item()
            
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

### 4.3. Run the Training Loop

Now we'll execute the main training loop for a few epochs. We'll keep track of the best validation loss to save the best version of our model.

Note that training a Transformer can be more computationally intensive per epoch than an LSTM due to the self-attention mechanism's quadratic complexity with respect to sequence length. However, it often converges in fewer epochs. We'll train for just **3 epochs** here.

In [None]:
# --- 4.3. Run the Training Loop ---
N_EPOCHS = 3 # Transformers can converge in fewer epochs, but each epoch is computationally heavier.
best_valid_loss = float('inf')

history = {'train_loss': [], 'train_acc': [], 'val_loss': [], 'val_acc': []}

print("🚀 Starting Transformer model training...")

for epoch in range(N_EPOCHS):
    
    train_loss, train_acc = train_one_epoch(model, train_dataloader, optimizer, criterion)
    valid_loss, valid_acc = evaluate_one_epoch(model, test_dataloader, criterion)
    
    # Store history
    history['train_loss'].append(train_loss)
    history['train_acc'].append(train_acc)
    history['val_loss'].append(valid_loss)
    history['val_acc'].append(valid_acc)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'best_transformer_model.pt')
    
    print(f'\nEpoch: {epoch+1:02}')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

print("\n✅ Training complete.")
print(f"Best validation loss: {best_valid_loss:.3f}")
print("The best model has been saved to 'best_transformer_model.pt'")

## 🎬 Part 5: Use the Model for Inference

Just like with our LSTM, let's use our trained Transformer to predict the sentiment of new sentences. The process is identical: load the best model, preprocess the input sentence (tokenize, numericalize, batch), and pass it to the model to get a prediction.

A key difference in preprocessing for the Transformer is that we don't need to worry about truncating the input to the *exact* `MAX_LEN` during inference, as the model can handle variable-length sequences (though we must still pad it to be rectangular within a batch, which is handled by the `unsqueeze` for a single sentence). However, for consistency with our training setup, we'll pad/truncate to `MAX_LEN`.

In [None]:
# --- 5.1. Load the Best Model and Define Prediction Function ---
print("Loading the best Transformer model from 'best_transformer_model.pt'...")
model.load_state_dict(torch.load('best_transformer_model.pt'))
model.to(device)
print("✅ Model loaded.")

def predict_sentiment_transformer(sentence):
    """
    Predicts the sentiment of a sentence using the trained Transformer model.
    """
    model.eval()
    
    # Preprocess the sentence
    tokenized = tokenizer(sentence)
    indexed = vocab(tokenized)
    
    # Ensure the sequence is not longer than the model's max length
    if len(indexed) > MAX_LEN:
        indexed = indexed[:MAX_LEN]
        
    # Convert to tensor and add batch dimension
    tensor = torch.LongTensor(indexed).unsqueeze(0).to(device)
    
    # Get prediction
    with torch.no_grad():
        prediction = model(tensor)
        
    # Convert to probability and classify
    probability = torch.sigmoid(prediction).item()
    sentiment = "Positive" if probability > 0.5 else "Negative"
    return sentiment, probability

# --- 5.2. Test with Example Sentences ---
print("\n--- Testing with a positive review ---")
positive_review = "This is one of the best films I have ever seen, a true masterpiece of cinema. The acting, direction, and storyline were all flawless."
sentiment, prob = predict_sentiment_transformer(positive_review)
print(f"Sentence:  '{positive_review}'")
print(f"Sentiment: {sentiment}")
print(f"Probability (Positive): {prob:.4f}\n")

print("--- Testing with a negative review ---")
negative_review = "A complete waste of time. The plot was predictable, the acting was terrible, and I wouldn't recommend it to anyone."
sentiment, prob = predict_sentiment_transformer(negative_review)
print(f"Sentence:  '{negative_review}'")
print(f"Sentiment: {sentiment}")
print(f"Probability (Positive): {prob:.4f}\n")

print("--- Comparing with the LSTM notebook's ambiguous review ---")
ambiguous_review = "The movie was okay, I guess. Some parts were good, others not so much. I'm not sure if I would recommend it."
sentiment, prob = predict_sentiment_transformer(ambiguous_review)
print(f"Sentence:  '{ambiguous_review}'")
print(f"Sentiment: {sentiment}")
print(f"Probability (Positive): {prob:.4f}")