# Understanding and Implementing Transformers: A Step-by-Step Guide

## 1. Introduction to Transformers

Transformers have revolutionized Natural Language Processing (NLP) and have applications in various domains. This notebook will guide you through understanding and implementing a simple transformer from scratch.

<p align="center">
  <img src="transformer_architecture.png" alt="Transformer Architecture Diagram" style="width:40%; height: 40%;">
  <br>
  <em>Figure 1: Transformer Architecture Diagram, Taken From "<a href="https://arxiv.org/abs/1706.03762">Attention Is All You Need</a>"</em>
</p>


## 2. Setting Up Our Environment

First, let's import the necessary libraries:

In [4]:
import numpy as np
import torch # PyTorch (alternative is Tensorflow)
import torch.nn as nn
import torch.optim as optim
import math

# Set a random seed for reproducibility
torch.manual_seed(42)

<torch._C.Generator at 0x1178e42b0>

In [5]:
# Spacy --> 300 dimensions
# openai large embedding model --> 3072 dimensions

## 3. Understanding Embeddings

Before we dive into the transformer architecture, let's start with a fundamental concept: word embeddings.

### What are Word Embeddings?

Word embeddings are dense vector representations of words. Instead of using sparse, one-hot encoded vectors, we represent each word as a dense vector of floating-point numbers. These vectors are learned from data and capture semantic relationships between words.

<div style="display: flex; align-items: center; justify-content: center;">
  <div>
    <img src="word_embeddings.png" alt="Word Embeddings Visualization" style="width:auto; height: auto;">
    <br>
    <em>Figure 2: Words Represented by Vectors</em>
  </div>
  <div style="margin-left: 20px;"> <!-- Adds some space between the images -->
    <img src="word_embeddings_similarity.png" alt="Word Embeddings Similarity" style="width:auto; height: auto;">
    <br>
    <em>Figure 3: Words Closer to Each Other are More Similar</em>
  </div>
</div>

### Mathematical Representation

Mathematically, an embedding layer can be thought of as a lookup table. If we have a vocabulary of size V and we want to embed each word into a D-dimensional space, we can represent this as a matrix E of shape (V, D).

For a given word index i, its embedding vector $e_i$ is the i-th row of E:

$$e_i = E[i, :]$$

#### Example:
Let's say we have a small vocabulary of 5 words: ["hello", "world", "transformer", "example", "embedding"]

We want to represent each word with a 3-dimensional vector. Our embedding matrix E might look like this:

In [6]:
E = torch.tensor([
    [0.1, 0.2, 0.3],  # "hello"
    [0.4, 0.5, 0.6],  # "world"
    [0.7, 0.8, 0.9],  # "transformer"
    [1.0, 1.1, 1.2],  # "example"
    [1.3, 1.4, 1.5]   # "embedding"
])

In [7]:
# To get the embedding for "transformer" (index 2):
transformer_embedding = E[2, :]  # Result: tensor([0.7, 0.8, 0.9])

In [8]:
len(transformer_embedding)
# Notice for spacy_md and spacy_lg we got 300 dimensions instead of 3
# Usually embeddings are much larger than 3 dimensions

3

In [9]:
class SimpleEmbedding(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)

    def forward(self, x):
        return self.embedding(x)

In [10]:
vocab_size = 1000000  # Size of our vocabulary
embedding_dim = 300  # Dimension of the embedding

In [11]:
embed_layer = SimpleEmbedding(vocab_size, embedding_dim)

In [12]:
# sample input tensor (batch_size=1, sequence_length=3)
sample_input = torch.tensor([[3,5,9]]) # "I love Transformers"

In [13]:
embedded_output = embed_layer(sample_input)
print(f"Input shape: {sample_input.shape}")
print(f"Embedded output shape: {embedded_output.shape[1]}")

Input shape: torch.Size([1, 3])
Embedded output shape: 3


In [14]:
import spacy
nlp = spacy.load("en_core_web_lg")

embedded_output_spacy = nlp("Transformer")

In [15]:
len(embedded_output_spacy.vector)

300

In [16]:
embedded_output[0] # "Transformer"

tensor([[-2.3908e+00,  3.2225e-01,  1.8754e+00,  1.1043e+00, -5.2238e-01,
         -7.4018e-01,  1.6236e-01, -2.3700e-01,  5.0993e-01,  1.6706e+00,
          1.5921e+00, -4.1619e-01,  1.8619e+00, -1.0779e+00,  8.8486e-01,
         -8.3421e-01,  1.0301e+00, -8.6810e-01, -5.7016e-01,  3.2332e-01,
          1.1285e+00, -1.2123e+00,  2.6024e+00, -9.5724e-02, -8.1148e-02,
          1.2587e+00,  8.6913e-01, -9.6094e-01,  5.1823e-02, -3.2848e-01,
         -2.2472e+00, -4.4790e-01,  4.2347e-01, -3.8746e-01, -2.2964e-01,
         -4.0709e-01,  8.7030e-01, -1.0553e+00, -1.3284e+00,  7.0607e-01,
          3.5730e-01,  5.8928e-01,  9.1878e-01,  6.6628e-01,  2.4651e-01,
          1.3287e-01,  1.2191e-01,  4.7809e-01,  2.7613e-01, -5.8957e-01,
          5.6918e-01, -7.9110e-01, -1.9897e-01, -1.3616e+00, -5.1936e-01,
          7.6482e-02,  3.4005e-01,  1.4557e+00, -3.4610e-01, -2.6338e-01,
         -4.4770e-01, -7.2882e-01, -1.6066e-01, -3.2064e-01, -6.3077e-01,
         -7.8877e-01,  1.3062e+00, -9.

In this example, we've created a simple embedding layer. Each word (represented by an integer) is mapped to a vector of size `embedding_dim`.

## 4. Positional Encoding

Next, let's implement positional encoding. This is crucial for transformers to understand the order of sequences.


Next, let's understand and implement positional encoding. This is crucial for transformers to understand the order of sequences.

### Why do we need Positional Encoding?

Unlike recurrent neural networks (RNNs), transformers process all words in a sequence simultaneously. This parallelization is great for efficiency, but it means the model loses information about the order of words. Positional encoding solves this by adding position-dependent patterns to the input embeddings.

### Mathematical Representation

For a position pos and dimension i in the embedding, the positional encoding PE is defined as:

$$PE_{(pos,2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)$$
$$PE_{(pos,2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)$$

Where $d_{model}$ is the dimensionality of the model's embeddings.

#### Example:
Let's calculate the positional encoding for the word "transformer" in a sentence, assuming it's at position 2 (0-indexed) and we're using a 4-dimensional model:


In [17]:
# "The good transformer (Optimus) beat the bad transformer (Megatron)"

import math

d_model = 300
pos = 2

pe = torch.zeros(1, d_model)
for i in range(0, d_model, 2):
    pe[0, i] = math.sin(pos / (10000 ** (2 * i / d_model)))
    pe[0, i+1] = math.cos(pos / (10000 ** (2 * i / d_model)))

print("Positional encoding for 'transformer':", pe)

Positional encoding for 'transformer': tensor([[ 9.0930e-01, -4.1615e-01,  9.8045e-01, -1.9678e-01,  9.9998e-01,
          6.3404e-03,  9.8254e-01,  1.8604e-01,  9.4039e-01,  3.4011e-01,
          8.8306e-01,  4.6926e-01,  8.1762e-01,  5.7576e-01,  7.4906e-01,
          6.6251e-01,  6.8076e-01,  7.3251e-01,  6.1490e-01,  7.8860e-01,
          5.5281e-01,  8.3331e-01,  4.9518e-01,  8.6879e-01,  4.4231e-01,
          8.9686e-01,  3.9423e-01,  9.1901e-01,  3.5077e-01,  9.3646e-01,
          3.1170e-01,  9.5018e-01,  2.7669e-01,  9.6096e-01,  2.4542e-01,
          9.6942e-01,  2.1754e-01,  9.7605e-01,  1.9274e-01,  9.8125e-01,
          1.7070e-01,  9.8532e-01,  1.5113e-01,  9.8851e-01,  1.3378e-01,
          9.9101e-01,  1.1840e-01,  9.9297e-01,  1.0477e-01,  9.9450e-01,
          9.2698e-02,  9.9569e-01,  8.2012e-02,  9.9663e-01,  7.2552e-02,
          9.9736e-01,  6.4180e-02,  9.9794e-01,  5.6771e-02,  9.9839e-01,
          5.0217e-02,  9.9874e-01,  4.4417e-02,  9.9901e-01,  3.9287e-02,

This encoding is unique for position 2 and will be different for other positions, allowing the model to distinguish word positions.

<p align="center">
  <img src="positional_encoding.png" alt="Positional Encoding" style="width:60%; height: 60%;">
  <br>
  <em>Figure 4: Positional Encoding of "I am a robot"</em>
</p>

In [18]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_seq_length=5000):
        super().__init__()

        # Create a long enough 'pe' matrix
        pe = torch.zeros(max_seq_length, d_model)

        # Create a vector of shape (max_seq_length, 1)
        position = torch.arange(0, max_seq_length, dtype=torch.float).unsqueeze(1)

        # Create a vector of shape (d_model/2)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model))

        # Apply sine to even indices
        pe[:, 0::2] = torch.sin(position * div_term)
        # Apply cosine to odd indices
        pe[:, 1::2] = torch.cos(position * div_term)

        # Add a batch dimension
        pe = pe.unsqueeze(0)

        # Register pe as a buffer (won't be considered a model parameter)
        self.register_buffer('pe', pe)

    def forward(self, x):
        # Add positional encoding to the input
        return x + self.pe[:, :x.size(1)]

In [19]:
# input <-- 300 dimensions
# positional encoding <-- 300 dimensions

# input + positional encoding --> encoder

In [20]:
d_model = 300  # Should match the embedding dimension
pos_encoder = PositionalEncoding(d_model)

In [21]:
# Use our previous embedded output
positional_encoded = pos_encoder(embedded_output)
print(f"Positional encoded output shape: {positional_encoded.shape}")

Positional encoded output shape: torch.Size([1, 3, 300])


In [22]:
print(f"First tensor of embedded_output:\n{embedded_output[0]}")
print(f"First tensor of positional_encoded:\n{positional_encoded[0]}")

First tensor of embedded_output:
tensor([[-2.3908e+00,  3.2225e-01,  1.8754e+00,  1.1043e+00, -5.2238e-01,
         -7.4018e-01,  1.6236e-01, -2.3700e-01,  5.0993e-01,  1.6706e+00,
          1.5921e+00, -4.1619e-01,  1.8619e+00, -1.0779e+00,  8.8486e-01,
         -8.3421e-01,  1.0301e+00, -8.6810e-01, -5.7016e-01,  3.2332e-01,
          1.1285e+00, -1.2123e+00,  2.6024e+00, -9.5724e-02, -8.1148e-02,
          1.2587e+00,  8.6913e-01, -9.6094e-01,  5.1823e-02, -3.2848e-01,
         -2.2472e+00, -4.4790e-01,  4.2347e-01, -3.8746e-01, -2.2964e-01,
         -4.0709e-01,  8.7030e-01, -1.0553e+00, -1.3284e+00,  7.0607e-01,
          3.5730e-01,  5.8928e-01,  9.1878e-01,  6.6628e-01,  2.4651e-01,
          1.3287e-01,  1.2191e-01,  4.7809e-01,  2.7613e-01, -5.8957e-01,
          5.6918e-01, -7.9110e-01, -1.9897e-01, -1.3616e+00, -5.1936e-01,
          7.6482e-02,  3.4005e-01,  1.4557e+00, -3.4610e-01, -2.6338e-01,
         -4.4770e-01, -7.2882e-01, -1.6066e-01, -3.2064e-01, -6.3077e-01,
     

## 5. Attention Mechanism

Now, let's dive into the core of the transformer: the attention mechanism.

### What is Attention?

Attention allows the model to focus on different parts of the input when producing each part of the output. In the context of transformers, we use self-attention, where the model attends to different parts of a single sequence.

### Mathematical Representation of Scaled Dot-Product Attention

Given query Q, key K, and value V matrices, the attention is computed as:

$$Attention(Q, K, V) = softmax\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Where $d_k$ is the dimension of the key vectors.

#### Example:
Let's compute attention for a simple case with 2 words and 3-dimensional embeddings:


In [23]:
# Q is 2x3
# Kt = torch.tensor([1, 2],
#                   [2, 1],
#                   [1, 0])
# Kt is 3x2

# QKt is (2x3)x(3x2) = (2x2)
# V needs to be (2 x something)
# V is 2x2

# attention_weights is 2x2
# attention_weights x V is (2x2)x(2x2) = (2x2)

In [24]:
Q = torch.tensor([[1, 0, 1],  # Query for word 1
                  [0, 1, 1]]) # Query for word 2
K = torch.tensor([[1, 2, 1],  # Key for word 1
                  [2, 1, 0]]) # Key for word 2
V = torch.tensor([[0.5, 0.8],  # Value for word 1
                  [0.2, 0.3]]) # Value for word 2

d_k = 3

In [25]:
QKt = torch.matmul(Q, K.transpose(0, 1))
scaled = QKt / math.sqrt(d_k)
attention_weights = torch.softmax(scaled, dim=-1)
output = torch.matmul(attention_weights, V)

In [26]:
value = torch.tensor([[2, 4, 5]], dtype=torch.float32)
result = torch.softmax(value, dim=1)
print(result)

# softmax(z) = e^z/sum(e^z))
# e^value/sum(e^value)

result_from_scratch = torch.exp(value)/torch.sum(torch.exp(value), dim=1, keepdim=True)
result_from_scratch

tensor([[0.0351, 0.2595, 0.7054]])


tensor([[0.0351, 0.2595, 0.7054]])

In [27]:
print("Attention weights:", attention_weights)
print("Output:", output)

Attention weights: tensor([[0.5000, 0.5000],
        [0.7604, 0.2396]])
Output: tensor([[0.3500, 0.5500],
        [0.4281, 0.6802]])


This example shows how each word attends to both itself and the other word, with the attention weights determining how much information to gather from each word.

In [28]:
# ChatGPT 4o --> 200K context window
# Claude --> 1M context window

# Attention complexity is O(n^2) where n is the tokens
# 12 tokens = O(144)

<p align="center">
  <img src="attention.gif" alt="Attention" style="width:60%; height: 60%;">
  <br>
  <em>Figure 5: Attention in Action</em>
</p>

### 5.1 Scaled Dot-Product Attention

Let's implement this:

In [29]:
def scaled_dot_product_attention(query, key, value, mask=None):
    """
    Compute the scaled dot-product attention.
    
    Args:
    - query: tensor of shape (..., seq_len_q, depth)
    - key: tensor of shape (..., seq_len_k, depth)
    - value: tensor of shape (..., seq_len_v, depth_v)
    - mask: optional tensor of shape (..., seq_len_q, seq_len_k)

    Returns:
    - output: weighted sum of values
    - attention_weights: attention weights
    """

    # Compute dot product of query with keys
    matmul_qk = torch.matmul(query, key.transpose(-2, -1))

    # Scale matmul_qk
    depth = query.size(-1)
    scaled_attention_logits = matmul_qk / math.sqrt(depth)

    # Apply mask (if provided)
    if mask is not None:
        scaled_attention_logits += (mask * -1e9)

    # Apply softmax to get attention weights
    attention_weights = torch.softmax(scaled_attention_logits, dim=-1)

    # Compute weighted sum of values
    output = torch.matmul(attention_weights, value)

    return output, attention_weights

In [30]:
seq_len, d_k = 3, 300
query = torch.rand(1, seq_len, d_k)
key = torch.rand(1, seq_len, d_k)
value = torch.rand(1, seq_len, d_k)

In [31]:
output, attention_weights = scaled_dot_product_attention(query, key, value)
print(f"Attention output shape: {output.shape}")
print(f"Attention weights shape: {attention_weights.shape}")

Attention output shape: torch.Size([1, 3, 300])
Attention weights shape: torch.Size([1, 3, 3])


### 5.2 Multi-Head Attention

Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions.

Mathematically, multi-head attention first projects Q, K, and V h times (where h is the number of heads) with different learned projections. Then, it performs attention on each of these projected versions of Q, K, and V. Finally, the results are concatenated and once again projected.

$$MultiHead(Q, K, V) = Concat(head_1, ..., head_h)W^O$$
where 
$$head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)$$

<p align="center">
  <img src="multi-head_attention.png" alt="Multi-Head Attention" style="width:30%; height: 30%;">
  <br>
  <em>Figure 6: Multi-head Attention</em>
</p>

In [32]:
!pip3 install bertviz



In [33]:
from transformers import AutoTokenizer, AutoModel, utils
from bertviz import model_view
utils.logging.set_verbosity_error()  # Suppress standard warnings

model_name = "microsoft/xtremedistil-l12-h384-uncased"  # Find popular HuggingFace models here: https://huggingface.co/models
input_text = "I love transformers"  
model = AutoModel.from_pretrained(model_name, output_attentions=True)  # Configure model to return attention values
tokenizer = AutoTokenizer.from_pretrained(model_name)
inputs = tokenizer.encode(input_text, return_tensors='pt')  # Tokenize input text
outputs = model(inputs)  # Run model
attention = outputs[-1]  # Retrieve attention from model outputs
tokens = tokenizer.convert_ids_to_tokens(inputs[0])  # Convert input ids to token strings
model_view(attention, tokens)  # Display model view



<IPython.core.display.Javascript object>

Now, let's implement this:

In [34]:
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"

        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads

        # Linear layers for Q, K, V projections
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)

        # Final output projection
        self.W_o = nn.Linear(d_model, d_model)

    def split_heads(self, x, batch_size):
        """Split the last dimension into (num_heads, depth)."""
        return x.view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)

    def forward(self, query, key, value, mask=None):
        batch_size = query.size(0)

        # Linear projections
        q = self.W_q(query)
        k = self.W_k(key)
        v = self.W_v(value)

        # Split heads
        q = self.split_heads(q, batch_size)
        k = self.split_heads(k, batch_size)
        v = self.split_heads(v, batch_size)

        # Scaled dot-product attention
        scaled_attention, attention_weights = scaled_dot_product_attention(q, k, v, mask)

        # Concatenate heads
        concat_attention = scaled_attention.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)

        # Final linear projection
        output = self.W_o(concat_attention)

        return output, attention_weights

In [35]:
d_model, num_heads = 300, 4
mha = MultiHeadAttention(d_model, num_heads)

In [36]:
# Use our previous positional encoded output
mha_output, mha_attention_weights = mha(positional_encoded, positional_encoded, positional_encoded)
print(f"Multi-head attention output shape: {mha_output.shape}")
print(f"Multi-head attention weights shape: {mha_attention_weights.shape}")

Multi-head attention output shape: torch.Size([1, 3, 300])
Multi-head attention weights shape: torch.Size([1, 4, 3, 3])


## 6. Position-wise Feed-Forward Networks

After the attention mechanism, each sub-layer in the transformer contains a fully connected feed-forward network. This network is applied to each position separately and identically.

### Mathematical Representation

The position-wise feed-forward network consists of two linear transformations with a ReLU activation in between:

$$FFN(x) = max(0, xW_1 + b_1)W_2 + b_2$$

Where $W_1$, $W_2$, $b_1$, and $b_2$ are learnable parameters.

<p align="center">
  <img src="position-wise_feed-forward_network.png" alt="Position-wise Feed-Forward Network" style="width:40%; height: 40%;">
  <br>
  <em>Figure 7: Illustration of a Position-wise Feed-Forward Network</em>
</p>


### Example:
Let's apply a feed-forward network to a single word embedding:


In [37]:
import torch.nn.functional as F

# word embedding
x = torch.tensor([0.5, -0.2, 0.1, 0.8])

# First linear transformation
W1 = torch.tensor([[0.1, 0.2],
                   [-0.1, 0.1],
                   [0.3, -0.2],
                   [0.2, 0.1]])
b1 = torch.tensor([0.01, 0.02])

# Second linear transformation
W2 = torch.tensor([[1.0, -0.5, 0.8, 0.2],
                   [0.5, 0.3, -0.2, 0.4]])
b2 = torch.tensor([0.03, -0.01, 0.02, 0.01])

In [38]:
# Apply FFN
hidden = F.relu(torch.matmul(x, W1) + b1)
output = torch.matmul(hidden, W2) + b2

In [39]:
print("FFN input:", x)
print("FFN output:", output)

FFN input: tensor([ 0.5000, -0.2000,  0.1000,  0.8000])
FFN output: tensor([ 0.3800, -0.0970,  0.2040,  0.1280])


### Implementation:

This implementation uses nn.Linear layers instead of explicit matrix multiplications.
While it may not look exactly like the formula FFN(x) = max(0, xW_1 + b_1)W_2 + b_2,
it is functionally equivalent.

$$FFN(x) = max(0, xW_1 + b_1)W_2 + b_2$$

In [82]:
# Layer 1 (Linear Layer) --> y = xW1 + b1
# Relu --> max(0, x)
# Layer 2 (Linear Layer) --> y = xW2 + b2

# step 1: max(0, xW1 + b1)
# step 2: max(0, xW1 + b)W2 + b2

In [83]:
class PositionWiseFeedForward(nn.Module):
    def __init__(self, d_model, d_ff):
        super().__init__()
        self.fc1 = nn.Linear(d_model, d_ff)
        self.fc2 = nn.Linear(d_ff, d_model)
        self.relu = nn.ReLU()

    def forward(self, x):
        # First linear transformation
        x = self.fc1(x)
        # ReLU activation
        x = self.relu(x)
        # Second linear transformation
        return self.fc2(x)

In [85]:
d_model, d_ff = 300, 64
ff_network = PositionWiseFeedForward(d_model, d_ff)
ff_output = ff_network(mha_output)
print(f"Feed-forward network output shape: {ff_output.shape}")

Feed-forward network output shape: torch.Size([1, 3, 300])


## 7. Layer Normalization

Layer normalization is a crucial component in transformers, helping to stabilize the learning process and reduce training time.

### Mathematical Representation

For a vector $x = (x_1, x_2, ..., x_H)$, layer normalization is defined as:

$$LN(x) = \alpha \odot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta$$

Where:
- $\mu$ is the mean of the elements in $x$
- $\sigma$ is the standard deviation of the elements in $x$
- $\alpha$ and $\beta$ are learnable parameters
- $\epsilon$ is a small constant for numerical stability
- $\odot$ represents element-wise multiplication

<p align="center">
  <img src="./layer_normalization.png" alt="Layer Normalization" style="width:auto; height: auto;">
  <br>
  <em>Figure 8: Illustration of Layer Normalization</em>
</p>

### Example:
Let's apply layer normalization to a simple feature vector:

In [86]:
x = torch.tensor([2.0, -1.0, 3.0, 0.0])

# Learnable parameters
alpha = torch.tensor([1.0, 1.0, 1.0, 1.0])
beta = torch.tensor([0.0, 0.0, 0.0, 0.0])

# Compute mean and standard deviation
mean = x.mean()
std = x.std()

# Apply layer normalization
epsilon = 1e-5
normalized = alpha * (x - mean) / (std + epsilon) + beta

In [87]:
print("Original vector:", x)
print("Normalized vector:", normalized)

Original vector: tensor([ 2., -1.,  3.,  0.])
Normalized vector: tensor([ 0.5477, -1.0954,  1.0954, -0.5477])


### Implementation:

In [89]:
class LayerNorm(nn.Module):
    def __init__(self, features, eps=1e-6):
        super().__init__()
        self.gamma = nn.Parameter(torch.ones(features))
        self.beta = nn.Parameter(torch.zeros(features))
        self.eps = eps

    def forward(self, x):
        # Compute mean and standard deviation
        mean = x.mean(-1, keepdim=True)
        std = x.std(-1, keepdim=True)

        # Normalize and scale
        return self.gamma * (x - mean) / (std + self.eps) + self.beta

In [90]:
layer_norm = LayerNorm(d_model)
normalized_output = layer_norm(ff_output)
print(f"Normalized output shape: {normalized_output.shape}")

Normalized output shape: torch.Size([1, 3, 300])


## 8. Encoder Layer

Now that we have all the components, let's put them together to create an encoder layer.

In [91]:
# Dropout --> regularization, which prevents overfitting

In [92]:
class EncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        self.mha = MultiHeadAttention(d_model, num_heads)
        self.ffn = PositionWiseFeedForward(d_model, d_ff)
        self.layernorm1 = LayerNorm(d_model)
        self.layernorm2 = LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        # Multi-head attention
        attn_output, _ = self.mha(x, x, x, mask)
        attn_output = self.dropout1(attn_output)
        out1 = self.layernorm1(x + attn_output)  # Add & Norm

        # Feed forward
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output)
        out2 = self.layernorm2(out1 + ffn_output)  # Add & Norm

        return out2

In [93]:
encoder_layer = EncoderLayer(d_model, num_heads, d_ff)
encoder_output = encoder_layer(positional_encoded)
print(f"Encoder layer output shape: {encoder_output.shape}")

Encoder layer output shape: torch.Size([1, 3, 300])


In [None]:
# class BERT(nn.Module):
#     # TODO: add 12-24 encoder blocks

<p align="center">
  <img src="./bert.png" alt="Bert Architecture" style="width:auto; height: auto;">
  <br>
  <em>Figure 9: BERT is and "Encoder-only" Transformer Architecture</em>
</p>

## 9. Decoder Layer

The decoder layer is similar to the encoder layer but includes an additional multi-head attention layer that attends to the output of the encoder.

<p align="center">
  <img src="./gpt.png" alt="GPT Architecture" style="width:50%; height: 50%;">
  <br>
  <em>Figure 10: GPT is and "Decoder-only" Transformer Architecture</em>
</p>

In [48]:
class DecoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        self.mha1 = MultiHeadAttention(d_model, num_heads)
        self.mha2 = MultiHeadAttention(d_model, num_heads)
        self.ffn = PositionWiseFeedForward(d_model, d_ff)
        self.layernorm1 = LayerNorm(d_model)
        self.layernorm2 = LayerNorm(d_model)
        self.layernorm3 = LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)
        self.dropout3 = nn.Dropout(dropout)

    def forward(self, x, enc_output, look_ahead_mask=None, padding_mask=None):
        # Self attention
        attn1, _ = self.mha1(x, x, x, look_ahead_mask)
        attn1 = self.dropout1(attn1)
        out1 = self.layernorm1(attn1 + x)

        # Multi-head attention using encoder output as Key and Value
        attn2, _ = self.mha2(out1, enc_output, enc_output, padding_mask)
        attn2 = self.dropout2(attn2)
        out2 = self.layernorm2(attn2 + out1)

        # Feed forward
        ffn_output = self.ffn(out2)
        ffn_output = self.dropout3(ffn_output)
        out3 = self.layernorm3(ffn_output + out2)

        return out3

In [94]:
decoder_layer = DecoderLayer(d_model, num_heads, d_ff)
decoder_output = decoder_layer(positional_encoded, encoder_output)
print(f"Decoder layer output shape: {decoder_output.shape}")

Decoder layer output shape: torch.Size([1, 3, 300])


## 10. Full Encoder

The full encoder consists of multiple encoder layers stacked on top of each other. It also includes the initial embedding layer and positional encoding.


In [95]:
class Encoder(nn.Module):
    def __init__(self, vocab_size, d_model, num_layers, num_heads, d_ff, max_seq_length, dropout=0.1):
        super().__init__()
        self.d_model = d_model
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.pos_encoding = PositionalEncoding(d_model, max_seq_length)
        self.encoder_layers = nn.ModuleList([EncoderLayer(d_model, num_heads, d_ff, dropout) 
                                             for _ in range(num_layers)])
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        # Embedding and positional encoding
        x = self.embedding(x) * math.sqrt(self.d_model)
        x = self.pos_encoding(x)
        x = self.dropout(x)

        # Pass through each encoder layer
        for layer in self.encoder_layers:
            x = layer(x, mask)

        return x

In [96]:
vocab_size = 10000
num_layers = 12
max_seq_length = 512

encoder = Encoder(vocab_size, d_model, num_layers, num_heads, d_ff, max_seq_length)
sample_input = torch.randint(0, vocab_size, (64, 30))  # Batch of 64, sequence length of 30
encoder_output = encoder(sample_input)
print(f"Encoder output shape: {encoder_output.shape}")

Encoder output shape: torch.Size([64, 30, 300])


In [97]:
# Equavalent to a BERT model 🙂

## 11. Full Decoder

The full decoder, like the encoder, consists of multiple decoder layers. It also includes embedding, positional encoding, and an output layer.

In [98]:
class Decoder(nn.Module):
    def __init__(self, vocab_size, d_model, num_layers, num_heads, d_ff, max_seq_length, dropout=0.1):
        super().__init__()
        self.d_model = d_model
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.pos_encoding = PositionalEncoding(d_model, max_seq_length)
        self.decoder_layers = nn.ModuleList([DecoderLayer(d_model, num_heads, d_ff, dropout) 
                                             for _ in range(num_layers)])
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, enc_output, look_ahead_mask=None, padding_mask=None):
        # Embedding and positional encoding
        x = self.embedding(x) * math.sqrt(self.d_model)
        x = self.pos_encoding(x)
        x = self.dropout(x)

        # Pass through each decoder layer
        for layer in self.decoder_layers:
            x = layer(x, enc_output, look_ahead_mask, padding_mask)

        return x

In [99]:
decoder = Decoder(vocab_size, d_model, num_layers, num_heads, d_ff, max_seq_length)
sample_target = torch.randint(0, vocab_size, (64, 20))  # Batch of 64, sequence length of 20
decoder_output = decoder(sample_target, encoder_output)
print(f"Decoder output shape: {decoder_output.shape}")

Decoder output shape: torch.Size([64, 20, 300])


## 12. Transformer

Now, let's put everything together to create the full Transformer model.

In [100]:
class Transformer(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model, num_layers, num_heads, d_ff, max_seq_length, dropout=0.1):
        super().__init__()
        self.encoder = Encoder(src_vocab_size, d_model, num_layers, num_heads, d_ff, max_seq_length, dropout)
        self.decoder = Decoder(tgt_vocab_size, d_model, num_layers, num_heads, d_ff, max_seq_length, dropout)
        self.final_layer = nn.Linear(d_model, tgt_vocab_size)

    def forward(self, src, tgt, src_mask=None, tgt_mask=None, src_padding_mask=None, tgt_padding_mask=None):
        enc_output = self.encoder(src, src_mask)
        dec_output = self.decoder(tgt, enc_output, tgt_mask, src_padding_mask)
        return self.final_layer(dec_output)

    def encode(self, src, src_mask=None):
        return self.encoder(src, src_mask)

    def decode(self, tgt, memory, tgt_mask=None, memory_mask=None):
        return self.decoder(tgt, memory, tgt_mask, memory_mask)

In [101]:
src_vocab_size = 10000
tgt_vocab_size = 10000

transformer = Transformer(src_vocab_size, tgt_vocab_size, d_model, num_layers, num_heads, d_ff, max_seq_length)

src = torch.randint(0, src_vocab_size, (64, 30))  # Batch of 64, source sequence length of 30
tgt = torch.randint(0, tgt_vocab_size, (64, 20))  # Batch of 64, target sequence length of 20

output = transformer(src, tgt)
print(f"Transformer output shape: {output.shape}")

Transformer output shape: torch.Size([64, 20, 10000])


## 13. Understanding and Implementing Masks in Transformers

Masks play a crucial role in the Transformer architecture. They serve two main purposes:

1. Padding Mask: To handle variable-length sequences in a batch.
2. Look-ahead Mask: To prevent the decoder from looking at future tokens during training (this helps with generation).

Let's dive into each type of mask and then implement them.

<p align="center">
  <img src="./masks.png" alt="Look Ahead Mask and Padding Mask" style="width:80%; height: 80%; background-color:white;">
  <br>
  <em>Figure 11: Masks in Transformers</em>
</p>

### 13.1 Padding Mask

In natural language processing tasks, we often work with sequences of different lengths. To process these in batches, we pad shorter sequences to match the length of the longest sequence in the batch. However, we don't want our model to pay attention to these padding tokens.

The padding mask is a binary mask where:
- 1 indicates a real token
- 0 indicates a padding token

### 13.2 Look-ahead Mask

In the decoder, we need to prevent it from looking at future tokens during training. This is because during inference, the model won't have access to future tokens. The look-ahead mask ensures that prediction for position i can depend only on the known outputs at positions less than i.

The look-ahead mask is a triangular matrix where:
- 1 indicates positions that can be attended to
- 0 indicates positions that should be masked out

### 13.3 Implementing the Mask Functions

Now, let's implement the functions to create these masks:

In [56]:
def create_padding_mask(seq):
    """
    Create a padding mask for the input sequence.

    Args:
    - seq: Input tensor of shape (batch_size, seq_len)

    Returns:
    - mask: Padding mask of shape (batch_size, 1, 1, seq_len)
    """
    # Create a mask for padding tokens (assuming 0 is the padding token)
    mask = (seq == 0).float()

    # Add extra dimensions to broadcast later
    return mask.unsqueeze(1).unsqueeze(2)

In [57]:
def create_look_ahead_mask(size):
    """
    Create a look-ahead mask for the decoder.

    Args:
    - size: Size of the square matrix

    Returns:
    - mask: Look-ahead mask of shape (size, size)
    """
    # Create a triangular matrix
    mask = torch.triu(torch.ones(size, size), diagonal=1).float()

    # Convert to binary mask
    return mask == 0

In [58]:
def create_masks(src, tgt):
    """
    Create all necessary masks for the Transformer model.

    Args:
    - src: Source sequence tensor of shape (batch_size, src_seq_len)
    - tgt: Target sequence tensor of shape (batch_size, tgt_seq_len)

    Returns:
    - src_mask: Source padding mask
    - tgt_mask: Combined target padding and look-ahead mask
    """
    # Source padding mask
    src_mask = create_padding_mask(src)

    # Target padding mask
    tgt_padding_mask = create_padding_mask(tgt)

    # Target look-ahead mask
    tgt_look_ahead_mask = create_look_ahead_mask(tgt.size(1))

    # Combine padding and look-ahead masks for the target
    tgt_mask = torch.max(tgt_padding_mask, tgt_look_ahead_mask.unsqueeze(0))

    return src_mask, tgt_mask

In [59]:
src = torch.tensor([[1, 2, 3, 0, 0], [4, 5, 0, 0, 0]])  # Batch of 2, max length 5
tgt = torch.tensor([[1, 2, 3, 4, 0], [5, 6, 0, 0, 0]])  # Batch of 2, max length 5

src_mask, tgt_mask = create_masks(src, tgt)

print("Source mask shape:", src_mask.shape)
print("Target mask shape:", tgt_mask.shape)

print("\nSource mask for first sequence:")
print(src_mask[0].squeeze())

print("\nTarget mask for first sequence:")
print(tgt_mask[0].squeeze())

Source mask shape: torch.Size([2, 1, 1, 5])
Target mask shape: torch.Size([2, 1, 5, 5])

Source mask for first sequence:
tensor([0., 0., 0., 1., 1.])

Target mask for first sequence:
tensor([[1., 0., 0., 0., 1.],
        [1., 1., 0., 0., 1.],
        [1., 1., 1., 0., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.]])


## 14. Training the Transformer

To train the Transformer, we need to define a loss function and an optimizer. For sequence-to-sequence tasks, we typically use cross-entropy loss.


In [60]:
# Define loss function and optimizer
criterion = nn.CrossEntropyLoss(ignore_index=0)  # 0 is the padding index
optimizer = optim.Adam(transformer.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)

In [61]:
def train_step(src, tgt):
    transformer.train()
    optimizer.zero_grad()

    # Create masks (implement these functions based on your specific requirements)
    src_mask, tgt_mask = create_masks(src, tgt)

    # Forward pass
    output = transformer(src, tgt[:, :-1], src_mask, tgt_mask)

    # Calculate loss
    loss = criterion(output.contiguous().view(-1, tgt_vocab_size), tgt[:, 1:].contiguous().view(-1))

    # Backward pass and optimize
    loss.backward()
    optimizer.step()

    return loss.item()

In [62]:
# TODO: create a data loader for training

This completes our implementation of the Transformer model. We've covered all the major components, from the basic building blocks like multi-head attention and positional encoding, to the full encoder and decoder structures, and finally the complete Transformer architecture.

Remember that this is a basic implementation and there are many optimizations and variations that can be applied in practice. Some areas for further exploration include:

1. Implementing more sophisticated decoding strategies (e.g., beam search)
2. Adding regularization techniques (e.g., label smoothing)
3. Experimenting with different attention mechanisms
4. Implementing transformer variants like BERT, GPT, or T5

Happy transforming!

## 14. Using GPT from Hugging Face

GPT (Generative Pre-trained Transformer) is a family of language models that use the decoder part of the transformer architecture. Let's use a GPT-2 model from Hugging Face to generate text.

### 14.1 Setting Up GPT-2

First, we need to install the transformers library and import the necessary modules:

In [103]:
!pip install transformers
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




Now, let's load a pre-trained GPT-2 model and its associated tokenizer:

In [113]:
model_name = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)

### 14.2 Generating Text with GPT-2

To generate text, we'll first tokenize an input prompt, then use the model to generate a sequence of tokens, and finally decode these tokens back into text.

In [110]:
def generate_text(prompt, max_length=100):
    # Encode the input prompt
    input_ids = tokenizer.encode(prompt, return_tensors="pt")

    # Generate text
    output = model.generate(input_ids, 
                            max_length=max_length, 
                            num_return_sequences=1, 
                            no_repeat_ngram_size=2,
                            top_k=50,
                            top_p=0.95,
                            do_sample=True,
                            temperature=0.7)

    # Decode the generated tokens
    generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
    return generated_text

In [111]:
prompt = "In a world where AI has become commonplace,"
generated_text = generate_text(prompt)
print(generated_text)

In a world where AI has become commonplace, it seems like we're not really talking about AI at all. That's not to say that AI is impossible, or even remotely possible, but it's still a pretty big problem. AI isn't just a technical concept, of course. It's also a very real possibility. The problem is that human beings don't have the capacity to understand it. They don"t know much about it, so they don`t really know what to do about


How would you approach building a sentiment analysis model, and what NLP techniques or tools would you use to achieve accurate sentiment classification?

1. Utilise a pre-trained deep learning model like BERT for sentiment analysis without any additional pre-processing.
2. Tokenise the text using spaCy, remove common words, and use a basic machine learning model for sentiment analysis.
3. Use regular expressions to match positive and negative keywords in the text and classify based on the keyword count.
4. Manually read and label each message to assign sentiment labels, then train a model on this labelled dataset.

* 3 --> done in Monday's lecture (not very good at all)
* 4 --> heck no
* 1 --> got a great model, but text fed into it still needs to be processed
    * for example: input is "I love transformers", then you still need to convert it to a vector for sentiment analysis
* 2 --> most correct. Even more correct: Utitlize pre-trained deep learning model to both tokenize and classify.

## 15. Using BERT from Hugging Face

BERT (Bidirectional Encoder Representations from Transformers) is a transformer-based model that uses the encoder part of the transformer architecture. Let's use a BERT model from Hugging Face for a text classification task.

### 15.1 Setting Up BERT

In [121]:
from transformers import BertTokenizer, BertForSequenceClassification
import torch
import torch.nn.functional as F

Now, let's load a pre-trained BERT model and its associated tokenizer:

In [122]:
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)

Here, we're using `BertForSequenceClassification`, which is BERT with an additional classification layer on top. We set `num_labels=2` for binary classification.

### 15.2 Text Classification with BERT

Let's create a function to classify text sentiment using our BERT model:

In [123]:
def classify_sentiment(text):
    # Tokenize input
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=512)

    # Get model outputs
    with torch.no_grad():
        outputs = model(**inputs)

    # Apply softmax to get probabilities
    probs = F.softmax(outputs.logits, dim=-1)

    # Get predicted class (0 for negative, 1 for positive)
    predicted_class = torch.argmax(probs, dim=-1).item()

    return "Positive" if predicted_class == 1 else "Negative", probs[0][predicted_class].item()

In [128]:
text = "I love how this transformer model works! It's amazing!"
sentiment, confidence = classify_sentiment(text)
print(f"Sentiment: {sentiment}, Confidence: {confidence:.2f}")

Sentiment: Positive, Confidence: 0.55
