# Exploring a forward pass through a Transformer (with the Beatles)

Number of lines of code: 189

Number of words: 2498

**Introduction**

This tutorial explores the "Attention is All You Need" paper by Vaswani et al., available [here](https://arxiv.org/abs/1706.03762), which introduced the transformer architecture to natural language processing (Vasqani et al.,2017). We will focus on the decoder component of the transformer, guiding you through what goes on under-the-hood. This exploration is based on Andrej Karpathy's practical implementation, detailed in [this Colab Notebook](https://colab.research.google.com/drive/1JMLa53HDuA-i7ZBmqV7ZnA3c_fvtXnx-?usp=sharing).

At the time of this paper's release, Recurrent Neural Networks and Convolutional Neural Networks were the primary frameworks for tasks like language modeling and machine translation. These models, however, faced significant limitations due to their sequential processing nature, which restricted parallelization during training—particularly with longer sequences.

In response, the paper introduced the Transformer model which dispenses of recurrence or convolution, and uses attention. This approach was inspired by the success of early encoder-decoder architectures that employed attention mechanisms, such as the work by [(Luong et al., 2015)](https://arxiv.org/abs/1508.04025). Attention allows the Transformer to process different parts of a sequence independently of their positional distances, greatly enhancing parallel processing capabilities. This shift in architecture not only increased computational efficiency but also established new performance benchmarks in machine translation, the field for which the Transformer was initially showcased.

<div style="text-align: center;">
    <img src="https://machinelearningmastery.com/wp-content/uploads/2021/08/attention_research_1.png" alt="Transformer Model Diagram" title="Transformer Model Overview" width="30%" height="auto"/>
</div>


The Transformer model features an encoder and decoder, both employing stacked self-attention and fully connected layers. The encoder includes multi-head self-attention and a feed-forward layer, enhanced with residual connections and layer normalization. The decoder mirrors this setup but incorporates a masked multi-head attention layer, ensuring each position only processes preceding information. For machine translation tasks, the encoder-decoder framework is essential as the encoder represents the source text, which the decoder then uses, as well as its own generated tokens, to autoregressively generate subsequent tokens in the target language.

In contrast, tasks that involve generating text without translation, such as language modeling, can effectively utilize just the decoder component. This approach is exemplified by early state-of-the-art models like [GPT-2](https://huggingface.co/learn/nlp-course/chapter1/6). Therefore, this tutorial concentrates on the decoder aspect of the architecture, with a specific focus on implementing attention and feed-forward modules.

We use a dataset of [Beatles lyrics](https://www.kaggle.com/datasets/jenlooper/beatles-lyrics), which if given the time to train, would build a model that is able to autoregressively generate text in the style of The Beatles.


In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
import math

In [2]:
device='cpu'

# Section 1: Tokenizer

The initial step in building a transformer involves tokenizing our text, which means converting characters into numerical values that the model can process. While tokenization often uses sub-words, this tutorial employs character-level tokenization, where each character in our Beatles lyrics string is mapped to a unique number.

In [3]:
beatles_txt_pth='./beatles.txt'
beatles_txt_ob = open(beatles_txt_pth, "r")
beatles_lyrics= beatles_txt_ob.read()

In [5]:
# Extract unique characters and define vocabulary size
chars = sorted(set(beatles_lyrics))
vocab_size=len(chars)
print(f"Unique Characters: {''.join(chars)}\nNumber of Unique Characters:{len(chars)}")
# Create bi-directional mappings between characters and integers
char_to_int = {ch: i for i, ch in enumerate(chars)}
int_to_char = {i: ch for i, ch in enumerate(chars)}
print(f"For example Token 50 represents '{int_to_char[50]}'")

Unique Characters: 
 !&'(),-./0123456789:;<=>?ABCDEFGHIJKLMNOPQRSTUVWY[]abcdefghijklmnopqrstuvwxyzöü‘’“”
Number of Unique Characters:85
For example Token 50 represents 'Y'


Next, we must encode our entire beatles_lyrics string and turn it into a tensor to then create our training and validation tensors. These tensors will be used to create our train and vali loaders.

In [6]:
# Define encoding and decoding functions
def encode(string_text):
    return [char_to_int[ch] for ch in string_text]
def decode(encoded_text):
    return ''.join(int_to_char[i] for i in encoded_text)
# Encode the entire Beatles lyrics dataset into a torch.Tensor
data = torch.tensor(encode(beatles_lyrics), dtype=torch.long)
split_point = int(0.9 * len(data)) # Split data into 90% training and 10% validation
train_data = data[:split_point]  # Training data slice
val_data = data[split_point:]    # Validation data slice

The `CharacterDataset` class processes a dataset using a specified `block_size`, which defines the sequence length for training samples. The `__len__` method ensures all sequences are complete by adjusting the dataset size, while the `__getitem__` method fetches sequences, each comprising an input sequence (`x`) and a target sequence (`y`)—the latter being the input shifted by one token for next token prediction.

We set `block_size` to 8, representing the maximum context length for generating sequences autoregressively. DataLoaders (`train_loader` and `val_loader`) are configured with a `batch_size` of 4 to batch and shuffle training data.

In [7]:
class CharacterDataset(Dataset):
    def __init__(self, data, block_size):
        self.data = data
        self.block_size = block_size
    def __len__(self):
        return len(self.data) - self.block_size # Subtract block_size to avoid overflow, ensure last sequence has full length
    def __getitem__(self, idx): # idx is the start index of the sequence and target sequence is shifted by one character
        return (self.data[idx:idx+self.block_size], self.data[idx+1:idx+self.block_size+1])
block_size = 8 #Define Block Size, meaning sequence lenght size is 8
batch_size = 4
# Instantiate the Dataset and define DataLoader for training and validation datasets
train_dataset = CharacterDataset(train_data, block_size)
val_dataset = CharacterDataset(val_data, block_size)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
# Print the first 3 elements in Train Dataset.
for ith in range(3):
    x, y = train_dataset[ith]
    x_decoded = decode([i.item() for i in x])
    y_decoded = decode([i.item() for i in y])
    print(f"Index {ith} element in train data\nx: {x} represents \"{x_decoded}\"\ny: {y} represents \"{y_decoded}\"\n----")

Index 0 element in train data
x: tensor([27,  1, 56, 53, 77,  1, 61, 66]) represents "A day in"
y: tensor([ 1, 56, 53, 77,  1, 61, 66,  1]) represents " day in "
----
Index 1 element in train data
x: tensor([ 1, 56, 53, 77,  1, 61, 66,  1]) represents " day in "
y: tensor([56, 53, 77,  1, 61, 66,  1, 72]) represents "day in t"
----
Index 2 element in train data
x: tensor([56, 53, 77,  1, 61, 66,  1, 72]) represents "day in t"
y: tensor([53, 77,  1, 61, 66,  1, 72, 60]) represents "ay in th"
----


Hence, each batch contains four input and target tensors. This means the shape of our initial input tensor's is `(4, 8)`. With our loaders now prepared, we're ready to conduct a forward pass. The specific batch we'll use is defined below.

In [8]:
torch.manual_seed(0)  # Ensure reproducibility
x_example, y_example = next(iter(train_loader))
print(f"Tutorial x input tensor with batch_size=4 and block_size=8:\n{x_example}\nSize: {x_example.shape}")
print(f"Corresponding target tensor:\n{y_example}")

Tutorial x input tensor with batch_size=4 and block_size=8:
tensor([[57,  1, 72, 67,  1, 63, 66, 67],
        [60, 53, 72,  1, 75, 57,  1, 55],
        [54, 70, 67, 75, 66,  7,  1, 77],
        [ 1, 64, 61, 72, 72, 64, 57,  1]])
Size: torch.Size([4, 8])
Corresponding target tensor:
tensor([[ 1, 72, 67,  1, 63, 66, 67, 75],
        [53, 72,  1, 75, 57,  1, 55, 53],
        [70, 67, 75, 66,  7,  1, 77, 57],
        [64, 61, 72, 72, 64, 57,  1, 59]])


# Section 2: Token and Position embeddings

In the next step, we extract embeddings for each token from an embedding table.

1. In this token embedding table, each row corresponds to a different token, and each column within a row represents a feature of that token's embedding. Thus, the entire row forms the token's embedding vector with a dimensionality equal to n_embd which we decide is 32. Hence the token embedding table is 32 rows by 85 columns. Pytorch automatically randomly initializes the token embedding table which mimics conditions at the start of training.
2. For each token in our input data, the corresponding embedding is plucked from the table and appended to our tensor in its third dimension. Hence the shape of our input tensor is now `(4, 8, 32)`. This means that each individual sequence within each batch has shape `(8, 32)`. This can be seen below for clarity.


In [10]:
n_embd = 32 # Set embedding dimension
torch.manual_seed(0)
token_embedding_table = nn.Embedding(num_embeddings=vocab_size, embedding_dim=n_embd) # Initialize the embedding layer with random weights
token_embedding = token_embedding_table(x_example) # Apply the embedding to the input tensor
print(f"Size of token_embedding is {token_embedding.shape}")
print(f"The first sequence of 8 tokens looks like the following:\n{token_embedding[0]}\nSize of the tensor above is {token_embedding[0].shape}.")

Size of token_embedding is torch.Size([4, 8, 32])
The first sequence of 8 tokens looks like the following:
tensor([[ 3.5992e-02, -8.7966e-01, -9.8009e-01,  1.6861e+00,  2.3678e-01,
          1.5649e+00, -2.2334e-01,  1.7531e-01,  8.5940e-02,  3.8752e-01,
         -1.1794e+00,  1.5783e+00, -2.0817e-01, -4.8517e-01, -4.3715e-02,
         -1.1596e-01,  7.9778e-01, -2.4252e-01, -4.7606e-01, -3.7957e-01,
         -1.1423e-02, -4.5123e-01,  1.0632e+00,  2.4969e-01, -5.4549e-02,
          8.1292e-01, -9.5756e-01,  1.2139e+00, -5.7249e-01,  7.9329e-02,
         -1.1229e+00, -1.4157e+00],
        [-6.1358e-01,  3.1593e-02, -4.9268e-01,  2.4841e-01,  4.3970e-01,
          1.1241e-01,  6.4079e-01,  4.4116e-01, -1.0231e-01,  7.9244e-01,
         -2.8967e-01,  5.2507e-02,  5.2286e-01,  2.3022e+00, -1.4689e+00,
         -1.5867e+00, -6.7309e-01,  8.7283e-01,  1.0554e+00,  1.7784e-01,
         -2.3034e-01, -3.9175e-01,  5.4329e-01, -3.9516e-01, -4.4622e-01,
          7.4402e-01,  1.5210e+00,  3.4105e

The next step involves incorporating positional embeddings. Similar to the token embeddings, we utilize a `position_embedding_table`. Using `torch.arange` to generate indices from 0 up to `sequence_length-1`, we expand these indices across the batch size to ensure each input in the batch receives a corresponding position embedding. The resulting shapes of the positional embeddings match those of the token embeddings.

In [11]:
batch_size, sequence_length = x_example.shape # x_example shape is [batch_size, sequence_length]
position_embedding_table = nn.Embedding(num_embeddings=sequence_length, embedding_dim=n_embd) # Create a position embedding table
positions = torch.arange(sequence_length, device=device).expand(batch_size, -1)
pos_emb = position_embedding_table(positions) # Retrieve position embeddings using the indices

Before feeding the input batch into the Transformer, we combine the token and position embeddings through element-wise addition, since both have dimensions of `(4, 8, 32)`. After this pooling, we normalize along the embedding dimension to consolidate the information. This process merges the semantic and positional data of the tokens, producing an integrated input tensor `x` of size `(4, 8, 32)`.

In [13]:
x_pre_norm = token_embedding + pos_emb
layer_norm = nn.LayerNorm(n_embd) # Applying layer normalization pytorch module to the combined embeddings
x = layer_norm(x_pre_norm)
print(f"Example token embedding in x before layer norm:\n{x_pre_norm[0][0]}\n\nExample token embedding in x  after layer norm:\n{x[0][0]}\n")

Example token embedding in x before layer norm:
tensor([ 0.6451,  0.6672, -1.5585,  3.8984,  1.3934,  0.4970,  0.2501,  0.6093,
         0.5185, -1.8517, -0.3389,  1.7525, -0.5139, -0.5475,  0.7616,  0.2294,
         0.6061, -0.4820, -2.4072, -1.4543, -0.1333, -0.4006,  0.3217,  0.0206,
        -0.4155,  1.0770,  0.5796,  2.8171, -1.8143, -0.3091, -1.7327, -1.3206],
       grad_fn=<SelectBackward0>)

Example token embedding in x  after layer norm:
tensor([ 0.4568,  0.4735, -1.2140,  2.9235,  1.0241,  0.3445,  0.1573,  0.4297,
         0.3608, -1.4363, -0.2893,  1.2965, -0.4220, -0.4474,  0.5452,  0.1416,
         0.4272, -0.3978, -1.8575, -1.1350, -0.1334, -0.3360,  0.2116, -0.0167,
        -0.3473,  0.7843,  0.4071,  2.1037, -1.4080, -0.2667, -1.3461, -1.0336],
       grad_fn=<SelectBackward0>)



# Section 3: Self-Attention

##Section 3a: One head of Self Attention


Next, we explore a single head of self-attention. Self-attention allows the model to weigh the importance of different parts of the input data relative to each other. Here, we'll explore how this process is implemented in our `Head` class and how it processes input to produce an output.

### Attention

Attention is described as:

$$
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$$


**Query (Q)**:
- The query vector  represents a specific item or part of the input data which is actively seeking information. It can be thought of as a question posed by a particular part of the input data. In the context of self-attention, each token in the sequence generates a query vector that seeks to find out how much it should pay attention to other parts of the input data.

**Key (K)**:
- The Key vector represents aspects of the input data to be queried against. Each part of the input data generates a key that corresponds to it. The compatibility of a query with different keys determines the attention or focus level on various parts of the input. This relationship is determined through the scaled dot product function between the the two vectors.

**Value (V)**:
- The value vector actually contains the data from the input tokens that will be aggregated into the output. The amount of attention a query pays to a particular key determines the weighting of the corresponding value in the output.

**Explaining the equation**:
1. **Dot Product of Queries and Keys**: The attention scores are calculated by taking the dot product of the query with all keys, quantifying the similarity between tokens. The 'similarity' practically means how much each token of the input sequence influences others, guiding how much attention is allocated to each token.
2. **Scaling**: The dot products are scaled down by the square root of the dimension of the keys $\sqrt{d_k}$ to normalize the values.
3. **Softmax Application**: The softmax function is applied to the scaled dot product scores to convert them into values between 0 and 1, effectively representing probabilities. This  ensures that the more similar the query is to a key, the higher the attention score.
4. **Multiplication by Values**: Finally, the weighted softmax scores are applied to the value vectors. Each value vector represents the embedding of a token, encapsulating its contextual information. Multiplying these vectors by the softmax scores effectively weights the importance of each token's contribution. This process highlights which aspects of the embeddings are most relevant, ensuring that the model focuses on the most informative parts of the input data during further processing.



In [14]:
class Head(nn.Module):#One Head of Self Attention
    def __init__(self,head_size,dropout):
        super().__init__()
        self.key=nn.Linear(n_embd,head_size,bias=False) #takes a tensor of length n_embd and projects it to head_size
        self.query=nn.Linear(n_embd,head_size,bias=False)
        self.value=nn.Linear(n_embd,head_size,bias=False)
        self.register_buffer('tril',torch.tril(torch.ones(block_size,block_size))) #Implements masking
        self.dropout=nn.Dropout(dropout)
        #Initialize values for Tutorial demonstration purposes
        self.key_vector = None
        self.query_vector = None
        self.unmasked_weights = None
        self.masked_weights = None
        self.softmax_weights = None
        self.dropout_weights = None
        self.aggregated_values = None
    def forward(self,x):
        B,T,C=x.shape #B=batch_size, T=block_size, C=n_embd
        k=self.key(x)
        self.key_vector = k.clone()
        q=self.query(x)
        self.query_vector = q.clone()
        transpose_k=torch.transpose(k,1,2)
        wei= q@transpose_k * C**-0.5
        self.unmasked_weights = wei.clone()
        wei=wei.masked_fill(self.tril[:T,:T]==0, float('-inf'))
        self.masked_weights = wei.clone()
        wei=F.softmax(wei,dim=-1)
        self.softmax_weights = wei.clone()
        wei=self.dropout(wei)
        self.dropout_weights = wei.clone()
        v=self.value(x)
        self.aggregated_values = v.clone()
        return wei@v

We will provide step-by-step outputs of the computations in the attention-head. However, for clarity, we will display the martices of the first batch sequence. This matrix will always contain 8 embeddings, with tensor shapes that vary throughout the forward pass.

The input to our `Head` class is the tensor `x` defined before, shape `(4, 8, 32)`. Since we are only looking at the first batch sequence, this tensor is `(8, 32)`.



In [15]:
print(f"Shape of the input tensor x: {x.shape}\nTensor of the first batch of the input tensor x:\n{x[0]}\nShape of first batch of the input tensor x: {x[0].shape}")

Shape of the input tensor x: torch.Size([4, 8, 32])
Tensor of the first batch of the input tensor x:
tensor([[ 0.4568,  0.4735, -1.2140,  2.9235,  1.0241,  0.3445,  0.1573,  0.4297,
          0.3608, -1.4363, -0.2893,  1.2965, -0.4220, -0.4474,  0.5452,  0.1416,
          0.4272, -0.3978, -1.8575, -1.1350, -0.1334, -0.3360,  0.2116, -0.0167,
         -0.3473,  0.7843,  0.4071,  2.1037, -1.4080, -0.2667, -1.3461, -1.0336],
        [-0.8623,  0.7000, -0.4710,  1.5819, -0.2941, -0.9924,  0.4580,  1.0163,
         -0.8155,  0.6536, -0.6400,  1.3092,  0.0311,  2.1863, -1.5589, -1.2425,
          0.1511,  0.7807,  0.1113, -0.0855,  0.2623,  0.5294, -0.0995, -0.8968,
         -1.0212, -0.3187,  0.9326,  1.2058, -0.9879, -1.5286,  1.5585, -1.6531],
        [ 1.0980,  1.0847, -0.4264,  1.8365, -0.4578, -0.2888,  0.7239, -1.4357,
          0.6711,  0.1065, -0.2511,  0.3691, -0.3241,  0.3483,  0.2968,  1.6777,
         -0.8705, -1.0669, -1.8113, -0.1857,  0.1032,  2.0222, -1.7120, -1.1176,
      

In [16]:
num_heads = 4
head_size = n_embd // num_heads
dropout = 0.2
head = Head(head_size,dropout) # Create an instance of the Head
output = head(x) # Forward pass of x through the head

**Linear Projections**:
   - **Key and Query Transformations**: The input tensor `x` first undergoes two separate linear transformations to produce keys (`k`), queries (`q`). These transformations project the input tensor from an embedding dimension of `n_embd` to a smaller dimension called `head_size`. `head_size= n_embd/num_heads` where `num_heads` is another hyperparameter. `num_heads` refers to the number of attention heads we want to define for our multihead attention class. We will go into details of this later. For now, let `num_heads = 4` which implies `head_size= 32/4 =8`.
    - **Example**: Our embedding dimension (`n_embd`) is 32 and our `head_size` is 8. If the input tensor `x` has a shape of `(4, 8, 32)`, the linear projection transforms these embeddings by applying a weight matrix of shape `(32, 8)` to each batch. Consequently, each sequence in the batch is projected to a new shape of `(8, 8)`. This results in the transformed tensors for keys (`k`), and queries (`q`), each having the shape `(4, 8, 8)`.
  

In [18]:
print("Shape of Key vector: ", head.key_vector.shape,"\nKey vector of first batch sequence:", head.key_vector[0],"\nShape of Key vector of first batch sequence:", head.key_vector[0].shape)
print("Shape of Query vector: ", head.query_vector.shape,"\nQuery vector of first batch sequence:", head.query_vector[0],"\nShape of Query vector of first batch sequence:", head.query_vector[0].shape)

Shape of Key vector:  torch.Size([4, 8, 8]) 
Key vector of first batch sequence: tensor([[ 2.2918e-01,  3.3472e-01, -2.7254e-01,  4.7843e-01,  6.8547e-01,
          9.1177e-01,  2.4008e-01, -1.1264e+00],
        [ 1.9337e-01,  3.3117e-01,  1.4469e-01,  8.1046e-01, -8.5312e-01,
         -8.7008e-02,  7.1845e-01, -7.1209e-02],
        [ 6.8281e-01, -4.2687e-01, -4.9447e-01,  5.3350e-01,  1.8807e-01,
          1.3283e+00, -3.9638e-04,  6.1308e-02],
        [ 6.6354e-01,  1.1197e-01,  3.9720e-01,  5.7539e-01, -3.3714e-01,
          4.7082e-03,  3.1838e-01,  8.2698e-01],
        [ 6.1830e-01, -3.4168e-01, -2.0562e-01,  4.6383e-01,  2.0837e-01,
         -8.4284e-01,  1.1451e+00, -1.9244e-01],
        [-7.0144e-01,  5.8017e-02, -6.6834e-01, -3.2742e-02,  2.5882e-01,
          4.3821e-01, -9.6909e-01, -5.1473e-01],
        [ 3.3577e-01,  4.5456e-01,  9.7191e-01, -8.0475e-01, -1.4307e-01,
         -2.5993e-02, -1.3257e+00,  5.5653e-01],
        [ 1.6220e+00,  3.0541e-01,  4.6095e-01,  2.6286e-0

**Calculating Attention Scores**:
Now we calculate the weight attention matrix, i.e. the weights we want to eventually apply to our Value matrix. This is implemented in the class with this equation: $$\left(\frac{QK^T}{\sqrt{d_k}}\right)$$

   - **Transpose**: Before calculating the dot product, the keys tensor `k` is transposed to align the dimensions properly for matrix multiplication with the query tensor `q`.
   - **Dot Product and Scaling**:Attention scores are computed by taking the dot product between queries `q` and the transposed keys `k` and then scaling down by dividing by the square root of the embedding dimension (`8`).
   - **Shape of Scores**: The resulting tensor, denoted as `wei` in our class, which holds the attention scores, has a shape of `(4, 8, 8)`.



In [19]:
print("Shape of Unmasked weight vector:", head.unmasked_weights.shape, "\nUnmasked weight vector of first batch sequence:", head.unmasked_weights[0], "\nShape of Unmasked weight vector of first batch sequence:", head.unmasked_weights[0].shape)

Shape of Unmasked weight vector: torch.Size([4, 8, 8]) 
Unmasked weight vector of first batch sequence: tensor([[-0.0157,  0.1203, -0.1613, -0.0254,  0.1581, -0.0918, -0.1198, -0.0993],
        [-0.0512,  0.2311, -0.1993,  0.0427, -0.0958, -0.1471,  0.0581, -0.1980],
        [-0.0927, -0.0873, -0.1183, -0.0933, -0.2811,  0.0696,  0.3176, -0.0025],
        [-0.0748, -0.1653, -0.0545,  0.1389, -0.1644, -0.0476,  0.5207,  0.4320],
        [-0.0595,  0.2116, -0.3924,  0.0513, -0.1412, -0.0188,  0.1741, -0.1577],
        [ 0.3095,  0.2778,  0.1183, -0.0383,  0.2893,  0.0640, -0.5482, -0.0684],
        [-0.0412, -0.0842,  0.2894,  0.0219,  0.1082, -0.0765, -0.1445,  0.0122],
        [-0.0758,  0.0190,  0.0490,  0.0639, -0.4415,  0.0852,  0.4859,  0.1974]],
       grad_fn=<SelectBackward0>) 
Shape of Unmasked weight vector of first batch sequence: torch.Size([8, 8])


**Masking, Normalization, and Dropout**:

- **Masking**: To prevent the model from accessing future tokens, we apply a masking operation using a lower triangular matrix (`tril`), setting attention scores for future tokens to negative infinity (`-inf`). This ensures that during the softmax operation, as seen below, the influence of future tokens is effectively zeroed out, as $ e^{-\infty} $ approaches zero.

- **Softmax**: Post-masking with (`-inf`), the softmax function normalizes the scores across each row to form a probability distribution by using the formula:
  $$\text{Softmax}(x_i) = \frac{e^{x_i}}{\sum_{j} e^{x_j}}$$

- **Dropout**: This step introduces regularization by randomly setting a percentage of the softmax output elements to zero, reducing overfitting. The remaining elements are scaled up during training to maintain the overall activation level.



In [20]:
print("Shape of Masked weight vector:\n", head.masked_weights.shape,"\nMasked weight vector of first batch sequence:\n", head.masked_weights[0],"\nShape of Masked weight vector of first batch sequence:", head.masked_weights[0].shape,
      "\n\nSoftmax applied to Masked weight vector of first batch sequence:\n", head.softmax_weights[0],
      "\n\nApplying Dropout to the matrix above:", head.dropout_weights[0])

Shape of Masked weight vector:
 torch.Size([4, 8, 8]) 
Masked weight vector of first batch sequence:
 tensor([[-0.0157,    -inf,    -inf,    -inf,    -inf,    -inf,    -inf,    -inf],
        [-0.0512,  0.2311,    -inf,    -inf,    -inf,    -inf,    -inf,    -inf],
        [-0.0927, -0.0873, -0.1183,    -inf,    -inf,    -inf,    -inf,    -inf],
        [-0.0748, -0.1653, -0.0545,  0.1389,    -inf,    -inf,    -inf,    -inf],
        [-0.0595,  0.2116, -0.3924,  0.0513, -0.1412,    -inf,    -inf,    -inf],
        [ 0.3095,  0.2778,  0.1183, -0.0383,  0.2893,  0.0640,    -inf,    -inf],
        [-0.0412, -0.0842,  0.2894,  0.0219,  0.1082, -0.0765, -0.1445,    -inf],
        [-0.0758,  0.0190,  0.0490,  0.0639, -0.4415,  0.0852,  0.4859,  0.1974]],
       grad_fn=<SelectBackward0>) 
Shape of Masked weight vector of first batch sequence: torch.Size([8, 8]) 

Softmax applied to Masked weight vector of first batch sequence:
 tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,

**Creating Value Tensor, Applying Attention, and Output**
- The input tensor `x` is transformed through a linear module separate to those for keys (`k`) and queries (`q`), producing a value tensor with shape `(4, 8, 8)`.
- This is then combined with the weighted attention scores.
- The final output from the `Head` class is a matrix that aggregates information across the input sequence based on the attention scores, resulting in an output shape of `(4, 8, 8)`.

In [21]:
print("Shape of Value tensor: ", head.aggregated_values.shape,
      "\nValue tensor of first batch sequence:\n", head.aggregated_values[0], "\nShape of Value tensor of first batch sequence: ", head.aggregated_values[0].shape,
      "\n\nNow to get final output we simply perform wei@value.", "\nShape of Output tensor: ", output.shape,
      "\nOutput tensor of first batch sequence:\n", output[0], "\nShape of Output tensor of first batch sequence: ", output[0].shape)


Shape of Value tensor:  torch.Size([4, 8, 8]) 
Value tensor of first batch sequence:
 tensor([[-0.4804,  0.3331,  0.8757, -0.8767,  0.6813, -0.8240,  0.5069, -0.8390],
        [ 0.7429,  0.4216,  0.4471, -0.2993,  0.4419, -0.3827, -0.6488, -1.1264],
        [ 0.1403,  0.6073,  0.9829, -0.5851,  1.2902,  0.1106,  0.0901, -0.4593],
        [ 0.3313,  0.0216,  0.2271, -0.5255,  0.0763,  0.4787, -0.9156,  0.3430],
        [ 1.0087,  0.2155,  0.5385, -0.5209,  0.4164, -0.9702,  0.2153, -0.1101],
        [ 0.6059,  0.0109, -0.3678, -0.3594, -0.8629, -0.4883,  0.2519,  0.6358],
        [-0.4756, -0.2780, -1.0006,  0.5887,  0.3727, -0.4774, -1.0189, -0.8238],
        [ 0.8161, -0.5589, -0.8094, -0.7569,  0.8272, -0.3293, -0.1516,  0.3642]],
       grad_fn=<SelectBackward0>) 
Shape of Value tensor of first batch sequence:  torch.Size([8, 8]) 

Now to get final output we simply perform wei@value. 
Shape of Output tensor:  torch.Size([4, 8, 8]) 
Output tensor of first batch sequence:
 tensor([[ 0

## Section 3b: Multi-Head attention

In this section, we apply the Head module num_heads times in order to employ Multi-Head Attention.


In [22]:
class MultiHeadAttention(nn.Module):
    def __init__(self,num_heads,head_size,dropout):
        super().__init__()
        self.heads=nn.ModuleList([Head(head_size,dropout) for _ in range(num_heads)])
        self.proj=nn.Linear(n_embd,n_embd)
        self.dropout=nn.Dropout(dropout)
        self.individual_heads=None #For tutorial purposes
    def forward(self,x):
        self.individual_heads= [h(x) for h in self.heads]
        out=torch.cat(self.individual_heads,dim=-1)
        out=self.dropout(self.proj(out))
        return out

In [23]:
# Apply the multi-head attention module
multihead=MultiHeadAttention(num_heads,head_size,dropout)
output_multihead = multihead(x)

Here’s how the Multihead Attention module operates on `x`:

1. Each of the four heads processes the input tensor `x` independently, outputting a new tensor of shape `(4, 8, 8)` that focuses on different aspects of the input. We confirm these dimensions for each head's output.

2. The outputs from all heads are concatenated along the third dimension to form a unified tensor with a shape of `(4, 8, 32)`.

3. This tensor undergoes a linear transformation with a `(32, 32)` weight matrix, maintaining the shape `(4, 8, 32)`. A dropout module is then applied to the tensor for regularization before producing the final output.

In [24]:
for i, head in enumerate(multihead.individual_heads):
    print(f"Head {i+1}:", head.shape)
print("\nConcatenate the four heads in the third dimension creating the following output shape from the Multi-Head Attention module:", output_multihead.shape)

Head 1: torch.Size([4, 8, 8])
Head 2: torch.Size([4, 8, 8])
Head 3: torch.Size([4, 8, 8])
Head 4: torch.Size([4, 8, 8])

Concatenate the four heads in the third dimension creating the following output shape from the Multi-Head Attention module: torch.Size([4, 8, 32])


**Residual Connection Implementation and Layer Normalization**:

Residual connections improve gradient flow during backpropagation, mitigating the vanishing gradient issue in deep networks. This is implemented by adding the output tensor `output_multihead` of size `(4, 8, 32)` to the original tensor `x`. This addition is followed by a layer normalization to consolidate the combined outputs.


In [25]:
x= x+output_multihead
norm=nn.LayerNorm(n_embd)
x=norm(x)

# Section 4: Feed Forward Layer

The following contains the Feed Forward module which essentially linear layer's followed by a non-linear activation function (multilayer-perceptron).

In [26]:
class FeedForward(nn.Module):#Basic linear layer followed by non-linearity
    def __init__(self, n_embd, dropout_rate):
        super().__init__()
        self.expand_linear = nn.Linear(n_embd, 4 * n_embd)
        self.compress_linear = nn.Linear(4 * n_embd, n_embd)
        self.dropout = nn.Dropout(dropout_rate)
        #Initialize values for Tutorial demonstration purposes
        self.x_initial = None
        self.x_expanded_dim = None
        self.x_post_relu = None
        self.x_compressed_dim = None
    def forward(self, x):
        self.x_initial = x.clone()
        x = self.expand_linear(x)
        self.x_expanded_dim = x.clone()
        x = F.relu(x)
        self.x_post_relu = x.clone()
        x = self.compress_linear(x)
        self.x_compressed_dim = x.clone()
        x = self.dropout(x)
        return x

In [27]:
ffw= FeedForward(n_embd, dropout)
ffw_x_output=ffw(x)

Using the output `x` from the previous layer, we do the following steps which is demonstrated in the code below too:

1. **Dimension Expansion**: A linear layer first expands the dimensionality of `x` from `n_embd` to `4 x n_embd`. The paper explains that expanding the dimension exactly by 4 yielded best results.
2. **Activation Function**: A ReLU (Rectified Linear Unit) activation is applied, defined  as $ {ReLU}(x) = \max(0, x) $. This function sets all negative elements to zero and retains non-negative values unchanged, introducing non-linearity into the model. The impact of ReLU is seen in the comparison below of tensors before and after its application.
3. **Dimension Compression**: Another linear layer then compresses the dimensionality back from `4 x n_embd` to `n_embd`, aligning the output dimensions with those of the input to the feedforward block.
4. A dropout layer follows.

In [28]:
print(f"Input tensor shape: {ffw.x_initial[0].shape}\nLinear transformation of input tensor causing tensor of shape: {ffw.x_expanded_dim[0].shape}")
print(f"First sequence of batch matrix with increased dim:\n{ffw.x_expanded_dim[0]}\n\nApply relu to above matrix:\n{ffw.x_post_relu[0]}")
print(f"\nApply another linear transformation reducing dim back to orginal dim: {ffw.x_compressed_dim.shape}\nFinally we apply a dropout layer to the compressed_dim tensor.")

Input tensor shape: torch.Size([8, 32])
Linear transformation of input tensor causing tensor of shape: torch.Size([8, 128])
First sequence of batch matrix with increased dim:
tensor([[-4.7745e-01,  1.2739e+00, -6.0470e-01,  ...,  1.0150e-01,
          2.6331e-01,  4.0180e-01],
        [ 1.5667e-01,  7.5898e-01, -9.1495e-01,  ...,  9.9886e-01,
          9.5105e-02, -5.3086e-02],
        [ 4.7425e-01,  4.4758e-01, -9.2756e-01,  ...,  2.8581e-01,
          9.6989e-01, -9.7188e-01],
        ...,
        [-8.6118e-01, -2.8129e-01,  1.1455e+00,  ..., -3.6359e-01,
         -1.2507e-01, -2.0360e-01],
        [-1.7872e-04, -2.4844e-02,  1.1270e+00,  ..., -2.8387e-01,
         -1.2985e-01,  9.3607e-02],
        [-3.6850e-01,  2.4188e-02,  2.0080e-01,  ...,  4.5271e-02,
          5.1141e-01, -7.5126e-02]], grad_fn=<SelectBackward0>)

Apply relu to above matrix:
tensor([[0.0000, 1.2739, 0.0000,  ..., 0.1015, 0.2633, 0.4018],
        [0.1567, 0.7590, 0.0000,  ..., 0.9989, 0.0951, 0.0000],
        [

This is followed by implementing another residual connection.

In [29]:
x= x+ffw_x_output
print(f"x shape: {x.shape}")

x shape: torch.Size([4, 8, 32])


This marks the completion of one transformer Block. Typically, this Block is repeated several times, determined by `n_layer`, a hyperparameter. The output from the first Block, the tensor `x`, serves as the input for the subsequent Block, and so on.



# Section 5: Calculate Logits and Loss

In this tutorial, we let `n_layers = 1`. Therefore, the final representation of our input batch is the tensor `x` above. Before we proceed to calculate the logits and loss, we must apply a final layer norm to `x`.

In [30]:
final_norm=nn.LayerNorm(n_embd)
x=final_norm(x)

Next, we apply a linear transformation to expand each token's embedding from `n_embd = 32` to `vocab_size = 85`. This transformation maps each token's representation into a space where each dimension represents the likelihood of a vocabulary token being the next in the sequence. Below, we illustrate this transformation: the first sequence in the batch is reshaped to `(4, 8, 85)`, and we examine the first token vector of this sequence, now with 85 dimensions, to highlight the new token representations.




Next, we perform a linear transformation to expand each token's embedding from `n_embd = 32` to `vocab_size = 85`. This step maps each token's representation to a dimensionality where each element reflects the likelihood of a corresponding vocabulary token being next in the sequence. We demonstrate this with the first sequence in the batch reshaped to `(4, 8, 85)`.

In [31]:
LM_head=nn.Linear(n_embd, vocab_size)
x_lm_head=LM_head(x)
print(f"Shape of x after transformation {x_lm_head.shape}")
print(f"Shape of the first sequence of the batch {x_lm_head[0].shape}")
print(f"Shape of the first token in the first sequence of the batch {x_lm_head[0][0].shape}")
print(f"The first token in the first sequence of the batch:\n{x_lm_head[0][0]}")

Shape of x after transformation torch.Size([4, 8, 85])
Shape of the first sequence of the batch torch.Size([8, 85])
Shape of the first token in the first sequence of the batch torch.Size([85])
The first token in the first sequence of the batch:
tensor([ 0.3369,  0.4889,  0.8188, -1.2534, -0.2159,  0.3670,  0.0453, -0.3863,
         0.8111,  0.8527, -0.4549,  0.1976,  0.7870, -0.5994,  0.1839, -0.2460,
        -0.6910,  0.2241,  0.1366, -0.6425, -0.9364,  0.8674,  0.5115,  0.4715,
        -0.0195, -1.1206, -0.7382, -0.2078, -0.3084,  0.3226,  0.0551, -0.4556,
        -0.9575,  0.4610, -0.4741, -0.6724,  0.5813, -1.2560, -0.2587,  0.5787,
         0.4229, -0.7130,  0.1701,  1.0722,  0.3288, -0.4900,  1.6441,  0.1121,
         0.7037,  0.6791,  1.3215,  0.5854, -0.8697, -0.3246, -0.5486,  0.6922,
         0.0255,  1.3960, -0.0765, -0.2021,  0.3018,  0.2524, -1.0117,  0.2877,
         0.5397,  0.3793, -0.0650,  1.1037, -0.2339,  0.0717,  0.4808,  0.3143,
         0.0666,  0.1947,  1.3404, 

The transformed values are not probabilities, so we apply softmax to convert them into a probability distribution across 85 characters, all summing to 1. As seen below by the uniformity of these probabilities, it's clear this is the initial forward pass as the model has yet to learn.

In [32]:
token_1_probabilities=torch.nn.functional.softmax(x_lm_head[0][0],0)
token_1_probabilities

tensor([0.0120, 0.0139, 0.0194, 0.0024, 0.0069, 0.0123, 0.0089, 0.0058, 0.0192,
        0.0200, 0.0054, 0.0104, 0.0188, 0.0047, 0.0103, 0.0067, 0.0043, 0.0107,
        0.0098, 0.0045, 0.0033, 0.0203, 0.0143, 0.0137, 0.0084, 0.0028, 0.0041,
        0.0069, 0.0063, 0.0118, 0.0090, 0.0054, 0.0033, 0.0135, 0.0053, 0.0044,
        0.0153, 0.0024, 0.0066, 0.0152, 0.0130, 0.0042, 0.0101, 0.0250, 0.0119,
        0.0052, 0.0442, 0.0096, 0.0173, 0.0169, 0.0320, 0.0153, 0.0036, 0.0062,
        0.0049, 0.0171, 0.0088, 0.0345, 0.0079, 0.0070, 0.0116, 0.0110, 0.0031,
        0.0114, 0.0147, 0.0125, 0.0080, 0.0258, 0.0068, 0.0092, 0.0138, 0.0117,
        0.0091, 0.0104, 0.0326, 0.0170, 0.0215, 0.0092, 0.0046, 0.0144, 0.0043,
        0.0085, 0.0145, 0.0093, 0.0316], grad_fn=<SoftmaxBackward0>)

In [33]:
print(f"The following is the target tensor for the x tensor we are working with:\n{y_example}")
print(f"The target classification of the first token of the first sequence in the batch is:\n{y_example[0][0]}")

The following is the target tensor for the x tensor we are working with:
tensor([[ 1, 72, 67,  1, 63, 66, 67, 75],
        [53, 72,  1, 75, 57,  1, 55, 53],
        [70, 67, 75, 66,  7,  1, 77, 57],
        [64, 61, 72, 72, 64, 57,  1, 59]])
The target classification of the first token of the first sequence in the batch is:
1


Now, we will calculate the cross-entropy loss between our model's predictions (`token_1_probabilities`) and the target classification of the first token, which is `1`. Cross-entropy loss is a metric in classification tasks, as it quantifies the difference between two probability distributions - the predicted probabilities and the actual distribution of the target labels.

**Cross-Entropy Loss Equation:**

The general formula for cross-entropy loss is:

${Loss} = -\sum_{c=1}^{M} y_{o,c} \log(p_{o,c}) $

where \( M \) is the number of classes, $ y_{o,c} $ is a binary indicator (0 or 1) showing whether class $ c $ is the correct classification for observation $o $, and $p_{o,c} $ is the predicted probability of observation $ o $ belonging to class $ c $.

For instances where the target class is `1`, and assuming a one-hot encoded target, the equation simplifies to:

${Loss} = -\log(p_{o,1}) $

Here, $ p_{o,1} $ represents the model's predicted probability that observation $o $ belongs to class `1`. This simplification occurs because the indicator function $ y_{o,c} $ is 1 for the correct class and 0 for all others, thus negating their contributions to the loss.

When implementing the `torch.nn.functional.cross_entropy` we use use the logits directly (outputs from the model before applying the softmax) because the function computes the softmax internally.

This code below calculates the loss for the first token of the first sequence by comparing the logits (`x_lm_head[0][0]`) directly against the actual target (`y_example[0][0]`), with both tensors unsqueezed to add a batch dimension as required by the loss function.



In [34]:
loss_token_1 = F.cross_entropy(x_lm_head[0][0].unsqueeze(0), y_example[0][0].unsqueeze(0))
print(f"Loss for token 1: {loss_token_1.item()}")

Loss for token 1: 4.2735772132873535


For further intuition, see how the loss is the same when calculating it using the math module.

In [35]:
loss_token_1_with_equation=-math.log(token_1_probabilities[1]) #Index 1 because the target was 1
loss_token_1_with_equation

4.273577026684474

To compute the loss for the entire batch, we compare each input token to its target by reshaping the tensors for compatibility with the cross-entropy loss function:

- **Reshaping Logits and Targets**: The logits tensor `x_lm_head` is reshaped from `(B, T, C)` to `(B*T, C)`, and the target tensor `y_example` from `(B, T)` to `(B*T)`. This transformation treats each token as an independent instance, aligning logits with their corresponding targets. The rows of `y_example` are stacked vertically during this process. This restructuring doesn't change the data but ensures each prediction is accurately paired with its label, facilitating direct comparison and loss calculation.

In [36]:
B,T,C= x_lm_head.shape
logits=x_lm_head.view(B*T,C)
print(f"The Logits shape: {logits.shape}")
targets= y_example.view(B*T)
print(f"y_example tensor before reogranization:\n{y_example}")
print(f"y_example tensor (targets) after reogranization:\n{targets}")
print(f"The Targets shape: {targets.shape}")
loss=F.cross_entropy(logits,targets)
print(f"Loss for the first batch after the first forward pass: {loss.item()}")

The Logits shape: torch.Size([32, 85])
y_example tensor before reogranization:
tensor([[ 1, 72, 67,  1, 63, 66, 67, 75],
        [53, 72,  1, 75, 57,  1, 55, 53],
        [70, 67, 75, 66,  7,  1, 77, 57],
        [64, 61, 72, 72, 64, 57,  1, 59]])
y_example tensor (targets) after reogranization:
tensor([ 1, 72, 67,  1, 63, 66, 67, 75, 53, 72,  1, 75, 57,  1, 55, 53, 70, 67,
        75, 66,  7,  1, 77, 57, 64, 61, 72, 72, 64, 57,  1, 59])
The Targets shape: torch.Size([32])
Loss for the first batch after the first forward pass: 4.8207879066467285


**Conclusion**

This concludes a complete forward pass through a standard transformer decoder network. The next step is backpropagation, where gradients are computed to update the weights, followed by subsequent iterations of training to refine the model parameters based on the loss function.

This tutorial aimed to provide an understanding of how a transformer is constructed by demonstrating the setup of the training dataset and detailing a forward pass through the model with a single input batch. Observing the changes in tensor dimensions and the computation of attention, offers valuable insights into how the model operates. While we cannot train a full Beatles GPT model here due to space constraints, this is the foundation.

# References

Karapathy, A. (2023). Building a GPT. [online] Available at: https://colab.research.google.com/drive/1JMLa53HDuA-i7ZBmqV7ZnA3c_fvtXnx-?usp=sharing.Pytorch (n.d.).

Cross Entropy Loss — documentation. [online] pytorch. Available at: https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html.Pytorch (n.d.).

Softmax — documentation. [online] pytorch. Available at: https://pytorch.org/docs/stable/generated/torch.nn.Softmax.html#softmax.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L. and Polosukhin, I. (2017). Attention Is All You Need. [online] arXiv.org. Available at: https://arxiv.org/abs/1706.03762.

Hugging Face (n.d.). Decoder models - Hugging Face NLP Course. [online] huggingface.co. Available at: https://huggingface.co/learn/nlp-course/chapter1/6.