# Build an LLM from scratch (i.e. ChatGPT model)

## Chapter 1 - Understanding LLMs

- LLMs 'understand' human language in that they can generate text that appears coherent/ contextually relevant<br>
<br>

- The 'large' in LLM refers to the large number of parameters and size of the dataset the model is trained on
- LLMs have a 'transformer' architecture, which allows them to pay 'selective' attention to parts of an input
- LLMs fall under the category of deep learning, which is a subset of machine learning; a subset of AI
- LLMs are trained on datasets that allow them to classify things (unlike manual rule-setting)<br>
<br>

- LLMs are best used for tasks that involve parsing and generating text in specialised fields like medicine/ law<br>
<br>

- Most modern LLMs are implemented using PyTorch - domain specific ones can outperform general ones like ChatGPT
- Custom LLMs are also smaller scale and can be deployed from laptops/ phones, where biggers ones are more costly
- Creating an LLM involves 2 phases - pre-training (using large datasets) and fine-tuning (using narrower datasets)
- Pre-training uses raw, unlabelled text, and gives the LLM some simple capabilities like text completion
- Fine-tuning can be 'instruction' or 'classification', which uses 'QnA' style data or 'labelled' data respectively<br>
<br>

- The original transformer architecture was developed to translate English into German and French texts
- Transformers consist of:
    - An 'encoder' that processes the input text into numerical representations
    
    - A 'decoder' that processes the numerical representations and generates output text
- LLMs have a 'self-attention' mechanism that allows the model to weight the importance of different parts of an input
- Generative pre-trained transformers (GPT, like ChatGPT) use this mechanism to perform:
    - Zero shot learning, which are tasks that are carried out without any prior examples

    - Few shot learning, which involve some examples the user provides as input<br>
<br>

- Pre-trained models of current ChatGPT versions are versatile and good for being fine-tuned for specific purposes<br>
<br>

- GPT models were pre-trained on a next-word prediction task, which predicts the next word in a text based on the previous words
- This training is a form of self-labeling, where the structure of the data itself is the label (i.e. the predicted word)
- GPT architecture actually only contains the 'decoder' part of the transformer, AKA 'autoregressive' models
    - These models incorporate their previous outputs as inputs for future predictions
- GPT models are also interesting in that they can perform tasks that they were not trained for
    - Language translation is an 'emergent' capability of GPT, showing that diverse tasks do not require diverse models<br>
<br>

- Building an LLM first requires data preparation and implementing the architecture (both can be done low-cost)<br>

## Chapter 2 - Working with Text data

- Pre-training the LLM involves preparing text data by splitting it into individual word and subword 'tokens'
- These are then encoded into vector representations (i.e. lists containing numbers)<br>
<br>

- Text embedding is the process of converting text into numerical vectors (done so as LLMs cannot process raw text)
- Word embeddings can have more than 1 dimension (i.e. more than 1 number in a list), more dimensions = more computation<br>
<br>

- tokenising is the process of splitting text into tokens, where each word or punctuation is a single tokens<br>
<br>

The following code reads in a short story, "The Verdict", to be used as text data for the tokenisation process

In [1]:
file_path = "the-verdict.txt"

with open(file_path, "r", encoding="utf-8") as f:
    raw_text = f.read()

print("Total number of characters:", len(raw_text))
print("\nFirst 100 characters: " + str(raw_text[:99]))

Total number of characters: 20479

First 100 characters: I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


### 1. Split the text into singular words and punctuations (i.e. singular tokens)

- Python has a regular expression library that can be used to split text into singular words

In [None]:
import re

# using the regex library to remove characters like punctuation and brackets from the text
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]

# each word counts as a single token
print("Total number of tokens: " + str(len(preprocessed)))
print("\nFirst 10 individual words/ tokens: " + str(preprocessed[:10]))


Total number of tokens: 4690

First 10 individual words: ['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius']


### 2. Convert the tokens (retrieved words and punctuations) into token IDs

- The tokens are stored in an alphabetical vocabulary, where each token has a unique ID

In [None]:
# using the sorted function to sort the words alphabetically
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)

print("Total vocab size: " + str(vocab_size))

# assigning an integer to each token for the vocab
vocab = {token:integer for integer, token in enumerate(all_words)}

print("\nTokens 21 - 25 of the vocab:")
for i, item in enumerate(vocab.items()):
    if i > 20 and i <= 25:
        print(item)

Total vocab size: 1130

Tokens 21 - 25 of the vocab:
('Burlington', 21)
('But', 22)
('By', 23)
('Carlo', 24)
('Chicago', 25)


### 3. Apply the vocabulary to convert new text data into tokens

- This models the encode and decode processes of a transformer, which can be carried out by a tokeniser class

- However, this will only be able to tokenise text that is within the vocabulary

In [None]:
# creating a tokeniser class that combines the previous steps of splitting and sorting the text
class Tokeniser:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s, i in vocab.items()}

    # converts the input text into tokens for the vocab
    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]

        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
    
    # converts token ids into text
    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        text = re.sub(r'\s+([,.:;?!"()\\])', r'\1', text)

        return text

tokeniser = Tokeniser(vocab)
text = """"It's the last he painted, you know,"
       Mrs. Gisburn said with pardonable pride."""
ids = tokeniser.encode(text)

print("Token ids of the words in the text: " + str(ids))
print("\nDecoded result of the tokenised words: " + str(tokeniser.decode(ids)))

Token ids of the words in the text: [1, 56, 2, 850, 988, 602, 533, 746, 5, 1126, 596, 5, 1, 67, 7, 38, 851, 1108, 754, 793, 7]

Decoded result of the tokenised words: " It ' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.


### 4. Modify the tokeniser to handle unknown words

- Tokens can be added to the vocabulary to represent unknown words and text separations

- The encode function of the tokeniser class can be modified to tokenise these special cases in the text data

In [None]:
# adding new tokens to the vocab to process unknown words and text separations
all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])
vocab = {token:integer for integer, token in enumerate(all_tokens)}

print("New amount of tokens in vocab: " + str(len(vocab.items())))

# replacing the previous encode function with a new one that replaces unknown words in the text with "<|unk|>"
def new_encode(self, text):
    preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
    preprocessed = [item.strip() for item in preprocessed if item.strip()]
    preprocessed = [item if item in self.str_to_int
                    else "<|unk|>" for item in preprocessed]

    ids = [self.str_to_int[s] for s in preprocessed]
    return ids

Tokeniser.encode = new_encode

text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."
text = " <|endoftext|> ".join((text1, text2))

print("\nJoined sample text with separation: " + text)

new_tokeniser = Tokeniser(vocab)

print("\nDecoded encoded text: " + new_tokeniser.decode(new_tokeniser.encode(text)))

New amount of tokens in vocab: 1132

Joined sample text with separation: Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.

Decoded encoded text: <|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.


*Unknown words can also be handled by tokenising them through Byte Pair Encoding (BPE)*

- A BPE tokeniser breaks down unknown words into subwords which exist within the vocab and have token ids

- On subsequent iterations, the tokeniser merges frequent characters into larger words, which increase the vocab

*This was used to train models like GPT-2 and GPT-3, the original models for ChatGPT

In [None]:
import tiktoken

# using the tiktoken library to encode and decode text
tokeniser = tiktoken.get_encoding("gpt2")

text = (
    "Hello, do you like tea? <|endoftext|> In the sunlit terraces"
    "of someunknownPlace."
)
ids = tokeniser.encode(text, allowed_special={"<|endoftext|>"})

print("Tiktoken tokeniser encoded ids: " + str(ids))

words = tokeniser.decode(ids)

print("\nTiktoken tokeniser decoded words: " + words)

Tiktoken tokeniser encoded ids: [15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 1659, 617, 34680, 27271, 13]

Tiktoken tokeniser decoded words: Hello, do you like tea? <|endoftext|> In the sunlit terracesof someunknownPlace.


### 5. Generate input-target pairs from the text data using a 'sliding window'

- LLMs are pretrained by predicting the next word in a text, where the predicted words are taken in as input each time

- The 'sliding window' takes a group of words in a text for each prediction (i.e. a single context), and moves across it to continue predicting

In [None]:
the_verdict = raw_text
enc_text = tokeniser.encode(the_verdict)

print("Number of tokens in the verdict, as tokenised by the BPE tokeniser: " + str(len(enc_text)))

enc_sample = enc_text[50:]

# the context size is the number of words the LLM will process at a single time
context_size = 4
x = enc_sample[:context_size]
y = enc_sample[1: context_size + 1]

print("\nIllustration of how the sliding window works:")
print(f"x:  {x}")
print(f"y:       {y}\n")

for i in range(1, context_size + 1):
    context = enc_sample[:i]
    desired = enc_sample [i]
    print(tokeniser.decode(context), " --->", tokeniser.decode([desired]))

Number of tokens in the verdict, as tokenised by the BPE tokeniser: 5145

Illustration of how the sliding window works:
x:  [290, 4920, 2241, 287]
y:       [4920, 2241, 287, 257]

 and  --->  established
 and established  --->  himself
 and established himself  --->  in
 and established himself in  --->  a


### 6. Implement a more efficient way of iterating over the text data and returning the input-target pairs

- PyTorch is a library that can return data as 'tensors' (i.e. multidimensional arrays)

- The input tensor contains the text data the LLM sees and the target tensor contains the targets the LLM predicts

In [None]:
import torch

from torch.utils.data import Dataset, DataLoader

# creating a class to initialise a text dataset
class GPTDataset(Dataset):
    def __init__(self, txt, tokeniser, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        token_ids = tokeniser.encode(txt)

        # assigning input and target ids for the text input (the target id will be one after the input id)
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1:i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)
    
    def __getitem__(self, index):
        return self.input_ids[index], self.target_ids[index]
    
# creating a function to create a dataloader
def create_dataloader(txt, batch_size=4, max_length=256,
                         stride=128, shuffle=True, drop_last=True,
                         num_workers=0):
    tokeniser = tiktoken.get_encoding("gpt2")
    dataset = GPTDataset(txt, tokeniser, max_length, stride)
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers
    )

    return dataloader

# this creates a dataloader that looks at 4 words in one iteration and predicts the next word, and shifts 4 words ahead for the next iteration
dataloader = create_dataloader(the_verdict, batch_size=1, max_length=4, stride=1, shuffle=False)
data_iter = iter(dataloader)
first_batch = next(data_iter)

print("First input and target of the LLM: " + str(first_batch))

second_batch = next(data_iter)

print("\nSecond input and target of the LLM: " + str(second_batch))

  cpu = _conversion_method_template(device=torch.device("cpu"))


First input and target of the LLM: [tensor([[  40,  367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]

Second input and target of the LLM: [tensor([[ 367, 2885, 1464, 1807]]), tensor([[2885, 1464, 1807, 3619]])]


### 7. Convert the token IDs into embeddings (i.e. random values that give a weight to each token)

- Each token in the vocab is assigned a number of random output dimensions (3 for this example)

- When any token is read in as input, these output dimensions will be looked up in the vocab (i.e. embedding layer) and returned

In [40]:
# creates a tensor/ array with 4 ids
input_ids = torch.tensor([2, 3, 5, 1])
vocab_size = 6
output_dim = 3

# pytorch assigns random values to each of the output dimensions of each token
torch.manual_seed(123)
embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

print("Embeddings of tokens in vocab:\n", embedding_layer.weight)

print("\nEmbeddings for token 3/ row 4 in layer:\n", embedding_layer(torch.tensor([3])))

print("\nEmbeddings for the 4 tokens in input_ids:\n", embedding_layer(input_ids))

Embeddings of tokens in vocab:
 Parameter containing:
tensor([[ 0.3374, -0.1778, -0.1690],
        [ 0.9178,  1.5810,  1.3010],
        [ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-1.1589,  0.3255, -0.6315],
        [-2.8400, -0.7849, -1.4096]], requires_grad=True)

Embeddings for token 3/ row 4 in layer:
 tensor([[-0.4015,  0.9666, -1.1481]], grad_fn=<EmbeddingBackward0>)

Embeddings for the 4 tokens in input_ids:
 tensor([[ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-2.8400, -0.7849, -1.4096],
        [ 0.9178,  1.5810,  1.3010]], grad_fn=<EmbeddingBackward0>)


### 8. Change the token embeddings to account for different positions in a text input

- When a text input has duplicate words, the same token embeddings are returned for those 2 words (this is wrong)

- To avoid this, the original token embeddings can be modified to indicate their positions (absolute positional embeddings)

In [10]:
# using the size and dimensions of the BPE tokeniser as an example
vocab_size = 50257
output_dim = 256
token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

# using 'the-verdict' text as the text input for the dataloader
max_length = 4
dataloader = create_dataloader(the_verdict, batch_size=8, max_length=max_length, stride=max_length, shuffle=False)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)

# each 'batch' corresponds to the number of rows and the 'max_length' is the number of tokens in each row
print("Token IDs:\n", inputs)
print("\nInputs shape:\n", inputs.shape)

# this adds 256 random values to each of the token ids (the gpt-3 model added 12,288 :O)
token_embeddings = token_embedding_layer(inputs)

print("\nToken embeddings shape:\n", token_embeddings.shape)

# to modify the embeddings to indicate their position, a new layer must be created
context_length = max_length
pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)
pos_embeddings = pos_embedding_layer(torch.arange(context_length))

print("\nPosition embeddings shape:\n", pos_embeddings.shape)

# add the 2 layers together to create the final input embeddings for the LLM
input_embeddings = token_embeddings + pos_embeddings

print("\nFinal input embeddings shape:\n", input_embeddings.shape)

Token IDs:
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])

Inputs shape:
 torch.Size([8, 4])

Token embeddings shape:
 torch.Size([8, 4, 256])

Position embeddings shape:
 torch.Size([4, 256])

Final input embeddings shape:
 torch.Size([8, 4, 256])


The embeddings can be thought of as layers, where the:

- First layer: contains multiple input batches

- Second layer: each input batch contains multiple tokens

- Third layer: each token contains multiple embeddings (256 in the code above)

**The initial text data is now prepared for processing by the main LLM modules!**

## Chapter 3 - Coding Attention Mechanisms

- Attention mechanisms are part of the LLM architecture, and take token embeddings as input

- These can be implemented iteratively, going from a simplified -> normal -> causal -> multi-head<br>
<br>

- Pre-LLM architectures did not have attention mechanisms and could not process long sequences of text

- They could not process the grammatical structure and context of text, affecting capabilites like word translation

- An encoder-decoder recurrent neural network (RNN) can be used to address this problem:
    1. The encoder takes in input text and gives it a 'hidden state', which carries the context for the sentence

    2. The hidden state of the text is updated as each word is processed, eventually creating a final hidden state

    3. The decoder takes the final hidden state to translate the sentence one word at a time

- However, encoder-decoder RNNs cannot access earlier hidden states (only the final one), causing a loss of context

- This drove the development of attention mechanisms, which can 'pay attention' to all tokens and contexts of an input<br>
<br>

- The transformer architecture uses an attention mechanism similar to the Bahdanau mechanism for RNNs

- The mechanism can access all input tokens selectively and give some more weight than others, affecting output

- In essence, 'self-attention' allows each word in the input to consider the positions of all the other words<br>
<br>

- Implementing a simplified attention mechanism starts with a sample text input with a smaller embedding dimension (3)

In [None]:
import torch

# the inputs are token embeddings for the tokens: "Your", "journey", "starts", "with", "one", "step"
inputs = torch.tensor(
    [[0.43, 0.15, 0.89],
     [0.55, 0.87, 0.66],
     [0.57, 0.85, 0.64],
     [0.22, 0.58, 0.33],
     [0.77, 0.25, 0.10],
     [0.05, 0.80, 0.55]]
)

### 1. Calculate the attention scores of each of the tokens in relation to one of the tokens (i.e. the second one)

- The dot product between the token embeddings of the second token and that of the other tokens is calculated

- This represents how closely two vectors are aligned, where a higher dot product = higher attention between two tokens

In [41]:
# the query token is the token that is focused on, and is the foundation of the calculation of the scores of the other tokens
query = inputs[1] # query token is the second token
attn_scores_2 = torch.empty(inputs.shape[0]) # taking the number of tokens from the inputs (i.e. 6)
for i, x_i in enumerate(inputs):
    attn_scores_2[i] = torch.dot(x_i, query)

print("Attention scores for each of the 6 tokens:\n", attn_scores_2)

Attention scores for each of the 6 tokens:
 tensor([0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865])


### 2. Normalise each of the attention scores so that they add up to 1

- This is a convention that is useful for interpretation and maintaining training stability

In [None]:
# taking each of the scores as a percentage of the total sum, making every one add up to 1
attn_weights_2 = attn_scores_2 / attn_scores_2.sum()

print("Attention weights: " + str(attn_weights_2))
print("Sum: " + str(attn_weights_2.sum()))

# extra: the pytorch function can manage more extreme values
attn_weights_2 = torch.softmax(attn_scores_2, dim=0)

print("\nAttention weights: " + str(attn_weights_2))
print("Sum: " + str(attn_weights_2.sum()))

Attention weights: tensor([0.1455, 0.2278, 0.2249, 0.1285, 0.1077, 0.1656])
Sum: tensor(1.0000)

Attention weights: tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
Sum: tensor(1.)


### 3. Calculate the context vector for the second token

- This multiplies the token embeddings with their corresponding attention weights and summing the result

- Context vectors provide an enriched representation of each token in relation to other tokens

In [None]:
# multiplying the attention weights of each of the tokens with their token embeddings
query = inputs[1]
context_vec_2 = torch.zeros(query.shape)
for i,x_i in enumerate(inputs):
    context_vec_2 += attn_weights_2[i] * x_i

print("Context vector for second token: " + str(context_vec_2))

Context vector for second token: tensor([0.4419, 0.6515, 0.5683])


### 4. Replicate this process for all other tokens (tokens 1, 3 - 6)

- Compute the attention scores, normalise them to attention weights, and compute them in context vectors

- The 'dim' parameter of the softmax function denotes the dimension that the normalisation will take place:
    - A parameter of '-1' simply refers to the last dimension of the tensor (like where -1 refers to the last element in a list)

In [None]:
# creating an empty frame for the tokens to calculate their attention scores
attn_scores = torch.empty(6, 6)
for i, x_i in enumerate(inputs):
    for j, x_j in enumerate(inputs):
        attn_scores[i, j] = torch.dot(x_i, x_j)

# as for loops are slow, the scores can also be calculated using pytorch's matrix multiplication function
attn_scores = inputs @ inputs.T

print("Attention scores for each of the tokens in relation to the others:\n", attn_scores)

attn_weights = torch.softmax(attn_scores, dim=-1)

print("\nNorrmalised attention weights:\n",attn_weights)

context_vecs = attn_weights @ inputs

print("\nContext vectors for each of the tokens:\n", context_vecs)

Attention scores for each of the tokens in relation to the others:
 tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],
        [0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],
        [0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],
        [0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],
        [0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],
        [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])

Norrmalised attention weights:
 tensor([[0.2098, 0.2006, 0.1981, 0.1242, 0.1220, 0.1452],
        [0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581],
        [0.1390, 0.2369, 0.2326, 0.1242, 0.1108, 0.1565],
        [0.1435, 0.2074, 0.2046, 0.1462, 0.1263, 0.1720],
        [0.1526, 0.1958, 0.1975, 0.1367, 0.1879, 0.1295],
        [0.1385, 0.2184, 0.2128, 0.1420, 0.0988, 0.1896]])

Context vectors for each of the tokens:
 tensor([[0.4421, 0.5931, 0.5790],
        [0.4419, 0.6515, 0.5683],
        [0.4431, 0.6496, 0.5671],
        [0.4304, 0.6298, 0.5510],
        [0.4671, 0.5910

**This completes the simplified self-attention mechanism!**

- This can now be extended into the "scaled dot-product" self-attention mechanism used by the original transformer and GPT models

- The main difference between the two is the addition of training weight matrices, which are updated during the training phase

### 1. Add weight matrices to each of the tokens based on their original token embeddings

- These weight matrices represent query, key, and value, where the query token will have all 3 and the other tokens will have just the key and value

- Start by calculating the weights for the query token (i.e. second token for the example), and then for the rest

The query, key, and value terms are taken from the domain of information retrieval and simply mean to search, store, and retrieve information

In [43]:
x_2 = inputs[1] # the second token
dim_in = inputs.shape[1] # the input dimensions for each of the tokens (3)
dim_out = 2 # the output dimensions for each of the tokens (2)

torch.manual_seed(123)

# this assigns random values to the query, key, and value weights, based on the number of dimensions required for input and output
W_query = torch.nn.Parameter(torch.rand(dim_in, dim_out), requires_grad=False)
W_key   = torch.nn.Parameter(torch.rand(dim_in, dim_out), requires_grad=False)
W_value = torch.nn.Parameter(torch.rand(dim_in, dim_out), requires_grad=False)

# matrix multiplication for the q, k, and v weights for the second token
query_2 = x_2 @ W_query
key_2   = x_2 @ W_key
value_2 = x_2 @ W_value

print("Query weight for the query token (token 2):\n", query_2)

keys = inputs @ W_key
values = inputs @ W_value

print("\nKey weight matrices for all tokens:\n", keys)
print("\nValue weight matrices for all tokens:\n", values)

Query weight for the query token (token 2):
 tensor([0.4306, 1.4551])

Key weight matrices for all tokens:
 tensor([[0.3669, 0.7646],
        [0.4433, 1.1419],
        [0.4361, 1.1156],
        [0.2408, 0.6706],
        [0.1827, 0.3292],
        [0.3275, 0.9642]])

Value weight matrices for all tokens:
 tensor([[0.1855, 0.8812],
        [0.3951, 1.0037],
        [0.3879, 0.9831],
        [0.2393, 0.5493],
        [0.1492, 0.3346],
        [0.3221, 0.7863]])


### 2. Compute the attention scores for each of the tokens by getting the dot product of the query and key matrices

- As this example is with respect to the query token (second token), the query value used will be that of the second token

In [None]:
# the attention scores are a matrix multiplication between the query weights and the key weights for each of the 6 tokens
attn_scores_2 = query_2 @ keys.T

print("Attention scores for all the tokens: " + str(attn_scores_2))

# the attention weights are calculated using this
dim_key = keys.shape[-1]
attn_weights_2 = torch.softmax(attn_scores_2 / dim_key**0.5, dim=-1)

print("Attention weights for all the tokens: " + str(attn_weights_2))

Attention scores for all the tokens: tensor([1.2705, 1.8524, 1.8111, 1.0795, 0.5577, 1.5440])
Attention weights for all the tokens: tensor([0.1500, 0.2264, 0.2199, 0.1311, 0.0906, 0.1820])


### 3. Compute the context vector for the query token using the attention weights of all tokens

- Each of the value weights (2 dimensions) are multiplied by the attention weights (1 dimension) and summed together

In [None]:
# calculating the context vector by multiplying the attention weights and value weights
context_vec_2 = attn_weights_2 @ values

print("Context vector for the second token: " + str(context_vec_2))

Context vector for the second token: tensor([0.3061, 0.8210])


### 4. Organise the code into a compact self-attention class for easy implementation

- This class is derived from pytorch's nn module, which carries out the computation for inputs automatically

In [None]:
import torch.nn as nn

# the class is derived from the nn module, which has its own functions to automatically process the parameters
class SelfAttention(nn.Module):
    def __init__(self, dim_in, dim_out):
        super().__init__()
        self.W_query = nn.Parameter(torch.rand(dim_in, dim_out))
        self.W_key   = nn.Parameter(torch.rand(dim_in, dim_out))
        self.W_value = nn.Parameter(torch.rand(dim_in, dim_out))

    # the forward function is called when inputs are given to the class
    def forward(self, x):
        keys = x @ self.W_key
        queries = x @ self.W_query
        values = x @ self.W_value

        attn_scores = queries @ keys.T
        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
        context_vecs = attn_weights @ values

        return context_vecs

torch.manual_seed(123)
attention = SelfAttention(dim_in, dim_out)

print("Context vectors for all tokens:\n", attention(inputs))

Context vectors for all tokens:
 tensor([[0.2996, 0.8053],
        [0.3061, 0.8210],
        [0.3058, 0.8203],
        [0.2948, 0.7939],
        [0.2927, 0.7891],
        [0.2990, 0.8040]], grad_fn=<MmBackward0>)


### 5. Alter the self-attention class to use pytorch's native 'linear' layers

- The 'linear' layer has a more optimised weight intialisation scheme, making the model more stable

In [None]:
class SelfAttentionNew(nn.Module):
    def __init__(self, dim_in, dim_out, qkv_bias=False):
        super().__init__()
        # replacing the Parameter layer with the Linear layer
        self.W_query = nn.Linear(dim_in, dim_out,bias=qkv_bias)
        self.W_key   = nn.Linear(dim_in, dim_out,bias=qkv_bias)
        self.W_value = nn.Linear(dim_in, dim_out,bias=qkv_bias)

    def forward(self, x):
        # replacing the matrix multiplication with the native linear calculation 
        keys = self.W_key(x)
        queries = self.W_query(x)
        values = self.W_value(x)

        attn_scores = queries @ keys.T
        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
        context_vecs = attn_weights @ values

        return context_vecs
    
torch.manual_seed(789)
attention_new = SelfAttentionNew(dim_in, dim_out)

# the new context vectors will be different to the previous ones
print("Context vectors for all tokens:\n", attention_new(inputs))

Context vectors for all tokens:
 tensor([[-0.0739,  0.0713],
        [-0.0748,  0.0703],
        [-0.0749,  0.0702],
        [-0.0760,  0.0685],
        [-0.0763,  0.0679],
        [-0.0754,  0.0693]], grad_fn=<MmBackward0>)


**This completes the self-attention mechanism used in many LLMs and GPT models!**

- This can now be extended into a 'causal' self-attention mechanism, which is a specialised form of self-attention

    - Instead of considering the entire input sequence at once, this model only considers current and previous tokens

    - The tokens ahead of the current token are masked and normalise the attention weights of the unmasked tokens

### 1. Mask the computed attention weights for the inputs (6 tokens)

- The attention weights are first masked accordingly, and normalised such that the unmasked ones add up to 1

In [21]:
# compute the attention scores, then attention weights from the query and key weights
queries = attention_new.W_query(inputs)
keys = attention_new.W_key(inputs)

attn_scores = queries @ keys.T
attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)

print("Attention weights for the inputs:\n", attn_weights)

# for each token, mask the values of future tokens (i.e. for token 1, mask tokens 2 - 6/ for token 2, mask tokens 3 - 6, etc.)
context_length = attn_scores.shape[0]
mask_simple = torch.tril(torch.ones(context_length, context_length))

print("\nMasked tokens for each token:\n", mask_simple)

# normalise the tokens so that all unmasked tokens in each level add up to 1

mask_simple = attn_weights * mask_simple
row_sums = mask_simple.sum(dim=-1, keepdim=True)
norm_mask_simple = mask_simple / row_sums

print("\nNormalised masked tokens for each token:\n", norm_mask_simple)

Attention weights for the inputs:
 tensor([[0.1921, 0.1646, 0.1652, 0.1550, 0.1721, 0.1510],
        [0.2041, 0.1659, 0.1662, 0.1496, 0.1665, 0.1477],
        [0.2036, 0.1659, 0.1662, 0.1498, 0.1664, 0.1480],
        [0.1869, 0.1667, 0.1668, 0.1571, 0.1661, 0.1564],
        [0.1830, 0.1669, 0.1670, 0.1588, 0.1658, 0.1585],
        [0.1935, 0.1663, 0.1666, 0.1542, 0.1666, 0.1529]],
       grad_fn=<SoftmaxBackward0>)

Masked tokens for each token:
 tensor([[1., 0., 0., 0., 0., 0.],
        [1., 1., 0., 0., 0., 0.],
        [1., 1., 1., 0., 0., 0.],
        [1., 1., 1., 1., 0., 0.],
        [1., 1., 1., 1., 1., 0.],
        [1., 1., 1., 1., 1., 1.]])

Normalised masked tokens for each token:
 tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5517, 0.4483, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3800, 0.3097, 0.3103, 0.0000, 0.0000, 0.0000],
        [0.2758, 0.2460, 0.2462, 0.2319, 0.0000, 0.0000],
        [0.2175, 0.1983, 0.1984, 0.1888, 0.1971, 0.0000],
        [0.

### 2. Use a 'dropout' technique to ignore some attention weights

- This prevents LLMs from being overly reliant on any set of attention weight units during training

- The following example shows how dropouts are applied on attention weights

In [None]:
torch.manual_seed(123)

# using a dropout factor of 0.5 makes the model ignore half of the weights in each batch (i.e out of 6 weights, only 3 are considered)
dropout = torch.nn.Dropout(0.5)
example = torch.ones(6, 6)

# the dropout reduces some weights to 0, and multiplies the remaining weights by a scale (i.e. 1 / 0.5 = 2)
print("Example before dropout:\n", example)
print("\nExample after dropout:\n", dropout(example))
print("\nDropout used on previous attention weights:\n", dropout(attn_weights))

Example before dropout:
 tensor([[1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1.]])

Example after dropout:
 tensor([[2., 2., 0., 2., 2., 0.],
        [0., 0., 0., 2., 0., 2.],
        [2., 2., 2., 2., 0., 2.],
        [0., 2., 2., 0., 0., 2.],
        [0., 2., 0., 2., 0., 2.],
        [0., 2., 2., 2., 2., 0.]])

Dropout used on previous attention weights:
 tensor([[0.3843, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.3324, 0.0000, 0.3329, 0.2955],
        [0.0000, 0.3318, 0.3325, 0.2996, 0.3328, 0.2961],
        [0.0000, 0.0000, 0.3337, 0.3142, 0.0000, 0.3128],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.3317, 0.3169],
        [0.3869, 0.3327, 0.0000, 0.3084, 0.3331, 0.3058]],
       grad_fn=<MulBackward0>)


### 3. Incorporate the previous steps into a compact causal attention class

- This class is similar to the previous self-attention class but incorporates the masking and dropout steps

In [None]:
class CasualAttention(nn.Module):
    def __init__(self, dim_in, dim_out, context_length, dropout, qkv_bias=False):
        super().__init__()
        self.dim_out = dim_out
        self.dropout = nn.Dropout(dropout)
        self.register_buffer('mask', torch.triu(torch.ones(context_length, context_length), diagonal=1))

        self.W_query = nn.Linear(dim_in, dim_out,bias=qkv_bias)
        self.W_key   = nn.Linear(dim_in, dim_out,bias=qkv_bias)
        self.W_value = nn.Linear(dim_in, dim_out,bias=qkv_bias)
    
    def forward(self, x):
        batch, num_tokens, dim_in = x.shape
        
        keys = self.W_key(x)
        queries = self.W_query(x)
        values = self.W_value(x)

        attn_scores = queries @ keys.transpose(1, 2)
        attn_scores.masked_fill_(self.mask.bool()[:num_tokens, :num_tokens], -torch.inf)
        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
        attn_weights = self.dropout(attn_weights)

        context_vecs = attn_weights @ values

        return context_vecs

# duplicating the inputs to show that the casual attention module can handle 2 batches at once
batch = torch.stack((inputs, inputs), dim=0)
context_length = batch.shape[1]
casual_attention = CasualAttention(dim_in, dim_out, context_length, 0.0)
context_vecs = casual_attention(batch)

print("Context vectors for 2 batches of 6 tokens:\n", context_vecs)

Context vectors for 2 batches of 6 tokens:
 tensor([[[-0.6053,  0.6575],
         [-0.5503,  0.6564],
         [-0.5313,  0.6538],
         [-0.4500,  0.5680],
         [-0.4254,  0.5241],
         [-0.3933,  0.5011]],

        [[-0.6053,  0.6575],
         [-0.5503,  0.6564],
         [-0.5313,  0.6538],
         [-0.4500,  0.5680],
         [-0.4254,  0.5241],
         [-0.3933,  0.5011]]], grad_fn=<UnsafeViewBackward0>)


**This completes the causal self-attention mechanism!**

- This can finally be extended into a multi-head attention mechanism, which employs the causal attention over multiple "heads"

- A single head refers to only one set of attention weights being processed by the LLM, multiple heads = multiple weights

### 1. Create a wrapper class that runs multiple instances of the causal attention class

- Each head returns a single context vector for each token input, where the context vectors are concatenated and returned

In [None]:
class MultiHeadAttentionWrapper(nn.Module):
    def __init__(self, dim_in, dim_out, context_length, dropout, num_heads, qkv_bias=False):
        super().__init__()
        self.heads = nn.ModuleList([CasualAttention(dim_in, dim_out, context_length, dropout, qkv_bias)
                                   for _ in range(num_heads)])
    
    def forward(self, x):
        return torch.cat([head(x) for head in self.heads], dim=-1)
    
# using this class on 2 instances of the causal attention class would return a final context vector with 4 dimensions
torch.manual_seed(123)
context_length = batch.shape[1] # number of tokens is 6 for each batch
dim_in, dim_out = 3, 2

multi_head = MultiHeadAttentionWrapper(dim_in, dim_out, context_length, 0.0, num_heads=2)
context_vecs = multi_head(batch)

print("Multi-head context vectors for 2 batches of 6 tokens:\n", context_vecs)

Multi-head context vectors for 2 batches of 6 tokens:
 tensor([[[-0.4519,  0.2216,  0.4772,  0.1063],
         [-0.5874,  0.0058,  0.5891,  0.3257],
         [-0.6300, -0.0632,  0.6202,  0.3860],
         [-0.5675, -0.0843,  0.5478,  0.3589],
         [-0.5526, -0.0981,  0.5321,  0.3428],
         [-0.5299, -0.1081,  0.5077,  0.3493]],

        [[-0.4519,  0.2216,  0.4772,  0.1063],
         [-0.5874,  0.0058,  0.5891,  0.3257],
         [-0.6300, -0.0632,  0.6202,  0.3860],
         [-0.5675, -0.0843,  0.5478,  0.3589],
         [-0.5526, -0.0981,  0.5321,  0.3428],
         [-0.5299, -0.1081,  0.5077,  0.3493]]], grad_fn=<CatBackward0>)


### 2. Combine the casual attention class and multi head wrapper into a single class

- The previous multi head attention class simply instantiates and combines several causal attention objects

- This can be made more efficient by splitting the input into multiple heads by reshaping the projected q, k, v weights

In [None]:
class MultiHeadAttention(nn.Module):
    def __init__(self, dim_in, dim_out, context_length, dropout, num_heads, qkv_bias=False):
        super().__init__()
        assert (dim_out % num_heads == 0), \
            "dim_out must be divisible by num_heads"
    
        self.dim_out = dim_out
        self.num_heads = num_heads
        self.head_dim = dim_out // num_heads

        self.W_query = nn.Linear(dim_in, dim_out,bias=qkv_bias)
        self.W_key   = nn.Linear(dim_in, dim_out,bias=qkv_bias)
        self.W_value = nn.Linear(dim_in, dim_out,bias=qkv_bias)
        self.out_proj = nn.Linear(dim_in, dim_out)
        self.dropout = nn.Dropout(dropout)
        self.register_buffer('mask', torch.triu(torch.ones(context_length, context_length), diagonal=1))

    def forward(self, x):
        batch, num_tokens, dim_in = x.shape
        keys = self.W_key(x)
        queries = self.W_query(x)
        values = self.W_value(x)

        keys = keys.view(batch, num_tokens, self.num_heads, self.head_dim)
        queries = queries.view(batch, num_tokens, self.num_heads, self.head_dim)
        values = values.view(batch, num_tokens, self.num_heads, self.head_dim)

        keys = keys.transpose(1, 2)
        queries = queries.transpose(1, 2)
        values = values.transpose(1, 2)

        attn_scores = queries @ keys.transpose(2, 3)
        attn_scores.masked_fill_(self.mask.bool()[:num_tokens, :num_tokens], -torch.inf)
        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
        attn_weights = self.dropout(attn_weights)

        context_vecs = (attn_weights @ values).transpose(1, 2)
        context_vecs = context_vecs.contiguous().view(batch, num_tokens, self.dim_out)
        context_vecs = self.out_proj(context_vecs)

        return context_vecs

        