<a href="https://colab.research.google.com/github/elyager/LLMs-from-scratch/blob/main/LLM_from_scratch_personal_notes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Requiriments

In [2]:
!pip install tiktoken
import torch
print("PyTorch version:", torch.__version__)

Collecting tiktoken
  Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.2 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m1.2/1.2 MB[0m [31m69.5 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m31.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tiktoken
Successfully installed tiktoken-0.9.0
PyTorch version: 2.6.0+cu124


### Vocabulary

- **vocab_size:** es el tamaño del vocabulario en tokens únicos disponibles+extensiones. En nuestro caso dado por el tokenizador creado con BPE.
- **output_dim:** el número de dimensiones de cada token. Las dimensiones describen a una palabra o concepto. Más dimensiones capturan más detalles.
- **max_length:** es la máxima logitud de tokens por secuencia.
- **batch_size:** es cuantas secuencias (muestras de texto) tiene cada batch. Larger batch sizes can speed up training but might require more memory.
- **stride:** el tamaño de la zancada en tokens, cuantos tokens salta para la siguiente secuencia, esto determina que tanto se empalma una secuencia con otra. Ensure the model sees the context of each token multiple times during training, helping it learn better relationships between words.
- **shuffle:** determina si le da un orden aleatorio a las secuencias para que sea random en cada epoc. Prevents the model from learning patterns based on the order of data presentation. It helps generalize learning and avoid overfitting to a specific data order.
-**drop_last:** indica si se debe descartar el último batch cuando no cumple con el número de muestras establecido en max_lenght. Si se tiene un set de datos pequeño es mejor no hacer drop.
-**num_workers:** es el número de procesos, a mayor número más rápidez.
-**attention score:** determina qué tan "similar" es una pababra con la otra a través de dot product que es un tipo de similarity function. Mayor atention score mayor similitud entre los números.
--**attention weight:** es la versión normalizada a través de softmax de los attention scores.
-**context vector:** es un embedding vector pero que tiene todo el contexto del resto de input vectors. Se obtiene sumando todos los attention weights del inpute secuence.

# Chapter 1 & 2

### Load the text for training (our corpus)

In [3]:
import os
import urllib.request

if not os.path.exists("the-verdict.txt"):
    url = ("https://raw.githubusercontent.com/rasbt/"
           "LLMs-from-scratch/main/ch02/01_main-chapter-code/"
           "the-verdict.txt")
    file_path = "the-verdict.txt"
    urllib.request.urlretrieve(url, file_path)

with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

print("Total number of character:", len(raw_text))
print(raw_text[:99])

Total number of character: 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


### BytePair Encoding (BPE)


In [4]:
import importlib
import tiktoken

print("tiktoken version:", importlib.metadata.version("tiktoken"))

tiktoken version: 0.9.0


In [5]:
tokenizer = tiktoken.get_encoding("gpt2")
enc_text = tokenizer.encode(raw_text, allowed_special={"<|endoftext|>"})
enc_text.append(tokenizer.eot_token)

# First 10 tokens from raw_text
first_10_token_ids = enc_text[:10]
decoded_tokens = [tokenizer.decode([token_id]) for token_id in first_10_token_ids]
delimited_tokens = ' |-|'.join(decoded_tokens)
print(delimited_tokens)
print(enc_text[:10])
print(f'\n Total of tokens: {len(enc_text)}')

I |-| H |-|AD |-| always |-| thought |-| Jack |-| G |-|is |-|burn |-| rather
[40, 367, 2885, 1464, 1807, 3619, 402, 271, 10899, 2138]

 Total of tokens: 5146


### Dataset loader (creating tokenIDs for inputs and targets)

In [6]:
from torch.utils.data import Dataset, DataLoader

class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        # Tokenize the entire text
        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})
        token_ids.append(tokenizer.eot_token)

        # Use a sliding window to chunk the book into overlapping sequences of max_length
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

# dataset = GPTDatasetV1(raw_text, tokenizer, max_length=4, stride=1)
# print(dataset.input_ids)
# print(dataset.target_ids)

In [7]:
def create_dataloader_v1(txt, batch_size=4, max_length=256,
                         stride=128, shuffle=True, drop_last=True,
                         num_workers=0):

    # Initialize the tokenizer
    tokenizer = tiktoken.get_encoding("gpt2")

    # Create dataset
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)

    # Create dataloader
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers
    )

    return dataloader

### Use DataLoader

In [8]:
dataloader = create_dataloader_v1(
    raw_text, batch_size=1, max_length=4, stride=1, shuffle=False
)

# for batch in dataloader:
#     input, target = batch
#     print(input, target)

data_iter = iter(dataloader)

first_batch = next(data_iter)
print(first_batch) # input and target
second_batch = next(data_iter)
print(second_batch) # input and target

[tensor([[  40,  367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]
[tensor([[ 367, 2885, 1464, 1807]]), tensor([[2885, 1464, 1807, 3619]])]


In [9]:
dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=4, stride=4, shuffle=False)

data_iter = iter(dataloader)
# First batch
inputs, targets = next(data_iter)
print("Inputs:\n", inputs)
print("\nTargets:\n", targets)
print("Inputs:\n", tokenizer.decode(inputs[0].tolist()))
print("\nTargets:\n", tokenizer.decode(targets[0].tolist()))

# Second batch
inputs, targets = next(data_iter)
print("Inputs:\n", inputs)
print("\nTargets:\n", targets)
print("Inputs:\n", tokenizer.decode(inputs[0].tolist()))
print("\nTargets:\n", tokenizer.decode(targets[0].tolist()))

Inputs:
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])

Targets:
 tensor([[  367,  2885,  1464,  1807],
        [ 3619,   402,   271, 10899],
        [ 2138,   257,  7026, 15632],
        [  438,  2016,   257,   922],
        [ 5891,  1576,   438,   568],
        [  340,   373,   645,  1049],
        [ 5975,   284,   502,   284],
        [ 3285,   326,    11,   287]])
Inputs:
 I HAD always

Targets:
  HAD always thought
Inputs:
 tensor([[  287,   262,  6001,   286],
        [  465, 13476,    11,   339],
        [  550,  5710,   465, 12036],
        [   11,  6405,   257,  5527],
        [27075,    11,   290,  4920],
        [ 2241,   287,   257,  4489],
        [   64,   319,   262, 34686],
        [41976,    13,   357, 10915]])

Ta

### Create our token embedding layer

In [10]:
vocab_size = 50257
output_dim = 3 # 256 is a more common starting point

token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)
print(token_embedding_layer.weight.shape)
print(token_embedding_layer.weight)

torch.Size([50257, 3])
Parameter containing:
tensor([[ 0.0914, -0.9365,  0.7240],
        [ 0.2457,  0.1994, -0.9671],
        [-0.6384, -1.9128,  1.3959],
        ...,
        [ 0.1865, -1.0996,  0.0158],
        [-0.3695, -1.5027, -2.1622],
        [-0.5088,  0.7261,  1.1936]], requires_grad=True)


#### Load our dataset to get the inputs

In [11]:
max_length = 4
dataloader = create_dataloader_v1(
    raw_text, batch_size=8, max_length=max_length,
    stride=max_length, shuffle=False
)

data_iter = iter(dataloader)
inputs, targets = next(data_iter)

print("Token IDs:\n",  inputs) # we take the first batch and  ignore the targets for now
print("\nInputs shape:\n", inputs.shape)

Token IDs:
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])

Inputs shape:
 torch.Size([8, 4])


### Create token embeddings

In [12]:
token_embeddings = token_embedding_layer(inputs)
# each token now has the assigned number of dimentions instead of being a single token ID
print(token_embeddings.shape)

# uncomment & execute the following line to see how the embeddings look like
print(token_embeddings)

torch.Size([8, 4, 3])
tensor([[[-0.8138, -0.7311,  1.2164],
         [-0.8966, -0.8683,  1.7272],
         [-0.0507, -0.7772,  0.4739],
         [-0.4029,  0.7053,  0.4074]],

        [[ 0.2102, -0.9068,  0.8955],
         [-0.3159, -1.1545, -0.6932],
         [-0.7584, -0.7269,  0.4727],
         [-1.9021, -2.1596, -1.0677]],

        [[-0.1855,  0.4751,  0.8388],
         [ 0.1405, -1.5706,  0.7147],
         [-0.5403, -0.6365, -0.0687],
         [-0.9725, -0.0561,  0.2927]],

        [[-0.7529, -0.0139, -0.9845],
         [ 1.4724, -0.6392,  1.7239],
         [ 1.4560, -1.2420, -0.0319],
         [-0.5403, -0.6365, -0.0687]],

        [[ 1.8129,  1.0460,  2.0427],
         [ 0.5189,  0.9129, -0.7249],
         [ 0.8845, -1.0674, -2.6862],
         [ 1.4724, -0.6392,  1.7239]],

        [[ 0.5230, -0.2953, -1.4464],
         [-1.2002,  1.1369, -0.0891],
         [-0.5462,  1.8767, -0.4185],
         [-1.0630, -1.2027,  1.0012]],

        [[-1.5859, -0.3160, -0.4833],
         [-1.201

### Create absolute positional embeddings

In [13]:
pos_embedding_layer = torch.nn.Embedding(max_length, output_dim)

# # [0, 1, 2, 3] "column" position is the position of each word on each sequence of 4 context_length
pos_embeddings = pos_embedding_layer(torch.arange(max_length))
print(pos_embeddings.shape)

# # uncomment & execute the following line to see how the embeddings look like
print(pos_embeddings)

torch.Size([4, 3])
tensor([[-0.4791,  0.4190, -2.3272],
        [-0.4893, -0.4136,  0.2769],
        [-1.1482,  0.3328,  1.1950],
        [ 0.6695,  0.2243,  0.1151]], grad_fn=<EmbeddingBackward0>)


### Create input embeddings

In [14]:
input_embeddings = token_embeddings + pos_embeddings
print(f'{token_embeddings.shape} + {pos_embeddings.shape} = {input_embeddings.shape}')

# uncomment & execute the following line to see how the embeddings look like
print(input_embeddings)

torch.Size([8, 4, 3]) + torch.Size([4, 3]) = torch.Size([8, 4, 3])
tensor([[[-1.2929, -0.3120, -1.1107],
         [-1.3859, -1.2819,  2.0041],
         [-1.1990, -0.4444,  1.6688],
         [ 0.2666,  0.9296,  0.5224]],

        [[-0.2689, -0.4878, -1.4316],
         [-0.8052, -1.5681, -0.4163],
         [-1.9066, -0.3941,  1.6677],
         [-1.2326, -1.9353, -0.9527]],

        [[-0.6646,  0.8941, -1.4884],
         [-0.3488, -1.9842,  0.9916],
         [-1.6885, -0.3037,  1.1263],
         [-0.3030,  0.1682,  0.4077]],

        [[-1.2321,  0.4051, -3.3116],
         [ 0.9831, -1.0528,  2.0008],
         [ 0.3078, -0.9092,  1.1631],
         [ 0.1292, -0.4122,  0.0464]],

        [[ 1.3338,  1.4651, -0.2844],
         [ 0.0297,  0.4993, -0.4481],
         [-0.2637, -0.7346, -1.4913],
         [ 2.1419, -0.4150,  1.8390]],

        [[ 0.0439,  0.1237, -3.7736],
         [-1.6895,  0.7234,  0.1877],
         [-1.6944,  2.2095,  0.7765],
         [-0.3935, -0.9784,  1.1162]],

        [

### What about more dimmension?

In [15]:
def more_more_more_dimensions(dim):
  vocab_size = 50257
  output_dim = dim

  token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)
  token_embeddings = token_embedding_layer(inputs)
  print(token_embeddings.shape)
  print(token_embeddings)

  pos_embedding_layer = torch.nn.Embedding(max_length, output_dim)
  pos_embeddings = pos_embedding_layer(torch.arange(max_length))
  print(pos_embeddings.shape)
  print(pos_embeddings)


  input_embeddings = token_embeddings + pos_embeddings
  print(f'{token_embeddings.shape} + {pos_embeddings.shape} = {input_embeddings.shape}')
  print(input_embeddings)

more_more_more_dimensions(256)  #256 is a more common starting point

torch.Size([8, 4, 256])
tensor([[[ 1.3295e+00, -2.6935e-01,  2.8294e-01,  ...,  7.5636e-01,
          -2.6170e+00, -1.4374e+00],
         [ 9.8064e-01,  1.8127e-01, -8.6219e-01,  ..., -4.2655e-01,
          -8.7672e-01,  6.3201e-01],
         [ 6.1898e-01,  9.3566e-02, -3.4799e-01,  ...,  6.5005e-01,
           8.5892e-01,  1.4344e+00],
         [-1.4821e-02, -1.4379e+00, -3.6418e-01,  ..., -2.7496e-01,
          -1.1600e-01,  1.0902e+00]],

        [[-4.4345e-01, -1.3204e-03,  1.2373e+00,  ..., -8.0514e-01,
          -8.6535e-02, -3.1204e-01],
         [ 1.6229e+00,  1.2113e+00,  1.2216e+00,  ...,  1.1292e+00,
           1.5319e+00,  2.4835e-02],
         [-2.3544e-01,  7.7411e-02,  1.5386e-01,  ...,  4.1874e-01,
          -3.5889e-01,  3.7214e-01],
         [ 5.3323e-02, -1.2211e+00,  4.0893e-01,  ...,  2.3007e+00,
          -1.2084e+00,  1.5010e+00]],

        [[ 1.2954e-01,  1.1265e+00,  8.4160e-01,  ..., -1.7333e+00,
           2.1296e-01, -3.5150e-01],
         [ 3.8380e-01,  2.4

# Chapter 3

## Preparing data to work with

### Get input embeddings

In [16]:
def get_input_embeddings(my_raw_text):
  dataloader = create_dataloader_v1(my_raw_text, batch_size=1, max_length=6, stride=6, shuffle=False)

  data_iter = iter(dataloader)
  # First and only batch
  inputs, targets = next(data_iter)
  # print(inputs)
  # print(targets)  # ignore the targets

  vocab_size = 50257
  output_dim = 3

  token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)
  token_embeddings = token_embedding_layer(inputs)
  context_length = 6
  pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)
  pos_embeddings = pos_embedding_layer(torch.arange(context_length))

  input_embeddings = token_embeddings + pos_embeddings
  # print(f'{token_embeddings.shape} + {pos_embeddings.shape} = {input_embeddings.shape}')
  # print(input_embeddings)
  return input_embeddings

In [17]:
my_raw_text = "Your journey starts with one step"
small_input_embeddings = get_input_embeddings(my_raw_text)[0]
print(small_input_embeddings)

tensor([[ 2.9353,  2.4711, -2.9080],
        [ 0.0640,  0.8755, -1.3891],
        [-0.0383,  0.5499, -3.9638],
        [ 0.9477, -0.1366, -0.3135],
        [-0.1208,  0.2234,  1.2193],
        [ 1.2950,  0.7151, -1.4663]], grad_fn=<SelectBackward0>)


In [18]:
# tensor([
#     [ 0.4113,  1.3397, -1.2234], Your    (x^1)
#     [-1.8881, -0.0679, -1.1267], journey (x^2)
#     [-0.2323, -2.2089, -1.6685], starts  (x^3)
#     [ 0.5615,  1.2698,  2.5768], with    (x^4)
#     [-0.9290, -0.0227,  0.6467], one     (x^5)
#     [ 0.5691, -2.0627, -3.2411]  step    (x^6)
# ])

### Forcing values to fit between 0 and 1

In [19]:
import decimal
min_val = small_input_embeddings.min()
max_val = small_input_embeddings.max()
scaled_embeddings = (small_input_embeddings - min_val) / (max_val - min_val)
rounded_embeddings = torch.round(scaled_embeddings * 100) / 100
print(rounded_embeddings)

tensor([[1.0000, 0.9300, 0.1500],
        [0.5800, 0.7000, 0.3700],
        [0.5700, 0.6500, 0.0000],
        [0.7100, 0.5500, 0.5300],
        [0.5600, 0.6100, 0.7500],
        [0.7600, 0.6800, 0.3600]], grad_fn=<DivBackward0>)


## Simple self-attention

### Step 1 - Compute unormalized attention scores

In [20]:
query = rounded_embeddings[1]  # 2nd input token is the query)
# just allocate a tensor in memory with 6 spaces
attn_scores_2 = torch.empty(rounded_embeddings.shape[0])

#fill the tensor with the dot products which multiply and sum
for i, x_i in enumerate(rounded_embeddings):
    print(f'dot product of {x_i} against journey {query}')
    attn_scores_2[i] = torch.dot(x_i, query) # dot product (transpose not necessary here since they are 1-dim vectors)

print(attn_scores_2)

dot product of tensor([1.0000, 0.9300, 0.1500], grad_fn=<UnbindBackward0>) against journey tensor([0.5800, 0.7000, 0.3700], grad_fn=<SelectBackward0>)
dot product of tensor([0.5800, 0.7000, 0.3700], grad_fn=<UnbindBackward0>) against journey tensor([0.5800, 0.7000, 0.3700], grad_fn=<SelectBackward0>)
dot product of tensor([0.5700, 0.6500, 0.0000], grad_fn=<UnbindBackward0>) against journey tensor([0.5800, 0.7000, 0.3700], grad_fn=<SelectBackward0>)
dot product of tensor([0.7100, 0.5500, 0.5300], grad_fn=<UnbindBackward0>) against journey tensor([0.5800, 0.7000, 0.3700], grad_fn=<SelectBackward0>)
dot product of tensor([0.5600, 0.6100, 0.7500], grad_fn=<UnbindBackward0>) against journey tensor([0.5800, 0.7000, 0.3700], grad_fn=<SelectBackward0>)
dot product of tensor([0.7600, 0.6800, 0.3600], grad_fn=<UnbindBackward0>) against journey tensor([0.5800, 0.7000, 0.3700], grad_fn=<SelectBackward0>)
tensor([1.2865, 0.9633, 0.7856, 0.9929, 1.0293, 1.0500], grad_fn=<CopySlices>)


### Step 2 - Normalize the attenton scores to sum up to 1

In [21]:
def softmax_naive(x):
    return torch.exp(x) / torch.exp(x).sum(dim=0)

attn_weights_2_naive = softmax_naive(attn_scores_2)

print("Attention weights:", attn_weights_2_naive)
print("Sum:", attn_weights_2_naive.sum())

Attention weights: tensor([0.2156, 0.1561, 0.1307, 0.1608, 0.1667, 0.1702],
       grad_fn=<DivBackward0>)
Sum: tensor(1.0000, grad_fn=<SumBackward0>)


In [22]:
# Fooling around with dimensions on vectors
my_tensor = torch.ones(2,3,4) # I always start with the last dimension which is [-1]
print(my_tensor)
print(my_tensor.shape)
print(my_tensor.shape[-1]) # last dimension
print(my_tensor.shape[0]) # first dimenson
print(my_tensor.shape[1])
print(my_tensor.shape[2]) # same as [-1]
# print(my_tensor.shape[3]) # IndexError

tensor([[[1., 1., 1., 1.],
         [1., 1., 1., 1.],
         [1., 1., 1., 1.]],

        [[1., 1., 1., 1.],
         [1., 1., 1., 1.],
         [1., 1., 1., 1.]]])
torch.Size([2, 3, 4])
4
2
3
4


In [23]:
# using pytorch softmax fucntion
attn_weights_2 = torch.softmax(attn_scores_2, dim=0)

print("Attention weights:", attn_weights_2)
print("Sum:", attn_weights_2.sum()) #100%


Attention weights: tensor([0.2156, 0.1561, 0.1307, 0.1608, 0.1667, 0.1702],
       grad_fn=<SoftmaxBackward0>)
Sum: tensor(1., grad_fn=<SumBackward0>)


### Step 3 - Compute the context vector $z^{(2)}$

In [24]:
query = rounded_embeddings[1] # 2nd input token is the query

context_vec_2 = torch.zeros(query.shape)
attn_weights_sum = 0
for i,x_i in enumerate(rounded_embeddings):
    print(f'{attn_weights_2[i]} * {x_i}')
    context_vec_2 += attn_weights_2[i]*x_i
    attn_weights_sum += attn_weights_2[i]

print(attn_weights_sum)
print(context_vec_2)

0.21561072766780853 * tensor([1.0000, 0.9300, 0.1500], grad_fn=<UnbindBackward0>)
0.15606530010700226 * tensor([0.5800, 0.7000, 0.3700], grad_fn=<UnbindBackward0>)
0.1306568682193756 * tensor([0.5700, 0.6500, 0.0000], grad_fn=<UnbindBackward0>)
0.16075389087200165 * tensor([0.7100, 0.5500, 0.5300], grad_fn=<UnbindBackward0>)
0.16671313345432281 * tensor([0.5600, 0.6100, 0.7500], grad_fn=<UnbindBackward0>)
0.17020007967948914 * tensor([0.7600, 0.6800, 0.3600], grad_fn=<UnbindBackward0>)
tensor(1., grad_fn=<AddBackward0>)
tensor([0.7174, 0.7005, 0.3616], grad_fn=<AddBackward0>)


### Get All attention weights

In [25]:
attn_scores = torch.empty(6, 6)

for i, x_i in enumerate(rounded_embeddings):
    for j, x_j in enumerate(rounded_embeddings):
        attn_scores[i, j] = torch.dot(x_i, x_j)

print(attn_scores)

tensor([[1.8874, 1.2865, 1.1745, 1.3010, 1.2398, 1.4464],
        [1.2865, 0.9633, 0.7856, 0.9929, 1.0293, 1.0500],
        [1.1745, 0.7856, 0.7474, 0.7622, 0.7157, 0.8752],
        [1.3010, 0.9929, 0.7622, 1.0875, 1.1306, 1.1044],
        [1.2398, 1.0293, 0.7157, 1.1306, 1.2482, 1.1104],
        [1.4464, 1.0500, 0.8752, 1.1044, 1.1104, 1.1696]],
       grad_fn=<CopySlices>)


We can achive the same but more efficiently via matrix multiplication

In [26]:
# Using matrix transpose (remember is row * column)
print(rounded_embeddings.shape)
print(rounded_embeddings.T.shape)
print('-------------------')
print(rounded_embeddings)
print('-------------------')
print(rounded_embeddings.T)

torch.Size([6, 3])
torch.Size([3, 6])
-------------------
tensor([[1.0000, 0.9300, 0.1500],
        [0.5800, 0.7000, 0.3700],
        [0.5700, 0.6500, 0.0000],
        [0.7100, 0.5500, 0.5300],
        [0.5600, 0.6100, 0.7500],
        [0.7600, 0.6800, 0.3600]], grad_fn=<DivBackward0>)
-------------------
tensor([[1.0000, 0.5800, 0.5700, 0.7100, 0.5600, 0.7600],
        [0.9300, 0.7000, 0.6500, 0.5500, 0.6100, 0.6800],
        [0.1500, 0.3700, 0.0000, 0.5300, 0.7500, 0.3600]],
       grad_fn=<PermuteBackward0>)


In [27]:
attn_scores = rounded_embeddings @ rounded_embeddings.T
print(attn_scores)

tensor([[1.8874, 1.2865, 1.1745, 1.3010, 1.2398, 1.4464],
        [1.2865, 0.9633, 0.7856, 0.9929, 1.0293, 1.0500],
        [1.1745, 0.7856, 0.7474, 0.7622, 0.7157, 0.8752],
        [1.3010, 0.9929, 0.7622, 1.0875, 1.1306, 1.1044],
        [1.2398, 1.0293, 0.7157, 1.1306, 1.2482, 1.1104],
        [1.4464, 1.0500, 0.8752, 1.1044, 1.1104, 1.1696]],
       grad_fn=<MmBackward0>)


Apply softmax

In [28]:
# dim=-1 is to apply softmax to the last dimenison, in this case rows
attn_weights = torch.softmax(attn_scores, dim=-1)
print(attn_weights)

tensor([[0.2658, 0.1458, 0.1303, 0.1479, 0.1391, 0.1710],
        [0.2156, 0.1561, 0.1307, 0.1608, 0.1667, 0.1702],
        [0.2291, 0.1553, 0.1494, 0.1517, 0.1448, 0.1698],
        [0.2087, 0.1534, 0.1218, 0.1686, 0.1760, 0.1715],
        [0.1928, 0.1562, 0.1142, 0.1729, 0.1945, 0.1694],
        [0.2262, 0.1522, 0.1278, 0.1607, 0.1617, 0.1715]],
       grad_fn=<SoftmaxBackward0>)


In [29]:
row_0_sum = sum([0.1403, 0.1365, 0.1915, 0.1552, 0.1659, 0.2106])
print(row_0_sum)
print("All row sums:", attn_weights.sum(dim=-1))

1.0
All row sums: tensor([1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000],
       grad_fn=<SumBackward1>)


### Compute All context vectors

In [30]:
all_context_vecs = attn_weights @ rounded_embeddings
print(all_context_vecs)
print("Previous 2nd context vector:", context_vec_2)

tensor([[0.7376, 0.7165, 0.3381],
        [0.7174, 0.7005, 0.3616],
        [0.7221, 0.7060, 0.3419],
        [0.7157, 0.6974, 0.3712],
        [0.7089, 0.6918, 0.3852],
        [0.7223, 0.7036, 0.3584]], grad_fn=<MmBackward0>)
Previous 2nd context vector: tensor([0.7174, 0.7005, 0.3616], grad_fn=<AddBackward0>)


## Self Attention

**Weight parameters** are learned coefficients that define the network connections, while **attention weights** are dynamic, context-specific values.

In [31]:
x_2 = rounded_embeddings[1] # second input element
d_in = rounded_embeddings.shape[1] #depends on the embeddings dimension the input embedding size, dim=3
d_out = 2 # the output embedding size, dim=2

torch.manual_seed(123)

W_query = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)
W_key   = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)
W_value = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)

print(W_query)
print(W_key)
print(W_value)

Parameter containing:
tensor([[0.2961, 0.5166],
        [0.2517, 0.6886],
        [0.0740, 0.8665]])
Parameter containing:
tensor([[0.1366, 0.1025],
        [0.1841, 0.7264],
        [0.3153, 0.6871]])
Parameter containing:
tensor([[0.0756, 0.1966],
        [0.3164, 0.4017],
        [0.1186, 0.8274]])


In [32]:
query_2 = x_2 @ W_query # _2 because it's with respect to the 2nd input element
key_2 = x_2 @ W_key
value_2 = x_2 @ W_value

print(query_2)

# calculate keys and values vector for all inputs
keys = rounded_embeddings @ W_key
values = rounded_embeddings @ W_value

print("keys.shape:", keys.shape)
print("values.shape:", values.shape)

tensor([0.3753, 1.1022], grad_fn=<SqueezeBackward4>)
keys.shape: torch.Size([6, 2])
values.shape: torch.Size([6, 2])


In [33]:
keys_2 = keys[1]
attn_score_22 = query_2.dot(keys_2)
print(attn_score_22)

tensor(1.0281, grad_fn=<DotBackward0>)


In [34]:
attn_scores_2 = query_2 @ keys.T # All attention scores for given query
print(attn_scores_2)

tensor([1.1044, 1.0281, 0.6589, 1.0591, 1.2793, 1.0315],
       grad_fn=<SqueezeBackward4>)


The difference to earlier is that we now scale the attention scores by dividing them by the square root of the embedding dimension,  𝑑𝑘‾‾‾√  (i.e., d_k**0.5):

Imagine you have two vectors, and their dot product results in a large value. When this large value is passed through the softmax function, it might dominate the probabilities, making the attention mechanism less sensitive to other relevant parts of the input. Scaling helps to mitigate this issue by preventing any single dot product from becoming overly influential.

In [35]:
d_k = keys.shape[1] # dimension of keys
attn_weights_2 = torch.softmax(attn_scores_2 / d_k**0.5, dim=-1)
print(attn_weights_2)

tensor([0.1746, 0.1654, 0.1274, 0.1691, 0.1976, 0.1658],
       grad_fn=<SoftmaxBackward0>)


In [36]:
context_vec_2 = attn_weights_2 @ values
print(context_vec_2)

tensor([0.3161, 0.7322], grad_fn=<SqueezeBackward4>)


### Compact SelfAttention Class

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/18.webp" width="400px">

- We can streamline the implementation above using PyTorch's Linear layers instead of torch random, which are equivalent to a matrix multiplication if we disable the bias units.
- In the original Transformer paper, the authors noted that they did not observe significant performance gains from including bias terms in these specific layers. This observation has led to a common practice of omitting bias in similar attention-based architectures.
- Another big advantage of using `nn.Linear` over our manual `nn.Parameter(torch.rand(...)` approach is that `nn.Linear` has a preferred weight initialization scheme, which leads to more stable model training

In [37]:
import torch.nn as nn

In [50]:
class SelfAttention_v2(nn.Module):

    def __init__(self, d_in, d_out, qkv_bias=False):
        super().__init__()
        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key   = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)

    def forward(self, x):
        keys = self.W_key(x)
        queries = self.W_query(x)
        values = self.W_value(x)

        attn_scores = queries @ keys.T
        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)

        context_vec = attn_weights @ values
        return context_vec

In [51]:
torch.manual_seed(789)
sa_v2 = SelfAttention_v2(d_in, d_out)
print(sa_v2(rounded_embeddings))

tensor([[0.0044, 0.2247],
        [0.0053, 0.2258],
        [0.0059, 0.2266],
        [0.0052, 0.2258],
        [0.0050, 0.2255],
        [0.0051, 0.2256]], grad_fn=<MmBackward0>)


### Hiding futer words with causal attention (one step back)

In [40]:
# Reuse data from previous section
queries = sa_v2.W_query(rounded_embeddings)
keys = sa_v2.W_key(rounded_embeddings)
attn_scores = queries @ keys.T

attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
print(attn_weights)

tensor([[0.1561, 0.1632, 0.1462, 0.1800, 0.1864, 0.1680],
        [0.1585, 0.1636, 0.1502, 0.1777, 0.1820, 0.1680],
        [0.1602, 0.1648, 0.1541, 0.1746, 0.1787, 0.1675],
        [0.1584, 0.1632, 0.1498, 0.1784, 0.1821, 0.1682],
        [0.1577, 0.1627, 0.1483, 0.1797, 0.1833, 0.1684],
        [0.1579, 0.1634, 0.1492, 0.1784, 0.1829, 0.1681]],
       grad_fn=<SoftmaxBackward0>)


Applying negative infinity effectively zeros out the probabilities for these future tokens in the subsequent softmax calculation.

In [41]:
context_length = attn_scores.shape[-1]
mask = torch.triu(torch.ones(context_length, context_length), diagonal=1)
masked = attn_scores.masked_fill(mask.bool(), -torch.inf)
print(masked)

tensor([[ 0.0553,    -inf,    -inf,    -inf,    -inf,    -inf],
        [ 0.0413,  0.0866,    -inf,    -inf,    -inf,    -inf],
        [ 0.0350,  0.0750, -0.0206,    -inf,    -inf,    -inf],
        [ 0.0401,  0.0828, -0.0388,  0.2085,    -inf,    -inf],
        [ 0.0420,  0.0860, -0.0450,  0.2268,  0.2548,    -inf],
        [ 0.0439,  0.0919, -0.0362,  0.2163,  0.2516,  0.1318]],
       grad_fn=<MaskedFillBackward0>)


In [42]:
attn_weights = torch.softmax(masked / keys.shape[-1]**0.5, dim=-1)
print(attn_weights)

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.4920, 0.5080, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3344, 0.3440, 0.3215, 0.0000, 0.0000, 0.0000],
        [0.2437, 0.2512, 0.2305, 0.2746, 0.0000, 0.0000],
        [0.1896, 0.1956, 0.1783, 0.2161, 0.2204, 0.0000],
        [0.1579, 0.1634, 0.1492, 0.1784, 0.1829, 0.1681]],
       grad_fn=<SoftmaxBackward0>)


### Masking additional attention weights with dropout

In [43]:
torch.manual_seed(123)
dropout = torch.nn.Dropout(0.5) # dropout rate of 50%
example = torch.ones(6, 6) # create a matrix of ones

print(dropout(example))
print(dropout(attn_weights))

tensor([[2., 2., 2., 2., 2., 2.],
        [0., 2., 0., 0., 0., 0.],
        [0., 0., 2., 0., 2., 0.],
        [2., 2., 0., 0., 0., 2.],
        [2., 0., 0., 0., 0., 2.],
        [0., 2., 0., 0., 0., 0.]])
tensor([[2.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.6431, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.5024, 0.0000, 0.5491, 0.0000, 0.0000],
        [0.0000, 0.3912, 0.3566, 0.4322, 0.4408, 0.0000],
        [0.3159, 0.3268, 0.0000, 0.0000, 0.3659, 0.3361]],
       grad_fn=<MulBackward0>)


### Causal Attention Class

In [44]:
class CausalAttention(nn.Module):

    def __init__(self, d_in, d_out, context_length,
                 dropout, qkv_bias=False):
        super().__init__()
        self.d_out = d_out
        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key   = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.dropout = nn.Dropout(dropout) # New
        self.register_buffer('mask', torch.triu(torch.ones(context_length, context_length), diagonal=1)) # New

    def forward(self, x):
        b, num_tokens, d_in = x.shape # New batch dimension b
        keys = self.W_key(x)
        queries = self.W_query(x)
        values = self.W_value(x)

        attn_scores = queries @ keys.transpose(1, 2) # Changed transpose
        # `:num_tokens` to account for cases where the number of tokens in the batch is smaller than the supported context_size
        attn_scores.masked_fill_(  # New, _ ops are in-place
            self.mask.bool()[:num_tokens, :num_tokens], -torch.inf)
        attn_weights = torch.softmax(
            attn_scores / keys.shape[-1]**0.5, dim=-1
        )
        attn_weights = self.dropout(attn_weights) # New

        context_vec = attn_weights @ values
        return context_vec

In [55]:
torch.manual_seed(123)

batch = torch.stack((rounded_embeddings, rounded_embeddings), dim=0)
print(batch.shape) # 2 batches with 6 tokens each, and each token has embedding dimension 3
print(batch)
context_length = batch.shape[1]
ca = CausalAttention(d_in, d_out, context_length, 0.0)

context_vecs = ca(batch)

print("context_shape", context_vecs.shape)
print("context_vecs:", context_vecs)

torch.Size([2, 6, 3])
tensor([[[1.0000, 0.9300, 0.1500],
         [0.5800, 0.7000, 0.3700],
         [0.5700, 0.6500, 0.0000],
         [0.7100, 0.5500, 0.5300],
         [0.5600, 0.6100, 0.7500],
         [0.7600, 0.6800, 0.3600]],

        [[1.0000, 0.9300, 0.1500],
         [0.5800, 0.7000, 0.3700],
         [0.5700, 0.6500, 0.0000],
         [0.7100, 0.5500, 0.5300],
         [0.5600, 0.6100, 0.7500],
         [0.7600, 0.6800, 0.3600]]], grad_fn=<StackBackward0>)
2
context_shape torch.Size([2, 6, 2])
context_vecs: tensor([[[-0.8476, -0.4664],
         [-0.7296, -0.3522],
         [-0.6553, -0.3511],
         [-0.6575, -0.2942],
         [-0.6557, -0.2441],
         [-0.6602, -0.2456]],

        [[-0.8476, -0.4664],
         [-0.7296, -0.3522],
         [-0.6553, -0.3511],
         [-0.6575, -0.2942],
         [-0.6557, -0.2441],
         [-0.6602, -0.2456]]], grad_fn=<UnsafeViewBackward0>)


## Self Attention multi-head

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/26.webp" width="400px">

In [60]:
# Multiple heads to extract differente type of information, every head is using different initialized weights
# cada head da como resultado context vectors de cierte dimension que al final son concatenados

class MultiHeadAttention(nn.Module):
    def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
        super().__init__()
        assert (d_out % num_heads == 0), \
            "d_out must be divisible by num_heads"

        self.d_out = d_out
        self.num_heads = num_heads
        self.head_dim = d_out // num_heads # Reduce the projection dim to match desired output dim

        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.out_proj = nn.Linear(d_out, d_out)  # Linear layer to combine head outputs
        self.dropout = nn.Dropout(dropout)
        self.register_buffer(
                             "mask",
            torch.triu(torch.ones(context_length, context_length),
                       diagonal=1)
        )

    def forward(self, x):
        b, num_tokens, d_in = x.shape # b is for batches
        print(x.shape)
        keys = self.W_key(x) # Shape: (b, num_tokens, d_out)
        queries = self.W_query(x)
        values = self.W_value(x)
        # We implicitly split the matrix by adding a `num_heads` dimension
        # Unroll last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim)
        keys = keys.view(b, num_tokens, self.num_heads, self.head_dim)
        values = values.view(b, num_tokens, self.num_heads, self.head_dim)
        queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)

        # Transpose: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim)
        print('--------------------')
        print(f'{keys.shape} vs {keys.transpose(1,2).shape}')
        print('--------------------')
        keys = keys.transpose(1, 2)
        queries = queries.transpose(1, 2)
        values = values.transpose(1, 2)

        # Compute scaled dot-product attention (aka self-attention) with a causal mask
        attn_scores = queries @ keys.transpose(2, 3)  # Dot product for each head

        # Original mask truncated to the number of tokens and converted to boolean
        mask_bool = self.mask.bool()[:num_tokens, :num_tokens]

        # Use the mask to fill attention scores
        attn_scores.masked_fill_(mask_bool, -torch.inf)

        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
        attn_weights = self.dropout(attn_weights)

        # Shape: (b, num_tokens, num_heads, head_dim)
        context_vec = (attn_weights @ values).transpose(1, 2)

        # Combine heads, where self.d_out = self.num_heads * self.head_dim
        context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)
        context_vec = self.out_proj(context_vec) # optional projection

        return context_vec

torch.manual_seed(123)

batch_size, context_length, d_in = batch.shape
d_out = 2
mha = MultiHeadAttention(d_in, d_out, context_length, 0.0, num_heads=2)

context_vecs = mha(batch)

print(context_vecs)
print("context_vecs.shape:", context_vecs.shape)

torch.Size([2, 6, 3])
--------------------
torch.Size([2, 6, 2, 1]) vs torch.Size([2, 2, 6, 1])
--------------------
tensor([[[0.2288, 0.1973],
         [0.2357, 0.2711],
         [0.2233, 0.3080],
         [0.2367, 0.3146],
         [0.2476, 0.3221],
         [0.2479, 0.3197]],

        [[0.2288, 0.1973],
         [0.2357, 0.2711],
         [0.2233, 0.3080],
         [0.2367, 0.3146],
         [0.2476, 0.3221],
         [0.2479, 0.3197]]], grad_fn=<ViewBackward0>)
context_vecs.shape: torch.Size([2, 6, 2])
