## V1 

In [1]:
import torch
import torch.nn as nn
from torch.nn import functional as F

### V1 - Introduction 
- Attention is all You need
- Transformer
- Transformer Decoder
- Andrew Karpathy
- GPT (Generative pre-trained transformer) - Decoder Only transformer
- This is a very simplfied version, some steps, layers skipped
- An introductory overview of ChatGPT, covering how large language models are trained. This tutorial is motivated by a video by Andrew Karpathy
  
  
  

### V2 - Tokenization
Tokenization is the process of converting a sequence of characters into a sequence of tokens. In this script, the given text is read from an input.txt file. Every unique character in this text is treated as a token, leading to a vocabulary of unique characters. The script then provides utilities (encode and decode functions) to convert strings to their tokenized representations and vice versa.

In [2]:
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()
    
chars = sorted(list(set(text)))
chars = list("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!$&',-.3:;? \n")
vocab_size = len(chars)
all_chars = ''.join(chars)
print(all_chars)
print(vocab_size)
stoi = {ch:i for i, ch in enumerate(chars) }
itos = {i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join(itos[i] for i in l)
print(encode("hii there"))
print(decode(encode("hii there")))

data = torch.tensor(encode(text), dtype = torch.long)
n = int(0.9*len(data))
train_data = data[:n]
val_data = data[n:]

def get_batch(split):
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x.to(device),y.to(device)


abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!$&',-.3:;? 

65
[7, 8, 8, 63, 19, 7, 4, 17, 4]
hii there


### V3 - Parameters, Embeeding and Positional Encoding
Parameter - This section merely lists out various hyperparameters and parameters used in the training and model. This includes things like batch size, the maximum number of iterations, the learning rate, and so on.

Embeddings are a way of representing categorical data, like words or characters, as continuous vectors. In this script, each character is embedded into a continuous vector space using an embedding layer.

Positional encodings are added to give the model information about the relative position of tokens in a sequence. They're crucial for models like Transformers, which otherwise wouldn't have any idea about the order of tokens.

In [3]:
batch_size = 1
block_size = 4
max_iters = 10000
learning_rate = 3e-3
device = 'cuda' if torch.cuda.is_available() else 'mps'
eval_interval = 1
n_embd = 4
dropout = 0.2

token_embedding_table = nn.Embedding(vocab_size, n_embd).to(device)
positional_embedding_table = nn.Embedding(block_size, n_embd).to(device)
xb, yb = get_batch('train')
B, T = xb.shape
tok_emb = token_embedding_table(xb.to(device))
pos_emb = positional_embedding_table(torch.arange(T, device=xb.device))
x = tok_emb + pos_emb


In [4]:
dropout_layer = nn.Dropout(dropout).to(device)
tril = torch.tril(torch.ones(block_size, block_size)).to(device)
ln_f = nn.LayerNorm(n_embd).to(device)

### V4 - Multihead Attention Layer 1 Head 1
The Multihead Attention mechanism allows the model to focus on different parts of the input sequence when producing an output sequence. The mechanism works by producing multiple sets (or "heads") of key, query, and value projections, then combining them. This script appears to define a transformer with 2 layers, and each layer has 2 heads of attention.

Multihead Attention Layer 1 Head 1 & Head 2
These are the first and second heads of the multihead attention mechanism in the first layer. The keys, queries, and values are computed using linear projections of the input, and then the attention weights are calculated and used to produce the output of the attention mechanism.

In [5]:
# Hardcoded keys, queries, and values for 2 layers with 2 heads each
key_1_1 = nn.Linear(n_embd, n_embd // 2, bias=False).to(device)  # Neuron
query_1_1 = nn.Linear(n_embd, n_embd // 2, bias=False).to(device)
value_1_1 = nn.Linear(n_embd, n_embd // 2, bias=False).to(device)
k1 = key_1_1(x)
q1 = query_1_1(x)
v1 = value_1_1(x)
wei1 = (q1 @ k1.transpose(-2, -1)) * (n_embd**-0.5)
wei1 = F.softmax(wei1.masked_fill(tril[:T, :T] == 0, float('-inf')), dim=-1)
out1 = wei1 @ v1

### V4 - Multihead Attention Layer 1 Head 2

In [6]:
key_1_2 = nn.Linear(n_embd, n_embd // 2, bias=False).to(device)
query_1_2 = nn.Linear(n_embd, n_embd // 2, bias=False).to(device)
value_1_2 = nn.Linear(n_embd, n_embd // 2, bias=False).to(device)
k2 = key_1_2(x)
q2 = query_1_2(x)
v2 = value_1_2(x)
wei2 = (q2 @ k2.transpose(-2, -1)) * (n_embd**-0.5)
wei2 = F.softmax(wei2.masked_fill(tril[:T, :T] == 0, float('-inf')), dim=-1)
out2 = wei2 @ v2
out = torch.cat([out1, out2], dim=-1)

### V5- Layer 1 Add and Norm  - Feedforward - Add and Norm
After the attention outputs for each head are computed, they are concatenated and then passed through a feedforward network. The Add and Norm steps involve adding the original input to the output of the attention or feedforward networks (a form of residual connection), and then normalizing the result. This helps in stabilizing the activations and aids in training deeper models.

In [7]:
proj_1 = nn.Linear(n_embd, n_embd).to(device)

ffwd_1_fc1 = nn.Linear(n_embd, 4 * n_embd).to(device)
ffwd_1_relu = nn.ReLU().to(device)
ffwd_1_fc2 = nn.Linear(4 * n_embd, n_embd).to(device)
ffwd_1_dropout = nn.Dropout(dropout).to(device)

ln_1_1 = nn.LayerNorm(n_embd).to(device)
ln_1_2 = nn.LayerNorm(n_embd).to(device)

x = x + dropout_layer(proj_1(out))
x = ln_1_1(x)

x_ffwd_1 = ffwd_1_fc1(x)
x_ffwd_1 = ffwd_1_relu(x_ffwd_1)
x_ffwd_1 = ffwd_1_fc2(x_ffwd_1)
x_ffwd_1 = ffwd_1_dropout(x_ffwd_1)

x = x + x_ffwd_1
x = ln_1_2(x)

### V6 - Multihead Attention Layer 2 Head 1
Similar to the first layer, this defines the multihead attention mechanism for the second transformer layer.

In [8]:
key_2_1 = nn.Linear(n_embd, n_embd // 2, bias=False).to(device)
query_2_1 = nn.Linear(n_embd, n_embd // 2, bias=False).to(device)
value_2_1 = nn.Linear(n_embd, n_embd // 2, bias=False).to(device)

k1 = key_2_1(x)
q1 = query_2_1(x)
v1 = value_2_1(x)
wei1 = (q1 @ k1.transpose(-2, -1)) * (n_embd**-0.5)
wei1 = F.softmax(wei1.masked_fill(tril[:T, :T] == 0, float('-inf')), dim=-1)
out1 = wei1 @ v1

### V6 -Multihead Attention Layer 2 Head 2

In [9]:
key_2_2 = nn.Linear(n_embd, n_embd // 2, bias=False).to(device)
query_2_2 = nn.Linear(n_embd, n_embd // 2, bias=False).to(device)
value_2_2 = nn.Linear(n_embd, n_embd // 2, bias=False).to(device)

k2 = key_2_2(x)
q2 = query_2_2(x)
v2 = value_2_2(x)
wei2 = (q2 @ k2.transpose(-2, -1)) * (n_embd**-0.5)
wei2 = F.softmax(wei2.masked_fill(tril[:T, :T] == 0, float('-inf')), dim=-1)
out2 = wei2 @ v2
out = torch.cat([out1, out2], dim=-1)

### v7 - Layer 2 Add and Norm - Feedforward - Add and Norm
Just as in the first layer, after the attention outputs for each head in the second layer are computed, they're concatenated, passed through a feedforward network, and then subjected to the add and normalize operations.

In [10]:
proj_2 = nn.Linear(n_embd, n_embd).to(device)

ffwd_2_fc1 = nn.Linear(n_embd, 4 * n_embd).to(device)
ffwd_2_relu = nn.ReLU().to(device)
ffwd_2_fc2 = nn.Linear(4 * n_embd, n_embd).to(device)
ffwd_2_dropout = nn.Dropout(dropout).to(device)

ln_2_1 = nn.LayerNorm(n_embd).to(device)
ln_2_2 = nn.LayerNorm(n_embd).to(device)

x = x + dropout_layer(proj_2(out))
x = ln_2_1(x)
x_ffwd_2 = ffwd_2_fc1(x)
x_ffwd_2 = ffwd_2_relu(x_ffwd_2)
x_ffwd_2 = ffwd_2_fc2(x_ffwd_2)
x_ffwd_2 = ffwd_2_dropout(x_ffwd_2)


x = x + x_ffwd_2
x = ln_2_2(x)

### v8 - Final Feedforward
The output from the last transformer layer is passed through a linear layer (often termed as the head of the model) to produce logits for each token in the vocabulary. These logits can be used to predict the next token in a sequence, making this a language model.

In [12]:
lm_head = nn.Linear(n_embd, vocab_size).to(device)
logits = lm_head(x)

### v9 - Loss and backpropagation

In [14]:
optimizer = torch.optim.AdamW([
    *token_embedding_table.parameters(),
    *positional_embedding_table.parameters(),
    *key_1_1.parameters(), *key_1_2.parameters(), *key_2_1.parameters(), *key_2_2.parameters(),
    *query_1_1.parameters(), *query_1_2.parameters(), *query_2_1.parameters(), *query_2_2.parameters(),
    *value_1_1.parameters(), *value_1_2.parameters(), *value_2_1.parameters(), *value_2_2.parameters(),
    *proj_1.parameters(), *proj_2.parameters(),
    *ffwd_1_fc1.parameters(), *ffwd_1_fc2.parameters(),
    *ffwd_2_fc1.parameters(), *ffwd_2_fc2.parameters(),
    *ln_1_1.parameters(), *ln_1_2.parameters(), *ln_2_1.parameters(), *ln_2_2.parameters(),
    *ln_f.parameters(),
    *lm_head.parameters()
], lr=learning_rate)

loss = F.cross_entropy(logits.view(-1, vocab_size), yb.view(-1).to(device))
optimizer.zero_grad()
loss.backward()
optimizer.step()
print(loss)

tensor(4.3645, device='mps:0', grad_fn=<NllLossBackward0>)
