# NanoGPT
This notebook is a summary of Andrej Karpathy's [Let's build GPT: from scratch, in code, spelled out.](https://youtu.be/kCc8FmEb1nY?si=NAudUTMqId6D-oxA) lecture as part of the Neural Networks: Zero to Hero series.

As in previous videos, we will be using a character-level language model. We will train a model to generate Shakespeare-like text by training it on the `tinyshakespeare` dataset.

Note that there is a tradeoff between vocab size (i.e. roughly how many characters are used for a token) and sequence length. In our case, we are using character-level tokens, so we will have a small vocab size and long sequences.

## Setup code

In [1]:
# download the dataset
#!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

In [2]:
with open('nanogpt/input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

In [3]:
print("length of dataset is: ", len(text))

length of dataset is:  1115394


In [4]:
print(text[:1000])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.



In [5]:
# find vocab of input file (tokens are character-level)
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(''.join(chars))
print(vocab_size)


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
65


In [6]:
# implement encoder and decoder
# encoder: takes strings and converts them to integers
# decoder: takes integers and converts them back to the original string

stoi = {ch : i for i, ch in enumerate(chars)}
itos = {i : ch for i, ch in enumerate(chars)}
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: "".join([itos[i] for i in l])

In [7]:
print(encode("hii there"))
print(decode(encode("hii there")))

[46, 47, 47, 1, 58, 46, 43, 56, 43]
hii there


In [8]:
# convert data into a tensor
import torch
data = torch.tensor(encode(text), dtype=torch.long)
print(data.shape, data.dtype)
print(data[:1000])

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59,  1, 39, 56, 43,  1, 39, 50, 50,
         1, 56, 43, 57, 53, 50, 60, 43, 42,  1, 56, 39, 58, 46, 43, 56,  1, 58,
        53,  1, 42, 47, 43,  1, 58, 46, 39, 52,  1, 58, 53,  1, 44, 39, 51, 47,
        57, 46, 12,  0,  0, 13, 50, 50, 10,  0, 30, 43, 57, 53, 50, 60, 43, 42,
         8,  1, 56, 43, 57, 53, 50, 60, 43, 42,  8,  0,  0, 18, 47, 56, 57, 58,
         1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 18, 47, 56, 57, 58,  6,  1, 63,
        53, 59,  1, 49, 52, 53, 61,  1, 15, 39, 47, 59, 57,  1, 25, 39, 56, 41,
      

In [9]:
# split data into train and validation
n = int(0.9 * len(data)) # use first 90% as train
train_data = data[:n]
valid_data = data[n:]

## A single block contains multiple examples

When we feed in data to the transformer, we do not feed in the entire dataset at once because it is computationally prohibitive.
Instead, we feed in subsequences of the data of a fixed size, known as blocks.

In our case of pre-training using next word prediction, a single block actually contains multiple examples, as shown below. This is good for computational efficiency, but also exposes the model to contexts of different lengths.

In [10]:
block_size = 8
train_data[:block_size+1]

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

In [11]:
x = train_data[:block_size]
y = train_data[1:block_size+1]
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"when input is {context} the target is {target}")

when input is tensor([18]) the target is 47
when input is tensor([18, 47]) the target is 56
when input is tensor([18, 47, 56]) the target is 57
when input is tensor([18, 47, 56, 57]) the target is 58
when input is tensor([18, 47, 56, 57, 58]) the target is 1
when input is tensor([18, 47, 56, 57, 58,  1]) the target is 15
when input is tensor([18, 47, 56, 57, 58,  1, 15]) the target is 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target is 58


In [12]:
torch.manual_seed(1337)
batch_size = 4
block_size = 8

def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == "train" else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y

xb, yb = get_batch("train")
print("inputs:")
print(xb.shape)
print(xb)
print("targets:")
print(yb.shape)
print(yb)

print("---------")


inputs:
torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
targets:
torch.Size([4, 8])
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])
---------


## Implementation
We will implement a simple prototype without attention in this notebook. Goal is to get familiar with the PyTorch tensor manipulation tricks.

In [13]:
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)

class BigramLanguageModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)
        
    def forward(self, idx, targets=None):
        # idx and targets are both (B,T) tensor of integers
        logits = self.token_embedding_table(idx) # (B,T,C) tensor
        
        # NOTE: the shape of the returned logits is different depending on whether targets is provided
        if targets is None:
            loss = None
        else:
            # reshape dimensions to conform to pytorch
            # (pytorch expects number of channels (C) to be second argument)
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)

            loss = F.cross_entropy(logits, targets)
        
        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        # this function extends this to (B, T+1), (B, T+2), ...
        for _ in range(max_new_tokens):
            # get the predictions
            logits, loss = self(idx)
            # foxus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1)
            # sample from distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append samples
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx
    
m = BigramLanguageModel(vocab_size)
logits, loss = m(xb, yb)
print(logits.shape)
print(loss)

print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist()))

torch.Size([32, 65])
tensor(4.8786, grad_fn=<NllLossBackward0>)

SKIcLT;AcELMoTbvZv C?nq-QE33:CJqkOKH-q;:la!oiywkHjgChzbQ?u!3bLIgwevmyFJGUGp
wnYWmnxKWWev-tDqXErVKLgJ


In [14]:
# create a PyTorch optimizer
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)

In [15]:
batch_size = 32
for steps in range(100000):
    # sample a batch of data
    xb, yb = get_batch('train')
    # evaluate loss
    logits, loss = m(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()
    
print(loss.item())

2.5319576263427734


In [16]:
print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist()))



Ofows ht IUS:
S:

ING flvenje ssutefr,
M:
War cl igagimous pray whars:
Panalit I It aithit terised 


## The mathematical trick in self-attention

In self-attention, we only want to attend to previous tokens (not future ones), since we want to try to predict the future.

Let's begin with the simplest form of communicating with data from the past: taking the mean of all the previous tokens along with itself. This is a very weak and lossy form of interaction, but we will use it as a starting point. There are multiple ways to implement this:

In [17]:
# toy example:
torch.manual_seed(1337)
B,T,C = 4,8,2 # batch, time, channels
x = torch.randn(B,T,C)
x.shape

torch.Size([4, 8, 2])

In [18]:
# Method 1: for loop
# we want x[b,t] = mean_{i<=t} x[b,i]
xbow = torch.zeros((B,T,C)) # bow stands for "bag of words"
for b in range(B):
    for t in range(T):
        xprev = x[b,:t+1] # (t, C)
        xbow[b,t] = torch.mean(xprev, 0)

print(x[0])
print(xbow[0])

tensor([[ 0.1808, -0.0700],
        [-0.3596, -0.9152],
        [ 0.6258,  0.0255],
        [ 0.9545,  0.0643],
        [ 0.3612,  1.1679],
        [-1.3499, -0.5102],
        [ 0.2360, -0.2398],
        [-0.9211,  1.5433]])
tensor([[ 0.1808, -0.0700],
        [-0.0894, -0.4926],
        [ 0.1490, -0.3199],
        [ 0.3504, -0.2238],
        [ 0.3525,  0.0545],
        [ 0.0688, -0.0396],
        [ 0.0927, -0.0682],
        [-0.0341,  0.1332]])


`xbow[i]` is the average (along the vertical axis) of the tokens up to and including the `i`-th element. The above implementation using for loops is inefficient, but it can be optimized using matrix multiplication using a lower triangular matrix of 1s, normalized along each row.

In [19]:
# Method 2: lower triangular matrix of 1's
wei = torch.tril(torch.ones(T,T))
wei /= wei.sum(1, keepdim=True)
wei

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
        [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
        [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
        [0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]])

In [20]:
xbow2 = wei @ x # (T, T) @ (B, T, C) -> broadcasting makes this (B, T, T) @ (B, T, C) = (B, T, C)

In [21]:
# assert xbows are equal
torch.allclose(xbow, xbow2)

True

The third method (which is the one we will end up using) is similar to the second method. We create a matrix of zeros, set the upper triangular portion to be `-inf` and take softmax to obtain a matrix similar to `tril` above. The benefit of this approach is that it can be extended such that `wei` is data-dependent and represents the affinity between tokens.

In [22]:
# Method 3: softmax on weights to obtain matrix
tril = torch.tril(torch.ones(T, T))
wei = torch.zeros((T, T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=1)
wei

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
        [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
        [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
        [0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]])

## Self-attention

Above, when we attended to previous context, we did so by averaging each of the previous tokens uniformly.

However, some tokens may be of more interest than others, so we want to have a data-dependent weight matrix. Self-attention solves this.

In self-attention, each token emits two vectors by forwarding through independent linear layers: a query vector and a key vector. Roughly speaking, a query vector for a token represents "what I am looking for" and a key vector represents "what information I contain." The weight at which a token attends to another token is determined by the dot product of the query vector of the attending token and the key vector of the other token. There is also a value vector which is also obtained by forwarding through another linear layer. The value vector can be interpreted as information that is communicated (whereas the information of the original vector can be thought of as private). 

### Notes about attention:
- What does self-attention mean? Keys, queries, values all come from the same vector `x`, so it attends to itself. If the source of the key and value vectors and the source of the query vector differs, it is known as cross-attention. 
- Attention is a communication mechanism. Think of each token as a node in a directed graph, in which each node has some information and can aggregate information from all nodes that point to it. In self-attention, each node only points to future tokens, but there is no constraint on the connectivity of the graph.
- There is no concept of space or ordering in attention by default. In order to provide spatial information, we use positional encoding embeddings (found in `v2.py`).
- Encoder vs Decoder
  - Encoder block: allow all the nodes to talk to each other (e.g. sentiment analysis)
  - Decoder block: only allow node to attend to previous nodes (e.g. next token prediction)
- "Scaled attention": divide wei by $\sqrt {d_k}$ where $d_k$ is the head size. This is so that `wei` has variance of 1 and is evenly diffused. `wei` feeds into softmax, and softmax produces peaky vectors (similar to a one-hot where one value is really large and the others are low) for large inputs, so we want to keep the values in wei under a constrained variance, especially at initialization.
- Self-attention allows tokens within the input to talk to each other. Usually a linear layer is placed immediately after an attention layer to give the tokens time to process the information from attention.

In [23]:
torch.manual_seed(1337)
B, T, C = 4, 8, 32 # batch, time, channels
x = torch.randn(B,T,C)

# let's see a single Head perform self-attention
head_size = 16
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)
k = key(x) # (B, T, 16)
q = query(x) # (B, T, 16)

wei = q @ k.transpose(-2, -1) # (B, T, 16) @ (B, 16, T) -> (B, T, T)

tril = torch.tril(torch.ones(T, T))
wei = wei.masked_fill(tril == 0, float('-inf')) # assign -inf where tril holds 0 so that we do not give any weight to future tokens
wei = F.softmax(wei, dim=-1)

v = value(x)
out = wei @ v

out.shape

torch.Size([4, 8, 16])

In [24]:
wei[0]

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1574, 0.8426, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2088, 0.1646, 0.6266, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5792, 0.1187, 0.1889, 0.1131, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0294, 0.1052, 0.0469, 0.0276, 0.7909, 0.0000, 0.0000, 0.0000],
        [0.0176, 0.2689, 0.0215, 0.0089, 0.6812, 0.0019, 0.0000, 0.0000],
        [0.1691, 0.4066, 0.0438, 0.0416, 0.1048, 0.2012, 0.0329, 0.0000],
        [0.0210, 0.0843, 0.0555, 0.2297, 0.0573, 0.0709, 0.2423, 0.2391]],
       grad_fn=<SelectBackward0>)

## Optimizations for Training Deep Networks

Our final model in `v2.py` is quite large which require us to implement some optimizations for the network to train properly.

1. Skip / Redisual connections
- Addition of input to final value
- Like a highway for gradients to flow from targets to inputs
- Residual blocks are intialized in a way so that they do not have much contribution in the beginning, but come online as training progresses

2. LayerNorm
- Similar to batchnorm, but normalize along rows (input values along a single example)

3. Dropout
- Randomly prevent some nodes from communicating to prevent overfitting
- The nodes that are shut off change every training cycle, so you can think of this
  as training an ensemble of subnetworks, which are merged together in test time
- Usually placed after linear layers
