<a href="https://colab.research.google.com/github/ckkissane/deep_learning_curriculum/blob/scaling-laws/solutions/1_Transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Implement a decoder-only transformer language model.

Here are some first principle questions to answer:

## What is different architecturally from the Transformer, vs a normal RNN, like an LSTM? (Specifically, how are recurrence and time managed?)

Transformer:
* Non sequential: sequences are processed as a whole using multi-headed attention layers, which allows for parallel computation
* Positional encodings are used so that the transformer can capture sequential information

RNN:
* Sequential processing: sequences are processed one token at a time using recurrent layers, which is not parallelizable
* No positional encoding: RNNs learn positional information based on the past hidden state. This can cause issues with long sequences, as we lose information from older inputs

## Attention is defined as, Attention(Q,K,V) = softmax(QK^T/sqrt(d_k))V. What are the dimensions for Q, K, and V? Why do we use this setup? What other combinations could we do with (Q,K) that also output weights?

The dimensions are:
* Q: (seq_len, d_k)
* K: (seq_len, d_k)
* V: (seq_len, d_v)

1. d_k represents the dimension of the vectors representing the queries / keys. 
2. d_v is the dimension of the vectors representing the values.
3. Since there are query, key, and value vectors for each token in the sequence, it's natural to pack them into matrices for more efficient computation. That's why we have seq_len rows for each matrix. 


Other combinations we could do with (Q, K) that output weights:
* Additive attention computes the compatibility function using a feed-forward network with a single hidden layer

However, "dot-product attention is
much faster and more space-efficient in practice, since it can be implemented using highly optimized
matrix multiplication code."

## Are the dense layers different at each multi-head attention block? Why or why not?

Yes

Here are some ideas why:
* Intuitively, the point of stacking layers is so that each layer can transform the data independently of each other, resulting in a more expressive model
* The W^Q, W^K, W^V layers learn representations for the query, key, and values. The model will likely benefit from the flexibility of learning different representations for each block
* It's been empirically observed that [more learnable parameters lead to better performance](https://arxiv.org/abs/2001.08361)

## Why do we have so many skip connections, especially connecting the input of an attention function to the output? Intuitively, what if we didn't?

In the [ResNet paper](https://arxiv.org/abs/1512.03385?context=cs), it was observed that some deep neural networks perform worse than their shallow counterparts. Adding skip connections empirically seemed to solve this issue. 
The intuition is that adding skip connections allows layers to learn the identity mapping more easily. 
"To the extreme, if an identity mapping were optimal, it would be easier to push the residual to zero than to fit an identity mapping by a stack of nonlinear layers."

If we didn't include these skip connections, we might experience a degradation of performance for very deep transformer models due to vanishing / exploding gradient problems.

## Now we'll actually implement the code. Make sure each of these is completely correct - it's very easy to get the small details wrong. Implement the positional embedding function first.

In [None]:
import torch
import torch.nn.functional as F
from torch import optim
from torch import nn
from torch import einsum
from torch.utils.data import Dataset
from torch.utils.data.dataloader import DataLoader
import random
import numpy as np
import math
import copy
from tqdm import tqdm
import re

In [None]:
def set_seed(seed):
    random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    np.random.seed(seed)

set_seed(3407)

In [None]:
# I use learned encodings, rather than the fixed encodings used in Attention is All You Need
# This is because learned encodings seem to be popular in decoder-only models, like GPT-2
# plus, it's simpler to implement in pytorch
class PositionalEmbedding(nn.Module):
    def __init__(self, max_position_embeddings, hidden_size):
        super().__init__()
        self.pos_embedding = nn.Embedding(max_position_embeddings, hidden_size)
    
    def forward(self, pos):
        return self.pos_embedding(pos)

# sanity check
config = dict(hidden_size=32, max_position_embeddings=8, seq_len=5)
pos_emb = PositionalEmbedding(config['max_position_embeddings'], config['hidden_size'])

pos = torch.arange(config['seq_len'])
print(f"pos: {pos}")
pos_embeddings = pos_emb(pos)
print(f"pos_embeddings: {pos_embeddings.shape}")

pos: tensor([0, 1, 2, 3, 4])
pos_embeddings: torch.Size([5, 32])


## Then implement the function which calculates attention, given (Q,K,V) as arguments.

In [None]:
def attention(query, key, value):
    "Compute 'Scaled Dot Product Attention'"
    d_k = query.size(-1)
    scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
    p_attn = scores.softmax(dim=-1)
    return torch.matmul(p_attn, value), p_attn

# sanity check
test_q = torch.randn(1, 5, 64) # (batch_size, seq_len, d_k)
test_k = torch.randn(1, 5, 64) # (batch_size, seq_len, d_k)
test_v = torch.randn(1, 5, 64) # (batch_size, seq_len, d_v)

out = attention(test_q, test_k, test_v)
print("out shape:", out[0].shape)
print(f"p_attn: {out[1]}")

out shape: torch.Size([1, 5, 64])
p_attn: tensor([[[0.2474, 0.0444, 0.3023, 0.2104, 0.1955],
         [0.3195, 0.4795, 0.0223, 0.0933, 0.0853],
         [0.2972, 0.1110, 0.1023, 0.3373, 0.1522],
         [0.6928, 0.1399, 0.0174, 0.1139, 0.0360],
         [0.4379, 0.0779, 0.1950, 0.1982, 0.0909]]])


## Now implement the masking function.

In [None]:
def mask_scores(attn_scores):
    seq_len = attn_scores.shape[-2]
    neg_inf = torch.tensor(-1e9).to(attn_scores.device)
    q_ind = torch.arange(seq_len).unsqueeze(1)
    k_ind = torch.arange(seq_len).unsqueeze(0)
    mask = (q_ind < k_ind).to(attn_scores.device)
    attn_scores = torch.where(mask, neg_inf, attn_scores)
    return attn_scores

#sanity check 
test_scores = torch.randn(1, 4, 4) # (batch_size, seq_len, seq_len)
mask_scores(test_scores)

tensor([[[ 1.2899e-01, -1.0000e+09, -1.0000e+09, -1.0000e+09],
         [-1.0423e+00, -1.8570e-01, -1.0000e+09, -1.0000e+09],
         [ 1.1555e+00,  5.7537e-01, -3.6150e-01, -1.0000e+09],
         [-1.4767e+00,  2.6147e-01,  1.4466e+00,  2.0954e+00]]])

In [None]:
# rewrite attn to always use mask for decoder-only model
def masked_attention(query, key, value):
    "Compute 'Scaled Dot Product Attention'"
    d_k = query.size(-1)
    scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
    scores = mask_scores(scores)
    p_attn = scores.softmax(dim=-1)
    return torch.matmul(p_attn, value), p_attn

# sanity check
test_q = torch.randn(1, 5, 64) # (batch_size, seq_len, d_k)
test_k = torch.randn(1, 5, 64) # (batch_size, seq_len, d_k)
test_v = torch.randn(1, 5, 64) # (batch_size, seq_len, d_v)

out = masked_attention(test_q, test_k, test_v)
print("out shape:", out[0].shape)
print(f"p_attn: {out[1]}")

out shape: torch.Size([1, 5, 64])
p_attn: tensor([[[1.0000, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.4210, 0.5790, 0.0000, 0.0000, 0.0000],
         [0.1711, 0.3003, 0.5287, 0.0000, 0.0000],
         [0.5162, 0.1221, 0.1815, 0.1802, 0.0000],
         [0.2538, 0.0574, 0.1897, 0.3095, 0.1897]]])


## Put it all together to form an entire attention block.

In [None]:
class MaskedMultiHeadedAttention(nn.Module):
    def __init__(self, h, d_model):
        super().__init__()
        assert d_model % h == 0
        # We assume d_v always equals d_k
        self.d_k = d_model // h
        self.h = h
        self.q_proj = nn.Linear(d_model, d_model)
        self.k_proj = nn.Linear(d_model, d_model)
        self.v_proj = nn.Linear(d_model, d_model)
        self.output_proj = nn.Linear(d_model, d_model)
        self.attn = None

    def forward(self, query, key, value):
        batch_size = query.size(0)

        # 1) Do all the linear projections in batch from d_model => h x d_k
        query = self.q_proj(query).view(batch_size, -1, self.h, self.d_k).transpose(1, 2)
        key = self.k_proj(key).view(batch_size, -1, self.h, self.d_k).transpose(1, 2)
        value = self.v_proj(value).view(batch_size, -1, self.h, self.d_k).transpose(1, 2)

        # 2) Apply attention on all the projected vectors in batch.
        x, self.attn = masked_attention(query, key, value)

        # 3) "Concat" using a view and apply a final linear.
        x = (
            x.transpose(1, 2)
            .contiguous()
            .view(batch_size, -1, self.h * self.d_k)
        )
        out = self.output_proj(x)
        return out

#sanity check
test_query = torch.randn(1, 5, 24) # (batch_size, seq_len, d_model)
test_key = torch.randn(1, 5, 24) # (batch_size, seq_len, d_model)
test_value = torch.randn(1, 5, 24) # (batch_size, seq_len, d_model)

multi_head_attn = MaskedMultiHeadedAttention(h=4, d_model=24)
out = multi_head_attn(test_query, test_key, test_value)
print(f"out: {out.shape}")

out: torch.Size([1, 5, 24])


## Finish the whole architecture.

In [None]:
class DecoderBlock(nn.Module):
    def __init__(
        self,
        hidden_size: int,
        layer_norm_epsilon: float,
        dropout: float,
        num_heads: int
    ):
        super().__init__()
        self.ln1 = nn.LayerNorm(hidden_size, eps=layer_norm_epsilon)
        self.attn = MaskedMultiHeadedAttention(num_heads, hidden_size)
        self.ln2 = nn.LayerNorm(hidden_size, eps=layer_norm_epsilon)
        self.linear1 = nn.Linear(hidden_size, hidden_size * 4)
        self.linear2 = nn.Linear(hidden_size * 4, hidden_size)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        x = x + self.attn(self.ln1(x), self.ln1(x), self.ln1(x))
        x = x + self.dropout(self.linear2(F.gelu(self.linear1(self.ln2(x)))))
        return x

#sanity check
test_input = torch.randn(1, 5, 24) # (batch_size, seq_len, d_model)
dec_block = DecoderBlock(hidden_size=24, layer_norm_epsilon=1e-4, dropout=0.1, num_heads=4)
out = dec_block(test_input)
print(f"out: {out.shape}")

out: torch.Size([1, 5, 24])


In [None]:
class DecoderOnlyTransformer(nn.Module):
    def __init__(
        self,
        num_layers,
        num_heads,
        vocab_size,
        hidden_size,
        max_position_embeddings,
        dropout,
        layer_norm_epsilon
    ):
        super().__init__()
        self.token_embedding = nn.Embedding(vocab_size, hidden_size)
        self.pos_embedding = nn.Embedding(max_position_embeddings, hidden_size)
        self.dropout = nn.Dropout(dropout)
        self.blocks = nn.Sequential(
            *[
                DecoderBlock(hidden_size, layer_norm_epsilon, dropout, num_heads)
                for _ in range(num_layers)
            ]
        )
        self.ln = nn.LayerNorm(hidden_size, eps=layer_norm_epsilon)

        print(f"number of parameters: {sum(p.numel() for p in self.parameters())}")

    def forward(self, input_ids):
        batch_size, seq_len = input_ids.shape
        pos = torch.arange(seq_len).to(input_ids.device)
        enc = self.dropout(self.token_embedding(input_ids) + self.pos_embedding(pos))
        enc = self.blocks(enc)
        enc = self.ln(enc)
        logits = torch.einsum("bnl, vl -> bnv", enc, self.token_embedding.weight)
        return logits

# sanity check
config = dict(num_layers=2, num_heads=4, vocab_size=100, hidden_size=64,
                max_position_embeddings=32, dropout=0.0, layer_norm_epsilon=1e-4)
x = torch.randint(0, config['vocab_size'], (1, 5))

dec_only_transformer = DecoderOnlyTransformer(**config)
output = dec_only_transformer(x)
print(f"output: {output.shape}")

number of parameters: 108544
output: torch.Size([1, 5, 100])


## To check you have the attention mask set up correctly, train your model on a toy task, such as reversing a random sequence of tokens. The model should be able to predict the second sequence, but not the first.

In [None]:
class ReverseDataset(Dataset):
    """
    Reverses sequences up to some number of digits in the inputs. Recall
    that all GPT cares about are sequences of integers, and completing them according to
    patterns in the data. Therefore, we have to somehow encode reversals
    as a sequence of integers.
    
    As a few examples, for a 3 digit sequence:
    - reverse([1,2,3]) = [3, 2, 1] becomes the sequence [1, 2, 3, 3, 2, 1]
    - reverse([6, 8, 8]) = [8, 8, 6] becomes the sequence [6, 8, 8, 8, 8, 6]
    etc.
    
    We will also only train GPT on the final n-digits because the first
    (n-1)-digits are always assumed to be given. So when we give GPT an exam later,
    we will e.g. feed it the sequence [0, 6, 3, 9], which encodes that we'd like
    to reverse [0, 6, 3, 9], and hope that the model completes the integer sequence with [9, 3, 6, 0]
    in 4 sequential steps.
    """
    def __init__(self, ndigit):
        self.ndigit = ndigit
        self.vocab_size = 10 # 10 possible digits 0..9
        self.block_size = 2 * ndigit - 1
        
        self.size = 10**self.ndigit # total number of possible combinations

    def __len__(self):
        return self.size

    def __getitem__(self, idx):
        inp = torch.randint(self.vocab_size, size=(self.ndigit,), dtype=torch.long)
        sol = torch.flip(inp,(-1,))
        cat = torch.cat((inp, sol), dim=0)
        x = cat[:-1].clone()
        y = cat[1:].clone()
        y[: self.ndigit - 1] = -100
        return x, y

In [None]:
# create a dataset for e.g. 4-digit sequence reversals
ndigit = 4
train_dataset = ReverseDataset(ndigit=ndigit)

In [None]:
train_dataset[0]

(tensor([7, 5, 4, 6, 6, 4, 5]),
 tensor([-100, -100, -100,    6,    4,    5,    7]))

In [None]:
batch_size=512//2
train_loader = DataLoader(
    train_dataset, shuffle=True, pin_memory=True, batch_size=batch_size
)

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("device:", device)

device: cuda


In [None]:
model = DecoderOnlyTransformer(
    num_layers=2,
    num_heads=4,
    vocab_size=train_dataset.vocab_size,
    hidden_size=128,
    max_position_embeddings=train_dataset.block_size,
    dropout=0.1,
    layer_norm_epsilon=1e-5,
).to(device).train()

number of parameters: 398976


In [None]:
loss_fn = nn.CrossEntropyLoss()

optimizer = optim.Adam(model.parameters(), lr=6e-4)

In [None]:
max_epochs = 10

for epoch in range(max_epochs):
    pbar = tqdm(enumerate(train_loader), total=len(train_loader))
    for it, (x, y) in pbar:
        x = x.to(device)
        y = y.to(device)

        optimizer.zero_grad()
        
        logits = model(x)
        loss = loss_fn(logits.view(-1, logits.size(-1)), y.view(-1))
        loss.backward()

        optimizer.step()

        pbar.set_description(f"epoch {epoch} iter {it}: train loss {loss.item():.5f}")

epoch 0 iter 39: train loss 3.07595: 100%|██████████| 40/40 [00:04<00:00,  8.18it/s]
epoch 1 iter 39: train loss 2.44274: 100%|██████████| 40/40 [00:01<00:00, 35.21it/s]
epoch 2 iter 39: train loss 1.97242: 100%|██████████| 40/40 [00:01<00:00, 33.08it/s]
epoch 3 iter 39: train loss 1.38969: 100%|██████████| 40/40 [00:01<00:00, 32.54it/s]
epoch 4 iter 39: train loss 0.01131: 100%|██████████| 40/40 [00:01<00:00, 33.51it/s]
epoch 5 iter 39: train loss 0.00005: 100%|██████████| 40/40 [00:01<00:00, 33.42it/s]
epoch 6 iter 39: train loss 0.00082: 100%|██████████| 40/40 [00:01<00:00, 33.53it/s]
epoch 7 iter 39: train loss 0.00013: 100%|██████████| 40/40 [00:01<00:00, 33.09it/s]
epoch 8 iter 39: train loss 0.00311: 100%|██████████| 40/40 [00:01<00:00, 33.31it/s]
epoch 9 iter 39: train loss 0.00018: 100%|██████████| 40/40 [00:01<00:00, 34.34it/s]


In [None]:
@torch.no_grad()
def sample(model, x, steps, block_size):
    """
    take a conditioning sequence of indices in x (of shape (b,t)) and predict the next token in
    the sequence, feeding the predictions back into the model each time.
    """
    model.eval()
    for k in range(steps):
        x_cond = x if x.size(1) <= block_size else x[:, -block_size:] # crop context if needed
        logits = model(x_cond)
        # pluck the logits at the final step
        logits = logits[:, -1, :]
        # apply softmax to convert to probabilities
        probs = F.softmax(logits, dim=-1)
        # take the most likely
        _, ix = torch.topk(probs, k=1, dim=-1)
        # append to the sequence and continue
        x = torch.cat((x, ix), dim=1)

    return x

In [None]:
# test: reverse [1, 2, 3, 4] -> [4, 3, 2, 1]
inp = torch.tensor([[1, 2, 3, 4]]).to(device)
y = sample(model, inp, 4, train_dataset.block_size)[0]
y

tensor([1, 2, 3, 4, 4, 3, 2, 1], device='cuda:0')

In [None]:
# it shouldn't be able to predict the first sequence
inp = torch.tensor([[1]]).to(device)
y = sample(model, inp, 7, train_dataset.block_size)[0]
y

tensor([1, 1, 1, 1, 1, 1, 1, 1], device='cuda:0')

## Finally, train your model on the [complete works of William Shakespeare](https://www.gutenberg.org/files/100/100-0.txt). Tokenize the corpus by splitting at word boundaries (re.split(r"\b", ...)).

In [None]:
# you'll need to upload this file to your colab session
text = open('100-0.txt', 'r').read()

In [None]:
class WordDataset(Dataset):
    """
    arrange data and targets so that the first i elements of x
    will be asked to predict the i-th element of y. Notice that
    the eventual language model will actually make block_size
    individual predictions at the same time based on this data,
    so we are being clever and amortizing the cost of the forward
    pass of the network. So for example if block_size is 4, then
    we could e.g. sample a chunk of text "w1 w2 w3 w4 w5", the integers in
    x will correspond to "w1 w2 w3 w4" and in y will be "w2 w3 w4 w5". This will
    then actually "multitask" 4 separate examples at the same time
    in the language model:
    - given just "w1", please predict "w2" as next
    - given "w1 w2" please predict "w3" next
    - given "w1 w2 w3" predict "w4" next
    - given "w1 w2 w3 w4" predict "w5" next
    """
    def __init__(self, data, block_size):
        words = re.split(r"\b", data)
        vocab = sorted(list(set(words)))
        data_size, vocab_size = len(words), len(vocab)
        print('data has %d words, %d unique.' % (data_size, vocab_size))
        
        self.stoi = {word: i for i, word in enumerate(vocab)}
        self.itos = {i: word for i, word in enumerate(vocab)}
        self.block_size = block_size
        self.vocab_size = vocab_size
        self.data = words
    
    def __len__(self):
        return len(self.data) - self.block_size

    def __getitem__(self, idx):
        # grab a chunk of (block_size + 1) characters from the data
        chunk = self.data[idx:idx + self.block_size + 1]
        # encode every word to an integer
        dix = [self.stoi[s] for s in chunk]
        x = torch.tensor(dix[:-1], dtype=torch.long)
        y = torch.tensor(dix[1:], dtype=torch.long)
        return x, y

In [None]:
block_size = 128
train_dataset = WordDataset(text, block_size) 

data has 1987763 words, 34541 unique.


In [None]:
batch_size = 64
train_loader = DataLoader(
    train_dataset, shuffle=True, pin_memory=True, batch_size=batch_size
)

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("device:", device)

device: cuda


In [None]:
model = DecoderOnlyTransformer(
    num_layers=8,
    num_heads=8,
    vocab_size=train_dataset.vocab_size,
    hidden_size=512,
    max_position_embeddings=train_dataset.block_size,
    dropout=0.1,
    layer_norm_epsilon=1e-5
).to(device).train()

number of parameters: 42970624


In [None]:
loss_fn = nn.CrossEntropyLoss()

optimizer = optim.Adam(model.parameters(), lr=6e-4)

In [None]:
max_epochs = 1

for epoch in range(max_epochs):
    pbar = tqdm(enumerate(train_loader), total=len(train_loader))
    for it, (x, y) in pbar:
        x = x.to(device)
        y = y.to(device)

        optimizer.zero_grad()
        
        logits = model(x)
        loss = loss_fn(logits.view(-1, logits.size(-1)), y.view(-1))
        loss.backward()

        optimizer.step()

        pbar.set_description(f"epoch {epoch} iter {it}: train loss {loss.item():.5f}")

epoch 0 iter 31056: train loss 0.81490: 100%|██████████| 31057/31057 [3:23:02<00:00,  2.55it/s]


In [None]:
def top_k_logits(logits, k):
    v, ix = torch.topk(logits, k)
    out = logits.clone()
    out[out < v[:, [-1]]] = -float('Inf')
    return out

In [None]:
@torch.no_grad()
def sample(model, x, steps, temperature=1.0, sample=False, top_k=None):
    """
    take a conditioning sequence of indices in x (of shape (b,t)) and predict the next token in
    the sequence, feeding the predictions back into the model each time
    """
    model.eval()
    for k in range(steps):
        x_cond = x if x.size(1) <= block_size else x[:, -block_size:] # crop context if needed
        logits = model(x_cond)
        # pluck the logits at the final step and scale by temperature
        logits = logits[:, -1, :] / temperature
        # optionally crop probabilities to only the top k options
        if top_k is not None:
            logits = top_k_logits(logits, top_k)
        # apply softmax to convert to probabilities
        probs = F.softmax(logits, dim=-1)
        # sample from the distribution or take the most likely
        if sample:
            ix = torch.multinomial(probs, num_samples=1)
        else:
            _, ix = torch.topk(probs, k=1, dim=-1)
        # append to the sequence and continue
        x = torch.cat((x, ix), dim=1)

    return x

In [None]:
context = " O God, O God! "
x = torch.tensor([train_dataset.stoi[s] for s in re.split(r"\b", context)], dtype=torch.long)[None,...].to(device)
y = sample(model, x, 2000, temperature=1.0, sample=True, top_k=10)[0]
completion = ''.join([train_dataset.itos[int(i)] for i in y])
print(completion)

 O God, O God! O this, O villain!
O momentary churl!—duck, what bones are!
Within my brows! How! ’Tis good she.
How dost thou mean a woman’s negligence,
To take her life? She’s dead again;
Who, witty, so much, so much to do her good,
Lies senseless offences death. Hie her therefore!
Who is’t I say that she did say I?

SECOND GENTLEMAN.
I was sorry “Thomas heir,
Will hold away her, let them be patient.
The subject is already, not with a general touch;
But the worst is free that did the feast.

FIRST MURDERER.
Ay, and you shall put upon the marriage-bed.

HAMLET.
Take my part, or I mean to think where it is.
Imagine it so,—and so farewell yours,—
For she’s a false thing, the academes of cream,
That ’scuse on curtsies, guarded; but as a dove.
I wonder in every cabin I shall not
Outgo it so much conceit. I tell him plainly,
For that’s true enough to disgrace, I will ease
With full three thousand crowns; yet I hold one too,
And yet by degrees, methinks I love mine honour,
A father’s love, I