# Introduction
Here we will be building a GPT-like model from scratch based on the two papers [Attention is All You Need](https://arxiv.org/abs/1706.03762), which proposed the **transformer** architecture, and [GPT-3](https://arxiv.org/abs/2005.14165). GPT is a language model that given an input simple predicts the next word.

# Libraries

In [1]:
%matplotlib inline
%config IPCompleter.use_jedi=False

In [2]:
import math
import random
import matplotlib.pyplot as plt

In [3]:
###BIGRAM###
#################
### Libraries ###
#################

In [4]:
###BIGRAM###
import torch
import torch.nn as nn
import torch.nn.functional as F
device = 'cuda' if torch.cuda.is_available else 'cpu'; device = 'cpu'

# Data
We import the tiny-Shakespeare dataset and process it such that it can be used for creating a model that can create Shakespeare texts.

### Reading and Inspecting

In [5]:
###BIGRAM###
############
### Data ###
############

In [6]:
###BIGRAM###
# Reading Data
with open("../../data/tinyshakespeare.txt", "r", encoding="utf-8") as f:
    text = f.read()

In [7]:
print("Number of Characters: ", len(text))

Number of Characters:  1115394


In [8]:
# First 300 characters
print(text[:300])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us


In [9]:
###BIGRAM###
# All unique characters
chars = sorted(list(set(text)))
vocab_size = len(chars)

In [10]:
print(f"Vocab Size: {vocab_size}")
print(f"Vocab Chars: {''.join(chars):}")

Vocab Size: 65
Vocab Chars: 
 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz


### Building Vocabulary and Encoder/Decoder
We just use a character-level tokenizer here, but in practice people e.g. OpenAI uses something else e.g. the **BPE** tokenizer:

In [11]:
# OpenAI encoder/decoder
import tiktoken
enc = tiktoken.get_encoding("gpt2")
enc.decode(enc.encode("hello world")) == "hello world"

True

We will be using a simple tokenizer that uses characters rather than word-chunks to make things easier to understand.

In [12]:
###BIGRAM###
# Building the vocabulary
ctoi = {s:i for i,s in enumerate(chars)}
itoc = {i:s for s,i in ctoi.items()}

In [13]:
# Priting vocabulary
print(ctoi)
print(itoc)

{'\n': 0, ' ': 1, '!': 2, '$': 3, '&': 4, "'": 5, ',': 6, '-': 7, '.': 8, '3': 9, ':': 10, ';': 11, '?': 12, 'A': 13, 'B': 14, 'C': 15, 'D': 16, 'E': 17, 'F': 18, 'G': 19, 'H': 20, 'I': 21, 'J': 22, 'K': 23, 'L': 24, 'M': 25, 'N': 26, 'O': 27, 'P': 28, 'Q': 29, 'R': 30, 'S': 31, 'T': 32, 'U': 33, 'V': 34, 'W': 35, 'X': 36, 'Y': 37, 'Z': 38, 'a': 39, 'b': 40, 'c': 41, 'd': 42, 'e': 43, 'f': 44, 'g': 45, 'h': 46, 'i': 47, 'j': 48, 'k': 49, 'l': 50, 'm': 51, 'n': 52, 'o': 53, 'p': 54, 'q': 55, 'r': 56, 's': 57, 't': 58, 'u': 59, 'v': 60, 'w': 61, 'x': 62, 'y': 63, 'z': 64}
{0: '\n', 1: ' ', 2: '!', 3: '$', 4: '&', 5: "'", 6: ',', 7: '-', 8: '.', 9: '3', 10: ':', 11: ';', 12: '?', 13: 'A', 14: 'B', 15: 'C', 16: 'D', 17: 'E', 18: 'F', 19: 'G', 20: 'H', 21: 'I', 22: 'J', 23: 'K', 24: 'L', 25: 'M', 26: 'N', 27: 'O', 28: 'P', 29: 'Q', 30: 'R', 31: 'S', 32: 'T', 33: 'U', 34: 'V', 35: 'W', 36: 'X', 37: 'Y', 38: 'Z', 39: 'a', 40: 'b', 41: 'c', 42: 'd', 43: 'e', 44: 'f', 45: 'g', 46: 'h', 47: 'i',

In [14]:
###BIGRAM###
# Building encoder/decoder
encode = lambda s: [ctoi[c] for c in s]
decode = lambda l: ''.join([itoc[i] for i in l])

In [15]:
# Testing encoder/decoder
print(encode("hii there"))
print(decode(encode("hii there")))

[46, 47, 47, 1, 58, 46, 43, 56, 43]
hii there


### Tokenizing
Using our simple tokenizer we tokenize the entire dataset.

In [16]:
###BIGRAM###
# Tokenizing dataset
data = torch.tensor(encode(text), dtype=torch.long)

In [17]:
# Printing example
print(data.shape, data.dtype)
print(data[:50])

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56])


### Train/Valid Split

In [18]:
###BIGRAM###
# Train/Valid Split
n = int(0.9*len(data))
data_train = data[:n]
data_valid = data[n:]

### Creating dataset
When builing out model, we would like it to be able to generate text from as little a context as one character, but still up to a context of size **block_size**.

In [19]:
# Example of a sample
block_size = 8
x = data_train[:block_size]
y = data_train[1:block_size+1]
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"when input is {context} target is: {target}")

when input is tensor([18]) target is: 47
when input is tensor([18, 47]) target is: 56
when input is tensor([18, 47, 56]) target is: 57
when input is tensor([18, 47, 56, 57]) target is: 58
when input is tensor([18, 47, 56, 57, 58]) target is: 1
when input is tensor([18, 47, 56, 57, 58,  1]) target is: 15
when input is tensor([18, 47, 56, 57, 58,  1, 15]) target is: 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) target is: 58


We create a function for getting random batches from the data.

In [20]:
###BIGRAM###
# Gets Ramdom Batches
def get_batch(split: str, batch_size: int, block_size: int) -> torch.tensor:
    """
    Description:
        Generates a batch of data of inputs x and targets y.
    Inputs:
        split: test or valid split
        batch_size: How many independent sequences will be processed in parallel
        block_size: Maximum context length
    Outputs:
        x, y: a tuple with xs and ys
    """
    data = data_train if split == 'train' else data_valid
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x,y = x.to(device), y.to(device)
    return x, y

In [21]:
# Testing function
batch_size = 4 
block_size = 8 

xb, yb = get_batch('train', batch_size, block_size)
print('inputs:')
print(xb.shape)
print(xb)
print('outputs:')
print(yb.shape)
print(yb)

for b in range(batch_size):
    for t in range(block_size):
        context = xb[b, :t+1]
        target = yb[b, t]
        print(f"when input is {context.tolist()} target is: {target}")

inputs:
torch.Size([4, 8])
tensor([[53, 56, 51, 57,  1, 58, 46, 39],
        [ 1, 50, 43, 58,  1, 51, 63,  1],
        [58, 46,  2,  0,  0, 16, 33, 23],
        [40, 53, 58, 46,  1, 53, 44,  1]])
outputs:
torch.Size([4, 8])
tensor([[56, 51, 57,  1, 58, 46, 39, 58],
        [50, 43, 58,  1, 51, 63,  1, 53],
        [46,  2,  0,  0, 16, 33, 23, 17],
        [53, 58, 46,  1, 53, 44,  1, 63]])
when input is [53] target is: 56
when input is [53, 56] target is: 51
when input is [53, 56, 51] target is: 57
when input is [53, 56, 51, 57] target is: 1
when input is [53, 56, 51, 57, 1] target is: 58
when input is [53, 56, 51, 57, 1, 58] target is: 46
when input is [53, 56, 51, 57, 1, 58, 46] target is: 39
when input is [53, 56, 51, 57, 1, 58, 46, 39] target is: 58
when input is [1] target is: 50
when input is [1, 50] target is: 43
when input is [1, 50, 43] target is: 58
when input is [1, 50, 43, 58] target is: 1
when input is [1, 50, 43, 58, 1] target is: 51
when input is [1, 50, 43, 58, 1, 51] t

# Neural Network: Part I
Now we will start feeding the data into a neural network. We will just start by using the bigram model similar to the one we build previously.

In [22]:
# A batch
print(xb)
print(yb)

tensor([[53, 56, 51, 57,  1, 58, 46, 39],
        [ 1, 50, 43, 58,  1, 51, 63,  1],
        [58, 46,  2,  0,  0, 16, 33, 23],
        [40, 53, 58, 46,  1, 53, 44,  1]])
tensor([[56, 51, 57,  1, 58, 46, 39, 58],
        [50, 43, 58,  1, 51, 63,  1, 53],
        [46,  2,  0,  0, 16, 33, 23, 17],
        [53, 58, 46,  1, 53, 44,  1, 63]])


In [23]:
###BIGRAM###
#############
### Model ###
#############

In [24]:
###BIGRAM###
class BigramLanguageModel(nn.Module):
    
    def __init__(self, vocab_size):
        """ Creating Embedding Table """
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)
    
    def forward(self, idx: torch.tensor, targets: torch.tensor=None) -> tuple:
        """ Calculating the Loss """
        logits = self.token_embedding_table(idx) # (BATCH, TIME, CHANNEL)
        if targets is None:
            loss = None
        else:
            # Loss function takes (BATCH, CHANNEL, TIME) so we rearrange
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)
        
        return logits, loss

    def generate(self, idx: torch.tensor, max_new_tokens: int) -> torch.tensor:
        """ Generates Tokens Using a Sliding Window """
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions
            logits, loss = self(idx)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

In [25]:
###BIGRAM###
@torch.no_grad()
def estimate_loss():
    """
    Description:
        Estimates losses on train and valid
    Outputs:
        out: Mean loss across eval_iters items
    """
    out = {}
    model.eval()
    for split in ['train', 'valid']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X,Y = get_batch(split, batch_size, block_size)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

In [26]:
# Expected loss 
-math.log(1/vocab_size)

4.174387269895637

In [27]:
###BIGRAM###
# Creating model
model = BigramLanguageModel(vocab_size)
model = model.to(device)

In [28]:
# Running forward pass of model
logits, loss = model(xb, yb)
print(logits.shape)
print(loss)

torch.Size([32, 65])
tensor(4.8196, grad_fn=<NllLossBackward0>)


In [29]:
# Generating some text
model_input = torch.zeros((1,1), dtype=torch.long) # Input token 0, which is \n
model_output = model.generate(model_input, max_new_tokens=100)[0].tolist()
print(f"Generated Text: \n {decode(model_output)}")

Generated Text: 
 
dKezqas!wAjlvCaWm?
$iNAyg
pzn&YBxWW?.RpJRLZ$Eavx&DbZtWcSZ hnF!aO:d!&smxERrp
GnQiJAb cpGMYTE,3&Sa:,Et


# Training Model: Part I
Here we are going to use the Adam optimizer instead or stochastic gratient descent, which we used earlier. The optimizer is basically how the gradients are updated. Before we simple updadted it in the following way: 

p.data += -lr * 0.01 * p.grad. 

Now instead the optimizer keeps track of the gradient-history, such that it can create momentum in a certain direction and converge faster.

In [30]:
###BIGRAM###
################
### Training ###
################

In [31]:
###BIGRAM###
# Hyperparameters
batch_size = 32
block_size = 8
max_steps = 10000
eval_interval = 300
learning_rate = 1e-2
eval_iters = 200

In [32]:
###BIGRAM###
# Creating PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

In [33]:
###BIGRAM###
# Training loop
for step in range(max_steps):
    
    # Once in a while evaluate loss on train and valid sets
    if step % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {step}: train loss {losses['train']:.4f}, valid loss {losses['valid']:.4f}")
    
    # Sample a batch of data
    xb, yb = get_batch('train', batch_size, block_size)
    
    # Evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

step 0: train loss 4.5905, valid loss 4.5717
step 300: train loss 2.7891, valid loss 2.8027
step 600: train loss 2.5396, valid loss 2.5569
step 900: train loss 2.4952, valid loss 2.5058
step 1200: train loss 2.4872, valid loss 2.5027
step 1500: train loss 2.4839, valid loss 2.4951
step 1800: train loss 2.4714, valid loss 2.4960
step 2100: train loss 2.4589, valid loss 2.4892
step 2400: train loss 2.4727, valid loss 2.4947
step 2700: train loss 2.4614, valid loss 2.5005
step 3000: train loss 2.4592, valid loss 2.4945
step 3300: train loss 2.4623, valid loss 2.4881
step 3600: train loss 2.4519, valid loss 2.4919
step 3900: train loss 2.4648, valid loss 2.4795
step 4200: train loss 2.4615, valid loss 2.4736
step 4500: train loss 2.4570, valid loss 2.4932
step 4800: train loss 2.4598, valid loss 2.4930
step 5100: train loss 2.4619, valid loss 2.4840
step 5400: train loss 2.4586, valid loss 2.4868
step 5700: train loss 2.4543, valid loss 2.4773
step 6000: train loss 2.4607, valid loss 2.486

In [34]:
###BIGRAM###
##################
### Generating ###
##################

In [35]:
###BIGRAM###
# Generating some text
model_input = torch.zeros((1,1), dtype=torch.long) # Input token 0, which is \n
model_output = model.generate(model_input, max_new_tokens=300)[0].tolist()
print(f"Generated Text: \n {decode(model_output)}")

Generated Text: 
 
PS:
Catethar roor STo haves.
AD:
IORULeas paze enilo t w
Thenty,
QUp, vomyoul?

Rorse shate:

TAs, winth,
Whabumpofof cu al heduch whehasacepitr totro myo t bu she imaif que is yoknsothe own h; giler ueyofoneyor friest th aby hath he
beselens thed t iend tes clat bof ty I dd wofernind O f gr toudo, 


# Self-Attention: Part I
We will now build out the network by adding self-attention. We start out by making a minimal example illustrating what self-attention is.

Example:  
As we are going to predict the future the attention will work in the following way.

* Tokens: abcdef
 * a: Cannot attend to any other characters than itself
 * b: Can attend only to a and b
 * c: Can attend to a, b and c
 * .......
 
How to attend to different numbers of tokens can be done in various way, of which the most simple is just to average them. Here we will throw away a lot of information, but we start out this way for simplicity.

In [36]:
# Example
B,T,C = 4,8,2 # batch, time, channels
x = torch.randn(B,T,C)
x.shape

torch.Size([4, 8, 2])

In [37]:
# Version 1: Calculating average of all previous tokens
xbow = torch.zeros((B,T,C))
for b in range(B): 
    for t in range(T):
        xprev = x[b,:t+1]
        xbow[b,t] = torch.mean(xprev, 0)

In [38]:
# Inspecting a sample
print("input sample:\n", x[0])
print("output sample:\n", xbow[0])
print("token 1 average:", x[0][0])
print("token 2 average:", (x[0][0] + x[0][1])/2)
print("token 3 average:", (x[0][0] + x[0][1] + x[0][2])/3)

input sample:
 tensor([[-0.9394,  2.5976],
        [-0.2369, -0.4511],
        [ 0.4705,  0.9367],
        [ 0.6360,  0.1003],
        [-0.0450, -1.6891],
        [ 1.7757, -0.2522],
        [ 0.1259,  0.6234],
        [-1.1404,  0.4708]])
output sample:
 tensor([[-0.9394,  2.5976],
        [-0.5881,  1.0733],
        [-0.2353,  1.0278],
        [-0.0174,  0.7959],
        [-0.0230,  0.2989],
        [ 0.2768,  0.2071],
        [ 0.2553,  0.2665],
        [ 0.0808,  0.2921]])
token 1 average: tensor([-0.9394,  2.5976])
token 2 average: tensor([-0.5881,  1.0733])
token 3 average: tensor([-0.2353,  1.0278])


For loops are slow, so now we will do it a lot faster using [matrix multiplication](http://matrixmultiplication.xyz/). Here the approach is shown via an example.

In [39]:
a = torch.tril(torch.ones(3,3))
a = a/torch.sum(a,1,keepdim=True);print("a:");print(a)
b = torch.randint(0,10,(3,2)).float();print("b:");print(b)
c = a @ b;print("c:");print(c)

a:
tensor([[1.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000],
        [0.3333, 0.3333, 0.3333]])
b:
tensor([[1., 2.],
        [9., 2.],
        [5., 2.]])
c:
tensor([[1., 2.],
        [5., 2.],
        [5., 2.]])


Replacing the for-loop with matrix multiplications.

In [40]:
# Version 2: Calculating average of all previous tokens
wei = torch.tril(torch.ones(T,T))
wei = torch.tril(torch.ones(T,T))/torch.sum(wei,1,keepdim=True)
xbow2 = wei @ x;xbow2
torch.allclose(xbow, xbow2)

True

Because we will be implementing a more advanced attention system, we create a third version for calculating the same, just using soft-max.

In [41]:
# Version 3: Calculating average of all previous tokens
tril = torch.tril(torch.ones(T,T)) 
wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
xbow3 = wei @ x;xbow3
torch.allclose(xbow, xbow3)

True

# Neural Network: Part II
Here we make some adjustments to the BigramLanguageModel as well as add a positional embedding and implement the self-attention block.

In [42]:
###GPTHEAD###
#################
### Libraries ###
#################
import torch
import torch.nn as nn
import torch.nn.functional as F
device = 'cuda' if torch.cuda.is_available else 'cpu'; device = 'cpu'

In [43]:
###GPTHEAD###
#######################
### Hyperparameters ###
#######################
batch_size = 32
block_size = 8
max_steps = 10000
eval_interval = 300
learning_rate = 1e-3
eval_iters = 200
n_embed = 32 

In [44]:
###GPTHEAD###
############
### Data ###
############

# Reading Data
with open("../../data/tinyshakespeare.txt", "r", encoding="utf-8") as f:
    text = f.read()

# All unique characters
chars = sorted(list(set(text)))
vocab_size = len(chars)

# Building the vocabulary
ctoi = {s:i for i,s in enumerate(chars)}
itoc = {i:s for s,i in ctoi.items()}

# Building encoder/decoder
encode = lambda s: [ctoi[c] for c in s]
decode = lambda l: ''.join([itoc[i] for i in l])

# Tokenizing dataset
data = torch.tensor(encode(text), dtype=torch.long)

# Train/Valid Split
n = int(0.9*len(data))
data_train = data[:n]
data_valid = data[n:]

# Gets Ramdom Batches
def get_batch(split: str, batch_size: int, block_size: int) -> torch.tensor:
    """
    Description:
        Generates a batch of data of inputs x and targets y.
    Inputs:
        split: test or valid split
        batch_size: How many independent sequences will be processed in parallel
        block_size: Maximum context length
    Outputs:
        x, y: a tuple with xs and ys
    """
    data = data_train if split == 'train' else data_valid
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x,y = x.to(device), y.to(device)
    return x, y

To the language model we add a linear layer and positional embeddings. We also complete the self attention implementation, which is better explained [here](https://jalammar.github.io/illustrated-transformer/).

In [45]:
# Version 4: Self-Attention Header
# Sample
B,T,C = 4,8,32
x = torch.randn(B,T,C)

In [46]:
# Single Head of Self-Attention
head_size = 16
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)
k = key(x) # (B, T, head_size)
q = query(x) # (B, T, head_size)
v = value(x) # (B, T, head_size)
wei = q @ k.transpose(-2, -1) # (B, T, head_size) @ (B, head_size, T) --> (B, T, T)

In [47]:
# Self-Attention
tril = torch.tril(torch.ones(T,T)) 
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
out = wei @ v;out.shape

torch.Size([4, 8, 16])

Now we write the attention code for one head into a class.

In [48]:
###GPTHEAD###
#############
### Model ###
#############
class Head(nn.Module):
    """ One head of self-attention """
    
    def __init__(self, head_size):
        """ Creating three linear layers and a mask """
        super().__init__()
        self.key = nn.Linear(n_embed, head_size, bias=False)
        self.query = nn.Linear(n_embed, head_size, bias=False)
        self.value = nn.Linear(n_embed, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
    
    def forward(self, x):
        """ Attention calculation """
        B,T,C = x.shape
        k = self.key(x)
        q = self.query(x)
        
        # Compute attention scores
        wei = q @ k.transpose(-2, -1) * C**-0.5 ## **-0.5 normalize variance to 1
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))
        wei = F.softmax(wei, dim=-1)
        
        v = self.value(x)
        out = wei @ v
        return out

In [49]:
###GPTHEAD###
class BigramLanguageModel(nn.Module):
    
    def __init__(self):
        """ Creating Layers """
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embed)
        self.position_embedding_table = nn.Embedding(block_size, n_embed)
        self.sa_head = Head(n_embed)
        self.lm_head = nn.Linear(n_embed, vocab_size)
    
    def forward(self, idx: torch.tensor, targets: torch.tensor=None) -> tuple:
        """ Calculating the Loss """
        B, T = idx.shape
        tok_emb = self.token_embedding_table(idx) # (B,T,embed_size)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb
        x = self.sa_head(x)
        logits = self.lm_head(x) # (B,T,vocab_size)
        
        if targets is None:
            loss = None
        else:
            # Loss function takes (BATCH, CHANNEL, TIME) so we rearrange
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)
        
        return logits, loss

    def generate(self, idx: torch.tensor, max_new_tokens: int) -> torch.tensor:
        """ Generates Tokens Using a Sliding Window """
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx if idx.size(1) <= block_size else idx[:, -block_size:]
            
            # get the predictions
            logits, loss = self(idx_cond)
            
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

In [50]:
###GPTHEAD###
# Function for estimating loss
@torch.no_grad()
def estimate_loss():
    """
    Description:
        Estimates losses on train and valid
    Outputs:
        out: Mean loss across eval_iters items
    """
    out = {}
    model.eval()
    for split in ['train', 'valid']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X,Y = get_batch(split, batch_size, block_size)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

In [51]:
###GPTHEAD###
# Creating model
model = BigramLanguageModel()
model = model.to(device)

In [52]:
###GPTHEAD###
################
### Training ###
################

# Creating PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

# Training loop
for step in range(max_steps):
    
    # Once in a while evaluate loss on train and valid sets
    if step % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {step}: train loss {losses['train']:.4f}, valid loss {losses['valid']:.4f}")
    
    # Sample a batch of data
    xb, yb = get_batch('train', batch_size, block_size)
    
    # Evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

step 0: train loss 4.1273, valid loss 4.1317
step 300: train loss 2.9050, valid loss 2.9306
step 600: train loss 2.6385, valid loss 2.6372
step 900: train loss 2.5488, valid loss 2.5465
step 1200: train loss 2.5070, valid loss 2.5043
step 1500: train loss 2.4769, valid loss 2.4812
step 1800: train loss 2.4551, valid loss 2.4669
step 2100: train loss 2.4348, valid loss 2.4565
step 2400: train loss 2.4283, valid loss 2.4313
step 2700: train loss 2.4173, valid loss 2.4030
step 3000: train loss 2.4181, valid loss 2.4233
step 3300: train loss 2.4094, valid loss 2.4091
step 3600: train loss 2.3986, valid loss 2.4132
step 3900: train loss 2.3962, valid loss 2.4054
step 4200: train loss 2.4019, valid loss 2.4105
step 4500: train loss 2.4050, valid loss 2.3967
step 4800: train loss 2.3939, valid loss 2.3939
step 5100: train loss 2.3943, valid loss 2.3997
step 5400: train loss 2.3942, valid loss 2.3814
step 5700: train loss 2.3791, valid loss 2.3926
step 6000: train loss 2.3810, valid loss 2.380

In [53]:
###GPTHEAD###
##################
### Generating ###
##################
# Generating some text
model_input = torch.zeros((1,1), dtype=torch.long) # Input token 0, which is \n
model_output = model.generate(model_input, max_new_tokens=300)[0].tolist()
print(f"Generated Text: \n {decode(model_output)}")

Generated Text: 
 
CHAn fimad hat wongor hes; in be she ha hre, inoongo what? ativearee I cod,

SI brenggle wime?

Thal:
KI sagrooun hour, the eand, atrkenngeer, my mge peard wies,
BRe:
MI: heay,
An, coun ars,

Wit, owwe!

INUMKIHARS:
I:
A:
Nell.
OMO:
r'd, owto spivete tapfe hare pore iveat otre by loum,
A wor;
AILABu


# Self-Attention: Part II
Now we will add multiple attention heads to our model. Multi-headed attention is basically to run the data through multiple attention headers and then concatenating the results. Here we implement this, but first all the boiler-plate code.

In [54]:
###GPTMULTIHEAD###
#################
### Libraries ###
#################
import torch
import torch.nn as nn
import torch.nn.functional as F
device = 'cuda' if torch.cuda.is_available else 'cpu'; device = 'cpu'

In [55]:
###GPTMULTIHEAD###
#######################
### Hyperparameters ###
#######################
batch_size = 32
block_size = 8
max_steps = 10000
eval_interval = 300
learning_rate = 1e-3
eval_iters = 200
n_embed = 32 

In [56]:
###GPTMULTIHEAD###
############
### Data ###
############

# Reading Data
with open("../../data/tinyshakespeare.txt", "r", encoding="utf-8") as f:
    text = f.read()

# All unique characters
chars = sorted(list(set(text)))
vocab_size = len(chars)

# Building the vocabulary
ctoi = {s:i for i,s in enumerate(chars)}
itoc = {i:s for s,i in ctoi.items()}

# Building encoder/decoder
encode = lambda s: [ctoi[c] for c in s]
decode = lambda l: ''.join([itoc[i] for i in l])

# Tokenizing dataset
data = torch.tensor(encode(text), dtype=torch.long)

# Train/Valid Split
n = int(0.9*len(data))
data_train = data[:n]
data_valid = data[n:]

# Gets Ramdom Batches
def get_batch(split: str, batch_size: int, block_size: int) -> torch.tensor:
    """
    Description:
        Generates a batch of data of inputs x and targets y.
    Inputs:
        split: test or valid split
        batch_size: How many independent sequences will be processed in parallel
        block_size: Maximum context length
    Outputs:
        x, y: a tuple with xs and ys
    """
    data = data_train if split == 'train' else data_valid
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x,y = x.to(device), y.to(device)
    return x, y

In [57]:
###GPTMULTIHEAD###
#############
### Model ###
#############
class Head(nn.Module):
    """ One head of self-attention """
    
    def __init__(self, head_size):
        """ Creating three linear layers and a mask """
        super().__init__()
        self.key = nn.Linear(n_embed, head_size, bias=False)
        self.query = nn.Linear(n_embed, head_size, bias=False)
        self.value = nn.Linear(n_embed, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
    
    def forward(self, x):
        """ Attention calculation """
        B,T,C = x.shape
        k = self.key(x)
        q = self.query(x)
        
        # Compute attention scores
        wei = q @ k.transpose(-2, -1) * C**-0.5 ## **-0.5 normalize variance to 1
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))
        wei = F.softmax(wei, dim=-1)
        
        v = self.value(x)
        out = wei @ v
        return out

Here we create the multi-head class that simple makes several copies of the head layer and concatenates the results.

In [134]:
###GPTMULTIHEAD###
class MultiHeadAttention(nn.Module):
    """ Multiple heads of self-attention in parallel """
    
    def __init__(self, num_heads, head_size):
        """ Multiple heads in parallel """
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        
    def forward(self, x):
        """ Calculating and Concatenating Results """
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        return out

Finally we make minor adjustments to the BigramLanguageModel such that it uses the multi-headed attention during training and inference.

In [136]:
###GPTMULTIHEAD###
class BigramLanguageModel(nn.Module):
    
    def __init__(self):
        """ Creating Layers """
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embed)
        self.position_embedding_table = nn.Embedding(block_size, n_embed)
        self.sa_heads = MultiHeadAttention(4, n_embed//4)
        self.lm_head = nn.Linear(n_embed, vocab_size)
    
    def forward(self, idx: torch.tensor, targets: torch.tensor=None) -> tuple:
        """ Calculating the Loss """
        B, T = idx.shape
        tok_emb = self.token_embedding_table(idx) # (B,T,embed_size)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb
        x = self.sa_heads(x)
        logits = self.lm_head(x) # (B,T,vocab_size)
        
        if targets is None:
            loss = None
        else:
            # Loss function takes (BATCH, CHANNEL, TIME) so we rearrange
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)
        
        return logits, loss

    def generate(self, idx: torch.tensor, max_new_tokens: int) -> torch.tensor:
        """ Generates Tokens Using a Sliding Window """
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx if idx.size(1) <= block_size else idx[:, -block_size:]
            
            # get the predictions
            logits, loss = self(idx_cond)
            
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

The training loop and evaluation is the same.

In [146]:
###GPTMULTIHEAD###
# Function for estimating loss
@torch.no_grad()
def estimate_loss():
    """
    Description:
        Estimates losses on train and valid
    Outputs:
        out: Mean loss across eval_iters items
    """
    out = {}
    model.eval()
    for split in ['train', 'valid']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X,Y = get_batch(split, batch_size, block_size)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

In [138]:
###GPTMULTIHEAD###
# Creating model
model = BigramLanguageModel()
model = model.to(device)

In [139]:
###GPTMULTIHEAD###
################
### Training ###
################

# Creating PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

# Training loop
for step in range(max_steps):
    
    # Once in a while evaluate loss on train and valid sets
    if step % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {step}: train loss {losses['train']:.4f}, valid loss {losses['valid']:.4f}")
    
    # Sample a batch of data
    xb, yb = get_batch('train', batch_size, block_size)
    
    # Evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

step 0: train loss 4.2183, valid loss 4.2240
step 300: train loss 2.8098, valid loss 2.8290
step 600: train loss 2.6040, valid loss 2.6045
step 900: train loss 2.5177, valid loss 2.5117
step 1200: train loss 2.4568, valid loss 2.4699
step 1500: train loss 2.4259, valid loss 2.4373
step 1800: train loss 2.4052, valid loss 2.3980
step 2100: train loss 2.3732, valid loss 2.3774
step 2400: train loss 2.3624, valid loss 2.3747
step 2700: train loss 2.3413, valid loss 2.3543
step 3000: train loss 2.3261, valid loss 2.3280
step 3300: train loss 2.3031, valid loss 2.3290
step 3600: train loss 2.3078, valid loss 2.3350
step 3900: train loss 2.2907, valid loss 2.3128
step 4200: train loss 2.2878, valid loss 2.3171
step 4500: train loss 2.2854, valid loss 2.3060
step 4800: train loss 2.2698, valid loss 2.2992
step 5100: train loss 2.2627, valid loss 2.2857
step 5400: train loss 2.2507, valid loss 2.2821
step 5700: train loss 2.2471, valid loss 2.2895
step 6000: train loss 2.2573, valid loss 2.279

In [140]:
###GPTMULTIHEAD###
##################
### Generating ###
##################
# Generating some text
model_input = torch.zeros((1,1), dtype=torch.long) # Input token 0, which is \n
model_output = model.generate(model_input, max_new_tokens=300)[0].tolist()
print(f"Generated Text: \n {decode(model_output)}")

Generated Text: 
 
lif't mines.
F Nrais is staigh,
Head
F:
Wentce mousesty,
No tue king ducer, thyaw the, Gem inetimy sals Bubperuck! aik war:
Welly heompe of thourry fack toour ncrd?

GHOR:
Wilt fad beance, willied ben ow, repplard, with,
TSeirk.

ISANT: My, onind.
I Geaven so of thond
Agive; you grot lichy shang thi


In [None]:
#https://www.youtube.com/watch?v=kCc8FmEb1nY&t=82m

# Exporting Code
Here we export code labeled with ###BIGRAM###, ###GPTHEAD### and so on to scripts.

In [141]:
!python3.10 ../../helpers/ipynb_to_py.py 1.\ Building\ GPT-like\ Model.ipynb "###BIGRAM###" ../../modules/GPT/bigram.py

INFO: Cells with label ###BIGRAM### extracted from 1. Building GPT-like Model.ipynb


In [144]:
!python3.10 ../../helpers/ipynb_to_py.py 1.\ Building\ GPT-like\ Model.ipynb "###GPTHEAD###" ../../modules/GPT/GPThead.py

INFO: Cells with label ###GPTHEAD### extracted from 1. Building GPT-like Model.ipynb


In [148]:
!python3.10 ../../helpers/ipynb_to_py.py 1.\ Building\ GPT-like\ Model.ipynb "###GPTMULTIHEAD###" ../../modules/GPT/GPTmultihead.py

INFO: Cells with label ###GPTMULTIHEAD### extracted from 1. Building GPT-like Model.ipynb
