# Introduction
Here we will be building a GPT-like model from scratch based on the two papers [Attention is All You Need](https://arxiv.org/abs/1706.03762), which proposed the **transformer** architecture, and [GPT-3](https://arxiv.org/abs/2005.14165). GPT is a language model that given an input simple predicts the next word.

# Libraries

In [1]:
%matplotlib inline
%config IPCompleter.use_jedi=False

In [2]:
import math
import random
import matplotlib.pyplot as plt

In [3]:
###BIGRAM###
#################
### Libraries ###
#################

In [4]:
###BIGRAM###
import torch
import torch.nn as nn
import torch.nn.functional as F
device = 'cuda' if torch.cuda.is_available else 'cpu'; device = 'cpu'

# Data
We import the tiny-Shakespeare dataset and process it such that it can be used for creating a model that can create Shakespeare texts.

### Reading and Inspecting

In [5]:
###BIGRAM###
############
### Data ###
############

In [6]:
###BIGRAM###
# Reading Data
with open("../../data/tinyshakespeare.txt", "r", encoding="utf-8") as f:
    text = f.read()

In [7]:
print("Number of Characters: ", len(text))

Number of Characters:  1115394


In [8]:
# First 300 characters
print(text[:300])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us


In [9]:
###BIGRAM###
# All unique characters
chars = sorted(list(set(text)))
vocab_size = len(chars)

In [10]:
print(f"Vocab Size: {vocab_size}")
print(f"Vocab Chars: {''.join(chars):}")

Vocab Size: 65
Vocab Chars: 
 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz


### Building Vocabulary and Encoder/Decoder
We just use a character-level tokenizer here, but in practice people e.g. OpenAI uses something else e.g. the **BPE** tokenizer:

In [11]:
# OpenAI encoder/decoder
import tiktoken
enc = tiktoken.get_encoding("gpt2")
enc.decode(enc.encode("hello world")) == "hello world"

True

We will be using a simple tokenizer that uses characters rather than word-chunks to make things easier to understand.

In [12]:
###BIGRAM###
# Building the vocabulary
ctoi = {s:i for i,s in enumerate(chars)}
itoc = {i:s for s,i in ctoi.items()}

In [13]:
# Priting vocabulary
print(ctoi)
print(itoc)

{'\n': 0, ' ': 1, '!': 2, '$': 3, '&': 4, "'": 5, ',': 6, '-': 7, '.': 8, '3': 9, ':': 10, ';': 11, '?': 12, 'A': 13, 'B': 14, 'C': 15, 'D': 16, 'E': 17, 'F': 18, 'G': 19, 'H': 20, 'I': 21, 'J': 22, 'K': 23, 'L': 24, 'M': 25, 'N': 26, 'O': 27, 'P': 28, 'Q': 29, 'R': 30, 'S': 31, 'T': 32, 'U': 33, 'V': 34, 'W': 35, 'X': 36, 'Y': 37, 'Z': 38, 'a': 39, 'b': 40, 'c': 41, 'd': 42, 'e': 43, 'f': 44, 'g': 45, 'h': 46, 'i': 47, 'j': 48, 'k': 49, 'l': 50, 'm': 51, 'n': 52, 'o': 53, 'p': 54, 'q': 55, 'r': 56, 's': 57, 't': 58, 'u': 59, 'v': 60, 'w': 61, 'x': 62, 'y': 63, 'z': 64}
{0: '\n', 1: ' ', 2: '!', 3: '$', 4: '&', 5: "'", 6: ',', 7: '-', 8: '.', 9: '3', 10: ':', 11: ';', 12: '?', 13: 'A', 14: 'B', 15: 'C', 16: 'D', 17: 'E', 18: 'F', 19: 'G', 20: 'H', 21: 'I', 22: 'J', 23: 'K', 24: 'L', 25: 'M', 26: 'N', 27: 'O', 28: 'P', 29: 'Q', 30: 'R', 31: 'S', 32: 'T', 33: 'U', 34: 'V', 35: 'W', 36: 'X', 37: 'Y', 38: 'Z', 39: 'a', 40: 'b', 41: 'c', 42: 'd', 43: 'e', 44: 'f', 45: 'g', 46: 'h', 47: 'i',

In [14]:
###BIGRAM###
# Building encoder/decoder
encode = lambda s: [ctoi[c] for c in s]
decode = lambda l: ''.join([itoc[i] for i in l])

In [15]:
# Testing encoder/decoder
print(encode("hii there"))
print(decode(encode("hii there")))

[46, 47, 47, 1, 58, 46, 43, 56, 43]
hii there


### Tokenizing
Using our simple tokenizer we tokenize the entire dataset.

In [16]:
###BIGRAM###
# Tokenizing dataset
data = torch.tensor(encode(text), dtype=torch.long)

In [17]:
# Printing example
print(data.shape, data.dtype)
print(data[:50])

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56])


### Train/Valid Split

In [18]:
###BIGRAM###
# Train/Valid Split
n = int(0.9*len(data))
data_train = data[:n]
data_valid = data[n:]

### Creating dataset
When builing out model, we would like it to be able to generate text from as little a context as one character, but still up to a context of size **block_size**.

In [19]:
# Example of a sample
block_size = 8
x = data_train[:block_size]
y = data_train[1:block_size+1]
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"when input is {context} target is: {target}")

when input is tensor([18]) target is: 47
when input is tensor([18, 47]) target is: 56
when input is tensor([18, 47, 56]) target is: 57
when input is tensor([18, 47, 56, 57]) target is: 58
when input is tensor([18, 47, 56, 57, 58]) target is: 1
when input is tensor([18, 47, 56, 57, 58,  1]) target is: 15
when input is tensor([18, 47, 56, 57, 58,  1, 15]) target is: 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) target is: 58


We create a function for getting random batches from the data.

In [20]:
###BIGRAM###
# Gets Ramdom Batches
def get_batch(split: str, batch_size: int, block_size: int) -> torch.tensor:
    """
    Description:
        Generates a batch of data of inputs x and targets y.
    Inputs:
        split: test or valid split
        batch_size: How many independent sequences will be processed in parallel
        block_size: Maximum context length
    Outputs:
        x, y: a tuple with xs and ys
    """
    data = data_train if split == 'train' else data_valid
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x,y = x.to(device), y.to(device)
    return x, y

In [21]:
# Testing function
batch_size = 4 
block_size = 8 

xb, yb = get_batch('train', batch_size, block_size)
print('inputs:')
print(xb.shape)
print(xb)
print('outputs:')
print(yb.shape)
print(yb)

for b in range(batch_size):
    for t in range(block_size):
        context = xb[b, :t+1]
        target = yb[b, t]
        print(f"when input is {context.tolist()} target is: {target}")

inputs:
torch.Size([4, 8])
tensor([[39, 41, 55, 59, 39, 47, 52, 58],
        [43, 57, 58,  8,  0,  0, 16, 33],
        [59, 43,  5, 42,  1, 51, 63,  1],
        [46, 43,  1, 39, 54, 54, 50, 39]])
outputs:
torch.Size([4, 8])
tensor([[41, 55, 59, 39, 47, 52, 58, 43],
        [57, 58,  8,  0,  0, 16, 33, 23],
        [43,  5, 42,  1, 51, 63,  1, 44],
        [43,  1, 39, 54, 54, 50, 39, 59]])
when input is [39] target is: 41
when input is [39, 41] target is: 55
when input is [39, 41, 55] target is: 59
when input is [39, 41, 55, 59] target is: 39
when input is [39, 41, 55, 59, 39] target is: 47
when input is [39, 41, 55, 59, 39, 47] target is: 52
when input is [39, 41, 55, 59, 39, 47, 52] target is: 58
when input is [39, 41, 55, 59, 39, 47, 52, 58] target is: 43
when input is [43] target is: 57
when input is [43, 57] target is: 58
when input is [43, 57, 58] target is: 8
when input is [43, 57, 58, 8] target is: 0
when input is [43, 57, 58, 8, 0] target is: 0
when input is [43, 57, 58, 8, 0,

# Neural Network: Part I
Now we will start feeding the data into a neural network. We will just start by using the bigram model similar to the one we build previously.

In [22]:
# A batch
print(xb)
print(yb)

tensor([[39, 41, 55, 59, 39, 47, 52, 58],
        [43, 57, 58,  8,  0,  0, 16, 33],
        [59, 43,  5, 42,  1, 51, 63,  1],
        [46, 43,  1, 39, 54, 54, 50, 39]])
tensor([[41, 55, 59, 39, 47, 52, 58, 43],
        [57, 58,  8,  0,  0, 16, 33, 23],
        [43,  5, 42,  1, 51, 63,  1, 44],
        [43,  1, 39, 54, 54, 50, 39, 59]])


In [23]:
###BIGRAM###
#############
### Model ###
#############

In [24]:
###BIGRAM###
class BigramLanguageModel(nn.Module):
    
    def __init__(self, vocab_size):
        """ Creating Embedding Table """
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)
    
    def forward(self, idx: torch.tensor, targets: torch.tensor=None) -> tuple:
        """ Calculating the Loss """
        logits = self.token_embedding_table(idx) # (BATCH, TIME, CHANNEL)
        if targets is None:
            loss = None
        else:
            # Loss function takes (BATCH, CHANNEL, TIME) so we rearrange
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)
        
        return logits, loss

    def generate(self, idx: torch.tensor, max_new_tokens: int) -> torch.tensor:
        """ Generates Tokens Using a Sliding Window """
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions
            logits, loss = self(idx)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

In [25]:
###BIGRAM###
@torch.no_grad()
def estimate_loss():
    """
    Description:
        Estimates losses on train and valid
    Outputs:
        out: Mean loss across eval_iters items
    """
    out = {}
    model.eval()
    for split in ['train', 'valid']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X,Y = get_batch(split, batch_size, block_size)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

In [26]:
# Expected loss 
-math.log(1/vocab_size)

4.174387269895637

In [27]:
###BIGRAM###
# Creating model
model = BigramLanguageModel(vocab_size)
model = model.to(device)

In [28]:
# Running forward pass of model
logits, loss = model(xb, yb)
print(logits.shape)
print(loss)

torch.Size([32, 65])
tensor(4.6371, grad_fn=<NllLossBackward0>)


In [29]:
# Generating some text
model_input = torch.zeros((1,1), dtype=torch.long) # Input token 0, which is \n
model_output = model.generate(model_input, max_new_tokens=100)[0].tolist()
print(f"Generated Text: \n {decode(model_output)}")

Generated Text: 
 
maYacbM?ZFyRcx-
DDK
qLNUkHW;jYYtV.aPKJd!o$helg$eTE-A: XdQv3;$SHUuropObpoPxhehPtlFFurx&bnju zO'M'ZCVp


# Training Model: Part I
Here we are going to use the Adam optimizer instead or stochastic gratient descent, which we used earlier. The optimizer is basically how the gradients are updated. Before we simple updadted it in the following way: 

p.data += -lr * 0.01 * p.grad. 

Now instead the optimizer keeps track of the gradient-history, such that it can create momentum in a certain direction and converge faster.

In [30]:
###BIGRAM###
################
### Training ###
################

In [31]:
###BIGRAM###
# Hyperparameters
batch_size = 32
block_size = 8
max_steps = 10000
eval_interval = 300
learning_rate = 1e-2
eval_iters = 200

In [32]:
###BIGRAM###
# Creating PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

In [33]:
###BIGRAM###
# Training loop
for step in range(max_steps):
    
    # Once in a while evaluate loss on train and valid sets
    if step % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {step}: train loss {losses['train']:.4f}, valid loss {losses['valid']:.4f}")
    
    # Sample a batch of data
    xb, yb = get_batch('train', batch_size, block_size)
    
    # Evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

step 0: train loss 4.6793, valid loss 4.6836
step 300: train loss 2.8199, valid loss 2.8321
step 600: train loss 2.5559, valid loss 2.5617
step 900: train loss 2.4975, valid loss 2.5172
step 1200: train loss 2.4851, valid loss 2.5030
step 1500: train loss 2.4629, valid loss 2.4922
step 1800: train loss 2.4685, valid loss 2.4906
step 2100: train loss 2.4721, valid loss 2.4764
step 2400: train loss 2.4656, valid loss 2.4942
step 2700: train loss 2.4647, valid loss 2.4891
step 3000: train loss 2.4662, valid loss 2.4872
step 3300: train loss 2.4668, valid loss 2.4809
step 3600: train loss 2.4565, valid loss 2.4901
step 3900: train loss 2.4676, valid loss 2.4935
step 4200: train loss 2.4640, valid loss 2.4891
step 4500: train loss 2.4588, valid loss 2.4906
step 4800: train loss 2.4501, valid loss 2.4853
step 5100: train loss 2.4582, valid loss 2.4909
step 5400: train loss 2.4679, valid loss 2.4805
step 5700: train loss 2.4519, valid loss 2.4919
step 6000: train loss 2.4581, valid loss 2.477

In [34]:
###BIGRAM###
##################
### Generating ###
##################

In [35]:
###BIGRAM###
# Generating some text
model_input = torch.zeros((1,1), dtype=torch.long) # Input token 0, which is \n
model_output = model.generate(model_input, max_new_tokens=300)[0].tolist()
print(f"Generated Text: \n {decode(model_output)}")

Generated Text: 
 

ARIIOMy nd finorearedeether g eay sw thaustetire ES: beasg winsous
Vofreawhe on are Andite, ere,
Whawh thmellliseaqu iorare:
Y it thomeorou, med MENTEYo ay I e sth:
ICICo,
An thitit yeucemok, a anithomior ouceamantu shole, r my KEMNourered s s lainthacudytogendwen:
Whil
TE fllon, men se I thid mee 


# Self-Attention: Part I
We will now build out the network by adding self-attention. We start out by making a minimal example illustrating what self-attention is.

Example:  
As we are going to predict the future the attention will work in the following way.

* Tokens: abcdef
 * a: Cannot attend to any other characters than itself
 * b: Can attend only to a and b
 * c: Can attend to a, b and c
 * .......
 
How to attend to different numbers of tokens can be done in various way, of which the most simple is just to average them. Here we will throw away a lot of information, but we start out this way for simplicity.

In [36]:
# Example
B,T,C = 4,8,2 # batch, time, channels
x = torch.randn(B,T,C)
x.shape

torch.Size([4, 8, 2])

In [37]:
# Version 1: Calculating average of all previous tokens
xbow = torch.zeros((B,T,C))
for b in range(B): 
    for t in range(T):
        xprev = x[b,:t+1]
        xbow[b,t] = torch.mean(xprev, 0)

In [38]:
# Inspecting a sample
print("input sample:\n", x[0])
print("output sample:\n", xbow[0])
print("token 1 average:", x[0][0])
print("token 2 average:", (x[0][0] + x[0][1])/2)
print("token 3 average:", (x[0][0] + x[0][1] + x[0][2])/3)

input sample:
 tensor([[-1.9903,  0.7617],
        [ 0.8435, -0.6335],
        [-0.6518, -0.7260],
        [ 3.1971,  1.0723],
        [ 0.1188,  1.3957],
        [-0.7644,  0.0765],
        [ 1.4463, -0.0962],
        [-0.3653,  1.2909]])
output sample:
 tensor([[-1.9903,  0.7617],
        [-0.5734,  0.0641],
        [-0.5995, -0.1993],
        [ 0.3496,  0.1186],
        [ 0.3035,  0.3740],
        [ 0.1255,  0.3245],
        [ 0.3142,  0.2644],
        [ 0.2292,  0.3927]])
token 1 average: tensor([-1.9903,  0.7617])
token 2 average: tensor([-0.5734,  0.0641])
token 3 average: tensor([-0.5995, -0.1993])


For loops are slow, so now we will do it a lot faster using [matrix multiplication](http://matrixmultiplication.xyz/). Here the approach is shown via an example.

In [39]:
a = torch.tril(torch.ones(3,3))
a = a/torch.sum(a,1,keepdim=True);print("a:");print(a)
b = torch.randint(0,10,(3,2)).float();print("b:");print(b)
c = a @ b;print("c:");print(c)

a:
tensor([[1.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000],
        [0.3333, 0.3333, 0.3333]])
b:
tensor([[1., 0.],
        [3., 6.],
        [4., 6.]])
c:
tensor([[1.0000, 0.0000],
        [2.0000, 3.0000],
        [2.6667, 4.0000]])


Replacing the for-loop with matrix multiplications.

In [40]:
# Version 2: Calculating average of all previous tokens
wei = torch.tril(torch.ones(T,T))
wei = torch.tril(torch.ones(T,T))/torch.sum(wei,1,keepdim=True)
xbow2 = wei @ x;xbow2
torch.allclose(xbow, xbow2)

True

Because we will be implementing a more advanced attention system, we create a third version for calculating the same, just using soft-max.

In [41]:
# Version 3: Calculating average of all previous tokens
tril = torch.tril(torch.ones(T,T)) 
wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
xbow3 = wei @ x;xbow3
torch.allclose(xbow, xbow3)

True

# Neural Network: Part II
Here we make some adjustments to the BigramLanguageModel as well as add a positional embedding and implement the self-attention block.

In [42]:
###GPTHEAD###
#################
### Libraries ###
#################
import torch
import torch.nn as nn
import torch.nn.functional as F
device = 'cuda' if torch.cuda.is_available else 'cpu'; device = 'cpu'

In [43]:
###GPTHEAD###
#######################
### Hyperparameters ###
#######################
batch_size = 32
block_size = 8
max_steps = 10000
eval_interval = 300
learning_rate = 1e-3
eval_iters = 200
n_embed = 32 

In [44]:
###GPTHEAD###
############
### Data ###
############

# Reading Data
with open("../../data/tinyshakespeare.txt", "r", encoding="utf-8") as f:
    text = f.read()

# All unique characters
chars = sorted(list(set(text)))
vocab_size = len(chars)

# Building the vocabulary
ctoi = {s:i for i,s in enumerate(chars)}
itoc = {i:s for s,i in ctoi.items()}

# Building encoder/decoder
encode = lambda s: [ctoi[c] for c in s]
decode = lambda l: ''.join([itoc[i] for i in l])

# Tokenizing dataset
data = torch.tensor(encode(text), dtype=torch.long)

# Train/Valid Split
n = int(0.9*len(data))
data_train = data[:n]
data_valid = data[n:]

# Gets Ramdom Batches
def get_batch(split: str, batch_size: int, block_size: int) -> torch.tensor:
    """
    Description:
        Generates a batch of data of inputs x and targets y.
    Inputs:
        split: test or valid split
        batch_size: How many independent sequences will be processed in parallel
        block_size: Maximum context length
    Outputs:
        x, y: a tuple with xs and ys
    """
    data = data_train if split == 'train' else data_valid
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x,y = x.to(device), y.to(device)
    return x, y

To the language model we add a linear layer and positional embeddings. We also complete the self attention implementation, which is better explained [here](https://jalammar.github.io/illustrated-transformer/).

In [45]:
# Version 4: Self-Attention Header
# Sample
B,T,C = 4,8,32
x = torch.randn(B,T,C)

In [46]:
# Single Head of Self-Attention
head_size = 16
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)
k = key(x) # (B, T, head_size)
q = query(x) # (B, T, head_size)
v = value(x) # (B, T, head_size)
wei = q @ k.transpose(-2, -1) # (B, T, head_size) @ (B, head_size, T) --> (B, T, T)

In [47]:
# Self-Attention
tril = torch.tril(torch.ones(T,T)) 
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
out = wei @ v;out.shape

torch.Size([4, 8, 16])

Now we write the attention code for one head into a class.

In [48]:
###GPTHEAD###
#############
### Model ###
#############
class Head(nn.Module):
    """ One head of self-attention """
    
    def __init__(self, head_size):
        """ Creating three linear layers and a mask """
        super().__init__()
        self.key = nn.Linear(n_embed, head_size, bias=False)
        self.query = nn.Linear(n_embed, head_size, bias=False)
        self.value = nn.Linear(n_embed, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
    
    def forward(self, x):
        """ Attention calculation """
        B,T,C = x.shape
        k = self.key(x)
        q = self.query(x)
        
        # Compute attention scores
        wei = q @ k.transpose(-2, -1) * C**-0.5 ## **-0.5 normalize variance to 1
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))
        wei = F.softmax(wei, dim=-1)
        
        v = self.value(x)
        out = wei @ v
        return out

In [49]:
###GPTHEAD###
class BigramLanguageModel(nn.Module):
    
    def __init__(self):
        """ Creating Layers """
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embed)
        self.position_embedding_table = nn.Embedding(block_size, n_embed)
        self.sa_head = Head(n_embed)
        self.lm_head = nn.Linear(n_embed, vocab_size)
    
    def forward(self, idx: torch.tensor, targets: torch.tensor=None) -> tuple:
        """ Calculating the Loss """
        B, T = idx.shape
        tok_emb = self.token_embedding_table(idx) # (B,T,embed_size)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb
        x = self.sa_head(x)
        logits = self.lm_head(x) # (B,T,vocab_size)
        
        if targets is None:
            loss = None
        else:
            # Loss function takes (BATCH, CHANNEL, TIME) so we rearrange
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)
        
        return logits, loss

    def generate(self, idx: torch.tensor, max_new_tokens: int) -> torch.tensor:
        """ Generates Tokens Using a Sliding Window """
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx if idx.size(1) <= block_size else idx[:, -block_size:]
            
            # get the predictions
            logits, loss = self(idx_cond)
            
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

In [50]:
###GPTHEAD###
# Function for estimating loss
@torch.no_grad()
def estimate_loss():
    """
    Description:
        Estimates losses on train and valid
    Outputs:
        out: Mean loss across eval_iters items
    """
    out = {}
    model.eval()
    for split in ['train', 'valid']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X,Y = get_batch(split, batch_size, block_size)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

In [51]:
###GPTHEAD###
# Creating model
model = BigramLanguageModel()
model = model.to(device)

In [52]:
###GPTHEAD###
################
### Training ###
################

# Creating PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

# Training loop
for step in range(max_steps):
    
    # Once in a while evaluate loss on train and valid sets
    if step % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {step}: train loss {losses['train']:.4f}, valid loss {losses['valid']:.4f}")
    
    # Sample a batch of data
    xb, yb = get_batch('train', batch_size, block_size)
    
    # Evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

step 0: train loss 4.2140, valid loss 4.2121
step 300: train loss 2.8634, valid loss 2.8531
step 600: train loss 2.6230, valid loss 2.6169
step 900: train loss 2.5330, valid loss 2.5386
step 1200: train loss 2.4936, valid loss 2.5163
step 1500: train loss 2.4777, valid loss 2.4781
step 1800: train loss 2.4550, valid loss 2.4716
step 2100: train loss 2.4433, valid loss 2.4556
step 2400: train loss 2.4320, valid loss 2.4438
step 2700: train loss 2.4164, valid loss 2.4408
step 3000: train loss 2.4226, valid loss 2.4274
step 3300: train loss 2.4227, valid loss 2.4269
step 3600: train loss 2.4070, valid loss 2.4214
step 3900: train loss 2.3984, valid loss 2.4262
step 4200: train loss 2.3968, valid loss 2.4151
step 4500: train loss 2.3927, valid loss 2.4037
step 4800: train loss 2.3840, valid loss 2.4040
step 5100: train loss 2.3821, valid loss 2.4080
step 5400: train loss 2.3850, valid loss 2.3969
step 5700: train loss 2.3812, valid loss 2.4014
step 6000: train loss 2.3860, valid loss 2.387

In [53]:
###GPTHEAD###
##################
### Generating ###
##################
# Generating some text
model_input = torch.zeros((1,1), dtype=torch.long) # Input token 0, which is \n
model_output = model.generate(model_input, max_new_tokens=300)[0].tolist()
print(f"Generated Text: \n {decode(model_output)}")

Generated Text: 
 
Whome rant, orthileakisd wonethe en,
As wheranen mout at ult was fin y-che borveve, at lcot meset y urd, ars of warve itunamp!
An a t I Hencad bous be's haleat by
I mier's we se youlll by ath caset, wer hanul wand ayou t
IMit thoure hith.
HAncesysel, tred, my s, half akacatonndel pos, moughm
Thidean


# Self-Attention: Part II
Now we will add multiple attention heads to our model. Multi-headed attention is basically to run the data through multiple attention headers and then concatenating the results. Here we implement this, but first all the boiler-plate code.

In [54]:
###GPTMULTIHEAD###
#################
### Libraries ###
#################
import torch
import torch.nn as nn
import torch.nn.functional as F
device = 'cuda' if torch.cuda.is_available else 'cpu'; device = 'cpu'

In [55]:
###GPTMULTIHEAD###
#######################
### Hyperparameters ###
#######################
batch_size = 32
block_size = 8
max_steps = 10000
eval_interval = 300
learning_rate = 1e-3
eval_iters = 200
n_embed = 32 

In [56]:
###GPTMULTIHEAD###
############
### Data ###
############

# Reading Data
with open("../../data/tinyshakespeare.txt", "r", encoding="utf-8") as f:
    text = f.read()

# All unique characters
chars = sorted(list(set(text)))
vocab_size = len(chars)

# Building the vocabulary
ctoi = {s:i for i,s in enumerate(chars)}
itoc = {i:s for s,i in ctoi.items()}

# Building encoder/decoder
encode = lambda s: [ctoi[c] for c in s]
decode = lambda l: ''.join([itoc[i] for i in l])

# Tokenizing dataset
data = torch.tensor(encode(text), dtype=torch.long)

# Train/Valid Split
n = int(0.9*len(data))
data_train = data[:n]
data_valid = data[n:]

# Gets Ramdom Batches
def get_batch(split: str, batch_size: int, block_size: int) -> torch.tensor:
    """
    Description:
        Generates a batch of data of inputs x and targets y.
    Inputs:
        split: test or valid split
        batch_size: How many independent sequences will be processed in parallel
        block_size: Maximum context length
    Outputs:
        x, y: a tuple with xs and ys
    """
    data = data_train if split == 'train' else data_valid
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x,y = x.to(device), y.to(device)
    return x, y

In [57]:
###GPTMULTIHEAD###
#############
### Model ###
#############
class Head(nn.Module):
    """ One head of self-attention """
    
    def __init__(self, head_size):
        """ Creating three linear layers and a mask """
        super().__init__()
        self.key = nn.Linear(n_embed, head_size, bias=False)
        self.query = nn.Linear(n_embed, head_size, bias=False)
        self.value = nn.Linear(n_embed, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
    
    def forward(self, x):
        """ Attention calculation """
        B,T,C = x.shape
        k = self.key(x)
        q = self.query(x)
        
        # Compute attention scores
        wei = q @ k.transpose(-2, -1) * C**-0.5 ## **-0.5 normalize variance to 1
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))
        wei = F.softmax(wei, dim=-1)
        
        v = self.value(x)
        out = wei @ v
        return out

Here we create the multi-head class that simple makes several copies of the head layer and concatenates the results.

In [58]:
###GPTMULTIHEAD###
class MultiHeadAttention(nn.Module):
    """ Multiple heads of self-attention in parallel """
    
    def __init__(self, num_heads, head_size):
        """ Multiple heads in parallel """
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        
    def forward(self, x):
        """ Calculating and Concatenating Results """
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        return out

Finally we make minor adjustments to the BigramLanguageModel such that it uses the multi-headed attention during training and inference.

In [59]:
###GPTMULTIHEAD###
class BigramLanguageModel(nn.Module):
    
    def __init__(self):
        """ Creating Layers """
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embed)
        self.position_embedding_table = nn.Embedding(block_size, n_embed)
        self.sa_heads = MultiHeadAttention(4, n_embed//4)
        self.lm_head = nn.Linear(n_embed, vocab_size)
    
    def forward(self, idx: torch.tensor, targets: torch.tensor=None) -> tuple:
        """ Calculating the Loss """
        B, T = idx.shape
        tok_emb = self.token_embedding_table(idx) # (B,T,embed_size)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb
        x = self.sa_heads(x)
        logits = self.lm_head(x) # (B,T,vocab_size)
        
        if targets is None:
            loss = None
        else:
            # Loss function takes (BATCH, CHANNEL, TIME) so we rearrange
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)
        
        return logits, loss

    def generate(self, idx: torch.tensor, max_new_tokens: int) -> torch.tensor:
        """ Generates Tokens Using a Sliding Window """
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx if idx.size(1) <= block_size else idx[:, -block_size:]
            
            # get the predictions
            logits, loss = self(idx_cond)
            
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

The training loop and evaluation is the same.

In [60]:
###GPTMULTIHEAD###
# Function for estimating loss
@torch.no_grad()
def estimate_loss():
    """
    Description:
        Estimates losses on train and valid
    Outputs:
        out: Mean loss across eval_iters items
    """
    out = {}
    model.eval()
    for split in ['train', 'valid']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X,Y = get_batch(split, batch_size, block_size)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

In [61]:
###GPTMULTIHEAD###
# Creating model
model = BigramLanguageModel()
model = model.to(device)

In [62]:
###GPTMULTIHEAD###
################
### Training ###
################

# Creating PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

# Training loop
for step in range(max_steps):
    
    # Once in a while evaluate loss on train and valid sets
    if step % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {step}: train loss {losses['train']:.4f}, valid loss {losses['valid']:.4f}")
    
    # Sample a batch of data
    xb, yb = get_batch('train', batch_size, block_size)
    
    # Evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

step 0: train loss 4.3186, valid loss 4.3102
step 300: train loss 2.8007, valid loss 2.8097
step 600: train loss 2.6013, valid loss 2.5953
step 900: train loss 2.5277, valid loss 2.5124
step 1200: train loss 2.4651, valid loss 2.4687
step 1500: train loss 2.4275, valid loss 2.4369
step 1800: train loss 2.3942, valid loss 2.4200
step 2100: train loss 2.3778, valid loss 2.3951
step 2400: train loss 2.3644, valid loss 2.3788
step 2700: train loss 2.3460, valid loss 2.3672
step 3000: train loss 2.3289, valid loss 2.3503
step 3300: train loss 2.3180, valid loss 2.3472
step 3600: train loss 2.2994, valid loss 2.3298
step 3900: train loss 2.2971, valid loss 2.3305
step 4200: train loss 2.2800, valid loss 2.3143
step 4500: train loss 2.2730, valid loss 2.3195
step 4800: train loss 2.2759, valid loss 2.2976
step 5100: train loss 2.2641, valid loss 2.2924
step 5400: train loss 2.2364, valid loss 2.2921
step 5700: train loss 2.2404, valid loss 2.2723
step 6000: train loss 2.2363, valid loss 2.282

In [63]:
###GPTMULTIHEAD###
##################
### Generating ###
##################
# Generating some text
model_input = torch.zeros((1,1), dtype=torch.long) # Input token 0, which is \n
model_output = model.generate(model_input, max_new_tokens=300)[0].tolist()
print(f"Generated Text: \n {decode(model_output)}")

Generated Text: 
 
Gite?

Thasefardwe thor's orke the,-ht ce an 'fll cals me. Nurve thou thehaprichis mach yous blo Gray sing bray, siggsire mary swe teras det; his loow? as bavil tall rett swen fall I Ond nonot ist law's slare.

KING I'TO:
Whear?

Of I st:
Hot may.

Nock? dadd Butwosen.

QUEENGRE: there no,, jus; tha


# Adding Non-Linearity
So far we have just added attention, but we do not use any activation functions. Here we are going to add non-liniarities, but first all the boiler-plate code.

In [64]:
###GPTNONLINEARITY###
#################
### Libraries ###
#################
import torch
import torch.nn as nn
import torch.nn.functional as F
device = 'cuda' if torch.cuda.is_available else 'cpu'; device = 'cpu'

In [65]:
###GPTNONLINEARITY###
#######################
### Hyperparameters ###
#######################
batch_size = 32
block_size = 8
max_steps = 10000
eval_interval = 300
learning_rate = 1e-3
eval_iters = 200
n_embed = 32 

In [66]:
###GPTNONLINEARITY###
############
### Data ###
############

# Reading Data
with open("../../data/tinyshakespeare.txt", "r", encoding="utf-8") as f:
    text = f.read()

# All unique characters
chars = sorted(list(set(text)))
vocab_size = len(chars)

# Building the vocabulary
ctoi = {s:i for i,s in enumerate(chars)}
itoc = {i:s for s,i in ctoi.items()}

# Building encoder/decoder
encode = lambda s: [ctoi[c] for c in s]
decode = lambda l: ''.join([itoc[i] for i in l])

# Tokenizing dataset
data = torch.tensor(encode(text), dtype=torch.long)

# Train/Valid Split
n = int(0.9*len(data))
data_train = data[:n]
data_valid = data[n:]

# Gets Ramdom Batches
def get_batch(split: str, batch_size: int, block_size: int) -> torch.tensor:
    """
    Description:
        Generates a batch of data of inputs x and targets y.
    Inputs:
        split: test or valid split
        batch_size: How many independent sequences will be processed in parallel
        block_size: Maximum context length
    Outputs:
        x, y: a tuple with xs and ys
    """
    data = data_train if split == 'train' else data_valid
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x,y = x.to(device), y.to(device)
    return x, y

In [67]:
###GPTNONLINEARITY###
#############
### Model ###
#############
class Head(nn.Module):
    """ One head of self-attention """
    
    def __init__(self, head_size):
        """ Creating three linear layers and a mask """
        super().__init__()
        self.key = nn.Linear(n_embed, head_size, bias=False)
        self.query = nn.Linear(n_embed, head_size, bias=False)
        self.value = nn.Linear(n_embed, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
    
    def forward(self, x):
        """ Attention calculation """
        B,T,C = x.shape
        k = self.key(x)
        q = self.query(x)
        
        # Compute attention scores
        wei = q @ k.transpose(-2, -1) * C**-0.5 ## **-0.5 normalize variance to 1
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))
        wei = F.softmax(wei, dim=-1)
        
        v = self.value(x)
        out = wei @ v
        return out

In [68]:
###GPTNONLINEARITY###
class MultiHeadAttention(nn.Module):
    """ Multiple heads of self-attention in parallel """
    
    def __init__(self, num_heads, head_size):
        """ Multiple heads in parallel """
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        
    def forward(self, x):
        """ Calculating and Concatenating Results """
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        return out

Here we add the FeedForwad class with has a linear layer and uses the ReLU non-linearity.

In [69]:
###GPTNONLINEARITY###
class FeedForward(nn.Module):
    """ A linear layer followed by a non-linearity """

    def __init__(self, n_embed):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embed, n_embed),
            nn.ReLU(),
        )

    def forward(self, x):
        return self.net(x)

In [70]:
###GPTNONLINEARITY###
class BigramLanguageModel(nn.Module):
    
    def __init__(self):
        """ Creating Layers """
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embed)
        self.position_embedding_table = nn.Embedding(block_size, n_embed)
        self.sa_heads = MultiHeadAttention(4, n_embed//4)
        self.ffw = FeedForward(n_embed)
        self.lm_head = nn.Linear(n_embed, vocab_size)
    
    def forward(self, idx: torch.tensor, targets: torch.tensor=None) -> tuple:
        """ Calculating the Loss """
        B, T = idx.shape
        tok_emb = self.token_embedding_table(idx) # (B,T,embed_size)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb
        x = self.sa_heads(x)
        x = self.ffw(x)
        logits = self.lm_head(x) # (B,T,vocab_size)
        
        if targets is None:
            loss = None
        else:
            # Loss function takes (BATCH, CHANNEL, TIME) so we rearrange
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)
        
        return logits, loss

    def generate(self, idx: torch.tensor, max_new_tokens: int) -> torch.tensor:
        """ Generates Tokens Using a Sliding Window """
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx if idx.size(1) <= block_size else idx[:, -block_size:]
            
            # get the predictions
            logits, loss = self(idx_cond)
            
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

In [71]:
###GPTNONLINEARITY###
# Function for estimating loss
@torch.no_grad()
def estimate_loss():
    """
    Description:
        Estimates losses on train and valid
    Outputs:
        out: Mean loss across eval_iters items
    """
    out = {}
    model.eval()
    for split in ['train', 'valid']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X,Y = get_batch(split, batch_size, block_size)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

In [72]:
###GPTNONLINEARITY###
# Creating model
model = BigramLanguageModel()
model = model.to(device)

In [73]:
###GPTNONLINEARITY###
################
### Training ###
################

# Creating PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

# Training loop
for step in range(max_steps):
    
    # Once in a while evaluate loss on train and valid sets
    if step % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {step}: train loss {losses['train']:.4f}, valid loss {losses['valid']:.4f}")
    
    # Sample a batch of data
    xb, yb = get_batch('train', batch_size, block_size)
    
    # Evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

step 0: train loss 4.1700, valid loss 4.1720
step 300: train loss 2.7750, valid loss 2.7878
step 600: train loss 2.5964, valid loss 2.6003
step 900: train loss 2.5114, valid loss 2.5038
step 1200: train loss 2.4553, valid loss 2.4620
step 1500: train loss 2.4258, valid loss 2.4423
step 1800: train loss 2.3927, valid loss 2.4084
step 2100: train loss 2.3843, valid loss 2.3806
step 2400: train loss 2.3601, valid loss 2.3699
step 2700: train loss 2.3435, valid loss 2.3580
step 3000: train loss 2.3271, valid loss 2.3368
step 3300: train loss 2.3180, valid loss 2.3239
step 3600: train loss 2.3051, valid loss 2.3175
step 3900: train loss 2.2926, valid loss 2.3072
step 4200: train loss 2.2750, valid loss 2.3021
step 4500: train loss 2.2735, valid loss 2.2964
step 4800: train loss 2.2559, valid loss 2.2718
step 5100: train loss 2.2501, valid loss 2.2915
step 5400: train loss 2.2489, valid loss 2.2761
step 5700: train loss 2.2311, valid loss 2.2711
step 6000: train loss 2.2245, valid loss 2.270

In [74]:
###GPTNONLINEARITY###
##################
### Generating ###
##################
# Generating some text
model_input = torch.zeros((1,1), dtype=torch.long) # Input token 0, which is \n
model_output = model.generate(model_input, max_new_tokens=300)[0].tolist()
print(f"Generated Text: \n {decode(model_output)}")

Generated Text: 
 
FPRDI:
Gould; the your'd bereak.

DUCEN NO:
Wir for hid mie my
They tor the, bus,
As wit his thoumy burd whouske.

CABESTIN
No nou of I st se
Whe wou.

KE MET:
A't. 'BEY:
I thryess theme.

Mome, andou,
Ages.

FERBIZIS:
He, seU ME vop lillt harsast 'st pill not, lad wo's pisus!

BUCILONTESBEO:
Wo ait


# Creating Transformer Block
We simply write the transformer code into a block such that we easily can create multiple transformer layers for the next model.

In [75]:
###GPTMULTILAYER###
#################
### Libraries ###
#################
import torch
import torch.nn as nn
import torch.nn.functional as F
device = 'cuda' if torch.cuda.is_available else 'cpu'; device = 'cpu'

In [76]:
###GPTMULTILAYER###
#######################
### Hyperparameters ###
#######################
batch_size = 32
block_size = 8
max_steps = 10000
eval_interval = 300
learning_rate = 1e-3
eval_iters = 200
n_embed = 32 

In [77]:
###GPTMULTILAYER###
############
### Data ###
############

# Reading Data
with open("../../data/tinyshakespeare.txt", "r", encoding="utf-8") as f:
    text = f.read()

# All unique characters
chars = sorted(list(set(text)))
vocab_size = len(chars)

# Building the vocabulary
ctoi = {s:i for i,s in enumerate(chars)}
itoc = {i:s for s,i in ctoi.items()}

# Building encoder/decoder
encode = lambda s: [ctoi[c] for c in s]
decode = lambda l: ''.join([itoc[i] for i in l])

# Tokenizing dataset
data = torch.tensor(encode(text), dtype=torch.long)

# Train/Valid Split
n = int(0.9*len(data))
data_train = data[:n]
data_valid = data[n:]

# Gets Ramdom Batches
def get_batch(split: str, batch_size: int, block_size: int) -> torch.tensor:
    """
    Description:
        Generates a batch of data of inputs x and targets y.
    Inputs:
        split: test or valid split
        batch_size: How many independent sequences will be processed in parallel
        block_size: Maximum context length
    Outputs:
        x, y: a tuple with xs and ys
    """
    data = data_train if split == 'train' else data_valid
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x,y = x.to(device), y.to(device)
    return x, y

In [78]:
###GPTMULTILAYER###
#############
### Model ###
#############
class Head(nn.Module):
    """ One head of self-attention """
    
    def __init__(self, head_size):
        """ Creating three linear layers and a mask """
        super().__init__()
        self.key = nn.Linear(n_embed, head_size, bias=False)
        self.query = nn.Linear(n_embed, head_size, bias=False)
        self.value = nn.Linear(n_embed, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
    
    def forward(self, x):
        """ Attention calculation """
        B,T,C = x.shape
        k = self.key(x)
        q = self.query(x)
        
        # Compute attention scores
        wei = q @ k.transpose(-2, -1) * C**-0.5 ## **-0.5 normalize variance to 1
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))
        wei = F.softmax(wei, dim=-1)
        
        v = self.value(x)
        out = wei @ v
        return out

In [79]:
###GPTMULTILAYER###
class MultiHeadAttention(nn.Module):
    """ Multiple heads of self-attention in parallel """
    
    def __init__(self, num_heads, head_size):
        """ Multiple heads in parallel """
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        
    def forward(self, x):
        """ Calculating and Concatenating Results """
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        return out

In [80]:
###GPTMULTILAYER###
class FeedForward(nn.Module):
    """ A linear layer followed by a non-linearity """

    def __init__(self, n_embed):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embed, n_embed),
            nn.ReLU(),
        )

    def forward(self, x):
        return self.net(x)

In [81]:
###GPTMULTILAYER###
class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embed, n_head):
        """ Transformer block """
        super().__init__()
        head_size = n_embed // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffw = FeedForward(n_embed)

    def forward(self, x):
        """ Adding Attention and ffw to X """
        x = self.sa(x)
        x = self.ffw(x)
        return x

In [82]:
###GPTMULTILAYER###
class BigramLanguageModel(nn.Module):
    
    def __init__(self):
        """ Creating Layers """
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embed)
        self.position_embedding_table = nn.Embedding(block_size, n_embed)
        self.blocks = nn.Sequential(
            Block(n_embed, n_head=4),
            Block(n_embed, n_head=4),
            Block(n_embed, n_head=4),
            Block(n_embed, n_head=4),
        )
        self.lm_head = nn.Linear(n_embed, vocab_size)
    
    def forward(self, idx: torch.tensor, targets: torch.tensor=None) -> tuple:
        """ Calculating the Loss """
        B, T = idx.shape
        tok_emb = self.token_embedding_table(idx) # (B,T,embed_size)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb
        x = self.blocks(x)
        logits = self.lm_head(x) # (B,T,vocab_size)
        
        if targets is None:
            loss = None
        else:
            # Loss function takes (BATCH, CHANNEL, TIME) so we rearrange
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)
        
        return logits, loss

    def generate(self, idx: torch.tensor, max_new_tokens: int) -> torch.tensor:
        """ Generates Tokens Using a Sliding Window """
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx if idx.size(1) <= block_size else idx[:, -block_size:]
            
            # get the predictions
            logits, loss = self(idx_cond)
            
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

In [83]:
###GPTMULTILAYER###
# Function for estimating loss
@torch.no_grad()
def estimate_loss():
    """
    Description:
        Estimates losses on train and valid
    Outputs:
        out: Mean loss across eval_iters items
    """
    out = {}
    model.eval()
    for split in ['train', 'valid']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X,Y = get_batch(split, batch_size, block_size)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

In [84]:
###GPTMULTILAYER###
# Creating model
model = BigramLanguageModel()
model = model.to(device)

In [85]:
###GPTMULTILAYER###
################
### Training ###
################

# Creating PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

# Training loop
for step in range(max_steps):
    
    # Once in a while evaluate loss on train and valid sets
    if step % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {step}: train loss {losses['train']:.4f}, valid loss {losses['valid']:.4f}")
    
    # Sample a batch of data
    xb, yb = get_batch('train', batch_size, block_size)
    
    # Evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

step 0: train loss 4.2121, valid loss 4.2139
step 300: train loss 3.2488, valid loss 3.2867
step 600: train loss 3.1272, valid loss 3.1155
step 900: train loss 2.8783, valid loss 2.8771
step 1200: train loss 2.7434, valid loss 2.7289
step 1500: train loss 2.6651, valid loss 2.6661
step 1800: train loss 2.6052, valid loss 2.5962
step 2100: train loss 2.5637, valid loss 2.5549
step 2400: train loss 2.5275, valid loss 2.5247
step 2700: train loss 2.4861, valid loss 2.4985
step 3000: train loss 2.4691, valid loss 2.4583
step 3300: train loss 2.4530, valid loss 2.4346
step 3600: train loss 2.4373, valid loss 2.4401
step 3900: train loss 2.3983, valid loss 2.4056
step 4200: train loss 2.4049, valid loss 2.3902
step 4500: train loss 2.3889, valid loss 2.3864
step 4800: train loss 2.3644, valid loss 2.3709
step 5100: train loss 2.3600, valid loss 2.3512
step 5400: train loss 2.3526, valid loss 2.3407
step 5700: train loss 2.3404, valid loss 2.3582
step 6000: train loss 2.3376, valid loss 2.329

In [86]:
###GPTMULTILAYER###
##################
### Generating ###
##################
# Generating some text
model_input = torch.zeros((1,1), dtype=torch.long) # Input token 0, which is \n
model_output = model.generate(model_input, max_new_tokens=300)[0].tolist()
print(f"Generated Text: \n {decode(model_output)}")

Generated Text: 
 
A beate of.
Andot ca? The, mings: mollavavot thee meast undel!' Circoveen this mor hasher, deoms for, why the bues butfer wown here henplise and the wave wity vate to feraths, where boires thow di teare you reith are,
Booy ils ous;
Wel dise-arands manow hat ir arlh, I dint tigon: blold, thins maint 


# Adding Residual Connections, Normalization and Dropout
Here we add the different innovations from the transformer paper:

* Residual Connection
  * It is basically just adding two matrices together. This is done in the **Block** class two times.
* Layer Normalization
  * It is very similar to batch normalization ensuring zero mean and 1 variance, it just normalized rows instead of columns. We add it in **Block**.
* Dropout
  * Randomly prevents some of the nodes from communicating. This has a regularizing effect, as the network is forced to make more robust connections. Implemented in **Head**, **MultiHeadAttention** and **FeedForward**

In [87]:
###GPTCONNECT###
#################
### Libraries ###
#################
import torch
import torch.nn as nn
import torch.nn.functional as F
device = 'cuda' if torch.cuda.is_available else 'cpu'; device = 'cpu'

In [88]:
###GPTCONNECT###
#######################
### Hyperparameters ###
#######################
batch_size = 32
block_size = 16
max_steps = 10000
eval_interval = 300
learning_rate = 3e-3
eval_iters = 200
n_head = 5
n_layer = 5
n_embed = 32 
dropout = 0.2

In [89]:
###GPTCONNECT###
############
### Data ###
############

# Reading Data
with open("../../data/tinyshakespeare.txt", "r", encoding="utf-8") as f:
    text = f.read()

# All unique characters
chars = sorted(list(set(text)))
vocab_size = len(chars)

# Building the vocabulary
ctoi = {s:i for i,s in enumerate(chars)}
itoc = {i:s for s,i in ctoi.items()}

# Building encoder/decoder
encode = lambda s: [ctoi[c] for c in s]
decode = lambda l: ''.join([itoc[i] for i in l])

# Tokenizing dataset
data = torch.tensor(encode(text), dtype=torch.long)

# Train/Valid Split
n = int(0.9*len(data))
data_train = data[:n]
data_valid = data[n:]

# Gets Ramdom Batches
def get_batch(split: str, batch_size: int, block_size: int) -> torch.tensor:
    """
    Description:
        Generates a batch of data of inputs x and targets y.
    Inputs:
        split: test or valid split
        batch_size: How many independent sequences will be processed in parallel
        block_size: Maximum context length
    Outputs:
        x, y: a tuple with xs and ys
    """
    data = data_train if split == 'train' else data_valid
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x,y = x.to(device), y.to(device)
    return x, y

In [90]:
###GPTCONNECT###
#############
### Model ###
#############
class Head(nn.Module):
    """ One head of self-attention """
    
    def __init__(self, head_size):
        """ Creating three linear layers and a mask """
        super().__init__()
        self.key = nn.Linear(n_embed, head_size, bias=False)
        self.query = nn.Linear(n_embed, head_size, bias=False)
        self.value = nn.Linear(n_embed, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x):
        """ Attention calculation """
        B,T,C = x.shape
        k = self.key(x)
        q = self.query(x)
        
        # Compute attention scores
        wei = q @ k.transpose(-2, -1) * C**-0.5 ## **-0.5 normalize variance to 1
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))
        wei = F.softmax(wei, dim=-1)
        wei = self.dropout(wei)
        
        v = self.value(x)
        out = wei @ v
        return out

In [91]:
###GPTCONNECT###
class MultiHeadAttention(nn.Module):
    """ Multiple heads of self-attention in parallel """
    
    def __init__(self, num_heads, head_size):
        """ Multiple heads in parallel """
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(head_size * num_heads, n_embed)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x):
        """ Calculating and Concatenating Results """
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out

In [92]:
###GPTCONNECT###
class FeedForward(nn.Module):
    """ A linear layer followed by a non-linearity """

    def __init__(self, n_embed):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embed, 4 * n_embed),
            nn.ReLU(),
            nn.Linear(4 * n_embed, n_embed),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

In [93]:
###GPTCONNECT###
class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embed, n_head):
        """ Transformer block """
        super().__init__()
        head_size = n_embed // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffw = FeedForward(n_embed)
        self.ln1 = nn.LayerNorm(n_embed) # Have trainable paramters gamma and beta 
        self.ln2 = nn.LayerNorm(n_embed) # Have trainable paramters gamma and beta

    def forward(self, x):
        """ Adding Attention, ffw and pre-layernorm to X """
        x = x + self.sa(self.ln1(x))
        x = x + self.ffw(self.ln2(x))
        return x

In [94]:
###GPTCONNECT###
class BigramLanguageModel(nn.Module):
    
    def __init__(self):
        """ Creating Layers """
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embed)
        self.position_embedding_table = nn.Embedding(block_size, n_embed)
        self.blocks = nn.Sequential(*[Block(n_embed, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embed)
        self.lm_head = nn.Linear(n_embed, vocab_size)
    
    def forward(self, idx: torch.tensor, targets: torch.tensor=None) -> tuple:
        """ Calculating the Loss """
        B, T = idx.shape
        tok_emb = self.token_embedding_table(idx) # (B,T,embed_size)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb
        x = self.blocks(x) 
        x = self.ln_f(x)
        logits = self.lm_head(x) # (B,T,vocab_size)
        
        if targets is None:
            loss = None
        else:
            # Loss function takes (BATCH, CHANNEL, TIME) so we rearrange
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)
        
        return logits, loss

    def generate(self, idx: torch.tensor, max_new_tokens: int) -> torch.tensor:
        """ Generates Tokens Using a Sliding Window """
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx if idx.size(1) <= block_size else idx[:, -block_size:]
            
            # get the predictions
            logits, loss = self(idx_cond)
            
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

In [95]:
###GPTCONNECT###
# Function for estimating loss
@torch.no_grad()
def estimate_loss():
    """
    Description:
        Estimates losses on train and valid
    Outputs:
        out: Mean loss across eval_iters items
    """
    out = {}
    model.eval()
    for split in ['train', 'valid']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X,Y = get_batch(split, batch_size, block_size)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

In [96]:
###GPTCONNECT###
# Creating model
model = BigramLanguageModel()
model = model.to(device)

In [None]:
###GPTCONNECT###
################
### Training ###
################

# Creating PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

# Training loop
for step in range(max_steps):
    
    # Once in a while evaluate loss on train and valid sets
    if step % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {step}: train loss {losses['train']:.4f}, valid loss {losses['valid']:.4f}")
    
    # Sample a batch of data
    xb, yb = get_batch('train', batch_size, block_size)
    
    # Evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

step 0: train loss 4.3294, valid loss 4.3323
step 300: train loss 2.3949, valid loss 2.3950
step 600: train loss 2.2566, valid loss 2.2590
step 900: train loss 2.1788, valid loss 2.1846
step 1200: train loss 2.1008, valid loss 2.1441
step 1500: train loss 2.0630, valid loss 2.0964
step 1800: train loss 2.0359, valid loss 2.0921
step 2100: train loss 2.0032, valid loss 2.0599
step 2400: train loss 1.9748, valid loss 2.0579
step 2700: train loss 1.9587, valid loss 2.0482
step 3000: train loss 1.9506, valid loss 2.0228
step 3300: train loss 1.9302, valid loss 2.0066
step 3600: train loss 1.9142, valid loss 2.0016
step 3900: train loss 1.8987, valid loss 2.0026
step 4200: train loss 1.9032, valid loss 1.9945
step 4500: train loss 1.9038, valid loss 2.0191
step 4800: train loss 1.8872, valid loss 1.9736
step 5100: train loss 1.8728, valid loss 1.9725
step 5400: train loss 1.8685, valid loss 1.9724
step 5700: train loss 1.8691, valid loss 1.9703
step 6000: train loss 1.8553, valid loss 1.959

In [None]:
###GPTCONNECT###
##################
### Generating ###
##################
# Generating some text
model_input = torch.zeros((1,1), dtype=torch.long) # Input token 0, which is \n
model_output = model.generate(model_input, max_new_tokens=300)[0].tolist()
print(f"Generated Text: \n {decode(model_output)}")

# Exporting Code
Here we export code labeled with ###BIGRAM###, ###GPTHEAD### and so on to scripts.

In [None]:
!python3.10 ../../helpers/ipynb_to_py.py 1.\ Building\ GPT-like\ Model.ipynb "###BIGRAM###" ../../modules/GPT/bigram.py

In [None]:
!python3.10 ../../helpers/ipynb_to_py.py 1.\ Building\ GPT-like\ Model.ipynb "###GPTHEAD###" ../../modules/GPT/GPThead.py

In [None]:
!python3.10 ../../helpers/ipynb_to_py.py 1.\ Building\ GPT-like\ Model.ipynb "###GPTMULTIHEAD###" ../../modules/GPT/GPTmultihead.py

In [None]:
!python3.10 ../../helpers/ipynb_to_py.py 1.\ Building\ GPT-like\ Model.ipynb "###GPTNONLINEARITY###" ../../modules/GPT/GPTnonlinearity.py

In [None]:
!python3.10 ../../helpers/ipynb_to_py.py 1.\ Building\ GPT-like\ Model.ipynb "###GPTMULTILAYER###" ../../modules/GPT/GPTmultilayer.py

In [None]:
!python3.10 ../../helpers/ipynb_to_py.py 1.\ Building\ GPT-like\ Model.ipynb "###GPTCONNECT###" ../../modules/GPT/GPT.py