# Introduction
Here we will be building a GPT-like model from scratch based on the two papers [Attention is All You Need](https://arxiv.org/abs/1706.03762), which proposed the **transformer** architecture, and [GPT-3](https://arxiv.org/abs/2005.14165). GPT is a language model that given an input simple predicts the next word.

# Libraries

In [1]:
%matplotlib inline
%config IPCompleter.use_jedi=False

In [2]:
import math
import random
import matplotlib.pyplot as plt

In [3]:
###BIGRAM###
#################
### Libraries ###
#################

In [4]:
###BIGRAM###
import torch
import torch.nn as nn
import torch.nn.functional as F
device = 'cuda' if torch.cuda.is_available else 'cpu'; device = 'cpu'

# Data
We import the tiny-Shakespeare dataset and process it such that it can be used for creating a model that can create Shakespeare texts.

### Reading and Inspecting

In [5]:
###BIGRAM###
############
### Data ###
############

In [6]:
###BIGRAM###
# Reading Data
with open("../../data/tinyshakespeare.txt", "r", encoding="utf-8") as f:
    text = f.read()

In [7]:
print("Number of Characters: ", len(text))

Number of Characters:  1115394


In [8]:
# First 300 characters
print(text[:300])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us


In [9]:
###BIGRAM###
# All unique characters
chars = sorted(list(set(text)))
vocab_size = len(chars)

In [10]:
print(f"Vocab Size: {vocab_size}")
print(f"Vocab Chars: {''.join(chars):}")

Vocab Size: 65
Vocab Chars: 
 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz


### Building Vocabulary and Encoder/Decoder
We just use a character-level tokenizer here, but in practice people e.g. OpenAI uses something else e.g. the **BPE** tokenizer:

In [11]:
# OpenAI encoder/decoder
import tiktoken
enc = tiktoken.get_encoding("gpt2")
enc.decode(enc.encode("hello world")) == "hello world"

True

We will be using a simple tokenizer that uses characters rather than word-chunks to make things easier to understand.

In [12]:
###BIGRAM###
# Building the vocabulary
ctoi = {s:i for i,s in enumerate(chars)}
itoc = {i:s for s,i in ctoi.items()}

{'\n': 0, ' ': 1, '!': 2, '$': 3, '&': 4, "'": 5, ',': 6, '-': 7, '.': 8, '3': 9, ':': 10, ';': 11, '?': 12, 'A': 13, 'B': 14, 'C': 15, 'D': 16, 'E': 17, 'F': 18, 'G': 19, 'H': 20, 'I': 21, 'J': 22, 'K': 23, 'L': 24, 'M': 25, 'N': 26, 'O': 27, 'P': 28, 'Q': 29, 'R': 30, 'S': 31, 'T': 32, 'U': 33, 'V': 34, 'W': 35, 'X': 36, 'Y': 37, 'Z': 38, 'a': 39, 'b': 40, 'c': 41, 'd': 42, 'e': 43, 'f': 44, 'g': 45, 'h': 46, 'i': 47, 'j': 48, 'k': 49, 'l': 50, 'm': 51, 'n': 52, 'o': 53, 'p': 54, 'q': 55, 'r': 56, 's': 57, 't': 58, 'u': 59, 'v': 60, 'w': 61, 'x': 62, 'y': 63, 'z': 64}
{0: '\n', 1: ' ', 2: '!', 3: '$', 4: '&', 5: "'", 6: ',', 7: '-', 8: '.', 9: '3', 10: ':', 11: ';', 12: '?', 13: 'A', 14: 'B', 15: 'C', 16: 'D', 17: 'E', 18: 'F', 19: 'G', 20: 'H', 21: 'I', 22: 'J', 23: 'K', 24: 'L', 25: 'M', 26: 'N', 27: 'O', 28: 'P', 29: 'Q', 30: 'R', 31: 'S', 32: 'T', 33: 'U', 34: 'V', 35: 'W', 36: 'X', 37: 'Y', 38: 'Z', 39: 'a', 40: 'b', 41: 'c', 42: 'd', 43: 'e', 44: 'f', 45: 'g', 46: 'h', 47: 'i',

In [None]:
# Priting vocabulary
print(ctoi)
print(itoc)

In [13]:
###BIGRAM###
# Building encoder/decoder
encode = lambda s: [ctoi[c] for c in s]
decode = lambda l: ''.join([itoc[i] for i in l])

In [14]:
# Testing encoder/decoder
print(encode("hii there"))
print(decode(encode("hii there")))

[46, 47, 47, 1, 58, 46, 43, 56, 43]
hii there


### Tokenizing
Using our simple tokenizer we tokenize the entire dataset.

In [15]:
###BIGRAM###
# Tokenizing dataset
data = torch.tensor(encode(text), dtype=torch.long)

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56])


In [None]:
# Printing example
print(data.shape, data.dtype)
print(data[:50])

### Train/Valid Split

In [16]:
###BIGRAM###
# Train/Valid Split
n = int(0.9*len(data))
data_train = data[:n]
data_valid = data[n:]

### Creating dataset
When builing out model, we would like it to be able to generate text from as little a context as one character, but still up to a context of size **block_size**.

In [17]:
# Example of a sample
block_size = 8
x = data_train[:block_size]
y = data_train[1:block_size+1]
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"when input is {context} target is: {target}")

when input is tensor([18]) target is: 47
when input is tensor([18, 47]) target is: 56
when input is tensor([18, 47, 56]) target is: 57
when input is tensor([18, 47, 56, 57]) target is: 58
when input is tensor([18, 47, 56, 57, 58]) target is: 1
when input is tensor([18, 47, 56, 57, 58,  1]) target is: 15
when input is tensor([18, 47, 56, 57, 58,  1, 15]) target is: 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) target is: 58


We create a function for getting random batches from the data.

In [18]:
###BIGRAM###
# Gets Ramdom Batches
def get_batch(split: str, batch_size: int, block_size: int) -> torch.tensor:
    """
    Description:
        Generates a batch of data of inputs x and targets y.
    Inputs:
        split: test or valid split
        batch_size: How many independent sequences will be processed in parallel
        block_size: Maximum context length
    Outputs:
        x, y: a tuple with xs and ys
    """
    data = data_train if split == 'train' else data_valid
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x,y = x.to(device), y.to(device)
    return x, y

In [19]:
# Testing function
batch_size = 4 
block_size = 8 

xb, yb = get_batch('train', batch_size, block_size)
print('inputs:')
print(xb.shape)
print(xb)
print('outputs:')
print(yb.shape)
print(yb)

for b in range(batch_size):
    for t in range(block_size):
        context = xb[b, :t+1]
        target = yb[b, t]
        print(f"when input is {context.tolist()} target is: {target}")

inputs:
torch.Size([4, 8])
tensor([[47, 50, 58, 63,  1, 53, 44,  8],
        [43,  1, 63, 53, 59,  1, 46, 39],
        [ 1, 39, 44, 56, 39, 47, 42,  1],
        [ 1, 57, 59, 41, 46,  1, 39, 50]])
outputs:
torch.Size([4, 8])
tensor([[50, 58, 63,  1, 53, 44,  8,  0],
        [ 1, 63, 53, 59,  1, 46, 39, 60],
        [39, 44, 56, 39, 47, 42,  1, 58],
        [57, 59, 41, 46,  1, 39, 50, 50]])
when input is [47] target is: 50
when input is [47, 50] target is: 58
when input is [47, 50, 58] target is: 63
when input is [47, 50, 58, 63] target is: 1
when input is [47, 50, 58, 63, 1] target is: 53
when input is [47, 50, 58, 63, 1, 53] target is: 44
when input is [47, 50, 58, 63, 1, 53, 44] target is: 8
when input is [47, 50, 58, 63, 1, 53, 44, 8] target is: 0
when input is [43] target is: 1
when input is [43, 1] target is: 63
when input is [43, 1, 63] target is: 53
when input is [43, 1, 63, 53] target is: 59
when input is [43, 1, 63, 53, 59] target is: 1
when input is [43, 1, 63, 53, 59, 1] tar

# Neural Network: Part I
Now we will start feeding the data into a neural network. We will just start by using the bigram model similar to the one we build previously.

In [20]:
# A batch
print(xb)
print(yb)

tensor([[47, 50, 58, 63,  1, 53, 44,  8],
        [43,  1, 63, 53, 59,  1, 46, 39],
        [ 1, 39, 44, 56, 39, 47, 42,  1],
        [ 1, 57, 59, 41, 46,  1, 39, 50]])
tensor([[50, 58, 63,  1, 53, 44,  8,  0],
        [ 1, 63, 53, 59,  1, 46, 39, 60],
        [39, 44, 56, 39, 47, 42,  1, 58],
        [57, 59, 41, 46,  1, 39, 50, 50]])


In [21]:
###BIGRAM###
#############
### Model ###
#############

In [22]:
###BIGRAM###
class BigramLanguageModel(nn.Module):
    
    def __init__(self, vocab_size):
        """ Creating Embedding Table """
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)
    
    def forward(self, idx: torch.tensor, targets: torch.tensor=None) -> tuple:
        """ Calculating the Loss """
        logits = self.token_embedding_table(idx) # (BATCH, TIME, CHANNEL)
        if targets is None:
            loss = None
        else:
            # Loss function takes (BATCH, CHANNEL, TIME) so we rearrange
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)
        
        return logits, loss

    def generate(self, idx: torch.tensor, max_new_tokens: int) -> torch.tensor:
        """ Generates Tokens Using a Sliding Window """
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions
            logits, loss = self(idx)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

In [36]:
###BIGRAM###
@torch.no_grad()
def estimate_loss():
    """
    Description:
        Estimates losses on train and valid
    Outputs:
        out: Mean loss across eval_iters items
    """
    out = {}
    model.eval()
    for split in ['train', 'valid']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X,Y = get_batch(split, batch_size, block_size)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

In [24]:
# Expected loss 
-math.log(1/vocab_size)

4.174387269895637

In [25]:
###BIGRAM###
# Creating model
model = BigramLanguageModel(vocab_size)
model = model.to(device)

In [26]:
# Running forward pass of model
logits, loss = model(xb, yb)
print(logits.shape)
print(loss)

torch.Size([32, 65])
tensor(4.5953, grad_fn=<NllLossBackward0>)


In [27]:
# Generating some text
model_input = torch.zeros((1,1), dtype=torch.long) # Input token 0, which is \n
model_output = model.generate(model_input, max_new_tokens=100)[0].tolist()
print(f"Generated Text: \n {decode(model_output)}")

Generated Text: 
 
NoKFskWLAbZ-PQzpl3, YlcduNTOkyDjDncxShQ$
GJjMepDUggcth$xplM?wGTNgX ''BJ'WrnJekdW'Ul!afGSU3jUwMcaAySz


# Training a Model: Part I
Here we are going to use the Adam optimizer instead or stochastic gratient descent, which we used earlier. The optimizer is basically how the gradients are updated. Before we simple updadted it in the following way: 

p.data += -lr * 0.01 * p.grad. 

Now instead the optimizer keeps track of the gradient-history, such that it can create momentum in a certain direction and converge faster.

In [28]:
###BIGRAM###
################
### Training ###
################

In [30]:
###BIGRAM###
# Hyperparameters
batch_size = 32
block_size = 8
max_steps = 10000
eval_interval = 300
learning_rate = 1e-2
eval_iters = 200

In [31]:
###BIGRAM###
# Creating PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

In [48]:
###BIGRAM###
# Training loop
for step in range(max_steps):
    
    # Once in a while evaluate loss on train and valid sets
    if step % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {step}: train loss {losses['train']:.4f}, valid loss {losses['valid']:.4f}")
    
    # Sample a batch of data
    xb, yb = get_batch('train', batch_size, block_size)
    
    # Evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

step 0: train loss 4.7526, valid loss 4.7464
step 300: train loss 2.8255, valid loss 2.8341
step 600: train loss 2.5408, valid loss 2.5755
step 900: train loss 2.5011, valid loss 2.5296
step 1200: train loss 2.4853, valid loss 2.5101
step 1500: train loss 2.4799, valid loss 2.5050
step 1800: train loss 2.4623, valid loss 2.4960
step 2100: train loss 2.4696, valid loss 2.4863
step 2400: train loss 2.4694, valid loss 2.4842
step 2700: train loss 2.4628, valid loss 2.4837
step 3000: train loss 2.4629, valid loss 2.4981
step 3300: train loss 2.4653, valid loss 2.4845
step 3600: train loss 2.4593, valid loss 2.4868
step 3900: train loss 2.4517, valid loss 2.4858
step 4200: train loss 2.4537, valid loss 2.4839
step 4500: train loss 2.4582, valid loss 2.4896
step 4800: train loss 2.4691, valid loss 2.4812
step 5100: train loss 2.4537, valid loss 2.4890
step 5400: train loss 2.4577, valid loss 2.4800
step 5700: train loss 2.4630, valid loss 2.4834
step 6000: train loss 2.4557, valid loss 2.489

In [None]:
###BIGRAM###
##################
### Generating ###
##################

In [50]:
###BIGRAM###
# Generating some text
model_input = torch.zeros((1,1), dtype=torch.long) # Input token 0, which is \n
model_output = model.generate(model_input, max_new_tokens=300)[0].tolist()
print(f"Generated Text: \n {decode(model_output)}")

Generated Text: 
 
Anghairk s prd. athist isullllconeriean.

Dourd toferee y
Sirdyou ole!
DY JUCLLAUS:
Mes arine:



S:
IUKIn, w oweqund belequeyowe an wh be? I hours, turg ony irendichin avis arthemyore n;
Th o nds mantolost hesbainondeeomomem thinlitwhen NUEN ornd t avy we OFomesou herlo d! her
EE:
NGSTh me ofois wa


# Exporting Code
Here we export code labeled with ###BIGRAM### to a script.

In [53]:
!python3.10 ../../helpers/ipynb_to_py.py 1.\ Building\ GPT-like\ Model.ipynb "###BIGRAM###" ../../modules/GPT/bigram.py

INFO: Cells with label ###BIGRAM### extracted from 1. Building GPT-like Model.ipynb
