# Building a ChatGPT Clone
- GPT is a probabilistic system and so can give different answers.
- Based on the famous "Attention Is All You Need" Transformer model.
- G.P.T. stands for Generally Pretrained Transformer.
- We will obviously not be building something even close to being as good as OpenAI's ChatGPT (OC).

In this case though we are going to be trying to create a model to predict the next character and not the next word (in the case of OC).

We will be training on the "Tiny Shakespeare Dataset" which is simply a text file containing all the literary works of Shakespeare.

We will essentially be modelling how these characters evolve one fter another in a sequence.

We will be creating a Language Model to output language that looks like Shakespeare.

- https://www.youtube.com/watch?v=kCc8FmEb1nY&t=4828s

In [26]:
from pprint import pprint

import torch
import torch.nn as nn

from torch.nn import functional as F

# RNG seed for reproducability
torch.manual_seed(1337)

<torch._C.Generator object at 0x10eebbdd0>

In [27]:
text: str
with open('tiny-shakespeare.txt', 'r', encoding='utf-8') as f:
    text = f.read() # Store whole file in 1 string ~ 1 million characters

In [28]:
unique_chars = sorted(list(set(text)))
vocab_size = len(unique_chars)
print(''.join(unique_chars))
print(vocab_size)


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
65


# Our First Tokenizer

In [29]:
# Google uses Sentence Piece
# OpenAI uses Tik Token

str_to_int = { ch:i for i, ch in enumerate(unique_chars) }
int_to_str = { i:ch for i, ch in enumerate(unique_chars) }

# Encoder and Decoder
# Here we assign each character an integer. In prod we might
# change this to be mutli-character.
encode = lambda s: [str_to_int[c] for c in s]
decode = lambda l: ''.join([int_to_str[i] for i in l])

In [30]:
# Test
print(encode("Hello World"))
print(decode(encode("Hello World")))

[20, 43, 50, 50, 53, 1, 35, 53, 56, 50, 42]
Hello World


In [31]:
# Encode tiny-shalespeare.txt dataset into PyTorch tensor
data = torch.tensor(encode(text), dtype=torch.long)

In [32]:
# Train-Test split
n = int(0.9*len(data))
train_data = data[:n]
test_data = data[n:]

# Batching Our Training Data

In [33]:
BLOCK_SIZE = 8
train_data[:BLOCK_SIZE+1]

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

In [34]:
X = train_data[:BLOCK_SIZE]
Y = train_data[1:BLOCK_SIZE+1]
for t in range(BLOCK_SIZE):
    context = X[:t+1]
    target = Y[t]
    print(f"If INPUT = {context} ; TARGET = {target}")

If INPUT = tensor([18]) ; TARGET = 47
If INPUT = tensor([18, 47]) ; TARGET = 56
If INPUT = tensor([18, 47, 56]) ; TARGET = 57
If INPUT = tensor([18, 47, 56, 57]) ; TARGET = 58
If INPUT = tensor([18, 47, 56, 57, 58]) ; TARGET = 1
If INPUT = tensor([18, 47, 56, 57, 58,  1]) ; TARGET = 15
If INPUT = tensor([18, 47, 56, 57, 58,  1, 15]) ; TARGET = 47
If INPUT = tensor([18, 47, 56, 57, 58,  1, 15, 47]) ; TARGET = 58


Taking the information we gaing above about how our model takes a sequence to then learns to predict the next element in that sequence. We split up this task by using a block size which limits the sequence size we are aiming to learn. In the following code we will take this concept of using a block size and also encoporate the idea of batching. Batching, meaning to restrucutre our data in order to parrallelise computation.

- `BATCH_SIZE`: Number of indpenedent sequences will we process in parallel.
- `BLOCK_SIZE`: Maximum context length for predictions.  

In [35]:
BATCH_SIZE = 4
BLOCK_SIZE = 8

def get_batch(d: torch.tensor):
    """Generate a batch of data of inputs(x) and target(y).
    """
    # Generate random position to get batch
    ix = torch.randint(len(d) - BLOCK_SIZE, (BATCH_SIZE,))

    x = torch.stack([d[i:i+BLOCK_SIZE] for i in ix])
    y = torch.stack([d[i+1:i+BLOCK_SIZE+1] for i in ix])
    return x, y

# Create example batch
xb, yb = get_batch(train_data)

print(f"""
### EXAMPLE BATCH ###
Inputs:
{xb.shape}
{xb}
Targets:
{yb.shape}
{yb}
""")


### EXAMPLE BATCH ###
Inputs:
torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
Targets:
torch.Size([4, 8])
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])



# Models and Neural Networks Time!
In this section we will start developing models, based on Neural Networks (NN), and begin passing in our data in batches to see how good we can get our models to be.

Our models:
1. Bigram Language Model

In [36]:
###
# 1. Bigram Language Model
###

class BigramModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()

        # Here we create a lookup matrix where every unique char has a row
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, inputs, targets=None):
        # Make predictions from our input
        logits = self.token_embedding_table(inputs) # (B, T, C)

        # Calculate our loss function. cross_entropy implementation
        # requires us to reshape our data.
        loss = None
        if targets is not None:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss
    
    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions
            logits, loss = self(idx)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

###
# Test Model
###

# We want to check if it is able to be initialised and infer.
m = BigramModel(vocab_size=vocab_size)
logits, loss = m(xb, yb)
print(f"Before Training Loss = {loss}")

# Here we print out the results from our model with no training.
# As we will see the results are very bad since they are totally random.
# print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist()))

###
# Train Model
###

optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)

TRAINING_STEPS = 10_000
TRAIN = True

if TRAIN:
    for i in range(TRAINING_STEPS):
        xb, yb = get_batch(train_data)

        # Optimise and calculate Loss
        logits, loss = m(xb, yb)
        optimizer.zero_grad(set_to_none=True)
        loss.backward()
        optimizer.step()

print(f"After Training Loss = {loss}")

Before Training Loss = 5.036386013031006


After Training Loss = 2.3838858604431152


In [37]:
# Inference after training:
print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist()))


Rulcof tcas
S:

ar:
IV

CCARUCiouth mby.

ute: tity d Cat LOp; yrer at ifa,
BUEzYr: he m, h onubaima


# Writing Our First Self-Attention Block

## Math Trick For Computing `x_bow`

In [38]:
B, T, C = 4, 8, 2 # Batch, Time, Channels
x = torch.randn(B, T, C)

# bow: Bag of words
#
# Here we make an example tensor where we try to implement
# a simple self-attention block. In this case an element in
# the x_bow tensor stores the mean of all total the elements
# which came behind it in the sequence.

In [39]:
# Method 1: Brute Force For Looping
x_bow = torch.zeros((B, T, C))
for b in range(B):
    for t in range(T):
        x_prev = x[b,:t+1] # (t, C)

        # Mean over time (0th dim)
        x_bow[b, t] = torch.mean(x_prev, 0)

In [40]:
# Method 2: Fast Matrix Multiplication
# Using triangular lower matricies e.g.
# [[1, 0, 0]
#  [1, 1, 0]
#  [1, 1, 1]]

wei = torch.tril(torch.ones(T, T))
wei = wei /  wei.sum(1, keepdim=True)

# wei in PyTorch gets treated as (B, T, T)
x_bow_mat = wei @ x # (T, T) @ (B, T, C) -> (B, T, C)

In [41]:
# Check values same
torch.allclose(x_bow, x_bow_mat)

True

In [42]:
# Method 3: Soft Max
tril = torch.tril(torch.ones(T, T))
wei = torch.zeros((T,T))

# Using triangular lower matricies e.g.
# [[1, 0, 0]
#  [1, 1, 0]
#  [1, 1, 1]]
# Every index where tril = 0 set the element
# at the same index in wei to 'inf'
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)

x_bow_sm = wei @ x


In [43]:
# Check values same
torch.allclose(x_bow, x_bow_sm)

True

In [44]:
pprint(x_bow[0])

tensor([[ 0.6211, -0.3773],
        [ 0.0226, -0.7264],
        [ 0.5547, -0.6317],
        [ 0.4800, -0.6547],
        [ 0.2894, -0.3380],
        [ 0.5003, -0.3041],
        [ 0.1993, -0.1599],
        [ 0.1507,  0.0065]])


## The Crux of The Self-Attention Block
We will be implement a small self-attention block for a single head.

Here are some notes from Andrej Karpathy on Attention:
```
- Attention is a communication mechanism. Can be seen as nodes in a directed graph looking at each other and aggregating information with a weighted sum from all nodes that point to them, with data-dependent weights.

- There is no notion of space. Attention simply acts over a set of vectors. This is why we need to positionally encode tokens.

- Each example across batch dimension is of course processed completely independently and never "talk" to each other

- In an "encoder" attention block just delete the single line that does masking with tril, allowing all tokens to communicate. This block here is called a "decoder" attention block because it has triangular masking, and is usually used in autoregressive settings, like language modeling.

- "self-attention" just means that the keys and values are produced from the same source as queries. In "cross-attention", the queries still get produced from x, but the keys and values come from some other, external source (e.g. an encoder module)

- "Scaled" attention additional divides wei by 1/sqrt(head_size). This makes it so when input Q,K are unit variance, wei will be unit variance too and Softmax will stay diffuse and not saturate too much. Illustration below
```

In [45]:
B, T, C = 4, 8, 32
x = torch.randn(B, T, C)

# Single Head
HEAD_SIZE = 16
key = nn.Linear(C, HEAD_SIZE, bias=False)
query = nn.Linear(C, HEAD_SIZE, bias=False)
value = nn.Linear(C, HEAD_SIZE, bias=False)

k = key(x) # (B, T, 16)
q = query(x) # (B, T, 16)

# Communication between keys and queries
wei = q @ k.transpose(-2, -1) # (B, T, 16) @ (B, T, 16) -> (B, T, T)

tril = torch.tril(torch.ones(T, T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)


v = value(x)
out = wei @ v # -> (B, T, HEAD_SIZE)