## Transformers From Scratch

Notebook where I build a Transformer from scratch using PyTorch neural networks.

### Read In Input File

First we are going to read in our input file which is a collection of all the text of Shakespeare. We can see the total vocabolary denoted in the `chars` variable. This a set of all the possible unique characters that have been found within Shakespeare's works.

In [1]:
#read in shakespear input file
with open('../notebooks/data/input.txt', 'r', encoding = 'utf-8') as file:
    text = file.read()

In [2]:
#get all the unique characters that occur in the text
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(''.join(chars))
print(vocab_size)


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
65


First we must create an initial mapping, in order to map each unique character to a unique integer. We do this so that we can eventually train our model with the text but obviously representing each unique character as an integer which can then be processed by our tensors.

In [3]:
#create a mapping of characters to integers
string_to_integer = {ch: i for i, ch in enumerate(chars)}
#create a mapping of integers to characters
integer_to_string = {i: ch for i, ch in enumerate(chars)}

Encoding our initial text into a PyTorch tensor using our simple encodings.

In [4]:
#build simple encoder that takes a string and outputs a list of integers
encoder = lambda input_string: [string_to_integer[character] for character in input_string]
#build a simple decoder that takes a list of integers, and outputs a string
decoder = lambda input_list: ''.join([integer_to_string[integer] for integer in input_list])

In [5]:
import torch

#encode entire text dataset and store it into a torch.Tensor
data = torch.tensor(encoder(text), dtype=torch.long)
print(data[:10])

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47])


In [6]:
#get first 90% characters
n = int(0.9 * len(data))
#splitting our dataset into training and validation sets for future use
train_data = data[:n]
val_data = data[n:]

In [7]:
#make batches for training
block_size = 8
train_data[:block_size+1]

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

This is a visualizer on how the training is going to occur within the tensors. We want to be able to predict the next character based on a sequence/context that is provided. Therefore when we consider a block size of 8, that means we have a sequence of 8 characters, but rather than just train our model on this sequence we will recursively train on each possible context behind the sequence of characters as visualized below. This will provide more context and informaton on how sequences affect your next probable target.

In [8]:
#training visualizer
x = train_data[:block_size]
y = train_data[1:block_size+1]

#see training inputs
for trial in range(block_size):
    context = x[:trial+1]
    target = y[trial]
    print(f"when input is {context} the target is: {target}")

when input is tensor([18]) the target is: 47
when input is tensor([18, 47]) the target is: 56
when input is tensor([18, 47, 56]) the target is: 57
when input is tensor([18, 47, 56, 57]) the target is: 58
when input is tensor([18, 47, 56, 57, 58]) the target is: 1
when input is tensor([18, 47, 56, 57, 58,  1]) the target is: 15
when input is tensor([18, 47, 56, 57, 58,  1, 15]) the target is: 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target is: 58


Now that we have visualized how batching will work we will create some batch dimensions.

In [9]:
#seeding for replication
torch.manual_seed(1337)
#the number of batches used for training
batch_size = 4
#the maximum context length for predictions
block_size = 8

#function that generates batches
def get_batch(split):
    #differentiate between training and validation
    if split == "train":
        data = train_data
    else:
        data = val_data
    #getting random indices between data for sampling of batches
    random_index = torch.randint(len(data) - block_size, (batch_size, ))
    #sampling batches and stacking them to same dimension
    x = torch.stack([data[index:index+block_size] for index in random_index])
    #sampling batches for target variable
    y = torch.stack([data[index+1:index+block_size+1] for index in random_index])
    return x, y

Now that we could get training batches, we will test out training on a simple BiGram Language Model.

In [14]:
import torch.nn as nn

#embeddings visualized
token_embedding_table = nn.Embedding(vocab_size, vocab_size)

In [15]:
token_embedding_table

Embedding(65, 65)

In [10]:
import torch.nn as nn
from torch.nn import functional as F

torch.manual_seed(1337)

#creating a simple BiGram Model Class
class BigramLanguageModel(nn.Module):
    #constructor for this class
    def __init__(self, vocab_size):
        super().__init__()
        #build an embedding for our vocab characters
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)
    
    def forward(self, idx, targets):
        #build logits, idx and targets are both (B, T) tensor of integers
        logits = self.token_embedding_table(idx) # (B, T, C)

        #re-organize our logits to fit cross entropy function
        B, T, C = logits.shape
        logits = logits.view(B * T, C)
        targets = targets.view(B * T)

        #build our loss function
        loss = F.cross_entropy(logits, targets)
        return logits, loss
    
    def generate(self, idx, max_new_tokens):
        #idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            #get the predictions
            logits, loss = self(idx)
            #focus only on the last time step
            logits = logits[:, -1, :] #becomes (B, C)
            #apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            #sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            #append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

In [11]:
#get current batch
xb, yb = get_batch('train')
#build our bigram model
bigram = BigramLanguageModel(vocab_size)
#run our model
logits, loss = bigram(xb, yb)

print(logits.shape)
print(loss)

torch.Size([32, 65])
tensor(5.0364, grad_fn=<NllLossBackward0>)
