# GPT 2 Implementation

#### Read in the text

In [16]:
# read in the text line by line
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# establish the characters and the vocabulary size
chars = sorted(list(set(text)))
vocab_size = len(chars)

#### Create encoder and decoder functions

In [17]:
# initialise lookuptables
char_ind = {}
ind_char = {}
for i, ch in enumerate(chars):
    char_ind[ch] = i
    ind_char[i] = ch

In [18]:
# encoder and decoder functions
def encode(input_str:str) -> list:
    return [char_ind[s] for s in input_str]

def decode(input_list:list) -> str:
    return ''.join([ind_char[l] for l in input_list])

Andrej Karpathy spoke about how there is a trade off between vocab size and encoded sequence size. If you have a large vocab size that means each individual sequence of text can be described in a smaller sequence of numbers, and vice versa.

#### Convert the data into tensors with PyTorch

In this step we convert our data into tensors. We will be performing our operations fully on tensors since this is the datatype that machine learning algorithms can actually understand.

In [19]:
import torch

data = torch.tensor(encode(text), dtype=torch.long)

From the tests below it seems that the `0` key is a new-line indicator and `1` is a space indicator.

In [20]:
print(data.shape, data.dtype, data[:100], text[:100], sep = '\n\n')

torch.Size([1115394])

torch.int64

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You


#### Split up the data into training and evaluation

Here we've chosen a 90/10 split. 90% of our data will be used to train the model and the other 10% will be used to evaluate and test the model.

In [21]:
n = int(0.9*len(data))
train_data = data[:n]
val_data = data[n:]

The code below was taken from Andrej Karpathy directly. Here we are creating the batches of data we wil train our model on.

In [35]:
torch.manual_seed(1337)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
batch_size = 4
block_size = 8 

def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y