# Decoder-only based GPT (language model)

Here we take a transformer block, the decoder in particular, and use it for the task of language modeling. In general, this is how GPTs are trained. We will do this on a much smaller scale.

We take everything we've already built and leverage it in the way Karpathy implements a character level LM here: 

In [80]:
import torch
from torch.utils.data import random_split
import sys 
sys.path.append("../models")
import tiktoken

In [81]:
harry_potter_text = " "
for i in range(4):
    book_num = i+1
    with open(f'../data/hp{book_num}.txt', 'r', encoding='utf-8') as f:
        harry_potter_text += f.read()
print(len(harry_potter_text))

2652650


## Tokenization
Instead of character level, we're going to model this LM using a tokenizer. in particular, we're going to try to use OpenAI's tiktoken with the gpt2 50k tokenizer. This might end up being too large of a vocab size given compute constraints, but

In [82]:
enc = tiktoken.get_encoding("gpt2")

In [83]:
encoder = enc.encode
decoder = enc.decode

In [84]:
token_example = encoder("hello world test tiktokenizer")

In [85]:
decoded = decoder([token_example[-1]])
decoded

'izer'

In [86]:
dataset = torch.tensor(encoder(harry_potter_text), dtype=torch.long)
print(dataset.shape, dataset.dtype)

torch.Size([683899]) torch.int64


In [90]:
train_val_size = int(len(dataset) * 0.9)  
test_size = len(dataset) - train_val_size
train_val_data, test_data = random_split(dataset, [train_val_size, test_size])

train_size = int(len(train_val_data) * 0.9)  
val_size = len(train_val_data) - train_size
train_data, val_data = random_split(train_val_data, [train_size, val_size])

In [91]:
enc.n_vocab

50257

In [92]:
print(f"train set size: {train_size}, test: {test_size}, val: {val_size}")

train set size: 553958, test: 68390, val: 61551


In [104]:
torch.manual_seed(10000)
batch_size = 8 # how many independent sequences will we process in parallel?
block_size = 8 # what is the maximum context length for predictions?

def get_batch(data):
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])  # for any sequence x, the target y will be the next 8 tokens
    return x, y

xb, yb = get_batch(train_data)

print(xb.shape, yb.shape)

context = xb[0, :2]
target = yb[0,1]
print(f"when input is {context.tolist()} the target: {target}")

torch.Size([8, 8]) torch.Size([8, 8])
when input is [2034, 287] the target: 470
