<a href="https://colab.research.google.com/github/aryamanpandya99/Transformers/blob/main/notebooks/decoder_only_gpt.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Decoder-only based GPT (language model)

Here we take a transformer block, the decoder in particular, and use it for the task of language modeling. In general, this is how GPTs are trained. We will do this on a much smaller scale.

We take everything we've already built and leverage it in the way Karpathy implements a character level LM here:

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
!cp drive/MyDrive/Transformers/models/transformer_blocks.py .
!cp drive/MyDrive/Transformers/models/modules.py .
!cp -r drive/MyDrive/Transformers/data/ .

In [3]:
!pip install tokenmonster

Collecting tokenmonster
  Downloading tokenmonster-1.1.12.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: tokenmonster
  Building wheel for tokenmonster (setup.py) ... [?25l[?25hdone
  Created wheel for tokenmonster: filename=tokenmonster-1.1.12-py3-none-any.whl size=15820 sha256=268dfb701f51cc0e26fe9f5dc56120536a3c0203963ffd07f7786bcb54eec2df
  Stored in directory: /root/.cache/pip/wheels/aa/49/56/9db5eb8fd22ea838f03cc48cc4e096d0f1e810dff3e4559abe
Successfully built tokenmonster
Installing collected packages: tokenmonster
Successfully installed tokenmonster-1.1.12


In [4]:
import torch
from torch import nn
import numpy as np
from torch.utils.data import random_split
import sys
sys.path.append("~/")
from transformer_blocks import Transformer
from torch.utils.data import Dataset, DataLoader
import tokenmonster

In [5]:
harry_potter_text = " "
for i in range(4):
    book_num = i+1
    with open(f'/content/data/hp{book_num}.txt', 'r', encoding='utf-8') as f:
        harry_potter_text += f.read()
print(len(harry_potter_text))

2652650


## Tokenization
Instead of character level, we're going to model this LM using a tokenizer. in particular, we're going to try to use OpenAI's tiktoken with the gpt2 50k tokenizer. This might end up being too large of a vocab size given compute constraints, but

In [6]:
vocab = tokenmonster.load("fiction-1024-consistent-v1")
tokens = vocab.tokenize("This is a test.")

In [7]:
tokens

array([138, 918, 108, 318, 202,  17], dtype=uint16)

In [8]:
token_example = vocab.tokenize("hello world test monster tokenizer")

In [9]:
token_example

array([ 37, 445, 174, 785, 318, 202, 465, 547, 321, 169, 181, 218,  62],
      dtype=uint16)

In [10]:
[vocab.decode([token]) for token in token_example]

['',
 ' hel',
 'lo',
 ' world',
 ' te',
 'st',
 ' mon',
 'ster',
 ' to',
 'ke',
 'ni',
 'ze',
 'r']

In [11]:
tokens = np.array(vocab.tokenize(harry_potter_text), dtype=np.float16)

In [12]:
dataset = torch.tensor(tokens, dtype=torch.long)
print(dataset.shape, dataset.dtype)

torch.Size([1050223]) torch.int64


In [13]:
class HPDataset(Dataset):
    def __init__(self, data, block_size):
        self.data = data
        self.block_size = block_size

    def __len__(self):
        # Return the total number of possible sequences
        return len(self.data) - self.block_size

    def __getitem__(self, idx):
        # Fetch a single sequence x and its corresponding target y
        x = self.data[idx:idx + self.block_size]
        y = self.data[idx + 1:idx + self.block_size + 1]
        return x, y

BLOCK_SIZE = 25
hp_data = HPDataset(dataset, BLOCK_SIZE)

test_block = torch.tensor([dataset[i] for i in range(100)])
test_list = test_block.tolist()
print(vocab.decode(test_list))

train_size = int(len(hp_data) * 0.8)
test_size = int(len(hp_data) * 0.1)
val_size = len(hp_data) - train_size - test_size

print(f"train set size: {train_size}, test: {test_size}, val: {val_size}, data size: {len(dataset)}, dataset_size: {hp_data.__len__()}")

train_dataset, val_dataset, test_dataset = random_split(hp_data, [train_size, val_size, test_size])

batch_size = 64
train_loader, val_loader, test_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True), DataLoader(val_dataset, batch_size=batch_size, shuffle=True), DataLoader(test_dataset, batch_size=batch_size, shuffle=True)

 Harry Potter and the Sorcerer's Stone


CHAPTER ONE

THE BOY WHO LIVED

Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say
that they were perfectly normal, thank you very much. They were the last
people you'd expect to be
train set size: 840158, test: 105019, val: 105021, data size: 1050223, dataset_size: 1050198


In [14]:
print(len(train_loader))
print(len(test_loader))
print(len(val_loader))

13128
1641
1641


In [15]:
print(train_dataset.__getitem__(0))

(tensor([ 34,   5, 647,  36, 264,  62, 196, 494,  62,  37, 300,  69,  17, 392,
        881, 490, 358, 823, 271, 216,  56, 384, 271,   3,  16]), tensor([  5, 647,  36, 264,  62, 196, 494,  62,  37, 300,  69,  17, 392, 881,
        490, 358, 823, 271, 216,  56, 384, 271,   3,  16,   5]))


In [16]:
class rowlingGPT(nn.Module):
    """
    JK Rowling would probably not approve
    """
    def __init__(self, d_k, d_model, d_v, d_ff, num_heads, num_layers, vocab_size, dropout=0.1) -> None:
        super().__init__()
        self.decoder_transformer = Transformer(d_k, d_model, d_v, d_ff, num_heads, num_layers, vocab_size=vocab_size, mask=True, dropout=dropout)
        self.dropout = nn.Dropout(dropout)
        self.layer_norm = nn.LayerNorm(d_model)
        self.fc = nn.Linear(d_model, vocab_size)

    def forward(self, x):
        out = self.decoder_transformer(x)
        return self.fc(self.layer_norm(out))

In [17]:
def compute_loss(y_target, y_pred, loss_function):
    B, T, C = y_pred.shape
    y_pred = y_pred.view(B*T, C)
    _, max_indices = torch.max(y_pred, dim=1)
    y_target_list = y_target.tolist()
    max_indices = max_indices.tolist()
    y_target = y_target.view(B*T)
    return loss_function(y_pred, y_target)

In [18]:
def generate(model, prompt: str, device,n = 50, block_size=BLOCK_SIZE):
  prompt_array = vocab.tokenize(prompt)
  print(prompt_array.shape)
  prompt_array = np.array(prompt_array[:block_size], dtype=np.int16)
  print(prompt_array.shape)
  print(prompt_array.tolist())
  decoded = vocab.decode(prompt_array)
  print(f"prompt: {decoded}")
  cumulative_array = prompt_array
  for i in range(n):
    prompt_tensor = torch.tensor(prompt_array, dtype=torch.long).to(device)
    next_token = predict_next_token(model, prompt_tensor.unsqueeze(0))
    next_token_np = next_token.cpu().numpy().flatten()
    cumulative_array = np.append(cumulative_array, next_token_np)
    prompt_array = np.append(prompt_array[1:], next_token_np)
    test_list = cumulative_array.tolist()
    print(vocab.decode(test_list))

In [19]:
def predict_next_token(model, block):
  with torch.no_grad():
    y_pred = model(block)
    token_probs = nn.functional.softmax(y_pred, dim=-1)
    _, max_idx = torch.max(token_probs, dim=-1)
  return max_idx.squeeze()[-1]  # return only the last next token prediction

In [20]:
def train(model, train_loader, val_loader, loss_function, optim, epochs, device):
    losses = [] #group losses for loss visualization
    running_loss = 0.0
    val_losses = []
    for epoch in range(epochs):
        model.train()
        print("Epoch %d / %d" % (epoch+1, epochs))
        print("-"*10)

        for i, batch_data in enumerate(train_loader):
            x, y = batch_data
            x, y = x.to(device), y.to(device)
            y_pred = model(x)

            loss = compute_loss(y, y_pred, loss_function)
            optim.zero_grad()
            loss.backward()
            optim.step()
            running_loss += loss.item()
            losses.append(loss)

            if (i+1) % 1000 == 0:
                print("Step: {}, average training loss over last 1000 steps: {:.4f}".format(i+1, running_loss/1000))
                running_loss = 0.0

        model.eval()
        val_loss = 0.0
        with torch.no_grad():
            for i, batch_data in enumerate(val_loader):
                (y, x) = batch_data
                y, x = y.to(device), x.to(device)
                y_pred = model(x)
                loss = compute_loss(y, y_pred, loss_function)
                _, predicted_labels = torch.max(y_pred, 1)
                val_loss += loss.item()

            val_losses.append(val_loss)
        print("Epoch: {}, validation loss: {:.4f}".format(epoch+1, val_loss/len(val_loader)))
        print("Generated text: ")
        generate(model, "Harry", device=DEVICE, n=20)

    return losses, val_losses

In [21]:
LEARNING_RATE = 6e-4
NUM_EPOCHS = 20
DROPOUT = 0.2
D_MODEL = 1024
NUM_HEADS = 8
D_K = int(D_MODEL / NUM_HEADS)
D_V = D_K
D_FF = D_MODEL * 4
NUM_LAYERS = 2
VOCAB_SIZE = vocab.vocab_size
DEVICE = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

In [22]:
model = rowlingGPT(D_K, D_MODEL, D_V, D_FF, num_heads=NUM_HEADS, num_layers=NUM_LAYERS, vocab_size=VOCAB_SIZE)
model = model.to(DEVICE)

In [23]:
optimizer = torch.optim.AdamW(model.parameters(), lr=LEARNING_RATE)

In [24]:
DEVICE

device(type='cuda', index=0)

In [None]:
train_loss, val_loss = train(model, train_loader, val_loader, torch.nn.functional.cross_entropy, optimizer, NUM_EPOCHS, DEVICE)

Epoch 1 / 20
----------
Step: 1000, average training loss over last 1000 steps: 4.2043
Step: 2000, average training loss over last 1000 steps: 3.9020
Step: 3000, average training loss over last 1000 steps: 3.8385
Step: 4000, average training loss over last 1000 steps: 3.8025
Step: 5000, average training loss over last 1000 steps: 3.7699
Step: 6000, average training loss over last 1000 steps: 3.7490
Step: 7000, average training loss over last 1000 steps: 3.7234
Step: 8000, average training loss over last 1000 steps: 3.7069
Step: 9000, average training loss over last 1000 steps: 3.6878
Step: 10000, average training loss over last 1000 steps: 3.6697
Step: 11000, average training loss over last 1000 steps: 3.6624
Step: 12000, average training loss over last 1000 steps: 3.6471
Step: 13000, average training loss over last 1000 steps: 3.6346
Epoch: 1, validation loss: 8.6949
Generated text: 
(4,)
(4,)
[138, 264, 62, 196]
prompt: Harry
Harry,
Harry, and
Harry, and
Harry, and Her
Harry, and Her

In [None]:
text_sample = dataset

In [None]:
print(text_sample[0])
test_block = torch.tensor([text_sample[i] for i in range(8)])

In [None]:
test_block

In [None]:
test_list = test_block.tolist()

In [None]:
vocab.decode(test_list)

In [None]:
test_block = torch.tensor([text_sample[i] for i in range(100)])
test_list = test_block.tolist()
vocab.decode(test_list)

In [None]:
generate(model, "in the dark", device= DEVICE, n = 100)