<a href="https://colab.research.google.com/github/aryamanpandya99/Transformers/blob/main/notebooks/decoder_only_gpt.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Decoder-only based GPT (language model)

Here we take a transformer block, the decoder in particular, and use it for the task of language modeling. In general, this is how GPTs are trained. We will do this on a much smaller scale.

We take everything we've already built and leverage it in the way Karpathy implements a character level LM here:

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!cp drive/MyDrive/Transformers/models/transformer_blocks.py .
!cp drive/MyDrive/Transformers/models/modules.py .
!cp -r drive/MyDrive/Transformers/data/ .

In [None]:
!pip install tokenmonster

Collecting tokenmonster
  Downloading tokenmonster-1.1.12.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: tokenmonster
  Building wheel for tokenmonster (setup.py) ... [?25l[?25hdone
  Created wheel for tokenmonster: filename=tokenmonster-1.1.12-py3-none-any.whl size=15820 sha256=286ea71772d537ebafac372cfda04f7e72a01542aac12db34aacd58dc8b951ba
  Stored in directory: /root/.cache/pip/wheels/aa/49/56/9db5eb8fd22ea838f03cc48cc4e096d0f1e810dff3e4559abe
Successfully built tokenmonster
Installing collected packages: tokenmonster
Successfully installed tokenmonster-1.1.12


In [None]:
import torch
from torch import nn
import numpy as np
from torch.utils.data import random_split
import sys
sys.path.append("~/")
from transformer_blocks import Transformer
from torch.utils.data import Dataset, DataLoader
import tokenmonster

In [None]:
harry_potter_text = " "
for i in range(4):
    book_num = i+1
    with open(f'/content/data/hp{book_num}.txt', 'r', encoding='utf-8') as f:
        harry_potter_text += f.read()
print(len(harry_potter_text))

2652650


## Tokenization
Instead of character level, we're going to model this LM using a tokenizer. in particular, we're going to try to use OpenAI's tiktoken with the gpt2 50k tokenizer. This might end up being too large of a vocab size given compute constraints, but

In [None]:
vocab = tokenmonster.load("fiction-2048-consistent-v1")
tokens = vocab.tokenize("This is a test.")

In [None]:
tokens

array([ 149, 1674,  110,  374,  233,   17], dtype=uint16)

In [None]:
token_example = vocab.tokenize("hello world test monster tokenizer")

In [None]:
token_example

array([  37,  586,  196, 1261,  374,  233,  627,  773,  377,   37,  601,
         53,  252,   62], dtype=uint16)

In [None]:
[vocab.decode([token]) for token in token_example]

['',
 ' hel',
 'lo',
 ' world',
 ' te',
 'st',
 ' mon',
 'ster',
 ' to',
 '',
 ' ken',
 'i',
 'ze',
 'r']

In [None]:
tokens = np.array(vocab.tokenize(harry_potter_text), dtype=np.float16)

In [None]:
dataset = torch.tensor(tokens, dtype=torch.long)
print(dataset.shape, dataset.dtype)

torch.Size([900724]) torch.int64


In [None]:
train_size = int(len(dataset) * 0.8)
test_size = int(len(dataset) * 0.1)
val_size = len(dataset) - train_size - test_size

train_data, test_data, val_data = dataset[:train_size], dataset[train_size:train_size+test_size], dataset[train_size+test_size:]

train_block = torch.tensor([train_data[i] for i in range(100)])
train_list = train_block.tolist()
print(vocab.decode(train_list))

val_block = torch.tensor([val_data[i] for i in range(100)])
val_list = val_block.tolist()
print(vocab.decode(val_list))

test_block = torch.tensor([test_data[i] for i in range(100)])
test_list = test_block.tolist()
print(vocab.decode(test_list))

 Harry Potter and the Sorcerer's Stone


CHAPTER ONE

THE BOY WHO LIVED

Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say
that they were perfectly normal, thank you very much. They were the last
people you'd expect to be involved in anything strange or mysterious,
because they just didn't hold with such no
 Viktor Krum, the famous International Quidditch player.  It was as though the eighteen-year-old Krum thought he. Harry, was an equal - a real rival -
"You haff never . . . you haff not..."
"No," said Harry very firmly.

moved from his own foot and tricked Mr. Malfoy into giving Dobby, thereby setting Dobby free.  The other was covered in pink and orange stripes.
"Dobby, what're you doing here?" Harry said in amazement. "Dobby has come to work at Hogwarts, sir!" Dob


In [None]:
print(f"train set size: {train_size}, test: {test_size}, val: {val_size}")

train set size: 720579, test: 90072, val: 90073


In [None]:
class HPDataset(Dataset):
    def __init__(self, data, block_size):
        self.data = data
        self.block_size = block_size

    def __len__(self):
        # Return the total number of possible sequences
        return len(self.data) - self.block_size

    def __getitem__(self, idx):
        # Fetch a single sequence x and its corresponding target y
        x = self.data[idx:idx + self.block_size]
        y = self.data[idx + 1:idx + self.block_size + 1]
        return x, y

BLOCK_SIZE = 32
train_dataset, val_dataset, test_dataset = HPDataset(train_data, BLOCK_SIZE), HPDataset(val_data, BLOCK_SIZE), HPDataset(test_data, BLOCK_SIZE)

batch_size = 64
train_loader, val_loader, test_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True), DataLoader(val_dataset, batch_size=batch_size, shuffle=True), DataLoader(test_dataset, batch_size=batch_size, shuffle=True)

In [None]:
print(len(train_loader))
print(len(test_loader))
print(len(val_loader))

11259
1407
1407


In [None]:
class zeptoGPT(nn.Module):
    """
    zepto because it's a really small GPT
    """
    def __init__(self, d_k, d_model, d_v, d_ff, num_heads, num_layers, vocab_size, dropout=0.1) -> None:
        super().__init__()
        self.decoder_transformer = Transformer(d_k, d_model, d_v, d_ff, num_heads, num_layers, vocab_size=vocab_size, mask=True, dropout=dropout)
        self.dropout = nn.Dropout(dropout)
        self.layer_norm = nn.LayerNorm(d_model)
        self.fc = nn.Linear(d_model, vocab_size)

    def forward(self, x):
        out = self.decoder_transformer(x)
        return self.fc(self.layer_norm(out))

In [None]:
def compute_loss(y_target, y_pred, loss_function):
    B, T, C = y_pred.shape
    y_pred = y_pred.view(B*T, C)
    _, max_indices = torch.max(y_pred, dim=1)
    y_target_list = y_target.tolist()
    max_indices = max_indices.tolist()
    y_target = y_target.view(B*T)
    return loss_function(y_pred, y_target)

In [None]:
def generate(model, prompt: str, device,n = 50, block_size=BLOCK_SIZE):
  prompt_array = vocab.tokenize(prompt)
  print(prompt_array.shape)
  prompt_array = np.array(prompt_array[:block_size], dtype=np.int16)
  print(prompt_array.shape)
  print(prompt_array.tolist())
  decoded = vocab.decode(prompt_array)
  print(f"prompt: {decoded}")
  cumulative_array = prompt_array
  for i in range(n):
    prompt_tensor = torch.tensor(prompt_array, dtype=torch.long).to(device)
    next_token = predict_next_token(model, prompt_tensor.unsqueeze(0))
    next_token_np = next_token.cpu().numpy().flatten()
    cumulative_array = np.append(cumulative_array, next_token_np)
    prompt_array = np.append(prompt_array[1:], next_token_np)
    test_list = cumulative_array.tolist()
    print(vocab.decode(test_list))

In [None]:
def predict_next_token(model, block):
  with torch.no_grad():
    y_pred = model(block)
    token_probs = nn.functional.softmax(y_pred, dim=-1)
    _, max_idx = torch.max(token_probs, dim=-1)
  return max_idx.squeeze()[-1]  # return only the last next token prediction

In [None]:
def train(model, train_loader, val_loader, loss_function, optim, epochs, device):
    losses = [] #group losses for loss visualization
    running_loss = 0.0
    val_losses = []
    for epoch in range(epochs):
        model.train()
        print("Epoch %d / %d" % (epoch+1, epochs))
        print("-"*10)

        for i, batch_data in enumerate(train_loader):
            x, y = batch_data
            x, y = x.to(device), y.to(device)
            y_pred = model(x)

            loss = compute_loss(y, y_pred, loss_function)
            optim.zero_grad()
            loss.backward()
            optim.step()
            running_loss += loss.item()
            losses.append(loss)

            if (i+1) % 1000 == 0:
                print("Step: {}, average training loss over last 1000 steps: {:.4f}".format(i+1, running_loss/1000))
                running_loss = 0.0

        model.eval()
        val_loss = 0.0
        with torch.no_grad():
            correct_pred = 0.0
            num_samples = 0
            for i, batch_data in enumerate(val_loader):
                (y, x) = batch_data
                y, x = y.to(device), x.to(device)
                y_pred = model(x)
                loss = compute_loss(y, y_pred, loss_function)
                _, predicted_labels = torch.max(y_pred, 1)
                num_samples+=predicted_labels.shape[0]
                val_loss += loss.item()

            val_losses.append(val_loss)
        print("Epoch: {}, validation loss: {:.4f}".format(epoch+1, val_loss/len(val_loader)))
        print("Generated text: ")
        generate(model, "Harry", device=DEVICE, n=20)

    return losses, val_losses

In [None]:
LEARNING_RATE = 6e-4
NUM_EPOCHS = 5
DROPOUT = 0.2
D_MODEL = 1024
NUM_HEADS = 8
D_K = int(D_MODEL / NUM_HEADS)
D_V = D_K
D_FF = D_MODEL * 4
NUM_LAYERS = 2
VOCAB_SIZE = vocab.vocab_size
DEVICE = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

In [None]:
model = zeptoGPT(D_K, D_MODEL, D_V, D_FF, num_heads=NUM_HEADS, num_layers=NUM_LAYERS, vocab_size=VOCAB_SIZE)
model = model.to(DEVICE)

In [None]:
optimizer = torch.optim.AdamW(model.parameters(), lr=LEARNING_RATE)

In [None]:
DEVICE

device(type='cuda', index=0)

In [None]:
train_loss, val_loss = train(model, train_loader, val_loader, torch.nn.functional.cross_entropy, optimizer, NUM_EPOCHS, DEVICE)

Epoch 1 / 5
----------
Step: 1000, average training loss over last 1000 steps: 4.5192
Step: 2000, average training loss over last 1000 steps: 4.0947
Step: 3000, average training loss over last 1000 steps: 3.9889
Step: 4000, average training loss over last 1000 steps: 3.9341
Step: 5000, average training loss over last 1000 steps: 3.8920
Step: 6000, average training loss over last 1000 steps: 3.8589
Step: 7000, average training loss over last 1000 steps: 3.8303
Step: 8000, average training loss over last 1000 steps: 3.8098
Step: 9000, average training loss over last 1000 steps: 3.7888
Step: 10000, average training loss over last 1000 steps: 3.7644
Step: 11000, average training loss over last 1000 steps: 3.7484
Epoch: 1, validation loss: 9.2498
Generated text: 
(3,)
(3,)
[149, 582, 226]
prompt: Harry
Harry,
Harry, and
Harry, and
Harry, and Har
Harry, and Harry
Harry, and Harry's
Harry, and Harry's

Harry, and Harry's
the
Harry, and Harry's
they
Harry, and Harry's
they'
Harry, and Harry's


In [None]:
text_sample = train_data

In [None]:
print(text_sample[0])
test_block = torch.tensor([text_sample[i] for i in range(8)])

tensor(36)


In [None]:
test_block

tensor([  36,  582,  226,   36,  354,  240,  172, 1528])

In [None]:
test_list = test_block.tolist()

In [None]:
vocab.decode(test_list)

' Harry Potter and the'

In [None]:
test_block = torch.tensor([text_sample[i] for i in range(100)])
test_list = test_block.tolist()
vocab.decode(test_list)

" Harry Potter and the Sorcerer's Stone\n\n\nCHAPTER ONE\n\nTHE BOY WHO LIVED\n\nMr. and Mrs. Dursley, of number four, Privet Drive, were proud to say\nthat they were perfectly normal, thank you very much. They were the last\npeople you'd expect to be involved in anything strange or mysterious,\nbecause they just didn't hold with such no"

In [None]:
generate(model, "in the dark", device= DEVICE, n = 25)

(3,)
(3,)
[37, 1381, 813]
prompt: in the dark
in the darkness
in the darkness.
in the darkness. Har
in the darkness. Harry
in the darkness. Harry had
in the darkness. Harry had seen
in the darkness. Harry had seen the
in the darkness. Harry had seen the
in the darkness. Harry had seen the S
in the darkness. Harry had seen the Sna
in the darkness. Harry had seen the Snape
in the darkness. Harry had seen the Snape's
in the darkness. Harry had seen the Snape's eyes
in the darkness. Harry had seen the Snape's eyes.
in the darkness. Harry had seen the Snape's eyes. Har
in the darkness. Harry had seen the Snape's eyes. Harry
in the darkness. Harry had seen the Snape's eyes. Harry had
in the darkness. Harry had seen the Snape's eyes. Harry had seen
in the darkness. Harry had seen the Snape's eyes. Harry had seen the
in the darkness. Harry had seen the Snape's eyes. Harry had seen the
in the darkness. Harry had seen the Snape's eyes. Harry had seen the S
in the darkness. Harry had seen the Sna