# GPT v1
-------------------
Based on the following attention architecture ...</br>
<img src="draws/attention-gpt.png" width="400">

In [1]:
!nvidia-smi

Tue May 28 17:00:31 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.65                 Driver Version: 551.86         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 4090        On  |   00000000:01:00.0  On |                    0 |
| 31%   37C    P0             53W /  450W |    1819MiB /  23028MiB |      1%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [2]:
from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = "all"

In [3]:
from datasets import load_dataset
from markdown import markdown
import mmap
import pickle
import random
import torch
import torch.nn as nn
import torch.nn.functional as F
import tqdm as notebook_tqdm

In [4]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

In [5]:
# hypterparameters
# number fo characters per batch
block_size = 32
batch_size = 128
epochs = 2100
learning_rate = 3e-4
eval_inters = 100
eval_interval = 300
# number fo features
embedding_dim = 256
# for the multi head attention layer
n_heads = 4
# number of decoders
n_layers = 4
dropout = 0.1

## Read data from file
---
I used the file ```data/train_data_1.txt```, which is all the openwebtext in one file. It is too big to be loaded in memory, therefore, we chunk it. </br>
**And** what I did was running the ```read_data.py```, it will generate all the required files:
- ```data/train_data_1.txt```
- ```data/vocab.txt```
Now,
1.  load the vocabulary so we can create the encode and decode functions.
2. load the data in chunks 


In [6]:
with open("data/vocab.txt", "r", encoding="utf-8") as f:
    chars = sorted(list(set(f.read())))
    vocab_size = len(chars)

vocab_size

28477

### char encoder and decoder

---


In [7]:
# then, create the mappings
string_to_int = {string: i for i, string in enumerate(chars)}
int_to_string = {i: string for i, string in enumerate(chars)}

In [8]:
# now, convert the book to integers
encode = lambda s: [string_to_int[c] for c in s]
decode = lambda s: "".join([int_to_string[c] for c in s])

### short test the encoder and decoder

---


In [9]:
encoded_string = encode("hello")
decoded_string = decode(encoded_string)
print(decoded_string)

hello


### read data in chunks
-------------------

In [10]:
def get_random_chunk(filname: str, block_size: int, batch_size: int) -> str:
    """_summary_

    Args:
        filname (str): _description_
        block_size (int): _description_
        batch_size (int): _description_

    Returns:
        str: _description_
    """
    with open(filname, "rb") as f:
        # mmap == memory mapping, does not open the whole file at once
        with mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as m:
            file_size = m.size()
            start_index = random.randint(0, (file_size) - block_size * batch_size)
            m.seek(start_index)
            block = m.read(block_size * batch_size - 1)
            # we decoded it becuase we load it in binary format
            decoded_block = block.decode("utf-8", errors="ignore").replace("\r", "")

            data = torch.tensor(encode(decoded_block), dtype=torch.long)
    return data

In [11]:
def get_batch(filname: str, block_size: int, batch_size: int):
    """_summary_

    Args:
        filname (str): _description_
        block_size (int): _description_
        batch_size (int): _description_

    Returns:
        _type_: _description_
    """
    chunk = get_random_chunk(filname, block_size, batch_size)
    starts = torch.randint(len(chunk) - block_size, (batch_size,))
    x = torch.stack([chunk[i : i + block_size] for i in starts]).to(device)
    y = torch.stack([chunk[i + 1 : i + block_size + 1] for i in starts]).to(device)
    return x, y

In [12]:
x, y = get_batch("data/train_data_1.txt", block_size, batch_size)

### helper functions
--------------------

In [13]:
@torch.no_grad()
def evaluate_loss(model, block_size):
    out = {}
    _ = model.eval()

    for split in ["train", "val"]:
        losses = torch.zeros(eval_inters)
        for i in range(eval_inters):
            # data = train if split == "train" else val
            input_batch, target_batch = get_batch(
                "data/train_data_1.txt", block_size, batch_size
            )
            _, loss = model.forward(input_batch, target_batch)
            losses[i] = loss.item()

        out[split] = losses.mean()

    _ = model.train()
    return out

## GTP Model
--------------------
It follow the simplified GPT attention architecture. You can read about it [page 9](https://arxiv.org/pdf/2305.10435v1).</br>
It looks compicated, but it is not. Is quite simple just a lot of code.

In [14]:
class FullMultiheadAttention(nn.Module):
    def __init__(self, embedding_dim, n_heads, dropout):
        super(FullMultiheadAttention, self).__init__()
        self.embedding_dim = embedding_dim
        self.n_heads = n_heads
        self.head_dim = embedding_dim // n_heads

        self.q_linear = nn.Linear(embedding_dim, embedding_dim)
        self.k_linear = nn.Linear(embedding_dim, embedding_dim)
        self.v_linear = nn.Linear(embedding_dim, embedding_dim)

        self.dropout = nn.Dropout(dropout)
        self.out = nn.Linear(embedding_dim, embedding_dim)

    def forward(self, query, key, value):
        # get number of training examples
        n_batches = query.shape[0]

        # perform linear transformation and split into N heads
        query = self.q_linear(query).view(n_batches, -1, self.n_heads, self.head_dim)
        key = self.k_linear(key).view(n_batches, -1, self.n_heads, self.head_dim)
        value = self.v_linear(value).view(n_batches, -1, self.n_heads, self.head_dim)

        # transpose to get dimensions batch_size * n_heads * seq_len * head_dim
        query = query.transpose(1, 2)
        key = key.transpose(1, 2)
        value = value.transpose(1, 2)

        # calculate attention using function we will define next
        scores = self.attention(query, key, value)

        # concatenate heads and put through final linear layer
        concat = (
            scores.transpose(1, 2).contiguous().view(n_batches, -1, self.embedding_dim)
        )
        output = self.out(concat)

        return output

    def attention(self, query, key, value):
        scores = torch.matmul(query, key.transpose(-2, -1)) / torch.sqrt(
            torch.tensor(self.head_dim, dtype=torch.float32)
        )
        scores = F.softmax(scores, dim=-1)
        scores = self.dropout(scores)
        output = torch.matmul(scores, value)
        return output

In [15]:
class AttentionHead(nn.Module):
    def __init__(self, embedding_dim, n_heads, dropout):
        super().__init__()
        self.k = nn.Linear(embedding_dim, n_heads, bias=False)
        self.q = nn.Linear(embedding_dim, n_heads, bias=False)
        self.v = nn.Linear(embedding_dim, n_heads, bias=False)
        # registers the no look ahead masking in the model state,
        # preventss overhead computation of having to read over and over again
        # we can train without it, but it will take longer
        self.register_buffer("tril", torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # input size is batch (b), block_size (t), vocab_size (c)
        # output size is batch (b), block_size (t), n_heads (h)
        b, t, c = x.shape
        k = self.k(x)
        q = self.q(x)
        v = self.v(x)

        # compute attention scores ("affinities"), w stand for weights
        # q * k^T / sqrt(h), the multi of q and k and then applied the scale factor (reference architecture)
        # scaling helps on hearing all part of the conversation, not just the loudest

        # (b,t,h) @ (b,h,t) -> (b,t,t)
        w = q @ k.transpose(-2, -1) * k.shape[-1] ** (-0.5)
        # masking so netwrok does not look ahead an cheat (reference architecture)
        # (b, t, t)
        w = w.masked_fill(self.tril[:t, :t] == 0, float("-inf"))
        # (b, t, t)
        w = F.softmax(w, dim=-1)
        w = self.dropout(w)
        # perform weighted aggergation of the values
        # (b, t, h)
        v = self.v(x)
        # (b, t, t) @ (b, t, h) -> (b, t, h
        return w @ v

In [16]:
class MultiheadAttention(nn.Module):

    def __init__(self, n_heads, head_size, embedding_dim, dropout=0.1):
        super().__init__()
        # ModuleList, independent modukes that can be run in parallel.
        self.heads = nn.ModuleList(
            [AttentionHead(embedding_dim, head_size, dropout) for _ in range(n_heads)]
        )
        self.proj = nn.Linear(n_heads * head_size, embedding_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # concat along the last dimmension, which is the feature dimmension
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        return self.dropout(self.proj(out))

In [17]:
class FeedForward(nn.Module):
    def __init__(self, embedding_dim, hidden_dim, dropout):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(embedding_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, embedding_dim),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

In [18]:
class DecoderBlock(nn.Module):
    def __init__(
        self, embedding_dim: int, n_heads: int, head_size: int, dropout: int
    ) -> None:
        super().__init__()
        # Multihead attention head dim, is calcualted as:
        # head_dim = embedding_dim // n_head
        # https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html
        # self.multihead_attention = nn.MultiheadAttention(embedding_dim, n_head)
        self.multihead_attention = MultiheadAttention(
            n_heads, head_size, embedding_dim, dropout=dropout
        )
        # self.ff = nn.Sequential(
        #     nn.Linear(embedding_dim, n_layers * embedding_dim),
        #     nn.ReLU(),
        #     nn.Linear(n_layers * embedding_dim, embedding_dim),
        #     nn.Dropout(dropout),
        # )
        self.ff = FeedForward(embedding_dim, 4 * embedding_dim, dropout)
        self.norm1 = nn.LayerNorm(embedding_dim)
        self.norm2 = nn.LayerNorm(embedding_dim)

    def forward(self, x):
        y = self.multihead_attention(x)
        x = self.norm1(x + y)
        y = self.ff(x)
        x = self.norm2(x + y)
        return x

In [19]:
class GPTLanguageModel(nn.Module):
    def __init__(
        self,
        vocab_size: int,
        embedding_dim: int,
        block_size: int,
        n_heads=4,
        n_layers=4,
        dropout=0.1,
    ):
        super().__init__()
        self.head_size = embedding_dim // n_heads
        self.char_embeddings = nn.Embedding(vocab_size, embedding_dim)

        # from the architecture (refer to architecture image):
        # first positional encoding as,
        # PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
        # PE(pos, 2i + 1) = cos(pos / 10000^(2i/d_model))
        # where pos is the position and i is the dimension
        # d_model is the embedding dimension
        # exmaple: "hello", "h" will be encoded with "sin" and "e" with "cos"
        # so, we need to create a matrix of shape (block_size, embedding_dim)
        # the embeddings dimension have all the infomration required with respect to the char in that position.
        # for GPTs we opnly use embeddings, we don't use positional encodings
        self.positional_encodings = nn.Embedding(block_size, embedding_dim)

        # Second,  define the decoder layers
        # define decoders layer, we have 4 deacoders therefore n_layers=4
        self.decoders = nn.Sequential(
            *[
                DecoderBlock(embedding_dim, n_heads, self.head_size, dropout)
                for _ in range(n_layers)
            ]
        )

        # Third,  linear layer with softmax activation
        self.linear_f = nn.LayerNorm(embedding_dim)
        # Fourth, outout as probabilities of each word in the vocab
        self.lm_head = nn.Linear(embedding_dim, vocab_size)

        self.apply(self._ini_weights)

    # std=0.02 is used in the original implementation
    # also is a very common value for initializing weights
    # and it represents the std for very closed values, meaning there are no outliers
    def _ini_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

    # B, T, C the dimmensions are:
    # (B) batch size (len(data) // block_size), (T) block_size, (C) vocab_size
    def forward(self, index, targets=None):
        b, t = index.shape
        logits = self.char_embeddings(index)
        pos_emb = self.positional_encodings(torch.arange(t, device=device))  # (T, C)
        x = logits + pos_emb  # (B, T, C)
        x = self.decoders(x)  # (B, T, C)
        x = self.linear_f(x)  # (B, T, C)
        logits = self.lm_head(x)  # (B, T, vocab_size)

        if targets is None:
            return logits, None

        b, t, c = logits.shape
        logits = logits.view(b * t, c)
        targets = targets.view(b * t)
        loss = F.cross_entropy(logits, targets)
        return logits, loss

    def generate(self, index, max_new_tokens: int):
        for _ in range(max_new_tokens):
            # print(index.shape, index[:, -block_size:].shape)
            # crop it, so the max number of tokens can exceed block_size
            index_cond = index[:, -block_size:]  # becomes (batch_size, block_size)
            logits, _ = self.forward(index_cond)
            logits = logits[:, -1, :]  # becomes (batch_size, n_classes)
            probs = F.softmax(logits, dim=-1)
            next_token = torch.multinomial(
                probs, num_samples=1
            )  # becomes (batch_size, 1)
            index = torch.cat(
                (index, next_token), dim=1
            )  # becomes (batch_size, time_dim + 1)
        return index


model = GPTLanguageModel(
    vocab_size=vocab_size,
    embedding_dim=embedding_dim,
    block_size=block_size,
    n_heads=n_heads,
    n_layers=n_layers,
    dropout=dropout,
)

# load model
with open("models/gpt_v1.pkl", "rb") as f:
    model = pickle.load(f)

model = model.to(device)
model

GPTLanguageModel(
  (char_embeddings): Embedding(28477, 256)
  (positional_encodings): Embedding(32, 256)
  (decoders): Sequential(
    (0): DecoderBlock(
      (multihead_attention): MultiheadAttention(
        (heads): ModuleList(
          (0-3): 4 x AttentionHead(
            (k): Linear(in_features=256, out_features=64, bias=False)
            (q): Linear(in_features=256, out_features=64, bias=False)
            (v): Linear(in_features=256, out_features=64, bias=False)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (proj): Linear(in_features=256, out_features=256, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (ff): FeedForward(
        (net): Sequential(
          (0): Linear(in_features=256, out_features=1024, bias=True)
          (1): ReLU()
          (2): Linear(in_features=1024, out_features=256, bias=True)
          (3): Dropout(p=0.1, inplace=False)
        )
      )
      (norm1): LayerNorm((256,), eps=1e-05, e

In [20]:
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

In [21]:
for i in range(epochs):
    input_batch, target_batch = get_batch("data/train_data_1.txt", block_size, batch_size)
    # forward pass
    logits, loss = model.forward(input_batch, target_batch)
    # backward pass
    # previous gradients do not affect the currrent ones, therefore, we set them to None
    # setting the gradient to None is a memory optimization, it saves memory
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()
    if i % eval_interval == 0:
        out = evaluate_loss(model, block_size)
        print(f"iteration {i}, train loss: {out["train"]}, val loss: {out["val"]}")

out = evaluate_loss(model, block_size)
print(f"final training loss, train loss: {out["train"]}, val loss: {out["val"]}")

iteration 0, train loss: 1.9156895875930786, val loss: 1.9258555173873901
iteration 300, train loss: 1.8608574867248535, val loss: 1.827478051185608
iteration 600, train loss: 1.8513919115066528, val loss: 1.7979546785354614
iteration 900, train loss: 1.8045518398284912, val loss: 1.8094267845153809
iteration 1200, train loss: 1.779594898223877, val loss: 1.8018397092819214
iteration 1500, train loss: 1.8209081888198853, val loss: 1.7878472805023193
iteration 1800, train loss: 1.7546147108078003, val loss: 1.768484354019165
final training loss, train loss: 1.7540913820266724, val loss: 1.7346044778823853


### Save model
-----
We save it as a pickel file.

In [22]:
with open("models/gpt_v1.pkl", "wb") as f:
    pickle.dump(model, f)

## Chat model
-----

In [23]:
context = torch.tensor(
    encode("hello, how are "), dtype=torch.long, device=device
).unsqueeze(0)
out = decode(model.generate(context, max_new_tokens=150)[0].tolist())
out

'hello, how are evate propment around a cold chan-music surings uponey who while is blike officing mistic, the Concley’s some annought the rotown you’ve disSiumacans '

In [24]:
out = decode(model.generate(context, max_new_tokens=32)[0].tolist())
out

'hello, how are find bunking follooms are one bl'

### Load dataset
------------
This is just how I downloaded the ```openwebtext``` dataset. You don't need to download it, you can work with it from datasets directly. You have the example in ```1_bigram.ipynb```

In [None]:
# Load the openwebtext dataset
dataset = load_dataset("Skylion007/openwebtext")

In [None]:
len(dataset)

In [None]:
# Define file paths to save the datasets
train_output_file = "data/openwebtext_train.txt"

# Save the train dataset to a text file
with open(train_output_file, "w", encoding="utf-8") as f:
    for sample in dataset["train"]:
        _ = f.write(sample["text"] + "\n")

In [None]:
# Access the train split
# train_dataset = dataset['train']
val_dataset = dataset["validation"]

In [None]:
# Example: Print the first sample from the train dataset
print(train_dataset[0])
