# Building GPT

> Recreating [nanoGPT](https://github.com/karpathy/nanoGPT) from Andrej Karpathy

<font color="purple">We'll train a character-level GPT on the works of Shakespeare.</font>

## 1 - Shakespeare Dataset

<hr>

Let's download the tiny shakespeare dataset. It is ~ 1MB file and contains ~ 1115394 characters.

In [3]:
import requests

url = 'https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt'
response = requests.get(url)
if response.status_code == 200:
    with open('data/tinyshakespeare.txt', 'wb') as file:
        file.write(response.content)
    print("File downloaded successfully.")
else:
    print("Failed to download the file. Status code:", response.status_code)

File downloaded successfully.


In [4]:
# read the text to inspect it
with open('data/tinyshakespeare.txt', 'r', encoding='utf-8') as f:
    text = f.read()

In [5]:
print("length of dataset in characters: ", len(text))

length of dataset in characters:  1115394


In [9]:
# let's look at the first 500 characters
print(text[:500])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor


### 1.1 - Unique Characters

Extract the unique characters in the dataset.
- `set(text)` will create a set of the characters that occur in the text
- `list(.)` will create an arbitrary ordering from the set
- `sorted(.)` will create a sorted ordering from the list
- `vocab_size` is the number of all the characters

In [11]:
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(''.join(chars)) # note the space character at the start
print("Vocabulary size:", vocab_size)


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
Vocabulary size: 65


## 2 - Tokenization

<hr>

A tokenizer functions as an encoder-decoder system, translating human-readable text into a format understandable by machines, and vice versa. Its primary role is to break down text into manageable pieces, known as tokens, which can represent individual characters, parts of words (sub-words), or whole words.

### 2.1 - Character-level vs. Sub-word Tokenization

- **Character-level Encoding:** This approach encodes each character of the text as a unique integer. Since we're building a character-level language model, we'll use this tokenizer.

- **Sub-word Tokenization:** Advanced tokenizers like OpenAI's `tiktoken` and Google's `SentencePiece` operate at the sub-word level. This method finds a balance between not encoding entire words and not going down to individual characters. 

For instance, the phrase "hi there" (8 characters) could be encoded into:
- 8 integers using character-level encoding
- 2 integers using word-level encoding
- 3 integers, for instance, using sub-word level encoding

```python
import tiktoken
enc = tiktoken.get_encoding('gpt2')
print(enc.n_vocab)
# 50257
print(enc.encode("hii there"))
# [71, 4178, 612]
print(enc.decode([71, 4178, 612]))
# 'hii there'
```

Notice that `tiktoken` uses a vocabular size of 50257 instead of 65 (in our case). And thus the encoding for the string "hi there" is just 3 integers.

In essence, you can have:
- A very large sequence of integers with a small vocabulary
- A very short sequence of integers with a large vocabulary

In [12]:
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s]          # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

print(encode("hii there"))
print(decode(encode("hii there")))

[46, 47, 47, 1, 58, 46, 43, 56, 43]
hii there


### 2.2 - Encoding Data into a Torch Tensor

Now that we've our character-level tokenizer, encode the entire Shakespeare dataset into a `torch.Tensor`.

In [13]:
import torch

data = torch.tensor(encode(text), dtype=torch.long)

In [14]:
print(data.shape, data.dtype)
print(data[:500]) # the 500 characters we looked at earier will to the GPT look like this

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59,  1, 39, 56, 43,  1, 39, 50, 50,
         1, 56, 43, 57, 53, 50, 60, 43, 42,  1, 56, 39, 58, 46, 43, 56,  1, 58,
        53,  1, 42, 47, 43,  1, 58, 46, 39, 52,  1, 58, 53,  1, 44, 39, 51, 47,
        57, 46, 12,  0,  0, 13, 50, 50, 10,  0, 30, 43, 57, 53, 50, 60, 43, 42,
         8,  1, 56, 43, 57, 53, 50, 60, 43, 42,  8,  0,  0, 18, 47, 56, 57, 58,
         1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 18, 47, 56, 57, 58,  6,  1, 63,
        53, 59,  1, 49, 52, 53, 61,  1, 15, 39, 47, 59, 57,  1, 25, 39, 56, 41,
      

## 3 - Train / Val Split & Data Loader

<hr>

Split the dataset into 90/10 train/val sets.

In [15]:
n = int(0.9 * len(data))
train_data = data[:n]
val_data = data[n:]

### 3.1 - Data Loader

To train transformer models like NanoGPT efficiently, data is not fed into the system all at once due to computational constraints. Instead, text data is broken into manageable chunks.

#### 3.1.1 - Block Size / Context Length

- The `block_size` or `context_length` determines the maximum length of these data chunks.
- When processing a chunk, the model is trained to predict the next character (or token) in the sequence. For example, in a block with a size of 9, there are effectively 8 training pairs, each input 'x' paired with its subsequent character 'y'.

#### 3.1.2 - Dimensions in Training

- **Time Dimension:** This refers to the sequence of tokens fed into the model, reflecting the linear progression of text.
- **Batch Dimension:** Training data is also organized in batches. This dimension allows multiple chunks of data to be processed simultaneously, enhancing the efficiency of the training process.

In [17]:
block_size = 8
train_data[:block_size+1]

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

In a chunk of 9 characters, there are 8 individual examples packed in there. For instance, given this chunk:

```python
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])
```

- in the context of [18], 47 comes next.
- in the context of [18, 47], 56 comes next.
- in the context of [18, 47, 56], 57 comes next, and so on.

Why do we do this?

- This approach ensures the transformer is exposed to contexts ranging from very small (a single integer) to the full length of the `block_size`, allowing it to learn and understand text in varying lengths effectively.
- If the input text exceeds the set `block_size`, the transformer model truncates the excess, focusing only on the text within the defined limit. This process ensures computational efficiency and relevance in training.

In [18]:
x = train_data[:block_size]      # inputs to the transformer: first block_size characters
y = train_data[1:block_size+1]   # targets for each input position: off-set by 1

for t in range(block_size):
    context = x[:t+1]            # all chars up to and including t
    target = y[t]                # t-th char in the y array
    print(f"when input is {context} the target: {target}")

when input is tensor([18]) the target: 47
when input is tensor([18, 47]) the target: 56
when input is tensor([18, 47, 56]) the target: 57
when input is tensor([18, 47, 56, 57]) the target: 58
when input is tensor([18, 47, 56, 57, 58]) the target: 1
when input is tensor([18, 47, 56, 57, 58,  1]) the target: 15
when input is tensor([18, 47, 56, 57, 58,  1, 15]) the target: 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target: 58


### 3.2 - Data Batches

- `batch_size = 4` specifies that four sequences (or chunks of data) will be processed in parallel during each training iteration. It enhances efficiency by utilizing the GPU's ability to handle parallel computations.
- `block_size = 8` indicates the maximum length of context the model will consider for making predictions. Each sequence in a batch will have up to 8 elements (e.g., characters or tokens).

- `get_batch(split)` creates batches of input $(x)$ and target $(y)$ data for training or validation. It randomly selects starting points in the data and then extracts sequences of length `block_size` for $x$ and the corresponding next elements as targets $y$. It selects `batch_size` number of random starting points $(ix)$ in the dataset.

In [19]:
torch.manual_seed(1337)
batch_size = 4            # how many independent sequences will we process in parallel?
block_size = 8            # what is the maximum context length for predictions?

def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y

In [20]:
xb, yb = get_batch('train')
print('inputs:')
print(xb.shape)
print(xb)
print('targets:')
print(yb.shape)
print(yb)

print('----')

for b in range(batch_size):     # batch dimension
    for t in range(block_size): # time dimension
        context = xb[b, :t+1]
        target = yb[b,t]
        print(f"when input is {context.tolist()} the target: {target}")

inputs:
torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
targets:
torch.Size([4, 8])
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])
----
when input is [24] the target: 43
when input is [24, 43] the target: 58
when input is [24, 43, 58] the target: 5
when input is [24, 43, 58, 5] the target: 57
when input is [24, 43, 58, 5, 57] the target: 1
when input is [24, 43, 58, 5, 57, 1] the target: 46
when input is [24, 43, 58, 5, 57, 1, 46] the target: 43
when input is [24, 43, 58, 5, 57, 1, 46, 43] the target: 39
when input is [44] the target: 53
when input is [44, 53] the target: 56
when input is [44, 53, 56] the target: 1
when input is [44, 53, 56, 1] the target: 58
when input is [44, 53, 56, 1, 58] the target: 46
when input is [44, 53

Above, we see that `xb` is a single batch of 32 independent examples sampled from the training dataset and `yb` is the corresponding target labels (for loss computations later on). This batch `xb` of 32 examples in going to feed into a transformer which will simutaneously process them.