# Let's Build a GPT


This is a companion notebook to [Let's build GPT: from scratch, in code, spelled out](https://www.youtube.com/watch?v=kCc8FmEb1nY) video by Andrej Karpathy.

This notebook also provides an in-depth introduction to LLMs. You can run this notebook locally, on [Colab](https://colab.research.google.com/), or on your preferred cloud service.


## Goal: Make a computer program that writes like Shakespeare

Dataset: [tinyshakespeare](https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt). This dataset is a .txt file comprising the collective works of Shakespeare.

Our goal is to write a program that predicts the sequence of characters that mimics Shakespeare's style. Given a sequence of characters, the transformer within the neural network will predict the next most likely character.


### Loading the data

We're going to read the file `dataset/shakespeare.txt`` that will serve as our dataset.

In [27]:
with open('dataset/tinyshakespeare.txt', 'r') as file:
    content = file.read()

len(content)

print(content[:100])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You



### Preparing Character Vocabulary

When training character-based models, each character serves as a "token" (the smallest unit the model deals with). We need to know all possible tokens to convert characters to integers (and vice versa).


This code identifies and counts all unique characters in the text. This is essential for training a GPT so it can recognize and predict each possible character in the dataset.

In [28]:
# text is a series of characters in python
chars = sorted(list(set(content)))  # gets all unique characters in the dataset sorted
vocab_size = len(chars)  # possible elements of the sequence
print("".join(chars))
print(vocab_size)


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
65


### Tokenizing characters

Large language models such as GPT, LLaMA, and PaLM work in terms of tokens. They take text, convert it into tokens (integers), then predict which tokens should come next.

So, **tokenization** is the process of breaking down a piece of text into tokens that a model can understand. By understanding the statistical relationships between these tokens, models can predict the next token in a sequence of tokens.

A few well known tokenizers:

- [google/sentencepiece](https://github.com/google/sentencepiece)
- [openai/tiktoken](https://github.com/openai/tiktoken)


In [30]:
# this code creates a character-level tokenizer. i.e. converts raw text to a sequence of integers

char_to_idx = {ch: i for i, ch in enumerate(chars)}
idx_to_char = {i: ch for i, ch in enumerate(chars)}

encode = lambda s: [char_to_idx[c] for c in s]  # encoder: takes a string, outputs a list of integers
decode = lambda l: ''.join([idx_to_char[i] for i in l])  # decoder: takes a list of integers, outputs a string

print(encode('hello, world'))
print(decode(encode('hello, world')))

[46, 43, 50, 50, 53, 6, 1, 61, 53, 56, 50, 42]
hello, world


### Storing the tokens


First, we created a tokenizer to convert text into a sequence of integers (tokens). Now, rather than putting these integers in a regular python list, we're putting them in something called a [Tensor](https://pytorch.org/docs/stable/tensors.html). A Tensor is basically the same as a [numpy ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html). It's a generic n-dimensional array that can be used for arbitrary numeric computations.

Think of this tensor as a special python list that our computer model likes more because it can run operations a lot faster and more efficiently.

At the simplest level:

- A `0-dimensional tensor` is just a number (also called a scalar).
- A `1-dimensional tensor` is similar to a list of numbers.
- A `2-dimensional tensor` is similar to a table (or matrix) of numbers.
- A `3-dimensional tensor` can be visualized as a cube of numbers.
and so on for higher dimensions.

In [31]:
# now let's encode the entire dataset and store it into a tensor
import torch # we use PyTorch: https://pytorch.org

# by definition, a Tensor is a multi-dimensional matrix that contains elements of a single data type.

# encode text it ints, and then store the ints in something called a tensor.
# a tensor can be a 1D box (like a line), a 2D box (like a grid), a 3D box (like a cube), or more.
data = torch.tensor(encode(content), dtype=torch.long)
print(data.shape, data.dtype)
print(data[:100]) # this shows what the first 100 chars in the dataset look like to a GPT.

# the entire dataset is now represented as a tensor of shape (numb_of_tokens)

torch.Size([1115393]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59])


### Monitoring Overfitting: splitting the dataset into training and validation sets

By dividing our data into training and validation sets, we can gauge if our model is overfitting. Overfitting happens when the model gets too tuned to the training data and struggles with new, unseen data.

Our goal isn't for the neural network to merely memorize Shakespeare's works. Instead, we want it to generate fresh new text that still feels Shakespearean.

In [32]:
### Split the dataset into training and validation sets

n = int(0.9 * len(data))  # 90% of the data for training
train_data = data[:n] # first 90% of the data for training
val_data = data[n:] # last 10% of the data for validation




### Training the model

Next, we'll feed the sequence of tokens into the GPT model. By training it, the model will pick up on patterns and learn to anticipate the following token in the sequence.


https://colab.research.google.com/drive/1JMLa53HDuA-i7ZBmqV7ZnA3c_fvtXnx-?usp=sharing#scrollTo=YJb0OXPwzvqg

In [37]:

block_size = 8
train_data[:block_size]+1 # we're gonna train it in every single position

tensor([19, 48, 57, 58, 59,  2, 16, 48])


#### Sequence prediction using the "sliding window" technique

At a high level, the following code is trying to train a model to predict the next item in a sequence based on the items that came before it.

The main idea is to teach the model to predict the next item in a sequence based on a given context from that sequence. The training data is prepared using a "sliding window" approach, which, in this example, starts with a single token and expands to include more tokens from the sequence as context.

Imagine a series of tokens as a horizontal line of blocks, like this:

```python
tensor([1, 2, 3, 4, 5, 6, 7, 8])
```

For block_size of 8, the sliding window approach will start with a small window over the first token and gradually increase in size, like so:

```python
tensor([1]) # next token (or target): tensor(2)
tensor([1, 2]) # next token: tensor(3)
tensor([1, 2, 3]) # next token: tensor(4)
tensor([1, 2, 3, 4]) # next token: tensor(5)
tensor([1, 2, 3, 4, 5]) # next token: tensor(6)
tensor([1, 2, 3, 4, 5, 6]) # next token: tensor(7)
tensor([1, 2, 3, 4, 5, 6, 7]) # next token: tensor(8)
tensor([1, 2, 3, 4, 5, 6, 7, 8]) # no next token or target since we're at the end of the tensor.

```

Parameters:
- `block_size`: The length of sequences to consider. For instance, if `block_size` is 8, sequences of length up to `8` are processed.
- `train_data`: A Tensor of tokens representing the data used for training.

Variables:
- `x`: The initial context, which starts from the first token and includes up to `block_size` tokens.
- `y`: The corresponding targets for each context in `x`. Each item in `y` is the next token after the corresponding context in `x`.

### Analogy: teaching a kid to guess the next word(s) in a story


Think of this model training process like teaching a kid to guess the next word in a story.


- **Different story lengths (using the sliding window technique)**
Sometimes we tell them just one word from the story, sometimes two words, sometimes three, and so on, up to a certain limit (like eight words). This is like giving the model different lengths of 'stories' to learn from.

- **Why vary lengths?**
This helps the kid (or our model) get comfortable with predicting the next word whether they've heard a short bit of the story or a longer bit. It's like practicing with different levels of difficulty.

- **Practice**
Once the kid has practiced enough, you can give them any short or long part of the story, and they'll try to guess the next word. This is because they've practiced with all different story lengths.

In [40]:
x = train_data[:block_size]
y = train_data[1:block_size+1]

# think of `x`` as the question and y as the answer. For example, if x is "how are" then y is "are you?"
print(train_data[:block_size+1])

for i in range(block_size):
    context = x[:i+1]
    target = y[i]
    print("when the context is", context, "the target is",target)

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])
when the context is tensor([18]) the target is tensor(47)
when the context is tensor([18, 47]) the target is tensor(56)
when the context is tensor([18, 47, 56]) the target is tensor(57)
when the context is tensor([18, 47, 56, 57]) the target is tensor(58)
when the context is tensor([18, 47, 56, 57, 58]) the target is tensor(1)
when the context is tensor([18, 47, 56, 57, 58,  1]) the target is tensor(15)
when the context is tensor([18, 47, 56, 57, 58,  1, 15]) the target is tensor(47)
when the context is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target is tensor(58)



In this example, we change how much information we give the model by varying the length of context, from just 1 token up to `block_size`. We do this so:

1. The model can be good at predicting with both short and long bits of information i.e. it becomes adept at generating predictions with any context length.
2. It understands how pieces of information connect in different situations. The model learns to anticipate and predict subsequent tokens based on different lengths of historical context.