## Code GPT from Scratch
### by Andrej Karpathy
[![YouTube Icon](https://upload.wikimedia.org/wikipedia/commons/thumb/0/09/YouTube_full-color_icon_%282017%29.svg/20px-YouTube_full-color_icon_%282017%29.svg.png)](https://www.youtube.com/watch?v=kCc8FmEb1nY&t=7s)

#### Table of Content
1. Import Dataset from Hugging Face    
    Dataset: *Tiny Shakespeare* by Andrej Karpathy: [🤗](https://huggingface.co/datasets/karpathy/tiny_shakespeare)
2. Tokenization

In [1]:
"""
!pip cache purge
!pip install pandas datasets # uninstall before pandas to handle the conflict
"""

'\n!pip cache purge\n!pip install pandas datasets # uninstall before pandas to handle the conflict\n'

### 1. Import Dataset from 🤗

In [2]:
from datasets import load_dataset

# Load the dataset
dataset = load_dataset('karpathy/tiny_shakespeare')

# Print the dataset structure
print(type(dataset))


  from .autonotebook import tqdm as notebook_tqdm


<class 'datasets.dataset_dict.DatasetDict'>


In [3]:
dataset.keys()

dict_keys(['train', 'validation', 'test'])

In [4]:
type(dataset['train'])

datasets.arrow_dataset.Dataset

In [5]:
# Extract a portion of the train text
text_sample = dataset['train'][0]['text']

print("Initial portion of the train text:\n------------------------------------")
print(text_sample[:200]) 

Initial portion of the train text:
------------------------------------
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you


### 2. Tokenization

- Mentioned different tokenizer libraries
    -  [Sentencepiece](https://github.com/google/sentencepiece) by Google
    - [tiktoken](https://github.com/openai/tiktoken) by OpenAI

In [6]:
# here are all the unique characters that occur in the sets:
def get_unique_characters(split):
    all_chars = set()
    for example in split:
        all_chars.update(example['text'])
    return ''.join(sorted(all_chars))

# Get unique characters for each split efficiently
unique_chars = {split: get_unique_characters(dataset[split]) for split in dataset.keys()}

# Print the unique characters for each split
for split, chars in unique_chars.items():
    print(f"Unique characters in the {split} set:")
    print(chars)
    print(f"Its length: {len(chars)}\n")

Unique characters in the train set:

 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
Its length: 65

Unique characters in the validation set:

 !',-.:;?ABCDEFGHIJKLMNOPQRSTUVWYabcdefghijklmnopqrstuvwxyz
Its length: 60

Unique characters in the test set:

 !',-.:;?ABCDEFGHIJKLMNOPRSTUVWYZabcdefghijklmnopqrstuvwxyz
Its length: 60



In [7]:
# create a mapping from characters to integers
# tokenize training characters
# Simple tokenizer:

st2int = {ch: i for i, ch in enumerate(unique_chars["train"])}
int2st = {i: ch for i, ch in enumerate(unique_chars["train"])}

# create encoder & decoder
encode = lambda sample: [st2int[ch] for ch in sample]
decode = lambda l: ''.join([int2st[i] for i in l])

In [8]:
# test encoder & decoder
print(encode("It is sunny"))
print(decode(encode("It is sunny")))

[21, 58, 1, 47, 57, 1, 57, 59, 52, 52, 63]
It is sunny


In [9]:
# Test different tokenizers
# Using tiktoken
import tiktoken

# Load tiktoken model
tokenizer = tiktoken.get_encoding("gpt2")
print(tokenizer.n_vocab)


50257


In [10]:
# Encode and decode using tiktoken
tiktoken_encoded = tokenizer.encode("It is sunny")
tiktoken_decoded = tokenizer.decode(tiktoken_encoded)

print("tiktoken Encoded:", tiktoken_encoded)
print("tiktoken Decoded:", tiktoken_decoded)

tiktoken Encoded: [1026, 318, 27737]
tiktoken Decoded: It is sunny


In [11]:
# Let's now encode the entire text dataset and store it into a torch. Tensor
import torch

train_data = torch.tensor(encode(dataset["train"][0]["text"]), dtype=torch.long)
val_data = torch.tensor(encode(dataset["validation"][0]["text"]), dtype=torch.long)
test_data = torch.tensor(encode(dataset["test"][0]["text"]), dtype=torch.long)

In [12]:
print(f"Train data type is {train_data.dtype}. Train data shape is {train_data.shape}")
print(f"Validation data type is {val_data.dtype}. Validation data shape is {val_data.shape}")
print(f"Test data type is {train_data.dtype}. Test data shape is {train_data.shape}")

Train data type is torch.int64. Train data shape is torch.Size([1003854])
Validation data type is torch.int64. Validation data shape is torch.Size([55770])
Test data type is torch.int64. Test data shape is torch.Size([1003854])


In [13]:
print(train_data[:100])

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59])


* The note about train/test ratio:  
Train (%) /Test (%): 90/10

In [14]:
block_size = 8
train_data[:block_size+1] # a chunk of data

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

In [15]:
# Explanation of block size on the chunk of data

x = train_data[:block_size]
y = train_data[:block_size+1]

for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    
    print(f"When the input {context}, the target is {target}")

When the input tensor([18]), the target is 18
When the input tensor([18, 47]), the target is 47
When the input tensor([18, 47, 56]), the target is 56
When the input tensor([18, 47, 56, 57]), the target is 57
When the input tensor([18, 47, 56, 57, 58]), the target is 58
When the input tensor([18, 47, 56, 57, 58,  1]), the target is 1
When the input tensor([18, 47, 56, 57, 58,  1, 15]), the target is 15
When the input tensor([18, 47, 56, 57, 58,  1, 15, 47]), the target is 47


In [16]:
torch.manual_seed(1337)
batch_size = 4 #how many independent sequences will we process in parallel?
block_size = 8 #what is the maximum context length for predictions?

def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size, ))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+1+block_size] for i in ix])
    
    return x, y

xb, yb = get_batch("train")

In [17]:
print(f"Inputs: {xb}\nIts shape: {xb.shape}")
print(f"Targets: {yb}\nIts shape: {yb.shape}")

Inputs: tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
Its shape: torch.Size([4, 8])
Targets: tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])
Its shape: torch.Size([4, 8])


In [18]:
counter_1 = 0

for b in range(batch_size):              # batch_dimension
    for t in range(block_size):          # time dimension
        context = xb[b, :t+1]
        target = yb[b, t]
        counter_1 += 1
        print(f"{counter_1} -  When the input is {context.tolist()}, the target is {target}")
    print(f"-------------------------end of row={b}--------------------------------------")

1 -  When the input is [24], the target is 43
2 -  When the input is [24, 43], the target is 58
3 -  When the input is [24, 43, 58], the target is 5
4 -  When the input is [24, 43, 58, 5], the target is 57
5 -  When the input is [24, 43, 58, 5, 57], the target is 1
6 -  When the input is [24, 43, 58, 5, 57, 1], the target is 46
7 -  When the input is [24, 43, 58, 5, 57, 1, 46], the target is 43
8 -  When the input is [24, 43, 58, 5, 57, 1, 46, 43], the target is 39
-------------------------end of row=0--------------------------------------
9 -  When the input is [44], the target is 53
10 -  When the input is [44, 53], the target is 56
11 -  When the input is [44, 53, 56], the target is 1
12 -  When the input is [44, 53, 56, 1], the target is 58
13 -  When the input is [44, 53, 56, 1, 58], the target is 46
14 -  When the input is [44, 53, 56, 1, 58, 46], the target is 39
15 -  When the input is [44, 53, 56, 1, 58, 46, 39], the target is 58
16 -  When the input is [44, 53, 56, 1, 58, 46,