## Code GPT from Scratch
### by Andrej Karpathy
[![YouTube Icon](https://upload.wikimedia.org/wikipedia/commons/thumb/0/09/YouTube_full-color_icon_%282017%29.svg/20px-YouTube_full-color_icon_%282017%29.svg.png)](https://www.youtube.com/watch?v=kCc8FmEb1nY&t=7s)

#### Table of Content
1. Import Dataset from Hugging Face    
    Dataset: *Tiny Shakespeare* by Andrej Karpathy: [🤗](https://huggingface.co/datasets/karpathy/tiny_shakespeare)
2. Tokenization

In [13]:
"""
!pip cache purge
!pip install pandas datasets # uninstall before pandas to handle the conflict
"""

Collecting pandas
  Downloading pandas-2.2.3-cp313-cp313-macosx_10_13_x86_64.whl.metadata (89 kB)
Collecting datasets
  Downloading datasets-3.3.2-py3-none-any.whl.metadata (19 kB)
Downloading pandas-2.2.3-cp313-cp313-macosx_10_13_x86_64.whl (12.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.5/12.5 MB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading datasets-3.3.2-py3-none-any.whl (485 kB)
Installing collected packages: pandas, datasets
Successfully installed datasets-3.3.2 pandas-2.2.3


### 1. Import Dataset from 🤗

In [2]:
from datasets import load_dataset

# Load the dataset
dataset = load_dataset('karpathy/tiny_shakespeare')

# Print the dataset structure
print(type(dataset))


Downloading data:   0%|          | 0.00/1.12M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1 [00:00<?, ? examples/s]

<class 'datasets.dataset_dict.DatasetDict'>


In [3]:
dataset.keys()

dict_keys(['train', 'validation', 'test'])

In [4]:
type(dataset['train'])

datasets.arrow_dataset.Dataset

In [17]:
# Extract a portion of the train text
text_sample = dataset['train'][0]['text']

print("Initial portion of the train text:\n------------------------------------")
print(text_sample[:200]) 

Initial portion of the train text:
------------------------------------
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you


### 2. Tokenization

- Mentioned different tokenizer libraries
    -  [Sentencepiece](https://github.com/google/sentencepiece) by Google
    - [tiktoken](https://github.com/openai/tiktoken) by OpenAI

In [20]:
# here are all the unique characters that occur in the sets:
def get_unique_characters(split):
    all_chars = set()
    for example in split:
        all_chars.update(example['text'])
    return ''.join(sorted(all_chars))

# Get unique characters for each split efficiently
unique_chars = {split: get_unique_characters(dataset[split]) for split in dataset.keys()}

# Print the unique characters for each split
for split, chars in unique_chars.items():
    print(f"Unique characters in the {split} set:")
    print(chars)
    print(f"Its length: {len(chars)}\n")

Unique characters in the train set:

 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
Its length: 65

Unique characters in the validation set:

 !',-.:;?ABCDEFGHIJKLMNOPQRSTUVWYabcdefghijklmnopqrstuvwxyz
Its length: 60

Unique characters in the test set:

 !',-.:;?ABCDEFGHIJKLMNOPRSTUVWYZabcdefghijklmnopqrstuvwxyz
Its length: 60



In [26]:
# create a mapping from characters to integers
# tokenize training characters
# Simple tokenizer:

st2int = {ch: i for i, ch in enumerate(unique_chars["train"])}
int2st = {i: ch for i, ch in enumerate(unique_chars["train"])}

# create encoder & decoder
encode = lambda sample: [st2int[ch] for ch in sample]
decode = lambda l: ''.join([int2st[i] for i in l])

In [28]:
# test encoder & decoder
print(encode("It is sunny"))
print(decode(encode("It is sunny")))

[21, 58, 1, 47, 57, 1, 57, 59, 52, 52, 63]
It is sunny


In [32]:
# Test different tokenizers
# Using tiktoken
import tiktoken

# Load tiktoken model
tokenizer = tiktoken.get_encoding("gpt2")
print(tokenizer.n_vocab)


50257


In [33]:
# Encode and decode using tiktoken
tiktoken_encoded = tokenizer.encode("It is sunny")
tiktoken_decoded = tokenizer.decode(tiktoken_encoded)

print("tiktoken Encoded:", tiktoken_encoded)
print("tiktoken Decoded:", tiktoken_decoded)

tiktoken Encoded: [1026, 318, 27737]
tiktoken Decoded: It is sunny


In [None]:
# Let's now encode the entire text dataset and store it into a torch. Tensor
import torch