# Building the GPT Tokenizer

## 0. Some Pre-requisites

### Introduction to Tokenization
Tokenization is the process of breaking down a piece of text into smaller units called **tokens**. These tokens are the vocabulary of the language model. 

#### Why is Tokenization so Important?

- **Spelling, String Processing, and Arithmetic:** LLMs often struggle with tasks that seem simple to humans like spelling words correctly reversing a string, or performing basic math. This is because these tasks require operating on individual characters but the model sees the world in terms of "tokens," which are often chunks of words. For example the number 1275 might be a single token not four separate digit characters making it hard for the model to perform arithmetic on it.

- **Non-English Languages and Code:** The vocabulary of a tokenizer is typically trained on a massive corpus of text which is often predominantly English. This means other languages or programming languages might be broken down into less efficient or meaningful tokens making the model less effective at understanding and generating them.

- **Weird Errors and Warnings:** Seemingly random failures or warnings like those related to "trailing whitespace" or specific strings like "SolidGoldMagikarp", are often artifacts of how the tokenizer processes text before the model ever sees it.

- **The Dream of a Tokenizer-Free World:** A major goal in AI research is to create models that can operate directly on raw text (or even raw bytes) which would eliminate the entire layer of complexity and the associated problems introduced by tokenization.



### Unicode and UTF-8 Encoding

- **Unicode:** This is a universal standard that assigns a unique number, called a code point to almost every character symbol or emoji in every language. 
- **UTF-8:** This is an encoding scheme that specifies how to represent these Unicode code points as a sequence of bytes (integers from 0 to 255). Simple characters like those in the English alphabet can be represented by a single byte while more complex characters or emojis might require multiple bytes.

## 1. Basic Text Encoding

### Unicode Code Points (ord)
The `ord()` function in Python gives us the Unicode code point for a given character. 

In [1]:
"안녕하세요 👋 (hello in Korean!)"

'안녕하세요 👋 (hello in Korean!)'

In [2]:
[ord(x)for x in "안녕하세요 👋 (hello in Korean!)"]

[50504,
 45397,
 54616,
 49464,
 50836,
 32,
 128075,
 32,
 40,
 104,
 101,
 108,
 108,
 111,
 32,
 105,
 110,
 32,
 75,
 111,
 114,
 101,
 97,
 110,
 33,
 41]

### UTF-8 Encoded Bytes
We take the same string and encode it into a sequence of bytes. 
- Notice that the English characters and symbols (like `h`, `e`, `l`, `o`,`     `) are represented by single bytes (numbers less than 128) that match their ASCII values.
- The Korean characters and the emoji are represented by sequences of multiple bytes (numbers greater than 127). For example the waving hand emoji `👋` becomes the four-byte sequence [240, 159, 145, 139].

In [3]:
list("안녕하세요 👋 (hello in Korean!)".encode("utf-8"))

[236,
 149,
 136,
 235,
 133,
 149,
 237,
 149,
 152,
 236,
 132,
 184,
 236,
 154,
 148,
 32,
 240,
 159,
 145,
 139,
 32,
 40,
 104,
 101,
 108,
 108,
 111,
 32,
 105,
 110,
 32,
 75,
 111,
 114,
 101,
 97,
 110,
 33,
 41]

## 2. Implementing Byte-Pair Encoding

The core idea of BPE is to start with a simple vocabulary (all individual bytes) and iteratively merge the most frequent pair of adjacent tokens to create new and longer tokens. This allows the model to learn a vocabulary that is optimized for the specific text it is trained on creating a balance between character-level and word-level tokenization.

### Initial Text and Byte Conversion
We start with a sample text and convert it into its raw UTF-8 byte representation. This list of integers (from 0 to 255) is our initial sequence of tokens. Our goal is to compress this token sequence compress by merging frequent pairs.

In [4]:
text = "Ｕｎｉｃｏｄｅ! 🅤🅝🅘🅒🅞🅓🅔‽ 🇺‌🇳‌🇮‌🇨‌🇴‌🇩‌🇪! 😄 The very name strikes fear and awe into the hearts of programmers worldwide. We all know we ought to “support Unicode” in our software (whatever that means—like using wchar_t for all the strings, right?). But Unicode can be abstruse, and diving into the thousand-page Unicode Standard plus its dozens of supplementary annexes, reports, and notes can be more than a little intimidating. I don’t blame programmers for still finding the whole thing mysterious, even 30 years after Unicode’s inception."
tokens = text.encode("utf-8")# raw bytes
# converting to a list of integers in range 0..255 for convenience
tokens = list(map(int, tokens))
print('---')
print(text)
print("length:", len(text))# number of characters
print('---')
print(tokens)# list of integers in range 0..255
print("length:", len(tokens))

---
Ｕｎｉｃｏｄｅ! 🅤🅝🅘🅒🅞🅓🅔‽ 🇺‌🇳‌🇮‌🇨‌🇴‌🇩‌🇪! 😄 The very name strikes fear and awe into the hearts of programmers worldwide. We all know we ought to “support Unicode” in our software (whatever that means—like using wchar_t for all the strings, right?). But Unicode can be abstruse, and diving into the thousand-page Unicode Standard plus its dozens of supplementary annexes, reports, and notes can be more than a little intimidating. I don’t blame programmers for still finding the whole thing mysterious, even 30 years after Unicode’s inception.
length: 533
---
[239, 188, 181, 239, 189, 142, 239, 189, 137, 239, 189, 131, 239, 189, 143, 239, 189, 132, 239, 189, 133, 33, 32, 240, 159, 133, 164, 240, 159, 133, 157, 240, 159, 133, 152, 240, 159, 133, 146, 240, 159, 133, 158, 240, 159, 133, 147, 240, 159, 133, 148, 226, 128, 189, 32, 240, 159, 135, 186, 226, 128, 140, 240, 159, 135, 179, 226, 128, 140, 240, 159, 135, 174, 226, 128, 140, 240, 159, 135, 168, 226, 128, 140, 240, 159, 135, 180, 226, 128, 140