<a href="https://www.kaggle.com/code/aabdollahii/the-art-of-tokenization?scriptVersionId=270368476" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# The Art of Tokenization, Part 1: Byte Pair Encoding

Welcome to this series on the art and science of tokenization! In Natural Language Processing (NLP), before a powerful model like GPT or BERT can understand human language, we must first break that text down into pieces it can recognize. These pieces are called tokens. The process of creating them is tokenization, and it’s one of the most fundamental steps in any NLP pipeline.

This series will explore the different strategies for tokenization. We’ll start with one of the most influential methods in the modern NLP landscape: Byte Pair Encoding (BPE).

# What is BPE? The “Happy Medium” of Tokenization
At its core, Byte Pair Encoding (BPE) is a subword tokenization algorithm. Instead of forcing us to choose between whole words or individual characters, it finds a “happy medium.”


Imagine trying to create a dictionary for a language model:
* Option A: Word Dictionary. You include every single word ("cat", "run", "photosynthesis", "antidisestablishmentarianism").

* Problem: The dictionary becomes enormous. What about new words (“de-platforming”), slang (“yeet”), or simple typos (“helllo”)? They are all “Out-of-Vocabulary” (OOV) and become a meaningless <UNK> (unknown) token.

* Option B: Character Dictionary. You only include characters ('a', 'b', 'c', '!').

* Problem: No more OOV issues, but the text sequences become incredibly long. The sentence “Hello world” is now 11 tokens ('H', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd'). This makes it computationally expensive and harder for the model to find meaning.

This way, a common word like "where" can be a single token, while a rare, complex word like "retrofitting" can be broken down into meaningful pieces like ["retro", "fit", "ting"]. Crucially, no word is ever “unknown.” Any new word can be built from these subword pieces.

# How the BPE Algorithm Works (Conceptually)
BPE was originally a data compression algorithm. Its adaptation for NLP is based on a simple, greedy, and iterative idea:
> Core Logic: Continuously find the most common pair of adjacent tokens in your text and merge them into a single, new token.


Let’s trace the process with a tiny imaginary corpus.

* Corpus: (low, low, low, lower, newest, wider)
* step 0: First, we break every word down into its basic characters. We also add a special symbol, like </w>, to mark the end of a word. This is important to distinguish between er inside a word (like in “newest”) and er at the end of a word (like in “lower”). Our initial “tokens” are just characters: {l, o, w, e, r, n, s, t, i, d, </w>}.
* Step 1: First MergeThe algorithm scans the entire corpus and counts the frequency of every adjacent pair of tokens.
Let’s say it finds that the pair e followed by r (e r) is the most common combination (appearing in lower and wider).