<a href="https://www.kaggle.com/code/aabdollahii/the-art-of-tokenization?scriptVersionId=271099105" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# The Art of Tokenization, Part 1: Byte Pair Encoding

Welcome to this series on the art and science of tokenization! In Natural Language Processing (NLP), before a powerful model like GPT or BERT can understand human language, we must first break that text down into pieces it can recognize. These pieces are called tokens. The process of creating them is tokenization, and it’s one of the most fundamental steps in any NLP pipeline.

This series will explore the different strategies for tokenization. We’ll start with one of the most influential methods in the modern NLP landscape: Byte Pair Encoding (BPE).

# What is BPE? The “Happy Medium” of Tokenization
At its core, Byte Pair Encoding (BPE) is a subword tokenization algorithm. Instead of forcing us to choose between whole words or individual characters, it finds a “happy medium.”


Imagine trying to create a dictionary for a language model:
* Option A: Word Dictionary. You include every single word ("cat", "run", "photosynthesis", "antidisestablishmentarianism").

* Problem: The dictionary becomes enormous. What about new words (“de-platforming”), slang (“yeet”), or simple typos (“helllo”)? They are all “Out-of-Vocabulary” (OOV) and become a meaningless <UNK> (unknown) token.

* Option B: Character Dictionary. You only include characters ('a', 'b', 'c', '!').

* Problem: No more OOV issues, but the text sequences become incredibly long. The sentence “Hello world” is now 11 tokens ('H', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd'). This makes it computationally expensive and harder for the model to find meaning.

This way, a common word like "where" can be a single token, while a rare, complex word like "retrofitting" can be broken down into meaningful pieces like ["retro", "fit", "ting"]. Crucially, no word is ever “unknown.” Any new word can be built from these subword pieces.

# How the BPE Algorithm Works (Conceptually)
BPE was originally a data compression algorithm. Its adaptation for NLP is based on a simple, greedy, and iterative idea:
> Core Logic: Continuously find the most common pair of adjacent tokens in your text and merge them into a single, new token.


<div style="background-color: #2d3748; color: #e2e8f0; border: 1px solid #4a5568; border-radius: 10px; padding: 25px; font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif; line-height: 1.7;">

<h3 style="color: #90cdf4; border-bottom: 2px solid #4a5568; padding-bottom: 10px; margin-top: 0;">How the BPE Algorithm Works (Conceptually)</h3>
    <p style="color: #a0aec0;">BPE is based on a simple, greedy, and iterative idea: continuously find the most common pair of adjacent tokens and merge them. Let's trace the process.</p>

 <style>
        .bpe-kbd-dark {
            background-color: #1a202c;
            border: 1px solid #4a5568;
            border-bottom: 2px solid #718096;
            border-radius: 4px;
            padding: 3px 6px;
            font-family: 'SFMono-Regular', Consolas, 'Liberation Mono', Menlo, Courier, monospace;
            font-size: 0.9em;
            color: #cbd5e1;
            white-space: nowrap;
        }
    </style>

<div style="margin-top: 25px;">
        <h4 style="color: #a0aec0; margin-bottom: 5px;">Step 0: Initialize</h4>
        <p style="margin: 0;">
            First, we break every word down into its basic characters. We also add a special symbol, like <span class="bpe-kbd-dark">&lt;/w&gt;</span>, to mark the end of a word. This is important to distinguish between <span class="bpe-kbd-dark">er</span> inside a word (like in “newest”) and <span class="bpe-kbd-dark">er</span> at the end of a word (like in “lower”).
        </p>
        <p style="margin-top: 10px;">
            Our initial “tokens” are just characters: <span class="bpe-kbd-dark">l</span> <span class="bpe-kbd-dark">o</span> <span class="bpe-kbd-dark">w</span> <span class="bpe-kbd-dark">e</span> <span class="bpe-kbd-dark">r</span> <span class="bpe-kbd-dark">n</span> <span class="bpe-kbd-dark">s</span> <span class="bpe-kbd-dark">t</span> <span class="bpe-kbd-dark">i</span> <span class="bpe-kbd-dark">d</span> <span class="bpe-kbd-dark">&lt;/w&gt;</span>.
        </p>
    </div>

<div style="margin-top: 25px; border-top: 1px dashed #4a5568; padding-top: 20px;">
        <h4 style="color: #a0aec0; margin-bottom: 5px;">Step 1: First Merge</h4>
        <p>The algorithm scans the entire corpus and counts the frequency of every adjacent pair of tokens.</p>
        <p>Let’s say it finds that the pair <span class="bpe-kbd-dark">e</span> followed by <span class="bpe-kbd-dark">r</span> (i.e., <span class="bpe-kbd-dark">e r</span>) is the most common combination (appearing in <em>lower</em> and <em>wider</em>).</p>
        <ul style="list-style-type: '➡️'; padding-left: 20px; margin-top: 10px;">
            <li style="padding-left: 10px; margin-bottom: 8px;"><strong>Action:</strong> It merges <span class="bpe-kbd-dark">e</span> and <span class="bpe-kbd-dark">r</span> to create a new token, <span class="bpe-kbd-dark">er</span>.</li>
            <li style="padding-left: 10px; margin-bottom: 8px;"><strong>New Vocabulary:</strong> Our vocabulary now includes <span class="bpe-kbd-dark">er</span>.</li>
            <li style="padding-left: 10px; margin-bottom: 8px;"><strong>Corpus Update:</strong> All instances of <span class="bpe-kbd-dark">e r</span> are replaced. So, <span class="bpe-kbd-dark">l o w e r &lt;/w&gt;</span> becomes <span class="bpe-kbd-dark">l o w er &lt;/w&gt;</span>.</li>
        </ul>
    </div>

 <div style="margin-top: 25px; border-top: 1px dashed #4a5568; padding-top: 20px;">
        <h4 style="color: #a0aec0; margin-bottom: 5px;">Step 2: Second Merge</h4>
        <p>The process repeats. The algorithm scans the <em>updated</em> corpus. Perhaps it now finds that <span class="bpe-kbd-dark">er</span> followed by <span class="bpe-kbd-dark">&lt;/w&gt;</span> (i.e., <span class="bpe-kbd-dark">er &lt;/w&gt;</span>) is the most frequent pair.</p>
        <ul style="list-style-type: '➡️'; padding-left: 20px; margin-top: 10px;">
            <li style="padding-left: 10px; margin-bottom: 8px;"><strong>Action:</strong> It merges <span class="bpe-kbd-dark">er</span> and <span class="bpe-kbd-dark">&lt;/w&gt;</span> to create a new token, <span class="bpe-kbd-dark">er&lt;/w&gt;</span>.</li>
            <li style="padding-left: 10px; margin-bottom: 8px;"><strong>New Vocabulary:</strong> Our vocabulary now includes <span class="bpe-kbd-dark">er</span> and <span class="bpe-kbd-dark">er&lt;/w&gt;</span>.</li>
            <li style="padding-left: 10px; margin-bottom: 8px;"><strong>Corpus Update:</strong> <span class="bpe-kbd-dark">l o w er &lt;/w&gt;</span> becomes <span class="bpe-kbd-dark">l o w er&lt;/w&gt;</span>.</li>
        </ul>
    </div>
    
 <div style="margin-top: 25px; border-top: 1px dashed #4a5568; padding-top: 20px;">
        <h4 style="color: #a0aec0; margin-bottom: 5px;">Step 3: Keep Going…</h4>
        <p>Next, maybe <span class="bpe-kbd-dark">l</span> followed by <span class="bpe-kbd-dark">o</span> (<span class="bpe-kbd-dark">l o</span>) is the most common. They get merged into <span class="bpe-kbd-dark">lo</span>. Then <span class="bpe-kbd-dark">lo</span> and <span class="bpe-kbd-dark">w</span> get merged into <span class="bpe-kbd-dark">low</span>.</p>
    </div>

 <div style="margin-top: 25px; background-color: #1a202c; border-left: 5px solid #805ad5; padding: 15px;">
        <h4 style="color: #b794f4; margin: 0 0 5px 0;">The End Result</h4>
        <p style="margin: 0; color: #a0aec0;">This process is repeated for a predetermined number of merges (e.g., <strong>30,000 times</strong>). The final vocabulary consists of the initial characters plus all the new tokens created during the merges. This gives us a ranked list of merge rules that we can use to tokenize any new text.</p>
    </div>
</div>


<div style="background-color: #2d3748; color: #e2e8f0; border: 1px solid #4a5568; border-radius: 10px; padding: 25px; font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif; line-height: 1.7;">

  <h3 style="color: #90cdf4; border-bottom: 2px solid #4a5568; padding-bottom: 10px; margin-top: 0;">
    Pros and Cons of BPE
  </h3>
  <p style="color: #a0aec0;">
    BPE is widely used for a reason, but it’s not without its trade‑offs.
  </p>

  <style>
    .bpe-kbd-dark {
      background-color: #1a202c;
      border: 1px solid #4a5568;
      border-bottom: 2px solid #718096;
      border-radius: 4px;
      padding: 3px 6px;
      font-family: 'SFMono-Regular', Consolas, 'Liberation Mono', Menlo, Courier, monospace;
      font-size: 0.9em;
      color: #cbd5e1;
      white-space: nowrap;
    }
  </style>

  <!-- ✅ PROS SECTION -->
  <div style="margin-top: 25px;">
    <h4 style="color: #68d391; border-bottom: 1px solid #4a5568; padding-bottom: 5px;">
      Advantages (Pros) 👍
    </h4>
    <ul style="list-style-type: '✔️'; padding-left: 25px; margin-top: 15px;">
      <li style="margin-bottom: 10px;">
        <strong>Eliminates Out‑of‑Vocabulary (OOV) Words:</strong> Any new or rare word can be decomposed into a sequence of known subword tokens. The model never has to deal with a completely 
        <span class="bpe-kbd-dark">&lt;UNK&gt;</span> token.
      </li>
      <li style="margin-bottom: 10px;">
        <strong>Controllable Vocabulary Size:</strong> The final vocabulary size is a hyperparameter — the number of merges. You can balance model size and expressiveness.
      </li>
      <li style="margin-bottom: 10px;">
        <strong>Efficiency:</strong> Common words stay as single tokens, making sequences shorter and inference faster. Rare words are decomposed smartly, making better use of vocab space.
      </li>
      <li style="margin-bottom: 10px;">
        <strong>Captures Morphology:</strong> Learns meaningful parts like prefixes (<span class="bpe-kbd-dark">un-</span>), suffixes (<span class="bpe-kbd-dark">-ing</span>), and stems, helping generalization. If it learns 
        <span class="bpe-kbd-dark">un</span>, it can recognize 
        <em>unhappy</em>, <em>unclear</em>, <em>unbelievable</em>.
      </li>
    </ul>
  </div>

  <!-- ❌ CONS SECTION -->
  <div style="margin-top: 35px; border-top: 1px dashed #4a5568; padding-top: 25px;">
    <h4 style="color: #f56565; border-bottom: 1px solid #4a5568; padding-bottom: 5px;">
      Disadvantages (Cons) 👎
    </h4>
    <ul style="list-style-type: '⚠️'; padding-left: 25px; margin-top: 15px;">
      <li style="margin-bottom: 10px;">
        <strong>Greedy Approach:</strong> BPE makes local (greedy) merges — not necessarily globally optimal. A different merge order might yield a better vocabulary.
      </li>
      <li style="margin-bottom: 10px;">
        <strong>Sensitive to Training Data:</strong> Merge rules depend entirely on the training corpus. A BPE trained on Wikipedia would tokenize Twitter slang or biomedical text poorly.
      </li>
      <li style="margin-bottom: 10px;">
        <strong>The “Byte” Misnomer:</strong> Most modern BPE operates on characters, not raw bytes. True byte‑level BPE (used in GPT‑2) works directly on UTF‑8 bytes and is more robust — handling any text, emoji, or noise.
      </li>
    </ul>
  </div>

  <!-- 🌍 CONTEXT SECTION -->
  <div style="margin-top: 35px; border-top: 1px dashed #4a5568; padding-top: 25px;">
    <h4 style="color: #b794f4; border-bottom: 1px solid #4a5568; padding-bottom: 5px;">
      BPE’s Place in the Tokenization World
    </h4>
    <p style="color: #a0aec0;">
      BPE is a cornerstone method for subword tokenization, but it’s not alone. Its relatives include:
    </p>
    <ul style="list-style-type: '🔹'; padding-left: 25px; margin-top: 15px;">
      <li style="margin-bottom: 10px;">
        <strong>WordPiece</strong> (<span class="bpe-kbd-dark">BERT</span>, <span class="bpe-kbd-dark">DistilBERT</span>): Similar to BPE but merges based on the merge that <em>maximizes the likelihood</em> of the data instead of raw frequency.
      </li>
      <li>
        <strong>Unigram Language Model</strong> (<span class="bpe-kbd-dark">T5</span>, <span class="bpe-kbd-dark">ALBERT</span>, <span class="bpe-kbd-dark">XLNet</span>): A probabilistic approach that prunes subwords contributing least to total corpus probability — can produce multiple valid tokenizations per word (useful as regularization).
      </li>
    </ul>
  </div>

  <!-- ⚡ FINAL NOTE -->
  <div style="margin-top: 30px; background-color: #1a202c; border-left: 5px solid #3182ce; padding: 15px; border-radius: 6px;">
    <p style="margin: 0; color: #a0aec0;">
      Now that we understand the theory, let’s roll up our sleeves and build a 
      <strong style="color:#63b3ed;">BPE tokenizer</strong> from scratch 🔧
    </p>
  </div>
</div>


<div style="background-color: #2d3748; color: #e2e8f0; border: 1px solid #4a5568; border-radius: 10px; padding: 25px; font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif; line-height: 1.7;">

<h3 style="color: #90cdf4; border-bottom: 2px solid #4a5568; padding-bottom: 10px; margin-top: 0;">Step 1: Preparing the Training Corpus</h3>

<style>
        .bpe-kbd-dark {
            background-color: #1a202c;
            border: 1px solid #4a5568;
            border-bottom: 2px solid #718096;
            border-radius: 4px;
            padding: 3px 6px;
            font-family: 'SFMono-Regular', Consolas, 'Liberation Mono', Menlo, Courier, monospace;
            font-size: 0.9em;
            color: #cbd5e1;
            white-space: nowrap;
        }
    </style>

 <h4 style="color: #a0aec0; margin-bottom: 5px;">What this part does:</h4>
    <p style="color: #a0aec0;">
        Before we can start merging pairs, we need a corpus of text in the right format. The BPE algorithm operates on a vocabulary built from words and their frequencies. So, our first step is to:
    </p>
    <ol style="color: #a0aec0; padding-left: 25px;">
        <li style="padding-left: 10px; margin-bottom: 8px;">Take a raw text corpus.</li>
        <li style="padding-left: 10px; margin-bottom: 8px;">Split it into individual words and count the frequency of each unique word.</li>
        <li style="padding-left: 10px; margin-bottom: 8px;">Represent each word as a sequence of characters, separated by spaces, and add a special end-of-word marker, like <span class="bpe-kbd-dark">&lt;/w&gt;</span>.</li>
    </ol>
     <p style="color: #a0aec0; margin-top: 15px;">
        This marker is crucial as it allows the algorithm to distinguish between a subword found inside a larger word (like <span class="bpe-kbd-dark">er</span> in "newer") and a subword that forms the end of a word (like <span class="bpe-kbd-dark">er&lt;/w&gt;</span> in "lower"). Our goal is to transform a block of text into a dictionary where keys are the character-split words and values are their frequencies.
    </p>

 <div style="margin-top: 25px; background-color: #1a202c; border-left: 5px solid #805ad5; padding: 15px;">
        <h4 style="color: #b794f4; margin: 0 0 5px 0;">Example</h4>
         <p style="margin: 0; color: #a0aec0;">
            <strong>Input Corpus:</strong> <code>"low lower newest wider low"</code><br>
            <strong>Becomes:</strong> <code>{'l o w </w>': 2, 'l o w e r </w>': 1, ...}</code>
        </p>
    </div>
</div>


In [1]:
import collections

def get_vocab(corpus: str) -> dict:
    """
    Takes a raw text corpus and creates an initial vocabulary.
    
    The vocabulary maps each word (split into characters and with an end-of-word marker)
    to its frequency in the corpus.

    Args:
        corpus (str): A string containing the training text.

    Returns:
        dict: A dictionary where keys are space-separated characters of words
              (e.g., 'l o w </w>') and values are their counts.
    """
    # Split the corpus into words and count their frequencies
    words = corpus.strip().split()
    word_counts = collections.Counter(words)
    
    # Initialize the vocabulary dictionary
    vocab = {}
    
    # For each word and its count, format it for BPE
    for word, count in word_counts.items():
        # Join characters with a space and add the end-of-word marker
        bpe_word = ' '.join(list(word)) + ' </w>'
        vocab[bpe_word] = count
        
    return vocab

# --- Let's test it with our example corpus ---

corpus_text = "low lower newest wider low low"
initial_vocab = get_vocab(corpus_text)

print("--- Initial Vocabulary ---")
# Use a nicer format for printing the dictionary
for word, count in initial_vocab.items():
    print(f"'{word}': {count}")


--- Initial Vocabulary ---
'l o w </w>': 3
'l o w e r </w>': 1
'n e w e s t </w>': 1
'w i d e r </w>': 1


<div style="background-color: #2d3748; color: #e2e8f0; border: 1px solid #4a5568; border-radius: 10px; padding: 25px; font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif; line-height: 1.7;">

<h3 style="color: #90cdf4; border-bottom: 2px solid #4a5568; padding-bottom: 10px; margin-top: 0;">Step 2: Finding the Most Frequent Pair</h3>

<style>
        .bpe-kbd-dark {
            background-color: #1a202c;
            border: 1px solid #4a5568;
            border-bottom: 2px solid #718096;
            border-radius: 4px;
            padding: 3px 6px;
            font-family: 'SFMono-Regular', Consolas, 'Liberation Mono', Menlo, Courier, monospace;
            font-size: 0.9em;
            color: #cbd5e1;
            white-space: nowrap;
        }
    </style>

 <h4 style="color: #a0aec0; margin-bottom: 5px;">What this part does:</h4>
    <p style="color: #a0aec0;">
        This is the "counting" step of the BPE loop. At each iteration, we need to scan our entire current vocabulary and count the occurrences of all adjacent pairs of symbols.
    </p>
    <p style="color: #a0aec0; margin-top: 15px;">
        For example, in the vocabulary <code style="color: #e2e8f0;">{'l o w </w>': 3, 'l o w e r </w>': 1}</code>, the pair <span class="bpe-kbd-dark">l o</span> appears in both keys. Its total count would be its count in the first word (3) plus its count in the second word (1), for a total of <strong>4</strong>.
    </p>
     <p style="color: #a0aec0; margin-top: 15px;">
        This function will iterate through every word-tokenization in our vocabulary, find all adjacent pairs within it, and sum up their counts, weighted by the frequency of the word itself. The function will return a dictionary of pairs and their total frequencies.
    </p>
</div>


In [2]:
import collections

# We assume 'initial_vocab' exists from the previous cell.
# If running this cell alone, uncomment the line below:
# initial_vocab = {'l o w </w>': 3, 'l o w e r </w>': 1, 'n e w e s t </w>': 1, 'w i d e r </w>': 1}

def get_pair_stats(vocab: dict) -> collections.Counter:
    """
    Counts the frequency of each adjacent pair of symbols in the vocabulary.

    Args:
        vocab (dict): The current vocabulary mapping token sequences to their counts.

    Returns:
        collections.Counter: A Counter object mapping pairs (tuples) to their frequencies.
    """
    pair_counts = collections.Counter()
    
    for word, count in vocab.items():
        symbols = word.split()
        
        # Iterate through symbols to find adjacent pairs
        for i in range(len(symbols) - 1):
            pair = (symbols[i], symbols[i+1])
            # Increment the pair's count by the frequency of the word it appeared in
            pair_counts[pair] += count
            
    return pair_counts

# --- Let's test it with our initial_vocab from Step 1 ---

pair_stats = get_pair_stats(initial_vocab)

print("--- Pair Frequencies (Sorted) ---")
# .most_common() conveniently sorts them from most to least common
for pair, count in pair_stats.most_common():
    print(f"{pair}: {count}")


--- Pair Frequencies (Sorted) ---
('l', 'o'): 4
('o', 'w'): 4
('w', '</w>'): 3
('w', 'e'): 2
('e', 'r'): 2
('r', '</w>'): 2
('n', 'e'): 1
('e', 'w'): 1
('e', 's'): 1
('s', 't'): 1
('t', '</w>'): 1
('w', 'i'): 1
('i', 'd'): 1
('d', 'e'): 1


<div style="background-color: #2d3748; color: #e2e8f0; border: 1px solid #4a5568; border-radius: 10px; padding: 25px; font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif; line-height: 1.7;">

<h3 style="color: #90cdf4; border-bottom: 2px solid #4a5568; padding-bottom: 10px; margin-top: 0;">Step 3: Merging the Best Pair</h3>

<style>
        .bpe-kbd-dark {
            background-color: #1a202c;
            border: 1px solid #4a5568;
            border-bottom: 2px solid #718096;
            border-radius: 4px;
            padding: 3px 6px;
            font-family: 'SFMono-Regular', Consolas, 'Liberation Mono', Menlo, Courier, monospace;
            font-size: 0.9em;
            color: #cbd5e1;
            white-space: nowrap;
        }
    </style>

 <h4 style="color: #a0aec0; margin-bottom: 5px;">What this part does:</h4>
    <p style="color: #a0aec0;">
        Once we've identified the most frequent pair using our <code>get_pair_stats</code> function, the next logical step is to merge that pair into a single, new token. This is the core "encoding" step.
    </p>
    <p style="color: #a0aec0; margin-top: 15px;">
        This operation involves creating a new vocabulary by iterating through the old one. For each word in the vocabulary, we find all occurrences of the target pair (e.g., <span class="bpe-kbd-dark">l o</span>) and replace them with the newly merged token (e.g., <span class="bpe-kbd-dark">lo</span>). The space between the original pair is removed, signifying they are now one unit.
    </p>

 <div style="margin-top: 25px; background-color: #1a202c; border-left: 5px solid #805ad5; padding: 15px;">
        <h4 style="color: #b794f4; margin: 0 0 5px 0;">Example</h4>
        <ul style="list-style-type: '➡️'; padding-left: 20px; margin-top: 10px; color: #a0aec0;">
            <li style="padding-left: 10px; margin-bottom: 8px;"><strong>Best Pair to Merge:</strong> <code>('l', 'o')</code></li>
            <li style="padding-left: 10px; margin-bottom: 8px;"><strong>Old Word Tokenization:</strong> <span class="bpe-kbd-dark">l o w </w></span></li>
            <li style="padding-left: 10px; margin-bottom: 8px;"><strong>New Word Tokenization:</strong> <span class="bpe-kbd-dark">lo w </w></span></li>
        </ul>
    </div>
</div>


In [3]:
import re

def merge_vocab(pair: tuple, v_in: dict) -> dict:
    """
    Merges the most frequent pair in the vocabulary.

    Args:
        pair (tuple): The pair of symbols to merge (e.g., ('l', 'o')).
        v_in (dict): The vocabulary before the merge.

    Returns:
        dict: The vocabulary after the merge.
    """
    v_out = {}
    
    # The pair as a string with a space, for regex replacement.
    # We use re.escape to handle tokens that might contain special regex characters.
    bigram = re.escape(' '.join(pair))
    
    # The merged pair as a single token string
    p = ''.join(pair)
    
    # Compile a regex to find the bigram. The (?<!\S) and (?!\S) are negative
    # lookarounds that ensure we are matching the whole token, not a substring of another token.
    # For example, it prevents merging 's' 't' inside 'est'.
    regex = re.compile(r'(?<!\S)' + bigram + r'(?!\S)')
    
    for word in v_in:
        # Replace the pair with the merged token in the word representation
        w_out = regex.sub(p, word)
        v_out[w_out] = v_in[word]
        
    return v_out

# --- Let's test it with our previous results ---

# From our pair_stats, the best pair is ('l', 'o') with a count of 4.
# (If there's a tie, we just pick one. 'o w' also had a count of 4).
best_pair = ('l', 'o')

# We use 'initial_vocab' from Step 1.
# If running this cell alone, uncomment the line below:
# initial_vocab = {'l o w </w>': 3, 'l o w e r </w>': 1, 'n e w e s t </w>': 1, 'w i d e r </w>': 1}

# Perform the merge
merged_vocab = merge_vocab(best_pair, initial_vocab)

print(f"--- Vocabulary after merging {best_pair} ---")
for word, count in merged_vocab.items():
    print(f"'{word}': {count}")


--- Vocabulary after merging ('l', 'o') ---
'lo w </w>': 3
'lo w e r </w>': 1
'n e w e s t </w>': 1
'w i d e r </w>': 1


<div style="background-color: #2d3748; color: #e2e8f0; border: 1px solid #4a5568; border-radius: 10px; padding: 25px; font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif; line-height: 1.7;">

<h3 style="color: #90cdf4; border-bottom: 2px solid #4a5568; padding-bottom: 10px; margin-top: 0;">Step 4: Putting It All Together - The Training Loop</h3>

 <style>
        .bpe-kbd-dark {
            background-color: #1a202c;
            border: 1px solid #4a5568;
            border-bottom: 2px solid #718096;
            border-radius: 4px;
            padding: 3px 6px;
            font-family: 'SFMono-Regular', Consolas, 'Liberation Mono', Menlo, Courier, monospace;
            font-size: 0.9em;
            color: #cbd5e1;
            white-space: nowrap;
        }
    </style>

 <h4 style="color: #a0aec0; margin-bottom: 5px;">What this part does:</h4>
    <p style="color: #a0aec0;">
        This is where we orchestrate the entire BPE training process. We'll create a main function that runs a loop for a specified number of merges (<code>num_merges</code>). This hyperparameter controls the size of our final vocabulary.
    </p>
    <p style="color: #a0aec0; margin-top: 15px;">In each iteration of the loop, the algorithm will perform the following actions:</p>
    <ol style="color: #a0aec0; padding-left: 25px;">
        <li style="padding-left: 10px; margin-bottom: 8px;">Call <code>get_pair_stats</code> on the current vocabulary to count all adjacent pairs.</li>
        <li style="padding-left: 10px; margin-bottom: 8px;">Identify the most frequent pair from the statistics. If no pairs are left, the training stops early.</li>
        <li style="padding-left: 10px; margin-bottom: 8px;">Record this "best pair" as a new <strong>merge rule</strong>. The ordered list of these rules is the primary output of our BPE training.</li>
        <li style="padding-left: 10px; margin-bottom: 8px;">Call <code>merge_vocab</code> to update the vocabulary by merging the best pair into a single token.</li>
        <li style="padding-left: 10px; margin-bottom: 8px;">Repeat this process until the desired number of merges is complete.</li>
    </ol>
    <p style="color: #a0aec0; margin-top: 15px;">
        The two final outputs will be the final state of the vocabulary and, more importantly, the ordered list of <strong>merge rules</strong>. This list is what we need to tokenize new, unseen text.
    </p>
</div>


In [4]:
import collections
import re

# We need to include all our helper functions in this cell to make it self-contained
# and runnable in the Kaggle notebook.

def get_vocab(corpus: str) -> dict:
    """Creates an initial vocabulary from a raw text corpus."""
    words = corpus.strip().split()
    word_counts = collections.Counter(words)
    vocab = {' '.join(list(word)) + ' </w>': count for word, count in word_counts.items()}
    return vocab

def get_pair_stats(vocab: dict) -> collections.Counter:
    """Counts the frequency of each adjacent pair of symbols."""
    pair_counts = collections.Counter()
    for word, count in vocab.items():
        symbols = word.split()
        for i in range(len(symbols) - 1):
            pair_counts[(symbols[i], symbols[i+1])] += count
    return pair_counts

def merge_vocab(pair: tuple, v_in: dict) -> dict:
    """Merges a specific pair in the vocabulary."""
    v_out = {}
    bigram = re.escape(' '.join(pair))
    p = ''.join(pair)
    regex = re.compile(r'(?<!\S)' + bigram + r'(?!\S)')
    for word in v_in:
        w_out = regex.sub(p, word)
        v_out[w_out] = v_in[word]
    return v_out

# --- The Main BPE Training Function ---

def train_bpe(corpus: str, num_merges: int, verbose: bool = True):
    """
    Trains a BPE model on a corpus.

    Args:
        corpus (str): The training text.
        num_merges (int): The number of merge operations to perform.
        verbose (bool): If True, prints the progress of each merge.

    Returns:
        tuple: A tuple containing:
            - The final vocabulary after all merges.
            - A list of the merge rules in the order they were learned.
    """
    # 1. Initialize vocabulary from the corpus
    vocab = get_vocab(corpus)
    
    # This will store our ordered merge rules
    merge_rules = []
    
    if verbose:
        print("--- Initial Vocabulary ---")
        print(vocab)
        print("-" * 50)
    
    for i in range(num_merges):
        # 2. Get pair statistics from the current vocabulary
        pair_stats = get_pair_stats(vocab)
        
        # If there are no more pairs to merge, stop
        if not pair_stats:
            if verbose: print("No more pairs to merge. Stopping early.")
            break
        
        # 3. Find the most frequent pair
        best_pair = max(pair_stats, key=pair_stats.get)
        merge_rules.append(best_pair)
        
        # 4. Merge the best pair in the vocabulary
        vocab = merge_vocab(best_pair, vocab)
        
        if verbose:
            print(f"Merge {i+1}/{num_merges}: Merged {best_pair} -> {''.join(best_pair)}")
            print(f"  Current Vocab: {list(vocab.keys())}\n")
            
    return vocab, merge_rules

# --- Let's run the full training loop on our corpus ---

corpus_text = "low lower newest wider low low"
num_merges_to_perform = 10

final_vocab, learned_rules = train_bpe(corpus_text, num_merges_to_perform)

print("\n" + "="*25 + " TRAINING COMPLETE " + "="*25)
print("\n--- Final Vocabulary State ---")
for word, count in final_vocab.items():
    print(f"'{word}': {count}")

print("\n--- Learned Merge Rules (in order) ---")
for i, rule in enumerate(learned_rules):
    print(f"{i+1}: {rule}")


--- Initial Vocabulary ---
{'l o w </w>': 3, 'l o w e r </w>': 1, 'n e w e s t </w>': 1, 'w i d e r </w>': 1}
--------------------------------------------------
Merge 1/10: Merged ('l', 'o') -> lo
  Current Vocab: ['lo w </w>', 'lo w e r </w>', 'n e w e s t </w>', 'w i d e r </w>']

Merge 2/10: Merged ('lo', 'w') -> low
  Current Vocab: ['low </w>', 'low e r </w>', 'n e w e s t </w>', 'w i d e r </w>']

Merge 3/10: Merged ('low', '</w>') -> low</w>
  Current Vocab: ['low</w>', 'low e r </w>', 'n e w e s t </w>', 'w i d e r </w>']

Merge 4/10: Merged ('e', 'r') -> er
  Current Vocab: ['low</w>', 'low er </w>', 'n e w e s t </w>', 'w i d er </w>']

Merge 5/10: Merged ('er', '</w>') -> er</w>
  Current Vocab: ['low</w>', 'low er</w>', 'n e w e s t </w>', 'w i d er</w>']

Merge 6/10: Merged ('low', 'er</w>') -> lower</w>
  Current Vocab: ['low</w>', 'lower</w>', 'n e w e s t </w>', 'w i d er</w>']

Merge 7/10: Merged ('n', 'e') -> ne
  Current Vocab: ['low</w>', 'lower</w>', 'ne w e s t </

<div style="background-color: #2d3748; color: #e2e8f0; border: 1px solid #4a5568; border-radius: 10px; padding: 25px; font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif; line-height: 1.7;">

<h3 style="color: #90cdf4; border-bottom: 2px solid #4a5568; padding-bottom: 10px; margin-top: 0;">Part 5: Applying the BPE Model to Tokenize New Words</h3>

 <style>
        .bpe-kbd-dark {
            background-color: #1a202c;
            border: 1px solid #4a5568;
            border-bottom: 2px solid #718096;
            border-radius: 4px;
            padding: 3px 6px;
            font-family: 'SFMono-Regular', Consolas, 'Liberation Mono', Menlo, Courier, monospace;
            font-size: 0.9em;
            color: #cbd5e1;
            white-space: nowrap;
        }
    </style>

 <h4 style="color: #a0aec0; margin-bottom: 5px;">What this part does:</h4>
    <p style="color: #a0aec0;">
        Now that we have our ordered list of <code>learned_rules</code>, we can build a tokenizer function. This function will take a new word (or a list of words) and segment it into the subword tokens our model has learned. This is the "inference" step.
    </p>
    <p style="color: #a0aec0; margin-top: 15px;">
        The process for tokenizing a single new word is as follows:
    </p>
    <ol style="color: #a0aec0; padding-left: 25px;">
        <li style="padding-left: 10px; margin-bottom: 8px;">First, split the word into its basic characters and add the end-of-word marker <span class="bpe-kbd-dark">&lt;/w&gt;</span>. For example, "lowest" becomes <span class="bpe-kbd-dark">l o w e s t &lt;/w&gt;</span>.</li>
        <li style="padding-left: 10px; margin-bottom: 8px;">Iterate through our <strong>ordered</strong> <code>learned_rules</code> one by one.</li>
        <li style="padding-left: 10px; margin-bottom: 8px;">For each rule (e.g., merge <span class="bpe-kbd-dark">('l', 'o')</span> into <span class="bpe-kbd-dark">lo</span>), find the most frequent adjacent pair within the current tokenization of the word.</li>
        <li style="padding-left: 10px; margin-bottom: 8px;">If the most frequent pair in the word is the same as the current merge rule we are applying, merge it. Then, restart the search for the next best pair <em>from the beginning of the merge rules list</em>. This ensures that higher-priority merges are always performed first.</li>
        <li style="padding-left: 10px; margin-bottom: 8px;">If the best pair in the word is not our current rule, we simply move to the next rule in the list.
        </li>
        <li style="padding-left: 10px; margin-bottom: 8px;">Continue this process until no more pairs in the word can be merged according to our rules.</li>
    </ol>
    <p style="color: #a0aec0; margin-top: 15px;">
        This procedure allows BPE to handle out-of-vocabulary (OOV) words gracefully. For instance, the word "lowest" was not in our training data, but because it contains subwords like "low" and "est", our tokenizer can break it down into meaningful, learned pieces like <span class="bpe-kbd-dark">low</span>, <span class="bpe-kbd-dark">es</span>, and <span class="bpe-kbd-dark">t&lt;/w&gt;</span> (depending on the learned rules).
    </p>
</div>


In [5]:
import collections

# We need a helper to find the best pair within a single word's tokenization
def get_word_pair_stats(tokens: list) -> collections.Counter:
    """Counts adjacent pairs in a single tokenized word."""
    pair_counts = collections.Counter()
    for i in range(len(tokens) - 1):
        pair_counts[(tokens[i], tokens[i+1])] += 1
    return pair_counts

# --- The BPE Tokenizer Function ---

def tokenize_word(word: str, learned_rules: list) -> list:
    """
    Tokenizes a single word using a pre-learned list of BPE merge rules.
    
    Args:
        word (str): The word to tokenize.
        learned_rules (list): The ordered list of merge rules.
        
    Returns:
        list: A list of BPE tokens for the word.
    """
    # 1. Pre-process the word: split into chars and add end-of-word marker
    tokens = list(word) + ['</w>']
    
    while True:
        # Find the most frequent pair in the current tokenization
        pair_stats = get_word_pair_stats(tokens)
        
        # If there are no pairs, we're done with this word
        if not pair_stats:
            break
            
        # Find the pair that should be merged next by checking against the ordered rules
        # We look for the first rule in `learned_rules` that exists in our current `pair_stats`
        best_pair_to_merge = None
        for rule in learned_rules:
            if rule in pair_stats:
                best_pair_to_merge = rule
                break
        
        # If no pair from our rules is found in the current word tokens, we're done.
        if best_pair_to_merge is None:
            break

        # Merge the best pair
        new_tokens = []
        i = 0
        while i < len(tokens):
            # Find the first occurrence of the pair to merge
            try:
                j = tokens.index(best_pair_to_merge[0], i)
                # Check if it's a valid pair (the next token matches)
                if j + 1 < len(tokens) and tokens[j+1] == best_pair_to_merge[1]:
                    # Append tokens before the pair
                    new_tokens.extend(tokens[i:j])
                    # Append the merged token
                    new_tokens.append(''.join(best_pair_to_merge))
                    # Move the index past the merged pair
                    i = j + 2
                else:
                    # Not a valid pair, just append the token and continue
                    new_tokens.append(tokens[i])
                    i += 1
            except ValueError:
                # The first element of the pair was not found, so append the rest of the list
                new_tokens.extend(tokens[i:])
                break
        
        tokens = new_tokens

    return tokens

# --- Let's test our tokenizer! ---

# We use the 'learned_rules' from the previous step.
# If running this cell alone, uncomment the line below:
# learned_rules = [('l', 'o'), ('lo', 'w'), ('low', '</w>'), ('e', 'r'), ('er', '</w>'), ('low', 'er</w>'), ('n', 'e'), ('ne', 'w'), ('new', 'e'), ('newe', 's')]

# Let's tokenize a word that was in our training set: "lower"
tokenized_lower = tokenize_word("lower", learned_rules)
print(f"Tokenization of 'lower': {tokenized_lower}")

# Now, let's tokenize a new, unseen word: "lowest"
tokenized_lowest = tokenize_word("lowest", learned_rules)
print(f"Tokenization of 'lowest': {tokenized_lowest}")

# Another unseen word: "newer"
tokenized_newer = tokenize_word("newer", learned_rules)
print(f"Tokenization of 'newer': {tokenized_newer}")

# A word that can't be merged much: "know"
tokenized_know = tokenize_word("know", learned_rules)
print(f"Tokenization of 'know': {tokenized_know}")


Tokenization of 'lower': ['lower</w>']
Tokenization of 'lowest': ['low', 'e', 's', 't', '</w>']
Tokenization of 'newer': ['new', 'er</w>']
Tokenization of 'know': ['k', 'n', 'o', 'w', '</w>']


<div style="background-color: #2d3748; color: #e2e8f0; border: 1px solid #4a5568; border-radius: 10px; padding: 25px; font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif; line-height: 1.7;">

<h3 style="color: #90cdf4; border-bottom: 2px solid #4a5568; padding-bottom: 10px; margin-top: 0;">Part 6: Understanding the Tokenizer's Output</h3>

 <style>
        .bpe-kbd-dark {
            background-color: #1a202c;
            border: 1px solid #4a5568;
            border-bottom: 2px solid #718096;
            border-radius: 4px;
            padding: 3px 6px;
            font-family: 'SFMono-Regular', Consolas, 'Liberation Mono', Menlo, Courier, monospace;
            font-size: 0.9em;
            color: #cbd5e1;
            white-space: nowrap;
        }
    </style>

<p style="color: #a0aec0;">
        The output from our tokenizer function demonstrates the core principles of BPE. The final tokenization of each word is a direct result of applying our 10 learned merge rules in their specific order. Let's analyze each case to see how the model arrived at its decision.
    </p>

 <div style="margin-top: 25px; background-color: #1a202c; border-left: 5px solid #63b3ed; padding: 15px;">
        <h4 style="color: #90cdf4; margin: 0 0 10px 0;">1. Word: "lower" &rarr; Result: <code>['lower&lt;/w&gt;']</code></h4>
        <p style="margin: 0; color: #a0aec0;">This is a perfect example of a word our model is "expert" at. It was present in our training corpus, so the model learned all the necessary merges to represent it as a single token:</p>
        <ol style="color: #a0aec0; padding-left: 25px; font-size: 0.95em;">
            <li style="padding-left: 10px; margin-bottom: 5px;"><code>l o w e r &lt;/w&gt;</code> &rarr; applies rule #1 <code>('l','o')</code> &rarr; <code>lo w e r &lt;/w&gt;</code></li>
            <li style="padding-left: 10px; margin-bottom: 5px;">&rarr; applies rule #2 <code>('lo','w')</code> &rarr; <code>low e r &lt;/w&gt;</code></li>
            <li style="padding-left: 10px; margin-bottom: 5px;">&rarr; applies rule #4 <code>('e','r')</code> &rarr; <code>low er &lt;/w&gt;</code></li>
            <li style="padding-left: 10px; margin-bottom: 5px;">&rarr; applies rule #5 <code>('er','&lt;/w&gt;')</code> &rarr; <code>low er&lt;/w&gt;</code></li>
            <li style="padding-left: 10px; margin-bottom: 5px;">&rarr; applies rule #6 <code>('low','er&lt;/w&gt;')</code> &rarr; <code>lower&lt;/w&gt;</code></li>
        </ol>
        <p style="margin: 0; color: #a0aec0;">Since all sub-parts could be progressively merged into one, the final output is a single token.</p>
    </div>

 <div style="margin-top: 25px; background-color: #1a202c; border-left: 5px solid #f6ad55; padding: 15px;">
        <h4 style="color: #fbd38d; margin: 0 0 10px 0;">2. Word: "lowest" &rarr; Result: <code>['low', 'e', 's', 't', '&lt;/w&gt;']</code></h4>
        <p style="margin: 0; color: #a0aec0;">This is an out-of-vocabulary (OOV) word. The tokenizer does its best with the rules it knows:</p>
        <ol style="color: #a0aec0; padding-left: 25px; font-size: 0.95em;">
            <li style="padding-left: 10px; margin-bottom: 5px;"><code>l o w e s t &lt;/w&gt;</code> &rarr; applies rules #1 and #2 &rarr; <code>low e s t &lt;/w&gt;</code></li>
        </ol>
        <p style="margin: 0; color: #a0aec0;">At this point, the tokens are <span class="bpe-kbd-dark">low</span>, <span class="bpe-kbd-dark">e</span>, <span class="bpe-kbd-dark">s</span>, <span class="bpe-kbd-dark">t</span>, <span class="bpe-kbd-dark">&lt;/w&gt;</span>. The tokenizer looks for pairs like <span class="bpe-kbd-dark">('low', 'e')</span>, <span class="bpe-kbd-dark">('e', 's')</span>, <span class="bpe-kbd-dark">('s', 't')</span>, etc. <strong>None of these pairs exist in our 10 learned rules.</strong> For example, we learned to merge <span class="bpe-kbd-dark">('new', 'e')</span> but not <span class="bpe-kbd-dark">('low', 'e')</span>. Therefore, the merging process stops here.
        </p>
    </div>

 <div style="margin-top: 25px; background-color: #1a202c; border-left: 5px solid #81e6d9; padding: 15px;">
        <h4 style="color: #a7f3d0; margin: 0 0 10px 0;">3. Word: "newer" &rarr; Result: <code>['new', 'er&lt;/w&gt;']</code></h4>
        <p style="margin: 0; color: #a0aec0;">This is another great example of handling an OOV word by composing known subwords. "newer" was not in our training data, but "newest" and "lower/wider" were.</p>
        <ol style="color: #a0aec0; padding-left: 25px; font-size: 0.95em;">
            <li style="padding-left: 10px; margin-bottom: 5px;"><code>n e w e r &lt;/w&gt;</code> &rarr; applies rules #7 and #8 &rarr; <code>new e r &lt;/w&gt;</code></li>
            <li style="padding-left: 10px; margin-bottom: 5px;">&rarr; applies rule #4 &rarr; <code>new er &lt;/w&gt;</code></li>
            <li style="padding-left: 10px; margin-bottom: 5px;">&rarr; applies rule #5 &rarr; <code>new er&lt;/w&gt;</code></li>
        </ol>
        <p style="margin: 0; color: #a0aec0;">The process stops here because the pair <span class="bpe-kbd-dark">('new', 'er&lt;/w&gt;')</span> was never learned. Our model only learned to merge <span class="bpe-kbd-dark">('low', 'er&lt;/w&gt;')</span>. This shows how BPE captures both stem words (<span class="bpe-kbd-dark">new</span>) and common suffixes (<span class="bpe-kbd-dark">er&lt;/w&gt;</span>).</p>
    </div>

 <div style="margin-top: 25px; background-color: #1a202c; border-left: 5px solid #f56565; padding: 15px;">
        <h4 style="color: #fc8181; margin: 0 0 10px 0;">4. Word: "know" &rarr; Result: <code>['k', 'n', 'o', 'w', '&lt;/w&gt;']</code></h4>
        <p style="margin: 0; color: #a0aec0;">This word demonstrates the ultimate fallback: if a word contains no known sub-sequences, it's simply broken down into its individual characters.
        The initial tokens are <span class="bpe-kbd-dark">k</span>, <span class="bpe-kbd-dark">n</span>, <span class="bpe-kbd-dark">o</span>, <span class="bpe-kbd-dark">w</span>, <span class="bpe-kbd-dark">&lt;/w&gt;</span>. The tokenizer looks at the pairs <span class="bpe-kbd-dark">('k', 'n')</span>, <span class="bpe-kbd-dark">('n', 'o')</span>, <span class="bpe-kbd-dark">('o', 'w')</span>, etc. Not a single one of these pairs exists in our `learned_rules`. Therefore, no merges can be performed, and the word remains fully split.
        </p>
    </div>
</div>


<div style="background-color: #2d3748; color: #e2e8f0; border: 1px solid #4a5568; border-radius: 10px; padding: 25px; font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif; line-height: 1.7;">

<h3 style="color: #90cdf4; border-bottom: 2px solid #4a5568; padding-bottom: 10px; margin-top: 0;">Part 7: Where to Go From Here?</h3>

<p style="color: #a0aec0;">
        We've successfully built a BPE tokenizer. Now, the real fun begins! We can use this tokenizer as a foundational piece for more advanced NLP models or explore ways to improve the tokenization process itself. Below are three potential paths to continue this project, ranging from direct improvements to building a full language model.
    </p>

<div style="margin-top: 25px; background-color: #1a202c; border-left: 5px solid #63b3ed; padding: 20px; border-radius: 5px;">
        <h4 style="color: #90cdf4; margin: 0 0 10px 0;">Path 1: Building a Text Encoder/Decoder</h4>
        <p style="margin: 0; color: #a0aec0;">
            Our current tokenizer works on single words. A complete tokenizer needs to handle entire sentences or documents. This involves creating a full "Encoder" class that manages the vocabulary and converts text to integer sequences, and a "Decoder" to convert them back.
        </p>
        <ul style="color: #a0aec0; padding-left: 25px; font-size: 0.95em;">
            <li style="padding-left: 10px; margin-bottom: 8px;"><strong>Create a Vocabulary Lookup:</strong> After training, create a final vocabulary list (<code>vocab</code>) that maps each unique token (from the initial characters and all merged tokens) to an integer index. Example: <code>{'l': 0, 'o': 1, ..., 'low': 27, 'er&lt;/w&gt;': 28, ...}</code>.</li>
            <li style="padding-left: 10px; margin-bottom: 8px;"><strong>Build the <code>encode()</code> method:</strong> This method will take a string of text, split it into words, tokenize each word using our <code>tokenize_word</code> function, and then map each resulting subword token to its integer ID from the vocabulary lookup.</li>
            <li style="padding-left: 10px; margin-bottom: 8px;"><strong>Build the <code>decode()</code> method:</strong> This method will take a list of integers, map them back to their string tokens, and then join them to reconstruct the original text. You'll need to handle the <code>&lt;/w&gt;</code> tokens by replacing them with a space.</li>
        </ul>
        <p style="margin: 15px 0 0 0; color: #a0aec0;">
            <strong>Outcome:</strong> A self-contained Python class that can encode any text into numbers and decode them back, just like tokenizers from libraries like Hugging Face's `tokenizers`.
        </p>
    </div>

<div style="margin-top: 25px; background-color: #1a202c; border-left: 5px solid #f6ad55; padding: 20px; border-radius: 5px;">
        <h4 style="color: #fbd38d; margin: 0 0 10px 0;">Path 2: Build a Simple Language Model (N-gram)</h4>
        <p style="margin: 0; color: #a0aec0;">
            With your text encoded into token IDs, you can build a classical statistical language model. An N-gram model learns to predict the next token based on the previous <em>N-1</em> tokens. A bigram (N=2) model is a great place to start.
        </p>
        <ul style="color: #a0aec0; padding-left: 25px; font-size: 0.95em;">
            <li style="padding-left: 10px; margin-bottom: 8px;"><strong>Encode your corpus:</strong> Use the encoder from Path 1 to convert your entire training corpus into a long sequence of token IDs.</li>
            <li style="padding-left: 10px; margin-bottom: 8px;"><strong>Count Bigram Frequencies:</strong> Iterate through the sequence and count the occurrences of every adjacent pair of token IDs. For example, how many times does token `27` (`low`) appear before token `15` (`e`)? Store this in a nested dictionary or a matrix.</li>
            <li style="padding-left: 10px; margin-bottom: 8px;"><strong>Calculate Probabilities:</strong> Convert these counts into conditional probabilities: $P(\text{token}_i | \text{token}_{i-1})$.</li>
            <li style="padding-left: 10px; margin-bottom: 8px;"><strong>Generate Text:</strong> Create a function that starts with a seed token, predicts the most likely next token, appends it, and then uses that new token to predict the next one, generating new text one token at a time.</li>
        </ul>
        <p style="margin: 15px 0 0 0; color: #a0aec0;">
            <strong>Outcome:</strong> A generative model that can produce novel text that mimics the style of your training corpus. It's the "ancestor" of modern models like GPT.
        </p>
    </div>

<div style="margin-top: 25px; background-color: #1a202c; border-left: 5px solid #81e6d9; padding: 20px; border-radius: 5px;">
        <h4 style="color: #a7f3d0; margin: 0 0 10px 0;">Path 3: Text Classification with Embeddings</h4>
        <p style="margin: 0; color: #a0aec0;">
            This is a more modern approach. You can use your tokenizer to prepare data for a neural network that performs a task like sentiment analysis.
        </p>
        <ul style="color: #a0aec0; padding-left: 25px; font-size: 0.95em;">
            <li style="padding-left: 10px; margin-bottom: 8px;"><strong>Find a Labeled Dataset:</strong> Use a dataset like the IMDB movie reviews, which has text and a positive/negative label.</li>
            <li style="padding-left: 10px; margin-bottom: 8px;"><strong>Train the BPE Tokenizer:</strong> Train your BPE tokenizer on the text from this dataset to create a domain-specific vocabulary.</li>
            <li style="padding-left: 10px; margin-bottom: 8px;"><strong>Build a Neural Network Model:</strong> Using a library like TensorFlow or PyTorch, build a simple model. The first layer will be an <strong>Embedding layer</strong>. This layer maps each of your integer token IDs to a dense vector, turning <code>[27, 15, 83]</code> into a matrix of float vectors.</li>
            <li style="padding-left: 10px; margin-bottom: 8px;"><strong>Train the Classifier:</strong> Feed the embedded token sequences into subsequent layers (like LSTM, GRU, or even just a simple pooling layer) and train the network to predict the sentiment label. The embedding vectors will be learned automatically during this process.</li>
        </ul>
        <p style="margin: 15px 0 0 0; color: #a0aec0;">
            <strong>Outcome:</strong> A complete text classification pipeline, where you've built the tokenizer from scratch. You'll gain a deep understanding of how text is prepared for modern deep learning models.
        </p>
    </div>
</div>


<div style="background-color: #2d3748; color: #e2e8f0; border: 1px solid #4a5568; border-radius: 10px; padding: 25px; font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif; line-height: 1.7;">

<h3 style="color: #90cdf4; border-bottom: 2px solid #4a5568; padding-bottom: 10px; margin-top: 0;">Part 8: Encapsulating BPE into a Reusable Tokenizer Class</h3>

<p style="color: #a0aec0;">
        So far, we have a set of powerful but disconnected functions: one to train BPE rules (`train_bpe`) and another to tokenize a single word (`tokenize_word`). This is like having the engine of a car but no chassis, steering wheel, or pedals. To make our tokenizer truly useful, we need to build a complete "vehicle" around it—a self-contained class that handles the entire process from raw text to numerical data and back again.
    </p>

<p style="color: #a0aec0; margin-top: 15px;">
        This class, which we'll call <code>BPE_Tokenizer</code>, will manage three critical components:
    </p>
    <ol style="color: #a0aec0; padding-left: 25px;">
        <li style="padding-left: 10px; margin-bottom: 8px;"><strong>Merge Rules:</strong> The ordered list of merge rules we learned during training.</li>
        <li style="padding-left: 10px; margin-bottom: 8px;"><strong>Vocabulary Lookup:</strong> A mapping from each unique subword token (like 'low', 'er</w>') to a unique integer ID. This is what allows us to convert text into a format that machine learning models can understand.</li>
        <li style="padding-left: 10px; margin-bottom: 8px;"><strong>Encoder/Decoder Logic:</strong> The main <code>encode()</code> and <code>decode()</code> methods that serve as the public interface for our tokenizer.</li>
    </ol>

<h4 style="color: #a0aec0; margin-top: 25px;">Pros & Cons: Moving from Raw Functions to a Class Structure</h4>

<div style="display: flex; gap: 20px; margin-top: 15px;">
        <div style="flex: 1; background-color: #1a202c; border: 1px solid #38a169; border-radius: 8px; padding: 15px;">
            <h5 style="color: #68d391; margin: 0 0 10px 0; font-size: 1.1em;">Pros (Why a Class is Better)</h5>
            <ul style="padding-left: 20px; margin: 0; color: #a0aec0; font-size: 0.95em;">
                <li style="margin-bottom: 10px;">
                    <strong>State Management:</strong> The class holds the `learned_rules` and `vocab` as internal state (attributes). We no longer need to pass them around as function arguments, which is cleaner and less error-prone.
                </li>
                <li style="margin-bottom: 10px;">
                    <strong>Abstraction:</strong> A user of the class only needs to know about <code>.train()</code>, <code>.encode()</code>, and <code>.decode()</code>. The complex internal logic of merging pairs and handling word boundaries is hidden away. This is a core principle of good software design.
                </li>
                <li style="margin-bottom: 10px;">
                    <strong>Reusability & Portability:</strong> The trained tokenizer object can be easily saved (e.g., using `pickle`) and loaded in another project without needing to retrain or copy-paste all the helper functions. It becomes a single, portable artifact.
                </li>
                <li style="margin-bottom: 10px;">
                    <strong>Handles Full Sentences:</strong> Our previous `tokenize_word` function only worked on a single word. The `encode` method in the class will be designed to process entire strings of text, handling spaces and punctuation correctly.
                </li>
            </ul>
        </div>

 <div style="flex: 1; background-color: #1a202c; border: 1px solid #c53030; border-radius: 8px; padding: 15px;">
            <h5 style="color: #f56565; margin: 0 0 10px 0; font-size: 1.1em;">Cons (Potential Downsides)</h5>
            <ul style="padding-left: 20px; margin: 0; color: #a0aec0; font-size: 0.95em;">
                <li style="margin-bottom: 10px;">
                    <strong>Increased Complexity (Initially):</strong> Writing a class requires more boilerplate code than simple functions. We need to think about the `__init__` method and how the object's attributes relate to each other.
                </li>
                <li style="margin-bottom: 10px;">
                    <strong>Less Transparency (for Debugging):</strong> While abstraction is a pro for users, it can be a con for developers trying to debug. The logic is now "hidden" inside methods, which can make tracing data flow slightly more challenging than with standalone functions.
                </li>
                <li style="margin-bottom: 10px;">
                    <strong>Stateful is not always better:</strong> For very simple, one-off scripts, a functional approach can be more straightforward. An object holding state can sometimes lead to unexpected behavior if its state is modified incorrectly. (However, for a tokenizer, state is essential).
                </li>
            </ul>
        </div>
    </div>
    <p style="margin-top:20px; color: #cbd5e1; text-align: center;">
        Overall, for any serious application, the benefits of encapsulation overwhelmingly outweigh the drawbacks. We are now moving from a "script" to a "tool".
    </p>

</div>
