# **Create The Byte-Pair Encoding (BPE) Tokenizer From Scratch**

As of 2025, BPE is still popular and is widely used. Models including GPT-2, GPT-3, GPT-4, Llama-3 etc. have made use fo this tokenizer. 

OpenAI's original implementation of the BPE tokenizer can be found [here](https://github.com/openai/gpt-2/blob/master/src/encoder.py), while practitioners usually incorporate the [tiktoken](https://github.com/openai/tiktoken) library in their model development pipelines. Karpathy's [minBPE](https://github.com/karpathy/minbpe) is also mentioned in Sebastian's work, as a possible alternative to the worflow below.

For practice, the BPE tokenizer will be implemented from scratch in this notebook - though it won't be nearly as optimized as OpenAI's or maybe even Karpathy's versions.



## **BPE Algorithm Outline**

>**1. Identify frequent pairs**
>- In each iteration, scan the text to find the most commonly occurring pair of bytes (or characters)
>
>**2. Replace and record**
>
>- Replace that pair with a new placeholder ID (one not already in use, e.g., if we start with 0...255, the first placeholder would be 256)
>- Record this mapping in a lookup table
>- The size of the lookup table is a hyperparameter, also called "vocabulary size" (for GPT-2, that's
>50,257)
>
>**3. Repeat until no gains**
>
>- Keep repeating steps 1 and 2, continually merging the most frequent pairs
>- Stop when no further compression is possible (e.g., no pair occurs more than once)
>
>**Decompression (decoding)**
>
>- To restore the original text, reverse the process by substituting each ID with its corresponding pair, using the lookup table
>

### **Working Example**

>&nbsp;
> Suppose we have the text (training dataset) `the cat in the hat` from which we want to build the vocabulary for a BPE tokenizer
>
>**Iteration 1**
>
>1. Identify frequent pairs
>  - In this text, "th" appears twice (at the beginning and before the second "e")
>
>2. Replace and record
>  - replace "th" with a new token ID that is not already in use, e.g., 256
>  - the new text is: `<256>e cat in <256>e hat`
>  - the new vocabulary is
>
>```
>  0: ...
>  ...
>  256: "th"
>```
>
>**Iteration 2**
>
>1. **Identify frequent pairs**  
>   - In the text `<256>e cat in <256>e hat`, the pair `<256>e` appears twice
>
>2. **Replace and record**  
>   - replace `<256>e` with a new token ID that is not already in use, for example, `257`.  
>   - The new text is:
>     ```
>     <257> cat in <257> hat
>     ```
>   - The updated vocabulary is:
>     ```
>     0: ...
>     ...
>     256: "th"
>     257: "<256>e"
>     ```
>
>**Iteration 3**
>
>1. **Identify frequent pairs**  
>   - In the text `<257> cat in <257> hat`, the pair `<257> ` appears twice (once at the beginning and once before “hat”).
>
>2. **Replace and record**  
>   - replace `<257> ` with a new token ID that is not already in use, for example, `258`.  
>   - the new text is:
>     ```
>     <258>cat in <258>hat
>     ```
>   - The updated vocabulary is:
>     ```
>     0: ...
>     ...
>     256: "th"
>     257: "<256>e"
>     258: "<257> "
>     ```
>     
>- and so forth
>
>&nbsp;
>#### Decoding Steps:
>
>- To restore the original text, we reverse the process by substituting each token ID with its corresponding pair in the reverse order they were introduced
>- Start with the final compressed text: `<258>cat in <258>hat`
>-  Substitute `<258>` → `<257> `: `<257> cat in <257> hat`  
>- Substitute `<257>` → `<256>e`: `<256>e cat in <256>e hat`
>- Substitute `<256>` → "th": `the cat in the hat`

## **Simplified BPE Implementation**

This is a simplified implementation of the BPE algorithm, which will mimic the `tiktoken` UI. Here the `encode()` method will approximate the original `train()` method.

In [1]:
from collections import Counter, deque
from functools import lru_cache
import json

In [2]:
class BPETokenizerLocal:
    def __init__(self):
        # Map token_id to token_str   
        self.vocab = {}
        # Map token_str to token_od
        self.inverse_vocab = {}
        # Dict of BPE merges
        self.bpe_merges = {}
        # Use a rank dict for GPT-2 merges. Low ranks have higher priority
        self.bpe_ranks = {}
     
    def train(self, text, vocab_size, allowed_special={"<|endoftext|>"}):
        """
        Train BPE tokenizer from scratch

        Args:
            text (str): Input / training text
            vocab_size (int): Desired vocabulary size
            allowed_special (set): Set of special tokens to include.
        """
        
        # Preprocessing: Replace spaces with "Ġ", as implemented in GPT-2.
        processed_text = []
        for i, char in enumerate(text):
            if char == " " and i != 0:
                processed_text.append("Ġ")
            if char != " ":
                processed_text.append(char)
        processed_text = "".join(processed_text)
        
        # Initialize vocab with unique characters, including "Ġ" if present starting
        # with first 256 ASCII characters
        unique_chars = [chr(i) for i in range(256)]
        unique_chars.extend(
            char for char in sorted(set(processed_text))
            if char not in unique_chars
        )
        if "Ġ" not in unique_chars:
            unique_chars.append("Ġ")
            
        self.vocab = {i: char for i, char in enumerate(unique_chars)}
        self.inverse_vocab = {char: i for i, char in self.vocab.items()}
        
        # Add allowed special tokens
        if allowed_special:
            for token in allowed_special:
                if token not in self.inverse_vocab:
                    new_id = len(self.vocab)
                    self.vocab[new_id] = token
                    self.inverse_vocab[token] = new_id
        
        # Tokenize the processed_text into token Ids
        token_ids = [self.inverse_vocab[char] for char in processed_text]
        
        # BPE steps: Repeatedly find and replace frequent pairs
        for new_id in range(len(self.vocab), vocab_size):
            pair_id = self.find_freq_pair(token_ids, mode='most')
            if pair_id is None:
                break
            token_ids = self.replace_pair(token_ids, pair_id, new_id)
            self.bpe_merges[pair_id] = new_id
            
        # Build vocab with merged tokens
        for (p0, p1), new_id in self.bpe_merges.items():
            merged_token = self.vocab[p0] + self.vocab[p1]
            self.vocab[new_id] = merged_token
            self.inverse_vocab[merged_token] = new_id
            
    def load_vocab_and_merges_from_openai(self, vocab_path, bpe_merges_path):
        """
        Load pretained vocabulary and BPE merges from OpenAI's GPT-2 files

        Args:
            vocab_path (str): Path to the vocab file (GPT-2 calls it 'encoder.json)
            bpe_merges_path (str): Path to bpe_merges file (GPT-2 calls it 'vocab.bpe'). 
        """
        # Load vocab
        with open(vocab_path, "r", encoding="utf-8") as file:
            loaded_vocab = json.load(file)
            # Load vocab to correct format
            self.vocab = {int(v): k for k, v in loaded_vocab.items()}
            self.inverse_vocab = {k: int(v) for k, v in loaded_vocab.items()}
        
        # Handle newline character without adding a new token
        if "\n" not in self.inverse_vocab:
            # Use existing token ID as a placeholder for '\n' i.e. "<|endoftext|>" if available
            fallback_token = next((token for token in ["<|endoftext|>", "Ġ", ""] if token in self.inverse_vocab), None)
            if fallback_token is not None:
                newline_token_id = self.inverse_vocab[fallback_token]
            else:
                raise KeyError("No suitable token found in vocabulary to map '\\n'.")
            
            self.inverse_vocab["\n"] = newline_token_id
            self.vocab[newline_token_id]= "\n"
            
        # Load GPT-2 merges and store these with an assigned rank.
        self.bpe_ranks = {}
        with open(bpe_merges_path, "r", encoding="utf-8") as file:
            lines = file.readlines()
            if lines and lines[0].startswith("#"):
                lines = lines[1:]
            
            rank = 0
            for line in lines:
                pair = tuple(line.strip().split())
                if len(pair) == 2:
                    token1, token2 = pair
                    # if both tokens are not in vocab then skip
                    if token1 in self.inverse_vocab and token2 in self.inverse_vocab:
                        self.bpe_ranks[(token1, token2)] = rank
                        rank += 1
                    else:
                        print(f"Skipping pair {pair} since one token isn't in the vocabulary!")
  
    def encode(self, text, allowed_special=None):
        """
        Encode the input text into a list of token IDs, with tiktoken style handling of special tokens.
        
        Args:
            text (str): The input text to encode.
            allowed_special (set or None): Special tokens to allow passthrough. If None, special handling is disabled.
    
        Returns:
            List of token IDs.
        """           
        import re
        token_ids = []
        
        # If special token handling is enabled
        if allowed_special is not None and len(allowed_special) > 0:
            # Regex to match allowed special tokens
            special_pattern = (
                "(" + "|".join(re.escape(tok) for tok in sorted(allowed_special, key=len, reverse=True)) + ")"
            )
            
            last_index = 0
            for match in re.finditer(special_pattern, text):
                prefix = text[last_index:match.start()]
                token_ids.extend(self.encode(prefix, allowed_special=None)) # Encode prefix without special handling
                
                special_token = match.group(0)
                if special_token in self.inverse_vocab:
                    token_ids.append(self.inverse_vocab[special_token])
                else:
                    raise ValueError(f"Special token {special_token} not found in vocabulary!")
                last_index = match.end()
            # Normal processing of remaining parts
            text = text[last_index:]
            
            # Check for disallowed special special tokens in the remainder
            disallowed = [
                tok for tok in self.inverse_vocab
                if tok.startswith("<|") and tok.endswith("|>") and tok in text and tok not in allowed_special
            ]
            if disallowed:
                raise ValueError(f"Disallowed special tokens encountered in text: {disallowed}")
        
        # In case of no special tokens , or remaining text after special token split:
        tokens = []
        lines = text.split("\n")
        for i, line in enumerate(lines):
            if i > 0: 
                tokens.append("\n")
            words = line.split()
            for j, word in enumerate(words):
                if j == 0 and i > 0:
                    tokens.append("Ġ" + word)
                elif j == 0:
                    tokens.append(word)
                else:
                    tokens.append("Ġ" + word)
        
        for token in tokens:
            if token in self.inverse_vocab:
                token_ids.append(self.inverse_vocab[token])
            else:
                token_ids.extend(self.tokenize_with_bpe(token))
        
        return token_ids
    
    def tokenize_with_bpe(self, token):
        """
        Tokenize a single token using BPE merges.

        Args:
            token (str): The token to tokenize.

        Returns:
            List[int]: The list of token IDs after applying BPE.
        """
        # Tokenize token into individal characters 
        token_ids  = [self.inverse_vocab.get(char, None) for char in token]
        if None in token_ids:
            missing_chars = [char for char, tid in zip(token, token_ids) if tid is None]
            raise ValueError(f"Characters not found in vocab: {missing_chars}")
        
        # In case OpenAI's GPT-2 merges weren't loaded, run the following
        if not self.bpe_ranks:
            can_merge = True
            while can_merge and len(token_ids) > 1:
                can_merge = False
                new_tokens = []
                i = 0
                while i < len(token_ids) - 1:
                    pair = (token_ids[i], token_ids[i + 1])
                    if pair in self.bpe_merges:
                        merged_token_id = self.bpe_merges[pair]
                        new_tokens.append(merged_token_id)
                        # Skip the next token as it is merged
                        i += 2 
                        can_merge = True
                    else:
                        new_tokens.append(token_ids[i])
                        i += 1
                if  i < len(token_ids):
                    new_tokens.append(token_ids[i])
                token_ids = new_tokens
            return token_ids
        
        # Alternatively run GPT-2 style merging with ranking:
        # Convert token_ids back to string "symbols" for each ID
        symbols = [self.vocab[id_num] for id_num in token_ids]
        
        # Repeatedly merge all occurences of the lowest-rank pair.
        while True:
            # Collect all adjacent pairs
            pairs = set(zip(symbols, symbols[1:]))
            if not pairs: 
                break
            
            # Find the pair with the best / lowest rank
            min_rank = float("inf")
            bigram = None
            for p in pairs:
                r = self.bpe_ranks.get(p, float("inf"))
                if r < min_rank:
                    min_rank = r
                    bigram = p
                
            # If no valid ranked pair is present, terminate
            if bigram is None or bigram not in self.bpe_ranks:
                break
            
            # Merge all occurence of the pair in question
            first, second = bigram
            new_symbols = []
            i = 0
            while i < len(symbols):
                # In case of (first, second) at position i, merge
                if i < len(symbols) - 1 and symbols[i] == first and symbols[i+1] == second:
                    new_symbols.append(first + second)
                    i += 2
                else:
                    new_symbols.append(symbols[i])
                    i += 1
            symbols = new_symbols
            
            if len(symbols) == 1:
                break
        
        # Convert merged symbols back to IDs
        merged_ids = [self.inverse_vocab[sym] for sym in symbols]
        return merged_ids
    
    def decode(self, token_ids):
        """
        Decode a list of token IDs back into a string.

        Args:
            token_ids (List[int]): The list of token IDs to decode.

        Returns:
            str: The decoded string.
        """
        decoded_string = ""
        for i, token_id in enumerate(token_ids):
            if token_id not in self.vocab:
                raise ValueError(f"Token ID {token_id} not found in vocab.")
            token = self.vocab[token_id]
            if token == "\n":
                if decoded_string and not decoded_string.endswith(" "):
                    decoded_string += " "  # Add space if not present before a newline
                decoded_string += token
            elif token.startswith("Ġ"):
                decoded_string += " " + token[1:]
            else:
                decoded_string += token
        return decoded_string

    def save_vocab_and_merges(self, vocab_path, bpe_merges_path):
        """
        Save the vocabulary and BPE merges to JSON files.

        Args:
            vocab_path (str): Path to save the vocabulary.
            bpe_merges_path (str): Path to save the BPE merges.
        """
        # Save vocabulary
        with open(vocab_path, "w", encoding="utf-8") as file:
            json.dump(self.vocab, file, ensure_ascii=False, indent=2)

        # Save BPE merges as a list of dictionaries
        with open(bpe_merges_path, "w", encoding="utf-8") as file:
            merges_list = [{"pair": list(pair), "new_id": new_id}
                           for pair, new_id in self.bpe_merges.items()]
            json.dump(merges_list, file, ensure_ascii=False, indent=2)

    def load_vocab_and_merges(self, vocab_path, bpe_merges_path):
        """
        Load the vocabulary and BPE merges from JSON files.

        Args:
            vocab_path (str): Path to the vocabulary file.
            bpe_merges_path (str): Path to the BPE merges file.
        """
        # Load vocabulary
        with open(vocab_path, "r", encoding="utf-8") as file:
            loaded_vocab = json.load(file)
            self.vocab = {int(k): v for k, v in loaded_vocab.items()}
            self.inverse_vocab = {v: int(k) for k, v in loaded_vocab.items()}

        # Load BPE merges
        with open(bpe_merges_path, "r", encoding="utf-8") as file:
            merges_list = json.load(file)
            for merge in merges_list:
                pair = tuple(merge["pair"])
                new_id = merge["new_id"]
                self.bpe_merges[pair] = new_id
                
    
    @lru_cache(maxsize=None)
    def get_special_token_id(self, token):
        return self.inverse_vocab.get(token, None)
    
    @staticmethod
    def find_freq_pair(token_ids, mode="most"):
        pairs = Counter(zip(token_ids, token_ids[1:]))

        if not pairs:
            return None

        if mode == "most":
            return max(pairs.items(), key=lambda x: x[1])[0]
        elif mode == "least":
            return min(pairs.items(), key=lambda x: x[1])[0]
        else:
            raise ValueError("Invalid mode. Choose 'most' or 'least'.")
    
    @staticmethod
    def replace_pair(token_ids, pair_id, new_id):
        dq = deque(token_ids)
        replaced = []
        
        while dq:
            current = dq.popleft()
            if dq and (current, dq[0]) == pair_id:
                replaced.append(new_id)
                # Remove 2nd token since 1st was already removed.
                dq.popleft()
            else:
                replaced.append(current)
        
        return replaced
    

## **Implementation**

### Train, encode and decode

In [6]:
import os
import urllib.request

In [7]:
def download_file_if_absent(url, filename, search_dirs):
    for directory in search_dirs:
        file_path = os.path.join(directory, filename)
        if os.path.exists(file_path):
            print(f"{filename} already exists in {file_path}")
            return file_path
        
    target_path = os.path.join(search_dirs[0], filename)
    try:
        with urllib.request.urlopen(url) as response, open(target_path, "wb") as out_file:
            out_file.write(response.read())
        print(f"Downloaded {filename} to {target_path}")
    except Exception as e:
        print(f"Failed to download {filename}. Error: {e}")
    return target_path

In [8]:
law_path = download_file_if_absent(
    url=(
        "https://github.com/bachaudhry/my-llm-from-scratch/blob/main/data/the-law-bastiat.txt"
    ),
    filename="the-law.txt",
    search_dirs="."
)

with open(law_path, "r", encoding="utf-8") as f:
    text = f.read()

the-law.txt already exists in ./the-law.txt


In [9]:
# Initialize the BPE Tokenizer
tokenizer = BPETokenizerLocal()
tokenizer.train(text, vocab_size=1000, allowed_special={"<|endoftext|>"})

In [14]:
print("Vocab: ", len(tokenizer.vocab), "\nBPE Merges: ", (len(tokenizer.bpe_merges)))

Vocab:  1000 
BPE Merges:  739


- The vocabulary size is already 256 by default, based on the single character ASCII tokens we've factored into the tokenizer. This way the tokenizer learns 739 vocabulary entries (including the `<|endoftext|>` and `Ġ` special tokens). 
- The GPT-2 tokenizer vocabulary is 50,257 tokens while GPT-4o takes it to 199,997 tokens.

In [21]:
input_text = "One of the first cares of the prince was to encourage agriculture."
token_ids = tokenizer.encode(input_text)
print(token_ids)
print("\n", len(token_ids))

[79, 110, 101, 256, 307, 470, 465, 509, 461, 256, 987, 500, 256, 307, 470, 351, 392, 110, 388, 256, 119, 522, 256, 301, 256, 273, 302, 413, 481, 101, 256, 481, 392, 99, 386, 741, 101, 46]

 38


In [20]:
input_text = "One of the first cares of the prince was to encourage agriculture.<|endoftext|>"
token_ids = tokenizer.encode(input_text)
print(token_ids)
print("\n", len(token_ids))

[79, 110, 101, 256, 307, 470, 465, 509, 461, 256, 987, 500, 256, 307, 470, 351, 392, 110, 388, 256, 119, 522, 256, 301, 256, 273, 302, 413, 481, 101, 256, 481, 392, 99, 386, 741, 101, 46, 60, 124, 740, 307, 116, 562, 124, 62]

 46


In [22]:
input_text = "One of the first cares of the prince was to encourage agriculture.<|endoftext|>"
token_ids = tokenizer.encode(input_text, allowed_special={"<|endoftext|>"})
print(token_ids)
print("\n", len(token_ids))

[79, 110, 101, 256, 307, 470, 465, 509, 461, 256, 987, 500, 256, 307, 470, 351, 392, 110, 388, 256, 119, 522, 256, 301, 256, 273, 302, 413, 481, 101, 256, 481, 392, 99, 386, 741, 101, 46, 260]

 39


In [23]:
print("Number of characters:", len(input_text))
print("Number of token IDs:", len(token_ids))

Number of characters: 79
Number of token IDs: 39


- Here, the 79 character sentence was encoded into 39 token IDs, which illustrates the compression component of the tokenizer.

In [24]:
print(tokenizer.decode(token_ids))

One of the first cares of the prince was to encourage agriculture.<|endoftext|>


In [26]:
# Iterating over each token ID
for id in token_ids:
    print(f"{id} ---> {tokenizer.decode([id])}")

79 ---> O
110 ---> n
101 ---> e
256 --->  
307 ---> of
470 --->  the
465 --->  f
509 ---> ir
461 ---> st
256 --->  
987 ---> ca
500 ---> res
256 --->  
307 ---> of
470 --->  the
351 --->  p
392 ---> ri
110 ---> n
388 ---> ce
256 --->  
119 ---> w
522 ---> as
256 --->  
301 ---> to
256 --->  
273 ---> en
302 ---> co
413 ---> ur
481 ---> ag
101 ---> e
256 --->  
481 ---> ag
392 ---> ri
99 ---> c
386 ---> ul
741 ---> tur
101 ---> e
46 ---> .
260 ---> <|endoftext|>


### Saving and Loading the Tokenizer

In [28]:
# Save tokenizer
tokenizer.save_vocab_and_merges(vocab_path="output/vocab.json", bpe_merges_path="output/bpe_merges.txt")

In [29]:
# Load the tokenizer
tokenizer2 = BPETokenizerLocal()
tokenizer2.load_vocab_and_merges(vocab_path="output/vocab.json", bpe_merges_path="output/bpe_merges.txt")

In [30]:
print(tokenizer2.decode(token_ids))

One of the first cares of the prince was to encourage agriculture.<|endoftext|>


### Loading the GPT-2 BPE from OpenAI

In [33]:
# Download necessary files
search_dir = [".", "../supplementary/output/gpt2-model/"]

files_to_download = {
    "https://openaipublic.blob.core.windows.net/gpt-2/models/124M/vocab.bpe": "vocab.bpe",
    "https://openaipublic.blob.core.windows.net/gpt-2/models/124M/encoder.json": "encoder.json"
}

# Ensure directories exist and download 
paths = {}
for url, filename in files_to_download.items():
    paths[filename] = download_file_if_absent(url, filename, search_dir)

vocab.bpe already exists in ../supplementary/output/gpt2-model/vocab.bpe
encoder.json already exists in ../supplementary/output/gpt2-model/encoder.json


In [34]:
# Loading files
tokenizer_gpt2 = BPETokenizerLocal()
tokenizer_gpt2.load_vocab_and_merges_from_openai(
    vocab_path=paths["encoder.json"], bpe_merges_path=paths["vocab.bpe"]
)

len(tokenizer_gpt2.vocab)

50257

In [35]:
# Testing the GPT tokenizer
input_text = "This is a text sample."
token_ids = tokenizer_gpt2.encode(input_text)
print(token_ids)

[1212, 318, 257, 2420, 6291, 13]


In [36]:
print(tokenizer_gpt2.decode(token_ids))

This is a text sample.
