# Step-by-Step Explanation of Byte Pair Encoding (BPE)

#### Step 1 - Initialize Vocabulary: 
    - Individual characters (a, b, c, ..., space, punctuation
    - Each character becomes a token in the initial vocabulary

#### Step 2 - Train on Corpus:
    -Count frequency of all adjacent token pairs in the training data: 
    - Find the most frequent pair (e.g., "e" + "s" = "es")

#### Step 3 - Merge Most Frequent Pair: 
    - Replace all occurrences of that pair with a new single token
    - Add this new merged token to vocabulary
#### Step 4 - Repeat: 
    - Continue counting and merging until desired vocabulary size
    - Each iteration creates more complex subword units

In [None]:
"""
Initial: "low", "lower", "newest"
Char vocab: l, o, w, e, r, n, s, t

Iteration 1: Most frequent pair "e" + "s" → merge to "es"
Iteration 2: Most frequent pair "es" + "t" → merge to "est"  
Iteration 3: Most frequent pair "l" + "o" → merge to "lo"
"""

In [5]:
import re
from collections import defaultdict

def get_stats(vocab):
    """Count frequency of adjacent symbol pairs"""
    pairs = defaultdict(int)
    
    for word, freq in vocab.items():
        symbols = word.split()
        for i in range(len(symbols)-1):
            pairs[symbols[i], symbols[i+1]] += freq
    return pairs
corpus = {
    "l o w": 5,
    "l o w e r": 2, 
    "n e w e s t": 6,
    "w i d e s t": 3
}
print(get_stats(corpus))

KeyError: ('l', 'o')