# Byte Pair Encoding (BPE), Unigram Language Model, and WordPiece Tokenization

## Byte Pair Encoding (BPE)

### Definition
Byte Pair Encoding is a subword tokenization algorithm originally developed as a data compression technique by Gage and Gale (1994) and adapted for NLP by Sennrich et al. (2016). BPE iteratively merges the most frequent pairs of adjacent tokens to build a vocabulary that effectively handles rare words while maintaining a fixed vocabulary size.

### Mathematical Formulation
In BPE, we begin with a vocabulary of individual characters and incrementally merge the most frequent adjacent token pairs.

Let:
- $V$ be the vocabulary of tokens
- $C(xy)$ be the count of occurrences of adjacent tokens $x$ and $y$ in the corpus

The merge operation selects the pair that maximizes:

$$\arg\max_{(x,y) \in V \times V} C(xy)$$

The probability of a token $t$ in the final vocabulary is estimated as:

$$P(t) = \frac{C(t)}{\sum_{t' \in V} C(t')}$$

### Core Principles
1. Initialize the vocabulary with individual characters/bytes
2. Count frequencies of all adjacent token pairs in the corpus
3. Iteratively merge the most frequent pair and add the new token to the vocabulary
4. Update the corpus by replacing all occurrences of the merged pair
5. Continue until reaching the desired vocabulary size or merge operations limit

### Pseudo-Algorithm
```
function BPE(corpus, num_merges):
    # Initialize vocabulary with individual characters
    vocab = set of all characters in corpus
    
    # Initialize each word as a sequence of characters
    words = {(word, frequency): list of characters in word for word, frequency in corpus}
    
    for i = 1 to num_merges:
        # Count all pairs
        pairs = {}
        for (word, freq), tokens in words.items():
            for j in range(len(tokens) - 1):
                pair = (tokens[j], tokens[j+1])
                pairs[pair] = pairs.get(pair, 0) + freq
        
        # Find most frequent pair
        best_pair = max(pairs, key=pairs.get)
        
        # Create new merged token
        new_token = best_pair[0] + best_pair[1]
        vocab.add(new_token)
        
        # Update words by replacing the pair
        new_words = {}
        for (word, freq), tokens in words.items():
            new_tokens = []
            i = 0
            while i < len(tokens):
                if i < len(tokens) - 1 and tokens[i] == best_pair[0] and tokens[i+1] == best_pair[1]:
                    new_tokens.append(new_token)
                    i += 2
                else:
                    new_tokens.append(tokens[i])
                    i += 1
            new_words[(word, freq)] = new_tokens
        words = new_words
    
    return vocab
```

### Importance
BPE is vital in NLP because it:
- Solves the out-of-vocabulary problem by representing unknown words as sequences of subwords
- Enables effective handling of morphologically rich languages
- Provides a compact vocabulary that balances common words and subword units
- Forms the foundation for tokenization in many state-of-the-art language models

### Pros and Cons

#### Pros
- Simple and efficient implementation
- Handles morphologically rich languages effectively
- Creates a fixed-size vocabulary that minimizes out-of-vocabulary tokens
- Language-agnostic approach that works for any script
- Produces interpretable subword units

#### Cons
- Merges are based solely on frequency, not linguistic meaning
- Can create unintuitive or linguistically meaningless subword units
- No consideration of context when creating tokens
- Original implementation doesn't directly optimize for likelihood of training data

### Recent Advancements
- **BPE-Dropout**: Introduces stochastic segmentation during training to improve robustness
- **Regularized BPE**: Incorporates regularization terms to prevent overfitting
- **Multilingual BPE**: Shared vocabulary across multiple languages enabling cross-lingual transfer
- **SentencePiece BPE**: Language-agnostic implementation that treats text as Unicode characters

## Unigram Language Model

### Definition
The Unigram Language Model tokenization is a probabilistic subword segmentation method introduced by Kudo (2018) that treats tokenization as a statistical inference problem, aiming to find the subword vocabulary that maximizes the likelihood of the training corpus.

### Mathematical Formulation
The Unigram model defines the probability of a sentence $X$ as:

$$P(X) = \sum_{x \in S(X)} P(x)$$

where $S(X)$ is the set of all possible segmentations, and $P(x)$ is the probability of a specific segmentation $x = (x_1, x_2, ..., x_m)$ given by:

$$P(x) = \prod_{i=1}^{m} P(x_i)$$

The training objective is to find the vocabulary $V$ and token probabilities $P(x_i)$ that maximize:

$$\mathcal{L} = \sum_{s \in D} \log \left( \sum_{x \in S(s)} \prod_{i=1}^{|x|} P(x_i) \right)$$

where $D$ is the training corpus.

The Expectation-Maximization (EM) algorithm is used for optimization:

**E-step**: Compute expected counts for each subword token:
$$c(w) = \sum_{s \in D} \sum_{x \in S(s)} P(x|s) \cdot count(w, x)$$

**M-step**: Update probabilities:
$$P(w) = \frac{c(w)}{\sum_{w' \in V} c(w')}$$

### Core Principles
1. Start with a large initial vocabulary (often generated by BPE)
2. Assign probabilities to each token based on occurrence frequency
3. Iteratively prune the vocabulary by removing tokens that contribute least to the corpus likelihood
4. Estimate the likelihood loss from removing each token
5. Use EM algorithm to re-estimate token probabilities after pruning
6. For tokenization, find the most probable segmentation using the Viterbi algorithm

### Pseudo-Algorithm

```
function UnigramTraining(corpus, initial_vocab_size, target_vocab_size):
    # Initialize with a large vocabulary
    vocab = Initialize_Vocabulary(corpus, initial_vocab_size)
    
    while size(vocab) > target_vocab_size:
        # Calculate token probabilities
        probs = {}
        total_count = 0
        for token in vocab:
            count = Count_Token_Occurrences(token, corpus)
            probs[token] = count
            total_count += count
        
        for token in vocab:
            probs[token] /= total_count
        
        # Compute loss for each token if removed
        token_losses = {}
        for token in vocab:
            loss = Compute_Loss_Without_Token(token, vocab, probs, corpus)
            token_losses[token] = loss
        
        # Sort tokens by loss impact
        sorted_tokens = Sort_By_Loss(token_losses)
        
        # Remove p% of tokens with lowest impact
        to_remove = sorted_tokens[0:int(0.2 * len(vocab))]
        for token in to_remove:
            vocab.remove(token)
    
    return vocab, Calculate_Final_Probabilities(vocab, corpus)

function UnigramTokenization(text, vocab, probs):
    # Viterbi algorithm for optimal segmentation
    n = length(text)
    best_path = array of size n+1
    best_score = array of size n+1, initialized with -infinity
    best_score[0] = 0
    
    for i = 0 to n-1:
        for j = i+1 to min(i+MAX_TOKEN_LENGTH, n):
            if text[i:j] in vocab:
                score = best_score[i] + log(probs[text[i:j]])
                if score > best_score[j]:
                    best_score[j] = score
                    best_path[j] = i
    
    # Backtrack to find tokens
    tokens = []
    pos = n
    while pos > 0:
        prev_pos = best_path[pos]
        tokens.insert(0, text[prev_pos:pos])
        pos = prev_pos
    
    return tokens
```

### Importance
The Unigram model is significant because it:
- Introduces a probabilistic framework for subword segmentation
- Directly optimizes the likelihood of the training data
- Allows for multiple possible segmentations of a word
- Provides a principled approach to vocabulary pruning
- Captures the statistical properties of language more effectively

### Pros and Cons

#### Pros
- Based on sound statistical principles
- Produces linguistically meaningful subword units
- Handles ambiguity through probabilistic segmentation
- Optimizes segmentation for likelihood of the training data
- Enables efficient pruning of ineffective tokens

#### Cons
- More computationally intensive than BPE
- Requires careful initialization of the vocabulary
- More complex implementation than frequency-based methods
- Sensitive to hyperparameter choices

### Recent Advancements
- **SentencePiece**: End-to-end text tokenization with Unigram model
- **Subword Regularization**: Training with multiple tokenization candidates to improve robustness
- **Unigram Mixture Model**: Extensions to handle multiple languages or domains
- **Dynamic Programming Optimizations**: Faster training and inference algorithms

## WordPiece Tokenization

### Definition
WordPiece is a subword tokenization algorithm developed by Schuster and Nakajima (2012) at Google and later used in BERT and other transformer models. It's similar to BPE but uses a likelihood-based criterion for merging tokens rather than raw frequency.

### Mathematical Formulation
WordPiece selects the pair that maximizes the likelihood of the training data after the merge. For pairs $(x,y)$, the merge criterion is:

$$\arg\max_{(x,y) \in V \times V} \frac{freq(xy)}{freq(x) \times freq(y)}$$

This is equivalent to maximizing the log-likelihood gain:

$$\arg\max_{(x,y) \in V \times V} \left[ freq(xy) \times \log\frac{freq(xy)}{freq(x) \times freq(y)} \right]$$

The probability of a sequence $x = (x_1, x_2, ..., x_m)$ is:

$$P(x) = \prod_{i=1}^{m} P(x_i)$$

where $P(x_i)$ is estimated from the corpus frequencies.

### Core Principles
1. Initialize vocabulary with basic units (usually characters)
2. Calculate likelihood gain for each potential merge
3. Select the merge that maximizes the likelihood gain
4. Add the new merged token to the vocabulary
5. Iterate until desired vocabulary size is reached
6. Use special WordPiece convention: prefix subwords with ## if they don't start a word

### Pseudo-Algorithm
```
function WordPiece(corpus, num_merges):
    # Initialize vocabulary with individual characters
    vocab = set of all characters in corpus
    
    # Count token frequencies
    token_counts = Count_Token_Frequencies(corpus, vocab)
    
    for i = 1 to num_merges:
        best_score = -infinity
        best_pair = None
        
        # Evaluate all potential merges
        for each pair (x, y) where x and y appear adjacently in corpus:
            freq_xy = Count_Adjacent_Occurrences(x, y, corpus)
            freq_x = token_counts[x]
            freq_y = token_counts[y]
            
            # Calculate likelihood gain
            score = freq_xy / (freq_x * freq_y)
            
            if score > best_score:
                best_score = score
                best_pair = (x, y)
        
        # Merge the best pair
        new_token = best_pair[0] + best_pair[1]
        vocab.add(new_token)
        
        # Update corpus by replacing occurrences of best_pair with new_token
        corpus = Replace_Token_Pair(corpus, best_pair, new_token)
        
        # Update token counts
        token_counts = Count_Token_Frequencies(corpus, vocab)
    
    # Apply WordPiece formatting (prefix with ##)
    formatted_vocab = Format_WordPiece_Vocabulary(vocab)
    
    return formatted_vocab
```

### Importance
WordPiece is important because:
- It optimizes directly for likelihood of the training data
- It creates more linguistically motivated subword units
- It has been used in highly successful models like BERT, ALBERT, and other Google models
- It bridges the gap between character-level and word-level tokenization
- Its ## prefix convention clearly distinguishes word-initial from word-internal subwords

### Pros and Cons

#### Pros
- Creates linguistically more meaningful subwords than BPE
- Directly optimizes for data likelihood
- Effective for morphologically rich languages
- Clear marking of word-internal subwords improves readability
- Demonstrated effectiveness in state-of-the-art models

#### Cons
- More computationally expensive than BPE
- Implementation details less documented (Google has not fully open-sourced it)
- May require more hyperparameter tuning
- Less flexible than fully probabilistic approaches like Unigram

### Recent Advancements
- **Multilingual WordPiece**: Used in mBERT with shared vocabulary across languages
- **Efficiency optimizations**: Faster implementations for large-scale training
- **Dynamic WordPiece**: Adaptive vocabulary selection based on domain
- **WordPiece with contextual information**: Incorporating surrounding context for better segmentation

## Comparative Analysis

### Mathematical Differences

| Algorithm | Merge Criterion | Optimization Target |
|-----------|-----------------|---------------------|
| BPE | $$\arg\max_{(x,y)} C(xy)$$ | Frequency of token pairs |
| WordPiece | $$\arg\max_{(x,y)} \frac{freq(xy)}{freq(x) \times freq(y)}$$ | Likelihood gain per merge |
| Unigram | $$\arg\max_{V, P} \sum_{s \in D} \log \left( \sum_{x \in S(s)} P(x) \right)$$ | Overall corpus likelihood |

### Tokenization Approaches

| Algorithm | Building Direction | Approach | Segmentation Strategy |
|-----------|-------------------|----------|------------------------|
| BPE | Bottom-up | Greedy, deterministic | Maximum frequency merging |
| WordPiece | Bottom-up | Greedy, deterministic | Maximum likelihood gain |
| Unigram | Top-down | Probabilistic, iterative pruning | Multiple segmentation candidates with EM |

### Implementation and Usage

| Algorithm | Computational Complexity | Implementation Difficulty | Notable Models |
|-----------|-------------------------|---------------------------|----------------|
| BPE | Lower | Simpler | GPT, RoBERTa, XLM |
| WordPiece | Medium | Medium | BERT, ALBERT, DistilBERT |
| Unigram | Higher | Complex | T5, SentencePiece applications |