# Byte Pair Encoding (BPE)

Byte Pair Encoding (BPE) is a subword tokenization method widely used in Natural Language Processing (NLP), particularly in Large Language Models (LLMs) like GPT, RoBERTa, and others.

## How it Works

1.  **Initialization**: Start with a vocabulary of individual characters.
2.  **Counting**: Count the frequency of all adjacent pairs of symbols in the text.
3.  **Merging**: Identify the most frequent pair and merge them into a new symbol.
4.  **Iteration**: Repeat steps 2 and 3 for a fixed number of merges (hyperparameter) or until a desired vocabulary size is reached.

## Example Walkthrough

Let's assume we have a small corpus with the following word frequencies:

*   "low": 5
*   "lower": 2
*   "newest": 6
*   "widest": 3

### Step 1: Initialization

We split words into characters and append a special end-of-word symbol `</w>`.

*   `l o w </w>`: 5
*   `l o w e r </w>`: 2
*   `n e w e s t </w>`: 6
*   `w i d e s t </w>`: 3

**Vocabulary**: `l, o, w, e, r, n, s, t, i, d, </w>`

### Step 2: Counting Pairs

We count the frequency of all adjacent pairs.

*   `e s`: 6 (newest) + 3 (widest) = **9**
*   `s t`: 6 (newest) + 3 (widest) = **9**
*   `e s t`: (overlaps, but we count bigrams first)
*   `l o`: 5 (low) + 2 (lower) = 7
*   `o w`: 5 (low) + 2 (lower) = 7
*   ...

### Step 3: Merging

The most frequent pair is `e` and `s` (9 times) or `s` and `t` (9 times). Let's pick `e` and `s` to merge into `es`.

*   `l o w </w>`: 5
*   `l o w e r </w>`: 2
*   `n e w es t </w>`: 6
*   `w i d es t </w>`: 3

**New Token**: `es`

### Step 4: Iteration

Now we count pairs again. The pair `es` and `t` appears 9 times (6 in newest + 3 in widest).

Merge `es` and `t` -> `est`.

*   `l o w </w>`: 5
*   `l o w e r </w>`: 2
*   `n e w est </w>`: 6
*   `w i d est </w>`: 3

**New Token**: `est`

We continue this process until we reach a desired vocabulary size.