# Tokenizers

Tokenizers are fundamental components in Natural Language Processing (NLP) that convert raw text into a sequence of smaller units called tokens. These tokens can be words, subwords, or even characters. Tokenization is a crucial first step before feeding text data into NLP models. Different tokenizers employ various strategies to balance vocabulary size, handling of out-of-vocabulary (OOV) words, and computational efficiency.

Here's an overview of three common subword tokenizers: BPE, WordPiece, and SentencePiece.

### Byte-Pair Encoding (BPE)

**Working Principle:**
BPE is a greedy algorithm that iteratively merges the most frequent pairs of bytes (or characters) into new, larger tokens. It starts with a base vocabulary of individual characters and repeatedly adds the most frequent byte pairs to the vocabulary until a desired vocabulary size is reached or no more frequent pairs exist.

**Merge Principle:**
The core idea is to identify the most frequent adjacent pair of characters or subwords in the training corpus and replace all occurrences of this pair with a new, combined token. This process is repeated, creating a vocabulary of subword units of varying lengths.

**OOV Handling:**
BPE handles OOV words by breaking them down into smaller, known subword units or individual characters. Since the base vocabulary includes all individual characters, any word can ultimately be represented as a sequence of characters if no larger subword units are found.

**Advantages:**
- Relatively simple to implement.
- Effectively reduces vocabulary size while handling OOV words.
- Generates meaningful subword units.

**Disadvantages:**
- Can create very long tokens if frequent pairs are long sequences.
- The greedy approach might not always result in the optimal tokenization.

### WordPiece

**Working Principle:**
WordPiece is similar to BPE but differs in its merging strategy. Instead of merging the most frequent pairs, WordPiece merges pairs that maximize the likelihood of the training data when added to the vocabulary. It starts with a vocabulary of individual characters and iteratively adds the pair of units that, when merged, results in the greatest increase in the product of their probabilities (or likelihood).

**Merge Principle:**
WordPiece considers the probability of a pair of tokens appearing together. It merges the pair `(A, B)` if the probability of `AB` is higher than the product of the probabilities of `A` and `B` individually, normalized by the frequency of `AB`. This can be seen as merging pairs that are statistically more likely to appear together.

**OOV Handling:**
Similar to BPE, WordPiece handles OOV words by breaking them down into smaller, known subword units. It typically uses a special prefix (e.g., `##` in BERT) to indicate that a subword is not the beginning of a word.

**Advantages:**
- Often produces a more linguistically motivated tokenization than BPE.
- Used in popular models like BERT.

**Disadvantages:**
- Can be slightly more complex to implement than BPE.

### SentencePiece

**Working Principle:**
SentencePiece is unique in that it treats the input as a raw stream of characters and directly learns a vocabulary of subword units. It does not rely on pre-splitting the text into words using whitespace. This makes it suitable for languages without explicit word boundaries (e.g., Chinese, Japanese). SentencePiece can implement both BPE and unigram language model based tokenization.

**Merge Principle (for BPE mode):**
Similar to BPE, it iteratively merges the most frequent character or subword sequences.

**Merge Principle (for Unigram Language Model mode):**
It learns a probability distribution over the vocabulary and tokenizes the text by finding the most likely sequence of subword units that reconstructs the original text.

**OOV Handling:**
SentencePiece can handle OOV words by breaking them down into smaller subword units or individual characters, similar to BPE and WordPiece. Its ability to handle raw character streams makes it robust to variations in whitespace and punctuation.

**Advantages:**
- Handles languages without explicit word boundaries effectively.
- Can produce reversible tokenization (decode tokens back to the original text).
- Supports both BPE and unigram language model approaches.

**Disadvantages:**
- Can be more computationally intensive during training compared to basic BPE.

In summary, BPE, WordPiece, and SentencePiece are powerful subword tokenization techniques that offer different approaches to balancing vocabulary size and OOV handling. The choice of tokenizer often depends on the specific language, dataset, and the requirements of the NLP task.

In [16]:
from tokenizers.models import BPE, WordPiece
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.trainers import BpeTrainer, WordPieceTrainer
from tokenizers import Tokenizer

## BPE Encoding

In [2]:
bpe_tokenizer = Tokenizer(BPE())

In [3]:
bpe_tokenizer

Tokenizer(version="1.0", truncation=None, padding=None, added_tokens=[], normalizer=None, pre_tokenizer=None, post_processor=None, decoder=None, model=BPE(dropout=None, unk_token=None, continuing_subword_prefix=None, end_of_word_suffix=None, fuse_unk=False, byte_fallback=False, ignore_merges=False, vocab={}, merges=[]))

In [4]:
bpe_tokenizer.pre_tokenizer = Whitespace()

In [5]:
bpe_tokenizer

Tokenizer(version="1.0", truncation=None, padding=None, added_tokens=[], normalizer=None, pre_tokenizer=Whitespace(), post_processor=None, decoder=None, model=BPE(dropout=None, unk_token=None, continuing_subword_prefix=None, end_of_word_suffix=None, fuse_unk=False, byte_fallback=False, ignore_merges=False, vocab={}, merges=[]))

In [6]:
trainer = BpeTrainer(vocab_size=1000, min_frequency=2, special_tokens=["<unk>", "<pad>", "<s>", "</s>"])
trainer

BpeTrainer(BpeTrainer(min_frequency=2, vocab_size=1000, show_progress=True, special_tokens=[AddedToken(content="<unk>", single_word=False, lstrip=False, rstrip=False, normalized=False, special=True), AddedToken(content="<pad>", single_word=False, lstrip=False, rstrip=False, normalized=False, special=True), AddedToken(content="<s>", single_word=False, lstrip=False, rstrip=False, normalized=False, special=True), AddedToken(content="</s>", single_word=False, lstrip=False, rstrip=False, normalized=False, special=True)], limit_alphabet=None, initial_alphabet=[], continuing_subword_prefix=None, end_of_word_suffix=None, max_token_length=None, words={}))

In [8]:
bpe_tokenizer.train(files=['corpus.txt'], trainer=trainer)

In [9]:
bpe_tokenizer

Tokenizer(version="1.0", truncation=None, padding=None, added_tokens=[{"id":0, "content":"<unk>", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":1, "content":"<pad>", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":2, "content":"<s>", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":3, "content":"</s>", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}], normalizer=None, pre_tokenizer=Whitespace(), post_processor=None, decoder=None, model=BPE(dropout=None, unk_token=None, continuing_subword_prefix=None, end_of_word_suffix=None, fuse_unk=False, byte_fallback=False, ignore_merges=False, vocab={"<unk>":0, "<pad>":1, "<s>":2, "</s>":3, ",":4, ".":5, ":":6, "A":7, "C":8, "E":9, "I":10, "J":11, "L":12, "P":13, "S":14, "a":15, "b":16, "c":17, "d":18, "e":19, "f":20, "g":21, "h":22, "i":23, "k":24, "l":25, "m":26,

### BPE Tokenizer Configuration

- **Special Tokens:**  
  - `<unk>`, `<pad>`, `<s>`, `</s>`  
    These tokens are added to handle unknown words, padding, sequence starts, and ends.

- **Pre-Tokenizer:**  
  - `Whitespace`  
    Text is split using whitespace before tokenization.

- **Model:**  
  - `BPE`  
    The Byte-Pair Encoding model uses its learned vocabulary and merge rules.

- **Vocabulary:**  
  - Contains the learned tokens:  
    - Characters  
    - Subwords  
    - Words  
  - Each token is mapped to a unique ID.

- **Merges:**  
  - Lists pairs of tokens that were merged during training.  
  - These merges form larger subword units from smaller ones.

In [10]:
bpe_tokenizer.save("bpe-tokenizer.json")
loaded_bpe = Tokenizer.from_file("bpe-tokenizer.json")

In [11]:
output = loaded_bpe.encode("huggingface and transformers")
print(output.tokens)

['h', 'u', 'g', 'g', 'ing', 'f', 'ac', 'e', 'and', 'tr', 'an', 's', 'f', 'orm', 'ers']


The output shows the tokens generated by the **BPE tokenizer** for the input string:

- The tokenizer breaks down the words into subword units based on the learned vocabulary and merges.

- Example token breakdown:  
  - `"huggingface"` is split into:  
    `'h'`, `'u'`, `'g'`, `'g'`, `'ing'`, `'f'`, `'ac'`, `'e'`  
  - `"transformers"` is split into:  
    `'tr'`, `'an'`, `'s'`, `'f'`, `'orm'`, `'ers'`

- The word `"and"` is kept as a **single token** because it likely exists as a whole word in the vocabulary.

This demonstrates how BPE efficiently represents words by splitting complex words into frequent subword units while retaining common words intact.

## WordPiece Tokenizer

In [17]:
wordpiece_tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
wordpiece_tokenizer.pre_tokenizer = Whitespace()
trainer = WordPieceTrainer(vocab_size=1000, special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

In [18]:
wordpiece_tokenizer.train(['corpus.txt'], trainer)
wordpiece_tokenizer

Tokenizer(version="1.0", truncation=None, padding=None, added_tokens=[{"id":0, "content":"[UNK]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":1, "content":"[CLS]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":2, "content":"[SEP]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":3, "content":"[PAD]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":4, "content":"[MASK]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}], normalizer=None, pre_tokenizer=Whitespace(), post_processor=None, decoder=None, model=WordPiece(unk_token="[UNK]", continuing_subword_prefix="##", max_input_chars_per_word=100, vocab={"[UNK]":0, "[CLS]":1, "[SEP]":2, "[PAD]":3, "[MASK]":4, ",":5, ".":6, ":":7, "A":8, "C":9, "E":10, "I":11, "J":12, "L":13, "P":14, "S":15, "a":16, "b":17,

In [19]:
output = wordpiece_tokenizer.encode("unbelievable transformation")
print(output.tokens)

['un', '##b', '##el', '##i', '##e', '##va', '##b', '##le', 'transf', '##orm', '##ation']


## SentencePiece Tokenizer

In [20]:
from sentencepiece import SentencePieceTrainer, SentencePieceProcessor

In [22]:
SentencePieceTrainer.Train(
    input='corpus.txt', model_prefix='spm_model', vocab_size=100,
    model_type='unigram', pad_id=0, unk_id=1, bos_id=2, eos_id=3,
    user_defined_symbols='[MASK]'
)

In [23]:
sp = SentencePieceProcessor(model_file='spm_model.model')
tokens = sp.encode("Sesquipedalophobia", out_type=str)
print(tokens)

['▁S', 'es', 'q', 'u', 'i', 'p', 'e', 'd', 'al', 'o', 'p', 'h', 'o', 'b', 'i', 'a']
