# Byte Pair Encoding Tokenizer from scratch

In this assignment, we will build BPE tokenizer used in GPT models.
BPE has some changes compared to WordPiece. For example, BPE does not use a normalizer for tokenizer. We also don’t need to specify an unk_token because GPT-2 uses byte-level BPE, which doesn’t require it.

In [1]:
# Imports and libraries
from datasets import load_dataset
from tokenizers import Tokenizer, models, trainers, pre_tokenizers, decoders, processors
from tests import *

### Loading dataset

In [None]:
def load_data():
    """
    Load the specified dataset and return the text data.
    """
    dataset = # TODO: Load wikitext dataset wikitext-2-raw-v1 as the config 
    return dataset["train"]["text"]

In [None]:
def initialize_bpe_tokenizer():
    """
    Initialize a Byte Pair Encoding (BPE) tokenizer with a Whitespace pre-tokenizer.
    """
    tokenizer = # TODO: Initialize tokenizer with BPE as model
    tokenizer.pre_tokenizer = # TODO: Add whitespace removal as pre-tokenizer step
    return tokenizer

In the below code block, we will train BPE on the wiki dataset. We will be adding some special tokens:

- Padding token: token used to pad sequences to a uniform length in a batch for processing
- Unkwon token: Represents unknown words or tokens not found in the model's vocabulary
- Classification token: A special token added at the start of a sequence for classification tasks.
- Separator token: Used to separate or mark boundaries between sequences in multi-sequence tasks.
- Mask token: A placeholder token used in masked language modeling to predict masked words.


In [None]:
def train_bpe_tokenizer(tokenizer, texts, vocab_size=30000, min_frequency=2, special_tokens=None):
    """
    Train a BPE tokenizer on the provided texts.
    """
    if special_tokens is None:
        special_tokens = # TODO: Make a list of ALL the tokens mentioned above
    
    trainer = # TODO: Initialize trainer with vocab_size, min_frequency, and special tokens

    def batch_iterator(batch_size=1000):
        for i in range(0, len(texts), batch_size):
            yield texts[i : i + batch_size]

    # TODO: train tokenizer using train_from_iterator function passing relevant parameters
    return tokenizer

The `configure_post_processing` function sets up rules to add special tokens (e.g., `[CLS]`, `[SEP]`) to tokenized sequences for single or paired inputs, ensuring proper formatting for downstream tasks. It also configures a BPE decoder to reconstruct text from token IDs.

In [None]:
def configure_post_processing(tokenizer):
    """
    Configure the post-processing and decoding rules for the tokenizer.
    """
    tokenizer.post_processor = processors.TemplateProcessing(
        single="[CLS] $A [SEP]",
        pair="[CLS] $A [SEP] $B:1 [SEP]:1",
        special_tokens=[
            ("[CLS]", tokenizer.token_to_id("[CLS]")),
            ("[SEP]", tokenizer.token_to_id("[SEP]")),
        ],
    )
    tokenizer.decoder = decoders.BPEDecoder()

Saving tokenizer to the specified path so that it can be used again. This saves training effort and also can be used for multiple projects.

In [None]:
def save_tokenizer(tokenizer, filepath):
    """
    Save the tokenizer to the specified filepath.
    """
    # TODO: Save tokenizer at the path specified
    tokenizer.save(filepath)

In [None]:
def test_tokenizer(tokenizer, text):
    """
    Test the tokenizer on a sample text and return the tokens and IDs.
    """
    # TODO: Encode the text using the tokenizer. Return tokens and corresponding IDs

Here we will assemble all the logic to train a BPE tokenizer and test the same

In [None]:
# Load the data
texts = # TODO: Load data
test_load_data(texts)

tokenizer = # TODO: # Initialize the tokenizer
test_initialize_bpe_tokenizer(tokenizer)

tokenizer = # TODO: # Train the tokenizer
test_train_bpe_tokenizer(tokenizer)

# TODO: Configure post-processing and decoding

# TODO: Save the tokenizer

# Test the tokenizer
test_text = "Natural Language Processing is fascinating."
tokens, ids = # TODO: test tokenizer

print(f"Tokens: {tokens}")
print(f"IDs: {ids}")

test_tokenizer_func(tokens, ids)