<img src="../images/cover.jpg" width="1920"/>

# Tokenization

Deep learning tokenization refers to the process of converting text into smaller units, called tokens, to make it suitable for neural networks. It starts with simple methods like space-based splitting (dividing text by spaces into words) and progresses to more advanced techniques such as WordPiece and Byte Pair Encoding (BPE). WordPiece breaks words into subword units, allowing the model to handle rare or unknown words effectively. BPE uses frequency-based merging of characters to create subword units, improving efficiency in handling large vocabularies and complex languages. These techniques help models like transformers understand and process text better.

<img src="../images/tokenization.svg" width="1920"/>

In [None]:
from tokenizers import BertWordPieceTokenizer
from transformers import PreTrainedTokenizerFast
import os

Prepare the dataset for tokenizer training by ensuring proper formatting and creating a list of file paths if needed.

In [None]:
# To download the dataset uncomment the following line
# ! python download_asosoft_small_text_corpus.py

In [None]:
input_file = "data/text_data.txt"  # a text file with one article per line
# Read the input file and split by newlines
with open(input_file, "r", encoding="utf-8") as f:
    articles = f.read().split("\n")

# Remove empty lines and create temporary files
articles = [article.strip() for article in articles if article.strip()]

# Create a temporary directory for individual article files
os.makedirs("temp_articles", exist_ok=True)

# Write each article to a separate file
files = []
for idx, article in enumerate(articles):
    file_path = f"temp_articles/article_{idx}.txt"
    with open(file_path, "w", encoding="utf-8") as f:
        f.write(article)
    files.append(file_path)

print(f"Created {len(files)} temporary files.")

Train a BERT WordPiece tokenizer on the provided dataset.

- files: List of file paths containing the training texts
- vocab_size: Size of the final vocabulary
- min_frequency: Minimum frequency for a token to be included

In [3]:
vocab_size = 30000
min_frequency = 2

# Initialize a BERT WordPiece tokenizer
tokenizer = BertWordPieceTokenizer(
    clean_text=True, handle_chinese_chars=True, strip_accents=False, lowercase=False
)

# Train the tokenizer
tokenizer.train(
    files=files,
    vocab_size=vocab_size,
    min_frequency=min_frequency,
    special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"],
    limit_alphabet=1000,
    wordpieces_prefix="##",
)

In [None]:
output_dir = "custom_tokenizer"

# Create output directory
os.makedirs(output_dir, exist_ok=True)


tokenizer.save("custom_tokenizer/tokenizer.json")

# Convert to PreTrainedTokenizerFast
fast_tokenizer = PreTrainedTokenizerFast(
    tokenizer_file=f"{output_dir}/tokenizer.json",
    # Add BERT-specific parameters
    bos_token="[CLS]",
    eos_token="[SEP]",
    unk_token="[UNK]",
    sep_token="[SEP]",
    pad_token="[PAD]",
    cls_token="[CLS]",
    mask_token="[MASK]",
)

# Save the fast tokenizer
fast_tokenizer.save_pretrained(output_dir)

Test the trained tokenizer on a sample text.

In [6]:
tokenizer = PreTrainedTokenizerFast.from_pretrained(output_dir)

In [None]:
test_text = "This is a sample text to test our new BERT tokenizer."
print(f"Original text: {test_text}")

# Encode the text
ids = tokenizer.encode(test_text)
tokens = tokenizer.convert_ids_to_tokens(ids)
# Print the results
print(f"Encoded tokens: {tokens}")
print(f"Token IDs: {ids}")

# Decode the tokens
decoded = tokenizer.decode(ids)
print(f"Decoded text: {decoded}")

In [None]:
# Clean up temporary files
for file in files:
    os.remove(file)
os.rmdir("temp_articles")