## Training a Byte-Pair Encoding Tokenizer

In this experiment a BPE tokenizer will be trained on the cleaned text found in the 'body' column of the dataset.

In [73]:
from tokenizers import Tokenizer, trainers, pre_tokenizers, models
import pandas as pd

The cleaned data is loaded and converted into a list of strings to be fed into the tokenizer during training.

In [74]:
# Load the data and convert it to a list of strings
df = pd.read_csv('Datasets/train_cleaned.csv')
corpus = df['body'].tolist()  # Assuming 'body' is the column containing text data

The tokenizer is trained in a custom manner such that the vocabulary reaches 10,000 items, it is aware of the alphabet, and some special tokens are included for exceptional cases in the text.

In [75]:
# Initizalize the tokenizer
tokenizer = Tokenizer(models.BPE())

# Customize training
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
trainer = trainers.BpeTrainer(
    vocab_size=10000, 
    show_progress=True, 
    initial_alphabet=pre_tokenizers.ByteLevel.alphabet(), 
    special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]
    )
tokenizer.train_from_iterator(corpus, trainer=trainer)

# Save files to disk
tokenizer.save('Tokenizers/bpe_tokenizer.json')

We can visualize how the tokenizer works by outputting a sample.

In [77]:
# Test the tokenizer
encoded = tokenizer.encode(df['body'][543])
print(encoded.tokens)

['Ġc', "'", 'est', 'Ġtan', 'nant', 'Ġquel', 'Ġpoint', 'Ġquartiers', 'Ġcentr', 'aux', 'ĠdÃ©jÃł', 'Ġbien', 'Ġdess', 'erv', 'is', 'Ġtransport', 'Ġact', 'if', 'Ġcollect', 'if', 'Ġcontinu', 'ent', 'Ġavoir', 'ĠamÃ©li', 'orations', 'Ġleurs', 'Ġinfr', 'astruct', 'ures', 'Ġpendant', 'Ġquartiers', 'ĠpÃ©ri', 'ph', 'Ã©ri', 'ques', 'Ġlaiss', 'Ã©s', 'Ġj', 'ach', 'Ã¨re', 'âĢ¦', 'Ġc', "'", 'est', 'Ġrendu', 'Ġc', "'", 'est', 'Ġbeaucoup', 'Ġplus', 'Ġfacile', 'Ġvivre', 'Ġsans', 'Ġauto', 'Ġlongueuil', 'Ġbrossard', 'Ġqu', "'", 'Ãł', 'Ġst', 'Ġlaurent', 'Ġcd', 'n', 'Ġndg', 'Ġlas', 'alle', 'ĠmontrÃ©al', 'ĠmontrÃ©al', 'Ġnord', 'Ġetc', 'Ġcomprends', 'Ġqu', "'", 'il', 'Ġfaut', 'Ġcommencer', 'Ġquelque', 'Ġpart', 'Ġdis', 'Ġc', "'", 'est', 'Ġcorrect', 'Ġproc', 'Ã©der', 'Ġainsi', 'Ġsachant', 'Ġservice', 'Ġsans', 'Ġdoute', 'ĠamÃ©li', 'orÃ©', 'Ġag', 'rand', 'i', 'Ġchaque', 'ĠannÃ©e', 'Ġc', "'", 'est', 'Ġfrustr', 'ant', 'Ġpareil', 'Ġvoir', 'Ġquel', 'Ġpoint', 'Ġchoses', 'Ġprogress', 'ent', 'Ġlent', 'ement']
