## Sentence Piece Encoding

In [45]:
import sentencepiece as spm
import pandas as pd

The cleaned data is loaded and converted into a list of strings to be fed into the tokenizer during training.

In [46]:
# Load the data and convert it to a list of strings
df = pd.read_csv('Datasets/train_cleaned.csv')
corpus = df['body'].tolist()  # Assuming 'body' is the column containing text data

A couple of housekeeping tasks have to be arranged in order for proper usage of the SentencePiece library. The list of corpus text is save to a temp file for dataloading.

In [47]:
# Save the corpus to a temporary file
with open('temp_corpus.txt', 'w', encoding='utf-8') as f:
    for sentence in corpus:
        f.write(sentence + '\n')

The tokenizer will be trained on the full train set according to a vocabulary size of 5000 and will output a 'model' that can then be used to tokenize text.

In [48]:
output_model = 'Tokenizers/sp_model'
vocab_size = 5000

spm.SentencePieceTrainer.train(
    f'--input=temp_corpus.txt --model_prefix={output_model} --vocab_size={vocab_size}',
)

# Remove the temporary file
import os
os.remove('temp_corpus.txt')

We can now test how the tokenizer interprets the dataset.

In [52]:
# Import the model that was just trained
model_path = 'Tokenizers/sp_model.model'
sp = spm.SentencePieceProcessor(model_file=model_path)

# Encode a sample sentence from the dataset
tokens = sp.encode_as_pieces(df['body'][543])
print(tokens)

['▁c', "'", 'est', '▁t', 'an', 'nant', '▁quel', '▁point', '▁quartier', 's', '▁centr', 'aux', '▁déj', 'à', '▁bien', '▁dess', 'er', 'vis', '▁transport', '▁act', 'if', '▁collectif', '▁continue', 'nt', '▁a', 'voir', '▁amélior', 'ations', '▁l', 'eurs', '▁infrastructure', 's', '▁p', 'endant', '▁quartier', 's', '▁péri', 'ph', 'éri', 'ques', '▁laissé', 's', '▁ja', 'ch', 'ère', '...', '▁c', "'", 'est', '▁rendu', '▁c', "'", 'est', '▁beaucoup', '▁plus', '▁facile', '▁viv', 're', '▁san', 's', '▁auto', '▁longue', 'uil', '▁bro', 's', 's', 'ard', '▁qu', "'", 'à', '▁st', '▁', 'laurent', '▁c', 'd', 'n', '▁n', 'd', 'g', '▁la', 's', 'alle', '▁montréal', '▁montréal', '▁nor', 'd', '▁etc', '▁comprend', 's', '▁qu', "'", 'il', '▁faut', '▁commence', 'r', '▁quelque', '▁part', '▁dis', '▁c', "'", 'est', '▁correct', '▁proc', 'é', 'der', '▁a', 'insi', '▁sach', 'ant', '▁service', '▁san', 's', '▁doute', '▁amélior', 'é', '▁', 'agrandi', '▁cha', 'que', '▁année', '▁c', "'", 'est', '▁frustra', 'nt', '▁pare', 'il', '▁', 'v