In this notebook we want to train a tokenizer on the sequences. A tokenizer is a technique used in the Natural Language Processing field to generate tokens from words, sentences or documents.

We will use it to generate subsequences which we will use as features in modelling.

Think of it as **N-GRAM technique but more far more effective that permutations of it**.

You can see that this technique is **pretty fast and afordable**, **even that I am doing the whole process on my laptop**, so **anyone can do it with out the need of heavy computational power**.

## DEPENDENCIES

In [1]:
import pandas as pd
import numpy as np

from tqdm.notebook import tqdm

from tokenizers import ByteLevelBPETokenizer, SentencePieceBPETokenizer

import seaborn as sns 

## DATA

In [2]:
train = pd.read_csv('../data/raw/train_values.csv')
test = pd.read_csv('../data/raw/test_values.csv')
print('Train: ',train.shape)
print('Test: ',test.shape)

Train:  (63017, 41)
Test:  (18816, 41)


In [3]:
# Put all sequences in a txt file
filename = '../data/tokenizer/corpus.txt'

with open(filename,'w+') as f:
    for i in tqdm(range(train.shape[0]),total=train.shape[0],leave=False):
        for x in train["sequence"].values[i]:
            f.write(x)
        f.write('\n')
    for i in tqdm(range(test.shape[0]),total=test.shape[0],leave=False):
        for x in test["sequence"].values[i]:
            f.write(x)
        f.write('\n')

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=63017.0), HTML(value='')))

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=18816.0), HTML(value='')))

## CREATE TOKENIZER

In [4]:
%%time
# Time will depend on where you run it. For me, I am in a laptop.

# Train tokenizer to generate vocabulary file, the vocabulary will be subsequences
# Vocabulary size is a good hyperparameter to tune depending on your problem and sequences, just means the number of subsequences to be generated

# Initialize a tokenizer
# You can choose which best suits you
#tokenizer = ByteLevelBPETokenizer()
tokenizer = SentencePieceBPETokenizer() # A bit slower

# Customize training
tokenizer.train(files='../data/tokenizer/corpus.txt', vocab_size=2500, min_frequency=2,special_tokens=['<unk>'])

Wall time: 21min 48s


In [5]:
# Save tokenizer, we will use vocab file genereted for modeling
tokenizer.save_model('../data/tokenizer/SP_2500/')

['../data/tokenizer/SP_2500/vocab.json',
 '../data/tokenizer/SP_2500/merges.txt']

In [None]:
# We will only use vocab file