In [None]:
import pandas as pd
import torch

In [None]:
data = pd.read_csv('data_for_tokenizer.csv')

The data_for_tokenizer.csv a file with a lot of possible unnormalized and normalized addresses in a column named '0'.

In [None]:
class NormalizedDataset(torch.utils.data.Dataset):
    def __init__(self, data):
        self.data = data

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        texto = self.data.iloc[idx]['0']
        return {'train': texto}

dataset_aux = NormalizedDataset(data=data)

In [None]:
def get_training_corpus():
    dataset = dataset_aux[:]["train"]
    for start_idx in range(0, len(dataset), 1000):
        samples = dataset[start_idx : start_idx + 1000]
        yield samples

WordPiece Tokenizer<br>
https://huggingface.co/course/chapter6/8?fw=pt#acquiring-a-corpus

In [1]:
from tokenizers import (
    decoders,
    models,
    normalizers,
    pre_tokenizers,
    processors,
    trainers,
    Tokenizer,
)
tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]"))

<p>The above code sets the <code>normalizer</code> of a tokenizer object. The <code>normalizer</code> is a Sequence of operations that will be applied to the input text before tokenization. In this case, the <code>normalizer</code> is composed of three operations: </p>
<ul>
    <li><code>normalizers.NFD()</code> - decomposes the text into its constituent graphemes, separating any accent marks or diacritical characters from their base character</li>
    <li><code>normalizers.Lowercase()</code> - converts all characters to lowercase</li>
    <li><code>normalizers.StripAccents()</code> - removes any remaining accent marks or diacritical characters from the text</li>
</ul>


In [None]:
tokenizer.normalizer = normalizers.Sequence(
    [normalizers.NFD(), normalizers.Lowercase(), normalizers.StripAccents()]
)

The given code is setting up a pre-tokenizer for a ``tokenizer`` object. Specifically, it is setting up a sequence of pre-tokenizers, each of which will be applied to the input text before tokenization.

The ``pre_tokenizers.Sequence()`` method is used to create a sequence of pre-tokenizers that will be applied in order. In this case, there are three pre-tokenizers in the sequence:
<ul>
    <li><code>pre_tokenizers.Whitespace()</code> - This pre-tokenizer will split the text on whitespace characters (spaces, tabs, and newlines).</li>
    <li><code>pre_tokenizers.Digits(False)</code> - This pre-tokenizer will split off any sequences of digits (0-9) from the text. The False argument specifies that these digit sequences should not be returned as individual tokens.</li>
    <li><code>pre_tokenizers.Punctuation('removed')</code> - This pre-tokenizer will remove any punctuation characters from the text.</li>
</ul>

In [None]:
tokenizer.pre_tokenizer = pre_tokenizers.Sequence(
    [pre_tokenizers.Whitespace(), pre_tokenizers.Digits(False), pre_tokenizers.Punctuation('removed')]
)

In the provided code, a <code>WordPieceTrainer</code> object is created from the <code>trainers</code> module. This object is then configured with several parameters:

<ul>
    <li><code>special_tokens</code>: a list of special tokens to be included in the vocabulary of the trained tokenizer. The special tokens are <code>[UNK]</code>, <code>[PAD]</code>, <code>[CLS]</code>, <code>[SEP]</code>, and <code>[MASK]</code>.</li>
    <li><code>show_progress</code>: a boolean value indicating whether or not to display progress bars during training.</li>
    <li><code>min_frequency</code>: an integer value indicating the minimum frequency a subword must have to be included in the final vocabulary.</li>
</ul>

In [None]:
special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
trainer = trainers.WordPieceTrainer(special_tokens=special_tokens, show_progress=True, min_frequency=2)

<p>The code is using the <code>train_from_iterator()</code> method of a tokenizer object to train the tokenizer from an iterator that provides training data. <br>
The training is performed using a specified trainer object, which is passed to the <code>train_from_iterator()</code> method as an argument.</p>

In [None]:
tokenizer.train_from_iterator(get_training_corpus(), trainer=trainer)

<div>
<p>The above code is performing the following tasks:</p>
<ul>
<li>Get the token ID for the "[CLS]" and "[SEP]" tokens using the <code>tokenizer.token_to_id()</code> function.</li>
<li>Set up the <code>tokenizer.post_processor</code> using the <code>processors.TemplateProcessing()</code> function to add the special tokens and combine the input sequence(s) with the special tokens.</li>
<li>The <code>single</code> and <code>pair</code> parameters specify the templates to be used when processing a single sequence or a pair of sequences respectively.</li>
<li>The <code>special_tokens</code> parameter specifies the special tokens and their corresponding token IDs to be added to the input sequence(s).</li>
</ul>
</div>

In [None]:
cls_token_id = tokenizer.token_to_id("[CLS]")
sep_token_id = tokenizer.token_to_id("[SEP]")

tokenizer.post_processor = processors.TemplateProcessing(
    single=f"[CLS]:0 $A:0 [SEP]:0",
    pair=f"[CLS]:0 $A:0 [SEP]:0 $B:1 [SEP]:1",
    special_tokens=[("[CLS]", cls_token_id), ("[SEP]", sep_token_id)],
)

<p>The code is doing the following:</p>
<ul>
  <li>Setting the <code>WordPiece</code> decoder for the <code>tokenizer</code> object</li>
  <li>Saving the <code>tokenizer</code> object in <code>JSON</code> format</li>
  <li>Creating a new instance of <code>BertTokenizerFast</code> from the saved tokenizer object and assigning it to a variable <code>wrapped_tokenizer</code></li>
  <li>Saving the <code>wrapped_tokenizer</code> in a folder named 'NewCustomTokenizer'</li>
</ul>


In [None]:
tokenizer.decoder = decoders.WordPiece(prefix="##")
tokenizer.save("tokenizer.json")
from transformers import BertTokenizerFast
wrapped_tokenizer = BertTokenizerFast(tokenizer_object=tokenizer)
wrapped_tokenizer.save_pretrained('Custom_Tokenizer')