# Train tokenizer

Before training language models, we need to learn how to tokenize text and math. We are going to use [pre-trained `roberta-base` tokenizer][1] and extend it with `[MATH]`, `[/MATH]` special tokens and math-specific tokens.

 [1]: https://huggingface.co/roberta-base

In [1]:
! hostname

docker.apollo.fi.muni.cz


## The LaTeX format

First, we will train a tokenizer on LaTeX math to learn math-specific tokens.

In [2]:
from tokenizers import Tokenizer, normalizers, pre_tokenizers
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer

try:
    latex_tokenizer = Tokenizer.from_file('tokenizer-latex.json')
except:
    latex_model = BPE(unk_token='[UNK]')
    latex_tokenizer = Tokenizer(latex_model)
    latex_tokenizer.pre_tokenizer = pre_tokenizers.WhitespaceSplit()
    latex_tokenizer.normalizer = normalizers.Sequence([normalizers.Strip()])
    latex_tokenizer_trainer = BpeTrainer(special_tokens=['[UNK]'])
    latex_tokenizer.train(['dataset-latex.txt'], latex_tokenizer_trainer)
    _ = latex_tokenizer.save('tokenizer-latex.json')

In [3]:
print(latex_tokenizer.encode(r'F(x)&=\int^a_b\frac{1}{3}x^3').tokens)

['F(x)', '&=\\int', '^a', '_b', '\\frac{1}{3}', 'x^3']


## The text + LaTeX format

Next, we will extend pre-trained `roberta-base` tokenizer with `[MATH]`, `[/MATH]` special tokens and LaTeX math tokens.

In [4]:
from transformers import AutoTokenizer

try:
    text_latex_tokenizer = AutoTokenizer.from_pretrained('./roberta-base-text+latex/')
except:
    text_latex_tokenizer = AutoTokenizer.from_pretrained('roberta-base', add_prefix_space=True)
    text_latex_tokenizer.add_special_tokens({'additional_special_tokens': [' [MATH] ', ' [/MATH]']})
    text_latex_tokenizer.add_tokens(list(latex_tokenizer.get_vocab()))
    _ = text_latex_tokenizer.save_pretrained('./roberta-base-text+latex/')

In [5]:
print(text_latex_tokenizer.tokenize(
    r'The proposed model [MATH] F(x)&=\int^a_b\frac{1}{3}x^3 [/MATH] was trained using ADAM optimizer'))

['ĠThe', 'Ġproposed', 'Ġmodel', ' [MATH] ', 'F(x)', '&=\\int', '^a', '_b', '\\frac{1}{3}', 'x^3', ' [/MATH]', 'Ġwas', 'Ġtrained', 'Ġusing', 'ĠAD', 'AM', 'Ġopt', 'imiz', 'Ġer']
