# Byte-Pair Encoding Algorithm

##### Reading the text file into a variable  

`utf-8` is the modern, dominant standard because its rulebook contains a number for almost every character and symbol in the world.

In [7]:
with open("../data/private/output/combined_text.txt", "r", encoding="utf-8") as f:
    text_sequence = f.read()

len(text_sequence)

1185064

##### Set system path so it knows the root directory

In [1]:
import sys
sys.path.append('..')

##### Train the Tokenizer

For now, we use the `BasicTokenizer`, but ther are other oiptions in the `minbpe` package

In [6]:
from minbpe import BasicTokenizer

In [8]:
tokenizer = BasicTokenizer()
tokenizer.train(text_sequence, vocab_size=1024)

100%|██████████| 768/768 [03:01<00:00,  4.23it/s]


Now we can see the encoding for each token

In [None]:
vocab = tokenizer.vocab
vocab

##### Test the tokenizer

In [14]:
tokenizer.encode('Hello')

[896, 269, 111]

In [12]:
tokenizer.decode([896, 269, 111])

'Hello'

##### Add special tokens
*   Adding special tokens that I will use in the fine-tuning and inference step.  
*   Special control characters that are added to vocab and are not treated like reuglar text

In [19]:
max_vocab_id = list(vocab.keys())[-1]  # Getting the id of the last token in the vocab
tokenizer.special_tokens = {
    '<|startoftext|>': max_vocab_id + 1,  # placed at the beginning of a prompt to signal that a new, independent piece of text is starting
    '<|separator|>': max_vocab_id + 2,  # used to seperate diff parts of output: "[Instruction] <|separator|> [User's Question]"
    '<|endoftext|>': max_vocab_id + 3,  # signals the end of a coherent passage. model will stop generating when it produces this.
    '<|unk|>': max_vocab_id + 3,  # if the model encounters an unkown word not in vocab during inference then it replaces that with this token
    '<|padding|>': max_vocab_id + 5  # To make all sequences in a batch to the same length, model ignores this token durin atention calculations.
}

##### Encode the text using the Tokenizer we trained

In [21]:
len(tokenizer.encode(text_sequence))

446516

In [23]:
tokenizer.save(file_prefix='../output/tokenizer/my_tokenizer')