What is transformer ?
Transformers is a encoder-decoder architecture with attention mechanism at its base. It takes input as a sequence of words in a source language Then returns output also as sequence of words but lenght need not be same.
![txt](images/1transformer.png)

What is tokenizer and where does it fit into the training pipeline of transformers ?
Tokenization is the process of splitting input text into the tokens. Ex- White  space tokenization, we split the sentence into words. Then Each unique token is maps to a unique ID and each ID is associated with a unique embedding vector in the embedding space.

Then Language Model take these embedding vectors as input and predicts token IDs. During inference time these token IDs are converted back to the tokens and then words.
![alt text](images/2tokenizers.png)
Hence Tokenizer contains two components 
1. Encoder, which converts inputs to the token(word) to token ID
2. Decoder, which converts token predicted by LM to the words i.e reverse operation of the encoder.

The size of vocabulary determines the size of embedding space Then how to build(learn) a vocabulary $V$ from a large corpus of text that contains trillions of tokens ?

What could be the reasonable size of vocabulary, we needed to build ?
1. Arbitary size : x
2. Small as number of characters : x
3. Subwords : yes - because it provides reasonable size vocabular that feasible to apply softmax for probability prediction.

Quest-1 : What are the challenges in building vocabulary for the given large corpus of text ?
![alt text](images/3challengestokenization.png)

Note : White space tokenizer is called pre-tokenization, we split sentence into the unique words then add all of them into vocabulary. We can also add special tokens like <go>, <stop>, <mask>, <sep> and <cls> and others to the vocabulary depending upon the types of downstream tasks and architectures(GPT/BERT) choice.




### 2. HF-Tokenizers
Tokenization Algorithms
What are the wishlist we have for our tokenizer algorithms ?
1. Moderate size vocabulary
2. Efficiently handle agnostic words during inference
3. Be language agnostic
What are the different tokenization algorithms we have ?
![alt text](images/4tokenizercategories.png)

Pre-processing Pipeline - HF tokenizers module provides class that encapsulate all of these components.
![alt text](images/5image.png)

We can customize each step of tokenizer pipeline like 1. Normalizer : Lowercase, StripAccents
2. Pre-Tokenizer : Whitespace, RegX and BERTlike etc.
3. Algorithm(Model) : BPE, WordPiece, etc.
4. Post Processor : Insert model specific tokens.
![alt text](images/image.png)
![alt text](images/image-1.png)

![alt text](images/image-2.png)

We will use the `bookcorpus`dataset to train our tokenizer. Make sure that you have sufficient memory because it will take approx 5GP of memory space to load the dataset.

In [None]:
# Build and Train a Tokenizer
from datasets import load_dataset

# tokenizer class build
from tokenizers.normalizers import Lowercase
from tokenizers.pre_tokenizers import Whitespace
from tokenizer.models import BPE
from tokenizer.trainer import BpeTrainer

In [None]:
ds = load_dataset('bookcorpus', split='all')
print(ds)

In [None]:
num_samples = 5
for idx, sample in enumerate(ds[0:num_samples]['text'])
   print(f"{idx} : {sample}")

In [None]:
# Let's the pipeline of tokenizer class
# 1. Normalizer : Lowercase
# 2. Pre-tokenizer : Whitespace
# 3. Tokenizer model : BPE
# 4. Post Processor : x

In [None]:
# intiate the tokenizer model : BPE with special unknown token and model will use it during prediction.
model = BPE(unk_token="[UNK]")
tokenizer=Tokenizer(model)

# addition of normalizer and pretokenizer to the pipeline
tokenizer.normalizer = Lowercase()
tokenizer.pre_tokenizer = Whitespace()
# creating the trainer for the BPE with vocab_size and special tokens
trainer = BpeTrainer(vocab_size=12000, special_tokens=["[PAD]", "[UNK]"], continuing_subword_prefix='##')

In [None]:
# batch processing by yield
def get_examples(batch_size=1000):
    for i in range(0, len(ds), batch_size):
        yield ds[i: i + batch_size['text']]

In [None]:
from multiprocessing import cpu_count
print(cpu_count())

In [None]:
tokenizer.train_from_iterator(get_examples(batch_size=1000), trainer=trainer, length=len(ds))


In [None]:
tokenizer.model.save('model', prefix='hopper')

In [None]:
# first 10 merges
with open('model/hopper-merges.txt', 'r') as file:
    row = 0
    num_lines = 10
    for line in file.readlines():
        print(line)
        row += 1
        if row >= num_lines:
            break

In [None]:
# last 10 merge
with open('model/hopper-merges.txt', 'r') as file:
    row = 0
    num_lines = 10
    for line in reversed(file.readlines()):
        print(line)
        row += 1
        if row >= num_lines:
            break

In [None]:
# lets see the vocabulary
# number of merges
with open('model/hopper-merges.txt', 'r') as file:
    lines = file.readlines()

print(f"Number of merges: {len(lines)}")

print(f"Vocab Size: {tokenizer.get_vocab_size()}")

vocab = tokenizer.get_vocab()
# sort vocab by token IDs
vocab_sorted = sorted(vocab.items(), key=lambda item: item[1])