<a href="https://colab.research.google.com/github/anhatsingh/hugging-face-learning/blob/main/Wk2_Tokenizers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TOKENIZATION
This module covers how to tokenize a simple dataset into a form that is understandable by an LLM model (basically a list of numbers).

Notebook average execution time: 33min + 25min + 25min = 1hr 23mins

## Install the Tokenizers Package

In [5]:
!pip install tokenizers datasets &> /dev/null

from tokenizers import Tokenizer
from datasets import load_dataset
from pprint import pprint

ds = load_dataset('bookcorpus', split='all', trust_remote_code=True)
pprint(ds)

Downloading data:   0%|          | 0.00/1.18G [00:00<?, ?B/s]

Generating train split:   0%|          | 0/74004228 [00:00<?, ? examples/s]

Dataset({
    features: ['text'],
    num_rows: 74004228
})


In [6]:
# check if the dataset is loaded correctly and look at the initial few lines
num_samples = 5
for idx, sample in enumerate(ds[0:num_samples]['text']):
  print(f'{idx} : {sample}')

0 : usually , he would be tearing around the living room , playing with his toys .
1 : but just one look at a minion sent him practically catatonic .
2 : that had been megan 's plan when she got him dressed earlier .
3 : he 'd seen the movie almost by mistake , considering he was a little young for the pg cartoon , but with older cousins , along with her brothers , mason was often exposed to things that were older .
4 : she liked to think being surrounded by adults and older kids was one reason why he was a such a good talker for his age .


Usual steps followed in any tokenization:

Input Text $→$ Normalization $→$ Pre-Tokenizer $→$ Algorithm (model) $→$ PostProcessing $→$ Output tokens

**Example:**<br>
1. **Input Text**: Simple text, stored as a List of Sentences, to be passed to the model.
2. **Normalization**: To convert the text to a standard form (example: to convert everything to lowercase or to remove accent marks in languages, etc)
3. **Pre-Tokenizer**: A basic algorithm (like explode sentences from " " character) to make an initial token list.
4. **Algorithm**: An algorithm (like Byte-Pair, Word-Piece, Sentence-Piece, etc etc) that will convert the given pre-trained token set to a more **useful** tokens.
5. **PostProcessing**: This includes adding special tokens like "[SEP]" or "[PAD]" etc etc.
6. **Output tokens**: A list of tokens that can be passed to the model to train on!

## Some useful Tokenizer methods:
1. `add_special_tokens(str, AddedToken)`
2. `add_tokens()`
3. `enable_padding()` and `enable_truncation()`
4. `encode(seq, pair, is_pretokenized)` and `encode_batch()`
5. `decode()` and `decode_batch()`
6. `from_file(.json)` for local json file
7. `from_pretrained(.json)` to import from hub
8. `get_vocab()` and `get_vocab_size()`
9. `id_to_token()` and `token_to_id()`
10. `post_process()`
11. `train(files)` and `train_from_iterator(dataset)`

In [11]:
from tokenizers.normalizers import Lowercase
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.models import BPE

model = BPE(unk_token="[UNK]")
tokenizer = Tokenizer(model)

tokenizer.normalizer = Lowercase()
tokenizer.pre_tokenizer = Whitespace()

## Trainer

We have passed the required information to initialize the tokenizer. Now we need to train the tokenizer (i.e. run the above sets in order on the train dataset) so that we can get our desired output tokens.

Assuming that we want output token vocabulary size to be of 32k, we run the following code

In [12]:
from tokenizers.trainers import BpeTrainer

trainer = BpeTrainer(vocab_size=32000,
                     special_tokens=['[PAD]', '[UNK]'],
                     continuing_subword_prefix="##"
                     )

For batch processing, create a function that will give out dataset in batches

In [13]:
def get_batch(batch_size=1000):
  for i in range(0, len(ds), batch_size):
    yield ds[i: i+batch_size]['text']

### Train the tokenizer from the data

In [14]:
tokenizer.train_from_iterator(
      iterator = get_batch(batch_size=10000),
      trainer = trainer,
      length = len(ds)
    )

#### Save the trained model to disk

This function will only save the model to the disk i.e. the vocubulary. No extra information (like attention masks, special tokens list, etc etc) are stored to the disk. For that, look at the sections ahead.

`model` is the directory name, `hopper` is the file-prefix for the files which will be saved into this directory.

In [16]:
tokenizer.model.save('model', prefix='hopper')

['model/hopper-vocab.json', 'model/hopper-merges.txt']

#### The trained vocabulary

Get the trained vocabulary.

In [17]:
vocab = tokenizer.get_vocab()
# vocab_sorted = sorted(vocab.items(), key = lambda item: item[1])

## Encoding

Now that we have found our vocabulary, let us try to encode some sample english texts into our own learnt vocabulary.

In [18]:
sample = ds[0]['text']
print(f'Sample: {sample}')

encoding = tokenizer.encode(sample)
print(encoding)

Sample: usually , he would be tearing around the living room , playing with his toys .
Encoding(num_tokens=16, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])


Any encoding will contain the following useful attributes:

1. `token_ids` indicating the actual encoding value that is passed to an ML algorithm.

2. `tokens` stores the actual tokens from the vocabulary that makes up the sentence.

3. `type_ids` used for advanced models like BERT. Google it if you wanna know more.

4. `attention_mask` a list of values containing 0s and 1s, telling the LLM model to pay attention to which word in the text, and which to ignore. For example, if we have added padding to the text, padding's attention will be 0, as it does not add value for us.

In [20]:
import pandas as pd

token_ids = encoding.ids
tokens = encoding.tokens
type_ids = encoding.type_ids
attention_mask = encoding.attention_mask

out_dict = {'tokens': tokens, "ids": token_ids, "type_ids": type_ids, "attention_mask": attention_mask}
df = pd.DataFrame.from_dict(out_dict)
df

Unnamed: 0,tokens,ids,type_ids,attention_mask
0,usually,2462,0,1
1,",",19,0,1
2,he,149,0,1
3,would,277,0,1
4,be,162,0,1
5,tearing,6456,0,1
6,around,422,0,1
7,the,131,0,1
8,living,1559,0,1
9,room,536,0,1


### Vizualize the encoded sentence

Let us visualize the sentence in the form of tokens encoded.

In [21]:
from tokenizers.tools import EncodingVisualizer
vs = EncodingVisualizer(tokenizer=tokenizer)
vs(text=sample)

### Padding and Batch encoding

When there are a lot of sentences, we need to encode them, as well as add padding to make them of equal lengths. We use the following methods for that:

In [22]:
samples = ds[0:4]["text"]

batch_encoding = tokenizer.encode_batch(samples)
pprint(batch_encoding)

[Encoding(num_tokens=16, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]),
 Encoding(num_tokens=14, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]),
 Encoding(num_tokens=14, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]),
 Encoding(num_tokens=42, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])]


Note that there is no padding added to the above sentences, as the length of encodings are different for each sentence. Let us add padding now.

We have assumed that our context window size is 512, so we have truncated the sentences to maximum of 512 size.

In [23]:
tokenizer.enable_padding(direction='right',
                         pad_id=0,
                         pad_type_id=0,
                         pad_token = '[PAD]',
                         length = None, # if we have a specific context window size, we can put it here, otherwise, it defaults to the max(lengths of the sentences in the sample).
                         pad_to_multiple_of = None
                         )

# maximum size a sentence can have. Padding only adds to the sentence, to delete tokens which exceed the context_window size, use truncate.
tokenizer.enable_truncation(max_length=512)

In [24]:
batch_encoding = tokenizer.encode_batch(samples)
pprint(batch_encoding)

[Encoding(num_tokens=42, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]),
 Encoding(num_tokens=42, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]),
 Encoding(num_tokens=42, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]),
 Encoding(num_tokens=42, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])]


Now observe that the length of the encoding is same for all sentences, which is equal to the max(length of sentences in the batch)=42.

## Save everthing to disk

Earlier we saved only the model i.e. the vocabulary generated. This will save all the parameters in a tokenizer like `added_tokens`, `model` and its details, `normalizer`, `pre_tokenizer` etc etc.

In [25]:
tokenizer.save('hopper.json')

In [26]:
import json
with open('hopper.json', 'r') as file:
  json_data = json.load(file)

pprint(json_data, depth=1)

{'added_tokens': [...],
 'decoder': None,
 'model': {...},
 'normalizer': {...},
 'padding': {...},
 'post_processor': None,
 'pre_tokenizer': {...},
 'truncation': {...},
 'version': '1.0'}


## Load the saved tokenizer from the disk

Use the `from_file()` method of the Tokenizer class.

In [27]:
# initialize the tokenizer class with the algorithm
trained_tokenizer = Tokenizer(BPE())
trained_tokenizer = trained_tokenizer.from_file('hopper.json')

check if it has worked

In [29]:
text = ds[0]['text']
tokens = trained_tokenizer.encode(text).tokens
print(tokens)

['usually', ',', 'he', 'would', 'be', 'tearing', 'around', 'the', 'living', 'room', ',', 'playing', 'with', 'his', 'toys', '.']


## BERT-Like Tokenizers

Let us build a bert-like tokenizer.
It adds `[CLS]` and `[SEP]` special tokens between sentences.

In [30]:
bert_tokenizer = Tokenizer(BPE(unk_token='[UNK]'))
bert_tokenizer.normalizer = Lowercase()
bert_tokenizer.pre_tokenizer = Whitespace()

bert_trainer = BpeTrainer(vocab_size=32000,
                          special_tokens = ['[PAD]', '[UNK]', '[CLS]', '[SEP]', '[MASK]'],
                          continuing_subword_prefix='##'
                          )

Up to this point, everything is same as before except the fact that we have added some special tokens to our vocabulary.

Let us now define how these tokens are to be added to the sentences. We just need to define the post_processor of our Tokenizer class.

In [31]:
from tokenizers.processors import TemplateProcessing

bert_tokenizer.post_processor = TemplateProcessing(single = "[CLS] $0 [SEP]",
                                                   pair = "[CLS] $A [SEP] $B:1",
                                                   special_tokens = [
                                                       ('[CLS]', 2),
                                                       ('[SEP]', 3)
                                                   ]
                                                   )

#### Train the BERT Model

In [32]:
bert_tokenizer.train_from_iterator(
    get_batch(batch_size=10000),
    trainer = bert_trainer,
    length = len(ds)
  )

#### See the Encoding

First, let us encode a single sentence:

In [33]:
text = "All these are so simple to do in HF. Let's do more"
encoded = bert_tokenizer.encode(text)

tokens = encoded.tokens
ids = encoded.ids
out_dict = {'tokens': tokens, 'ids': ids}
pprint(out_dict, depth=2, compact=True)

{'ids': [2, 270, 956, 336, 231, 2534, 141, 206, 157, 56, 98, 24, 462, 17, 67,
         206, 387, 3],
 'tokens': ['[CLS]', 'all', 'these', 'are', 'so', 'simple', 'to', 'do', 'in',
            'h', '##f', '.', 'let', "'", 's', 'do', 'more', '[SEP]']}


A pair of sentences:

In [34]:
text = "All these are so simple to do in HF. Let's do more"
pair = "We have a long way to go!"
encoded = bert_tokenizer.encode(text, pair)

tokens = encoded.tokens
ids = encoded.ids
out_dict = {'tokens': tokens, 'ids': ids}
pprint(out_dict, depth=2, compact=True)

{'ids': [2, 270, 956, 336, 231, 2534, 141, 206, 157, 56, 98, 24, 462, 17, 67,
         206, 387, 3, 214, 250, 49, 490, 415, 141, 260, 12],
 'tokens': ['[CLS]', 'all', 'these', 'are', 'so', 'simple', 'to', 'do', 'in',
            'h', '##f', '.', 'let', "'", 's', 'do', 'more', '[SEP]', 'we',
            'have', 'a', 'long', 'way', 'to', 'go', '!']}


## Decoding

After our LLM gives out an output in the form of token_ids, we need to convert those ids back to sentences. Here we need a decoder.

The special tokens need to be removed, and the sub-words have to be merged before outputting the final result to the end user.

In [35]:
plain_tokens = bert_tokenizer.decode(ids)
print(plain_tokens)

all these are so simple to do in h ##f . let ' s do more we have a long way to go !


Here, we see that the special characters are removed, but the words are not merged together still!

We have to use a decoding algorithm for this:

In [36]:
from tokenizers.decoders import WordPiece

bert_tokenizer.decoder = WordPiece(prefix = "##")

plain_tokens = bert_tokenizer.decode(ids)
print(plain_tokens)

all these are so simple to do in hf. let ' s do more we have a long way to go!


## Pre-Trained Tokenizer Wrapper

The Tokenizer is just used to feed input to the LLM model. Sometimes, apart from the tokens, the model needs additional information like special tokens, attention masks, etc etc, so we can wrap our Tokenizer in a Wrapper as follows:

In [37]:
!pip install transformers &> /dev/null

Define the wrapper class

In [38]:
from transformers import PreTrainedTokenizerFast

pt_tokenizer = PreTrainedTokenizerFast(tokenizer_file = 'hopper.json',
                                       unk_token = '[UNK]',
                                       pad_token = '[PAD]',
                                       model_input_names = ['input_ids', 'token_type_ids', 'attention_mask']
                                       )

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]



Encode for a single line

In [39]:
model_inputs = pt_tokenizer(text)

# now these can be passed to our LLM model directly
pprint(model_inputs, compact=True)

{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 'input_ids': [267, 953, 333, 228, 2531, 138, 203, 154, 53, 95, 21, 459, 14, 64,
               203, 384],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}


Do it for a pair of lines

In [40]:
model_inputs = pt_tokenizer(text, text_pair = pair)

# now these can be passed to our LLM model directly
pprint(model_inputs, compact=True)

{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
                    1, 1, 1, 1],
 'input_ids': [267, 953, 333, 228, 2531, 138, 203, 154, 53, 95, 21, 459, 14, 64,
               203, 384, 211, 247, 46, 487, 412, 138, 257, 9],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
                    1, 1, 1, 1]}


### Batch Processing
We can also pass a batch of samples as input and it works!

In [41]:
batch_text = ['I like the book The Psychology of Money', "I enjoyed watching the Transformers movie", "Oh! thanks for this"]

model_inputs = pt_tokenizer(batch_text)
pprint(model_inputs, compact=True)

{'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1],
                    [1, 1, 1, 1, 1]],
 'input_ids': [[54, 281, 131, 1701, 131, 19478, 153, 1564],
               [54, 4096, 1443, 131, 7744, 307, 3760],
               [772, 9, 1767, 200, 254]],
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0],
                    [0, 0, 0, 0, 0]]}
