<a href="https://colab.research.google.com/github/abdulsamadkhan/Courses-LLM-Lectures/blob/main/TokenizationinHuggingFace.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tokenizer (AutoTokenizer)

The AutoTokenizer is a core component of the popular Hugging Face Transformers library. It's a versatile tool that simplifies the process of text preprocessing for various natural language processing (NLP) tasks, such as:


*   **Tokenization**
Breaking down text into smaller units like words or subwords.
*   **Encoding** Converting tokens into numerical representations suitable for deep learning models.
*   **Decoding**
Transforming numerical representations back into text (optional, not all tokenizers support it).

We will explore two different tokenizer in this notebook

## 1.   BERT Tokenizer
## 2.   Albert Tokenizer



Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [1]:
!pip install datasets evaluate transformers[sentencepiece]

Collecting datasets
  Downloading datasets-2.17.0-py3-none-any.whl (536 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/536.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.9/536.6 kB[0m [31m3.6 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━[0m [32m522.2/536.6 kB[0m [31m7.5 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m536.6/536.6 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow>=12.0.0 (from datasets)
  Downloading pyarrow-15.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (38.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m38.3/38.3 MB[0m [31m3

## Tokinize by Space

In [4]:
tokenized_text = "This is very interesting class, I get a chance to learn about genAI".split()
print(tokenized_text)

['This', 'is', 'very', 'interesting', 'class,', 'I', 'get', 'a', 'chance', 'to', 'learn', 'about', 'genAI']


## 1. BERT Tokenizer
Tokenization is the process of converting text into tokens, which are numerical representations that machine learning models can understand. BERT uses a specific type of tokenization known as WordPiece tokenization.

bert-base-cased uses WordPiece as tokenization method.

In [18]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

In [19]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")


1.    The **token_type_ids** is an attribute used in BERT’s
tokenizer and it is specifically useful when you have a pair of sentences as input.
In BERT, if you’re working with two sentences, you concatenate them and place a [SEP] token in betweentoken_type_ids is a binary mask identifying the two different sentences. For example, for the tokens of the first sentence, the token_type_ids is 0, and for the second sentence, it’s 1.
2.   The **attention_mask** is a binary tensor indicating the position of the padded indices so that the model does not attend to them1. For the BertTokenizer, 1 indicates a value that should be attended to, while 0 indicates a padded value.


In [20]:
tokenizer("Using a Transformer network is simple")

{'input_ids': [101, 7993, 170, 13809, 23763, 2443, 1110, 3014, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [21]:
sequence = "Using a Transformer network is simple"
tokens = tokenizer.tokenize(sequence)

print(tokens)

['Using', 'a', 'Trans', '##former', 'network', 'is', 'simple']


In [22]:
ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)

[7993, 170, 13809, 23763, 2443, 1110, 3014]


In [23]:
decoded_string = tokenizer.decode([7993, 170, 11303, 1200, 2443, 1110, 3014])
print(decoded_string)

Using a transformer network is simple


In [24]:
inputs = tokenizer("Let's try to tokenize!")

print(tokenizer.decode(inputs["input_ids"]))


[CLS] Let's try to tokenize! [SEP]


#2.  Albert Tokenizer

albert-base-v1, it uses SentencePiece for tokenization.

In [25]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("albert-base-v1")
tokens = tokenizer.tokenize("Let's try to tokenize!")
print(tokens)


['▁let', "'", 's', '▁try', '▁to', '▁to', 'ken', 'ize', '!']


In [26]:
tokens = tokenizer.tokenize("Let's try to tokenize!")
input_ids = tokenizer.convert_tokens_to_ids(tokens)
print(input_ids)


[408, 22, 18, 1131, 20, 20, 2853, 2952, 187]


In [27]:
tokenizer("Let's try to tokenize!")

{'input_ids': [2, 408, 22, 18, 1131, 20, 20, 2853, 2952, 187, 3], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [28]:
inputs = tokenizer("Let's try to tokenize!")

print(tokenizer.decode(inputs["input_ids"]))


[CLS] let's try to tokenize![SEP]


In [17]:
tokenizer.get_vocab()

{'might': 24577,
 '▁transition': 4513,
 'idge': 21292,
 '▁saline': 27905,
 '▁sentient': 29707,
 '▁medal': 1217,
 '▁myth': 7970,
 'atus': 8796,
 'eaux': 14693,
 '▁neue': 24836,
 '▁annexed': 15334,
 'workers': 16355,
 '▁blurred': 21322,
 '▁temples': 9111,
 '▁affected': 4114,
 '▁lies': 1966,
 '▁gera': 23416,
 'lite': 10601,
 'visual': 20893,
 'planned': 27056,
 '▁andrzej': 28949,
 'oides': 11769,
 '▁brazilian': 5053,
 '▁lillian': 21759,
 '▁oral': 9144,
 '▁suggests': 5049,
 '▁showing': 3187,
 '▁instruments': 4507,
 '▁raced': 8066,
 '▁calculations': 19186,
 'learning': 26001,
 '▁companies': 1532,
 '▁salisbury': 16475,
 '▁slam': 7722,
 '▁truck': 2956,
 '▁byzantine': 7880,
 '▁geologic': 23689,
 'good': 3264,
 '▁elder': 5227,
 '▁kamal': 19258,
 '▁literate': 27146,
 'start': 13680,
 '▁connecting': 6440,
 '▁tyrol': 25378,
 '▁sect': 16742,
 'pl': 5727,
 'kal': 6766,
 'ed': 69,
 '▁unpleasant': 16727,
 '▁201112': 14447,
 '253': 23519,
 '▁figured': 5700,
 '▁cattle': 6206,
 '▁adolescent': 17051,
 '▁t