# Tokenizers Overview
#### Tokenizers serve one purpose: to translate text into data that can be processed by the model (numbers/ID).

# Word Based
Each word has a specific Id (words are seperated by relevant punctuations)

The Two Limitations are:
###### 1) Similar words having different number (Cat vs Cats) 2) Very large vocab size due to every single word having a token (Can be resolved by using top k used words, remaining converted to oov (results in lost info) )

In [22]:
tokenized_text = "Jim Henson was a puppeteer".split()
print(tokenized_text)

['Jim', 'Henson', 'was', 'a', 'puppeteer']


# Character Based
Character-based tokenizers split the text into characters, rather than words.

The Two primary Benefits compared to Word Based are:
###### 1) The vocabulary is much smaller. 2) Much fewer out-of-vocabulary (unknown) tokens, since every word can be built from characters.

The Two Limitations are:
###### 1) it’s less meaningful: each character doesn’t mean a lot on its own, whereas that is the case with words 2) We’ll end up with a very large amount of tokens to be processed by our model 3)  Questions arise concerning spaces and punctuation


# Subword tokenization
Frequently used words should not be split into smaller subwords, but rare words should be decomposed into meaningful subwords. (Annoying - ly / Postive - ly)

This allows us to have relatively good coverage with small vocabularies, and close to no unknown tokens

The Two primary Benefits compared to Word Based are:
###### 1) Semantic Meaning is Kept inteact Better. 2) Space Efficient (2 Tokens for long words)

More Techniques :

###### Byte-level = GPT-2, WordPiece = BERT, SentencePiece/Unigram (several Multilingual Models)


### Loading and Saving

In [23]:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

In [24]:
#Similar to AutoModel, the AutoTokenizer class will grab the proper tokenizer class in the library based on the checkpoint name, and can be used directly with any checkpoint:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

In [25]:
tokenizer("Using a Transformer network is simple")

{'input_ids': [101, 7993, 170, 13809, 23763, 2443, 1110, 3014, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [26]:
tokenizer.save_pretrained("Saved_Tokenizers")

('Saved_Tokenizers\\tokenizer_config.json',
 'Saved_Tokenizers\\special_tokens_map.json',
 'Saved_Tokenizers\\vocab.txt',
 'Saved_Tokenizers\\added_tokens.json',
 'Saved_Tokenizers\\tokenizer.json')

### Encoding
Translating text to numbers is known as encoding. Encoding is done in a two-step process: the tokenization, followed by the conversion to input IDs.

In [27]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
sequence = "Using a Transformer network is simple"
tokens = tokenizer.tokenize(sequence)

print(tokens)
# ### used as placeholdet

['Using', 'a', 'Trans', '##former', 'network', 'is', 'simple']


### From Tokens to Input ID

In [28]:
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

[7993, 170, 13809, 23763, 2443, 1110, 3014]


### Decoding
Decoding is going the other way around: from vocabulary indices, we want to get a string. This can be done with the decode() method as follows:

In [29]:
decoded_string = tokenizer.decode(ids)
print(decoded_string)

Using a Transformer network is simple


Note that the decode method not only converts the indices back to tokens, but also groups together the tokens that were part of the same words to produce a readable sentence. This behavior will be extremely useful when we use models that predict new text (either text generated from a prompt, or for sequence-to-sequence problems like translation or summarization).