# Tokenizers

>Tokenizers need to convert our text inputs to numerical data

![](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter2/word_based_tokenization.svg)

>There are different ways to split the text or extra rules for punctuation. For example, we could use whitespace to tokenize the text into words by applying Python’s split() function:

In [2]:
tokenized_text = "Jim Henson was a puppeteer".split()
print(tokenized_text)

['Jim', 'Henson', 'was', 'a', 'puppeteer']


>Each word gets assigned an ID, starting from 0 and going up to the size of the vocabulary. 

There are charachter based tokenization and word based tokenization  
**charachter based tokenization**  
1. very large amount of tokens to be processed by our model whereas a word would only be a single token with a word-based tokenizer
2. Each character doesn’t mean a lot on its own,in **Chinese**, for example, each character carries more information   
  **This has two primary benefits:**  
1. The vocabulary is much smaller.  
2. There are much fewer out-of-vocabulary (unknown) tokens, since every word can be built from characters.

**word based tokenization** 
1. 500,000 words in the English language map from each word to an input ID we’d need to keep track of that many IDs
2. “run” and “running”, “dog” and “dogs”, model will not see as similar initially.
3. It’s generally a bad sign if you see that the tokenizer is producing a lot of ”[UNK]” or ”<unk>”. tokens

![](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter2/character_based_tokenization.svg)

## Subword Tokenisation

Subword tokenization algorithms rely on the principle that frequently used words should not be split into smaller subwords, but rare words should be decomposed into meaningful subwords.  
“annoyingly”considered a rare word,decomposed into “annoying” and “ly”.   
Both appear frequently as standalone subwords,while meaning of “annoyingly” is kept by the composite meaning of “annoying” and “ly”.


>from_pretrained() and save_pretrained().Loading and saving tokenizers is as simple as it is with models.These methods will load or save the algorithm used by the tokenizer

In [3]:
from transformers import BertTokenizer
tokenizer=BertTokenizer.from_pretrained("bert-base-cased")

In [4]:
#Similar to AutoModel, the AutoTokenizer class will grab the proper tokenizer class
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

In [5]:
tokenizer("Using a Transformer network is simple")

{'input_ids': [101, 7993, 170, 13809, 23763, 2443, 1110, 3014, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [6]:
tokenizer.save_pretrained("directory_on_my_computer")

('directory_on_my_computer/tokenizer_config.json',
 'directory_on_my_computer/special_tokens_map.json',
 'directory_on_my_computer/vocab.txt',
 'directory_on_my_computer/added_tokens.json',
 'directory_on_my_computer/tokenizer.json')

## Encoding
##### Translating text to numbers is known as encoding. Encoding is done in a two-step process: the tokenization, followed by the conversion to input IDs.



>The second step converts those tokens into numbers, so we can build a tensor out of them and feed to the model.the tokenizer has a vocabulary, which is the part we download when we instantiate it with the from_pretrained() method. Again, we need to use the same vocabulary used when the model was pretrained.

### Tokenization


In [7]:
from transformers import AutoTokenizer

In [8]:
tokenizer=AutoTokenizer.from_pretrained("bert-base-cased")

In [13]:
sequence = "Using a Transformer network is simple"
tokens=tokenizer.tokenize(sequence)
print(tokens)#split into two tokens: transform and ##er.subword tokenizer

['Using', 'a', 'Trans', '##former', 'network', 'is', 'simple']


In [12]:
sequence = "I’ve been waiting for a HuggingFace course my whole life.” and “I hate this so much!"
tokens=tokenizer.tokenize(sequence)
print(tokens)

['I', '’', 've', 'been', 'waiting', 'for', 'a', 'Hu', '##gging', '##F', '##ace', 'course', 'my', 'whole', 'life', '.', '”', 'and', '“', 'I', 'hate', 'this', 'so', 'much', '!']


### From tokens to input IDs

In [14]:
ids=tokenizer.convert_tokens_to_ids(tokens)
print(ids)

[7993, 170, 13809, 23763, 2443, 1110, 3014]


### Decoding

In [15]:
decoded_string=tokenizer.decode(ids)
decoded_string

'Using a Transformer network is simple'

>Note that the decode method not only converts the indices back to tokens, but also groups together the tokens(sub word)