<a href="https://colab.research.google.com/github/dhnanjay/HuggingFace/blob/main/Hugging_Face_Course_3_Tokenizers_(PyTorch).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tokenizers (PyTorch)

[Hugging Face Ch 2 Tokenizer](https://huggingface.co/course/chapter2/4?fw=pt)

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [1]:
# !pip install datasets evaluate transformers[sentencepiece]
# !pip install transformers

Tokenizers are one of the core components of the NLP pipeline. **They serve one purpose: to translate text into data that can be processed by the model.** Models can only process numbers, so tokenizers need to convert our text inputs to numerical data. In this section, we’ll explore exactly what happens in the tokenization pipeline.

In NLP tasks, the data that is generally processed is raw text. Here’s an example of such text:


```
Jim Henson was a puppeteer
```

However, models can only process numbers, so **we need to find a way to convert the raw text to numbers. That’s what the tokenizers do**, and there are a lot of ways to go about this. The goal is to find the most meaningful representation — that is, the one that makes the most sense to the model — and, if possible, the smallest representation.

In [2]:
tokenized_text = "Jim Henson was a puppeteer".split()
print(tokenized_text)

['Jim', 'Henson', 'was', 'a', 'puppeteer']


<h3>Loading and saving</h3>

Loading and saving tokenizers is as simple as it is with models. Actually, it’s based on the same two methods: <b>from_pretrained() and save_pretrained().</b> These methods will load or save the algorithm used by the tokenizer (a bit like the architecture of the model) as well as its vocabulary (a bit like the weights of the model).

Loading the BERT tokenizer trained with the same checkpoint as BERT is done the same way as loading the model, except we use the BertTokenizer class:

In [None]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

Similar to AutoModel, the AutoTokenizer class will grab the proper tokenizer class in the library based on the checkpoint name, and can be used directly with any checkpoint:

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

In [None]:
tokenizer("Using a Transformer network is simple")

{'input_ids': [101, 7993, 170, 11303, 1200, 2443, 1110, 3014, 102],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [None]:
tokenizer.save_pretrained("directory_on_my_computer")

<h3>Encoding</h3>

Translating text to numbers is known as encoding. Encoding is done in a two-step process: the tokenization, followed by the conversion to input IDs.

As we’ve seen, the first step is to split the text into words (or parts of words, punctuation symbols, etc.), usually called tokens. There are multiple rules that can govern that process, which is why we need to instantiate the tokenizer using the name of the model, to make sure we use the same rules that were used when the model was pretrained.

The second step is to convert those tokens into numbers, so we can build a tensor out of them and feed them to the model. To do this, the tokenizer has a vocabulary, which is the part we download when we instantiate it with the from_pretrained() method. Again, we need to use the same vocabulary used when the model was pretrained.

To get a better understanding of the two steps, we’ll explore them separately. Note that we will use some methods that perform parts of the tokenization pipeline separately to show you the intermediate results of those steps, but in practice, you should call the tokenizer directly on your inputs (as shown in the section 2).

<h3>Tokenization</h3>

The tokenization process is done by the tokenize() method of the tokenizer:

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

sequence = "Using a Transformer network is simple"
tokens = tokenizer.tokenize(sequence)

print(tokens)

['Using', 'a', 'transform', '##er', 'network', 'is', 'simple']

In [2]:
# Encoding
ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)

NameError: ignored

<h3>Decoding</h3>

Decoding is going the other way around: from vocabulary indices, we want to get a string. This can be done with the decode() method as below:


Note that the decode method not only converts the indices back to tokens, but also groups together the tokens that were part of the same words to produce a readable sentence. This behavior will be extremely useful when we use models that predict new text (either text generated from a prompt, or for sequence-to-sequence problems like translation or summarization).

In [None]:
decoded_string = tokenizer.decode([7993, 170, 11303, 1200, 2443, 1110, 3014])
print(decoded_string)

'Using a Transformer network is simple'

Tokenization, encoding, and embedding are all fundamental steps in natural language processing (NLP) that involve converting raw text data into a numerical format that can be used as input for machine learning models.

Tokenization involves breaking down a sequence of text into individual tokens or words, usually using a tokenizer library or algorithm.

Encoding refers to the process of mapping each token to a unique numerical identifier or index. This is typically done using a lookup table, where each token is associated with a unique integer value.

Embedding is the process of representing each token as a dense vector or embedding. These embeddings are learned by a neural network model, and they capture the semantic meaning of the token in a low-dimensional space. Embeddings are typically used as input features for downstream NLP tasks such as sentiment analysis, text classification, or language translation.

In summary, tokenization breaks down text into individual tokens, encoding maps these tokens to unique integer values, and embedding represents each token as a dense vector in a lower-dimensional space.