# **The Tokenizers**
**Train** a brand **new tokenizer on a corpus of texts**, so it can then be **used to pretrain a language model**. This will all be done with the help of the 🤗 `Tokenizers` library, which provides the **“fast”** tokenizers in the 🤗 Transformers library. We’ll take a close look at the features that this library provides, and explore how the fast tokenizers differ from the **“slow”** versions.

**Topics:**
- How to **train a new tokenizer** similar to the one used by a given checkpoint on a new corpus of texts
- The special features of **fast tokenizers**
- The differences between the **three main subword tokenization algorithms** used in NLP today
- How to **build a tokenizer from scratch** with the 🤗 Tokenizers library and train it on some data.

## **Training a New Tokenizer from an old one**
**If a language model is not available in the language you are interested in**, *or* **if your corpus is very different from the one your language model was trained on**, you will most likely want to **retrain the model from scratch using a tokenizer adapted to your data**. That will require training a new tokenizer on your dataset.


Most Transformers models use a *`subword tokenization algorithm`*. To identify which subwords are of interest and occur most frequently in the corpus at hand, the **`tokenizer` needs to take a hard look at all the texts in the corpus** — a process we call *training*. The exact rules that govern this training depend on **the type of tokenizer used** (*go over 3 main algorithm*).

⚠️ **Training a tokenizer** is not the same as training a model! Model training uses *stochastic gradient descent* to make the loss a little bit smaller for each batch. It’s randomized by nature (meaning you have to set some seeds to get the same results when doing the same training twice).

⚠️ **Training a tokenizer** is a statistical process that tries to **identify which subwords are the best to pick for a given corpus**, *and* **the exact rules used to pick them depend on the tokenization algorithm**. It’s deterministic, meaning you *always get the same results when training with the same algorithm on the same corpus*.

### **Assembling a corpus**
There’s a very simple API in 🤗 Transformers that you can use to **train a new tokenizer with the same characteristics as an existing one**: `AutoTokenizer.train_new_from_iterator()`.

To see this in action, let’s say we want to **train `GPT-2` from scratch, but in a language other than *English*.** Our *first task* will be to **gather lots of data in that language in a training corpus**. To provide examples everyone will be able to understand, we won’t use a language like Russian or Chinese here, but rather a specialized English language: ***Python code.***

The 🤗 Datasets library can help us assemble a corpus of Python source code. We’ll use the usual `load_dataset()` function to download and cache the `CodeSearchNet` dataset. This dataset was created for the CodeSearchNet challenge and **contains millions of functions from open source libraries on GitHub in several programming languages**. Here, we will load the **Python** part of this dataset:

In [None]:
from datasets import load_dataset

# This can take a few minutes to load, so grab a coffee or tea while you wait!
raw_datasets = load_dataset("code_search_net", "python")

In [None]:
raw_datasets["train"]

## Output
# Dataset({
#     features: ['repository_name', 'func_path_in_repository', 'func_name', 'whole_func_string', 'language',
#       'func_code_string', 'func_code_tokens', 'func_documentation_string', 'func_documentation_tokens', 'split_name',
#       'func_code_url'
#     ],
#     num_rows: 412178
# })

In [None]:
raw_datasets["train"][0]

Here. we’ll just use the `whole_func_string` column to train our tokenizer.

In [None]:
print(raw_datasets["train"][123456]["whole_func_string"])

## Output
# def handle_simple_responses(
#       self, timeout_ms=None, info_cb=DEFAULT_MESSAGE_CALLBACK):
#     """Accepts normal responses from the device.

#     Args:
#       timeout_ms: Timeout in milliseconds to wait for each response.
#       info_cb: Optional callback for text sent from the bootloader.

#     Returns:
#       OKAY packet's message.
#     """
#     return self._accept_responses('OKAY', info_cb, timeout_ms=timeout_ms)

The first thing we need to do is **transform the dataset into an *iterator* of lists of texts** — for instance, a list of list of texts.

Using lists of texts will enable our tokenizer to go faster (training on batches of texts instead of processing individual texts one by one), and it should be an iterator if we want to avoid having everything in memory at once.

If your corpus is huge, you will want to take advantage of the fact that *🤗 `Datasets` does not load everything into RAM but stores the elements of the dataset on disk*.

Create **list of lists of 1000 texts each** by using a *Python generator*, we can avoid Python loading anything into memory until it’s actually necessary.

In [None]:
# Don't uncomment the following line unless your dataset is small!
# training_corpus = [raw_datasets["train"][i: i + 1000]["whole_func_string"] for i in range(0, len(raw_datasets["train"]), 1000)]

In [None]:
## Using Python Generator
training_corpus = (
    raw_datasets["train"][i : i + 1000]["whole_func_string"]
    for i in range(0, len(raw_datasets["train"]), 1000)
)
# This code doesn’t fetch any elements of the dataset;
# it just creates an object you can use in a Python for loop

The texts will only be loaded when you need them (that is, when you’re at the step of the `for` loop that requires them), and only 1,000 texts at a time will be loaded.

The problem with a *generator object* is that it can only be *used once*. So, instead of this giving us the list of the first 10 digits twice, we get them **once and then an empty list**.

In [None]:
## Define the function to returns a generator instead
def get_training_corpus():
    return (
        raw_datasets["train"][i : i + 1000]["whole_func_string"]
        for i in range(0, len(raw_datasets["train"]), 1000)
    )


training_corpus = get_training_corpus()

In [None]:
# or define a generator inside a for loop by using the yield
def get_training_corpus():
    dataset = raw_datasets["train"]
    for start_idx in range(0, len(dataset), 1000):
        samples = dataset[start_idx : start_idx + 1000]
        yield samples["whole_func_string"]

# which will produce the exact same generator as before,
# but allows you to use more complex logic than you can in a list comprehension.

### **Training a new Tokenizer**
Now that we have our **corpus** in the form of an *iterator of batches of texts*, we are ready to train a new tokenizer.

Load the tokenizer that want to pair with our model (etc., GPT-2)

In [None]:
from transformers import AutoTokenizer

old_tokenizer = AutoTokenizer.from_pretrained("gpt2")

Even though we are going to train a new tokenizer, it’s a good idea to do this to avoid starting entirely from scratch.

This way, we won’t have to specify anything about the tokenization algorithm or the special tokens we want to use; **our new tokenizer will be exactly the same as GPT-2**, *and* **the only thing that will change is the vocabulary**, which will be **determined by the training on our corpus**.

In [None]:
## Example
example = '''def add_numbers(a, b):
    """Add the two numbers `a` and `b`."""
    return a + b'''

tokens = old_tokenizer.tokenize(example)
tokens

# ['def', 'Ġadd', '_', 'n', 'umbers', '(', 'a', ',', 'Ġb', '):', 'Ċ', 'Ġ', 'Ġ', 'Ġ', 'Ġ"""', 'Add', 'Ġthe', 'Ġtwo',
#  'Ġnumbers', 'Ġ`', 'a', '`', 'Ġand', 'Ġ`', 'b', '`', '."', '""', 'Ċ', 'Ġ', 'Ġ', 'Ġ', 'Ġreturn', 'Ġa', 'Ġ+', 'Ġb']

This tokenizer has a few special symbols, like `Ġ` and `Ċ`, which denote *spaces* and *newlines*, respectively.

As we can see, this is not too efficient: the tokenizer returns individual tokens for each space, when it could group together indentation levels (since having sets of four or eight spaces is going to be very common in code). It also split the function name a bit weirdly, not being used to seeing words with the `_` character.

Let’s train a new tokenizer and see if it solves those issues. For this, we’ll use the method `train_new_from_iterator()`

In [None]:
tokenizer = old_tokenizer.train_new_from_iterator(training_corpus, 52000)

In [None]:
tokens = tokenizer.tokenize(example)
tokens

# ['def', 'Ġadd', '_', 'numbers', '(', 'a', ',', 'Ġb', '):', 'ĊĠĠĠ', 'Ġ"""', 'Add', 'Ġthe', 'Ġtwo', 'Ġnumbers', 'Ġ`',
#  'a', '`', 'Ġand', 'Ġ`', 'b', '`."""', 'ĊĠĠĠ', 'Ġreturn', 'Ġa', 'Ġ+', 'Ġb']

Here we again see the special symbols `Ġ` and `Ċ` that denote *spaces* and *newlines*, but we can also see that our tokenizer learned some tokens that are highly specific to a corpus of Python functions: for example, there is a `ĊĠĠĠ` token that represents an *indentation*, and a `Ġ"""` token that represents the *three quotes that start a docstring*. The tokenizer also correctly split the function name on `_`.

This is quite a compact representation; comparatively, using the plain English tokenizer on the same example will give us a longer sentence:

In [None]:
print(len(tokens))
print(len(old_tokenizer.tokenize(example)))

In [None]:
# Another example
example = """class LinearLayer():
    def __init__(self, input_size, output_size):
        self.weight = torch.randn(input_size, output_size)
        self.bias = torch.zeros(output_size)

    def __call__(self, x):
        return x @ self.weights + self.bias
    """
tokenizer.tokenize(example)

# ['class', 'ĠLinear', 'Layer', '():', 'ĊĠĠĠ', 'Ġdef', 'Ġ__', 'init', '__(', 'self', ',', 'Ġinput', '_', 'size', ',',
#  'Ġoutput', '_', 'size', '):', 'ĊĠĠĠĠĠĠĠ', 'Ġself', '.', 'weight', 'Ġ=', 'Ġtorch', '.', 'randn', '(', 'input', '_',
#  'size', ',', 'Ġoutput', '_', 'size', ')', 'ĊĠĠĠĠĠĠĠ', 'Ġself', '.', 'bias', 'Ġ=', 'Ġtorch', '.', 'zeros', '(',
#  'output', '_', 'size', ')', 'ĊĊĠĠĠ', 'Ġdef', 'Ġ__', 'call', '__(', 'self', ',', 'Ġx', '):', 'ĊĠĠĠĠĠĠĠ',
#  'Ġreturn', 'Ġx', 'Ġ@', 'Ġself', '.', 'weights', 'Ġ+', 'Ġself', '.', 'bias', 'ĊĠĠĠĠ']

In addition to the token corresponding to an indentation, here we can also see a token for a double indentation: `ĊĠĠĠĠĠĠĠ`. The special Python words like `class`, `init`, `call`, `self`, and `return` are each tokenized as one token, and we can see that as well as splitting on `_` and `.` the tokenizer correctly splits even camel-cased names: LinearLayer is tokenized as `["ĠLinear", "Layer"].`

### **Saving the Tokenizer**
To make sure we can use it later, we need to save our new tokenizer. Like for models, this is done with the `save_pretrained()` method

In [None]:
tokenizer.save_pretrained("code-search-net-tokenizer")

This will create a new folder named `code-search-net-tokenizer`, which will contain all the files the tokenizer needs to be reloaded.

If you want to share this tokenizer with your colleagues and friends, you can upload it to the Hub by logging into your account

In [None]:
from huggingface_hub import notebook_login

notebook_login()

## Working with terminal
# huggingface-cli login

**Push tokenizer to the Hub**

In [None]:
tokenizer.push_to_hub("code-search-net-tokenizer")

This will create a new repository in your namespace with the name `code-search-net-tokenizer`, containing the tokenizer file.

In [None]:
## Load our new Tokenizer
# Replace "huggingface-course" below with your actual namespace to use your own tokenizer
tokenizer = AutoTokenizer.from_pretrained("huggingface-course/code-search-net-tokenizer")

## **Fast Tokenizers' special powers**
Take a closer look at the capabilities of the tokenizers in 🤗 Transformers. Up to now we have only used them to tokenize inputs or decode IDs back into text, but tokenizers — especially those backed by the 🤗 Tokenizers library — **can do a lot more**.

Exploring how to **reproduce the results** of the `token-classification` (that we called `ner`) and `question-answering` pipelines

**Slow** tokenizers are those *written in Python* inside the 🤗 Transformers library, while the **fast** versions are the ones provided by 🤗 Tokenizers, which are *written in Rust*.

### **Batch Encoding**
The output of a tokenizer isn’t a simple Python dictionary; what we get is actually a special `BatchEncoding` object. It’s a subclass of a dictionary (which is why we were able to index into that result without any problem before), but with additional methods that are mostly used by **fast tokenizers**.

Besides their *parallelization* capabilities, the key functionality of **fast tokenizers** is that they always **keep track of the original span of texts the final tokens come from** — a feature we call *offset mapping*. This in turn unlocks features like mapping each word to the tokens it generated or mapping each character of the original text to the token it’s inside, and vice versa.

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
example = "My name is Sylvain and I work at Hugging Face in Brooklyn."
encoding = tokenizer(example)
print(type(encoding))

## output
# <class 'transformers.tokenization_utils_base.BatchEncoding'>

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

<class 'transformers.tokenization_utils_base.BatchEncoding'>


Since the `AutoTokenizer` class picks a **fast tokenizer** by default, we can use the additional methods this `BatchEncoding` object provides.

We have two ways to check if our tokenizer is a fast or a slow one. We can either check the attribute `is_fast` of the `tokenizer`

In [None]:
# is_fast
tokenizer.is_fast

# check the same attribute of encoding
encoding.is_fast

True

Let’s see what a **fast tokenizer** enables us to do.

First, we can **access the tokens without having to convert the IDs back to tokens**:

In [None]:
encoding.tokens()
# ['[CLS]', 'My', 'name', 'is', 'S', '##yl', '##va', '##in', 'and', 'I', 'work', 'at', 'Hu', '##gging', 'Face', 'in', 'Brooklyn', '.', '[SEP]']

['[CLS]',
 'My',
 'name',
 'is',
 'S',
 '##yl',
 '##va',
 '##in',
 'and',
 'I',
 'work',
 'at',
 'Hu',
 '##gging',
 'Face',
 'in',
 'Brooklyn',
 '.',
 '[SEP]']

We can also use the `word_ids()` method to get the index of the word each token comes from

In [None]:
encoding.word_ids()

[None, 0, 1, 2, 3, 3, 3, 3, 4, 5, 6, 7, 8, 8, 9, 10, 11, 12, None]

We can see that the tokenizer’s special tokens **[CLS]** and **[SEP]** are mapped to None, and then each token is mapped to the word it originates from.

This is especially useful to determine if a token is at the start of a word or if two tokens are in the same word. We could rely on the `##` prefix for that, but it only works for BERT-like tokenizers; this method works for any type of tokenizer as long as it’s a fast one. We can also use it to mask all the tokens coming from the same word in masked language modeling (a technique called *whole word masking*).

Lastly, we can **map any word or token to characters in the original text**, and *vice versa*, via the `word_to_chars()` or `token_to_chars()` and `char_to_word()` or `char_to_token()` methods. For instance, the `word_ids()` method told us that ##yl is part of the word at index 3, but which word is it in the sentence?

In [None]:
start, end = encoding.word_to_chars(3)
example[start:end]

'Sylvain'

As we mentioned previously, this is all **powered by the fact the *fast tokenizer* keeps track of the span of text each token comes from in a list of offsets**.

In [None]:
## Example
example2 = "My name is Aditya, currently exploring NLP and Transformers in HuggingFace"
encoding2 = tokenizer(example2)
print(type(encoding2))

<class 'transformers.tokenization_utils_base.BatchEncoding'>


In [None]:
print(encoding2.tokens())
print(encoding2.word_ids())

['[CLS]', 'My', 'name', 'is', 'Ad', '##ity', '##a', ',', 'currently', 'exploring', 'NL', '##P', 'and', 'Transformers', 'in', 'Hu', '##gging', '##F', '##ace', '[SEP]']
[None, 0, 1, 2, 3, 3, 3, 4, 5, 6, 7, 7, 8, 9, 10, 11, 11, 11, 11, None]


In [None]:
start, end = encoding2.word_to_chars(11)
example2[start:end]

'HuggingFace'

### **Inside the token-classification pipeline**
- Chapter 1: got our first taste of applying NER — where the task is to identify which parts of the text correspond to entities like persons, locations, or organizations — with the 🤗 Transformers `pipeline()` function.
- Chapter 2, we saw how a `pipeline` groups together the three stages necessary to get the predictions from a raw text: *tokenization, passing the inputs through the model, and post-processing*.

The first two steps in the `token-classification` pipeline are the same as in any other pipeline, but the post-processing is a little more complex.


#### **Getting the base results with the pipeline**
Grab a *token classification* pipeline so we can get some results to compare manually. The model used by default is `dbmdz/bert-large-cased-finetuned-conll03-english`; it performs `NER` on sentences.

In [None]:
from transformers import pipeline

token_classifier = pipeline("token-classification")
token_classifier("My name is Sylvain and I work at Hugging Face in Brooklyn.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

[{'entity': 'I-PER',
  'score': 0.99938285,
  'index': 4,
  'word': 'S',
  'start': 11,
  'end': 12},
 {'entity': 'I-PER',
  'score': 0.99815494,
  'index': 5,
  'word': '##yl',
  'start': 12,
  'end': 14},
 {'entity': 'I-PER',
  'score': 0.99590707,
  'index': 6,
  'word': '##va',
  'start': 14,
  'end': 16},
 {'entity': 'I-PER',
  'score': 0.99923277,
  'index': 7,
  'word': '##in',
  'start': 16,
  'end': 18},
 {'entity': 'I-ORG',
  'score': 0.9738931,
  'index': 12,
  'word': 'Hu',
  'start': 33,
  'end': 35},
 {'entity': 'I-ORG',
  'score': 0.976115,
  'index': 13,
  'word': '##gging',
  'start': 35,
  'end': 40},
 {'entity': 'I-ORG',
  'score': 0.9887976,
  'index': 14,
  'word': 'Face',
  'start': 41,
  'end': 45},
 {'entity': 'I-LOC',
  'score': 0.9932106,
  'index': 16,
  'word': 'Brooklyn',
  'start': 49,
  'end': 57}]

The model properly identified each token generated by “Sylvain” as a person, each token generated by “Hugging Face” as an organization, and the token “Brooklyn” as a location.

We can also ask the pipeline to group together the tokens that correspond to the same entity:

In [None]:
token_classifier = pipeline("token-classification", aggregation_strategy="simple")
token_classifier("My name is Sylvain and I work at Hugging Face in Brooklyn.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'entity_group': 'PER',
  'score': 0.9981694,
  'word': 'Sylvain',
  'start': 11,
  'end': 18},
 {'entity_group': 'ORG',
  'score': 0.9796019,
  'word': 'Hugging Face',
  'start': 33,
  'end': 45},
 {'entity_group': 'LOC',
  'score': 0.9932106,
  'word': 'Brooklyn',
  'start': 49,
  'end': 57}]

The `aggregation_strategy` picked will change the scores computed for each grouped entity. With `"simple"` the score is just the *mean of the scores of each token in the given entity*: for instance, the score of “Sylvain” is the mean of the scores we saw in the previous example for the tokens `S`, `##yl`, `##va`, and `##in`.

The other strategies are: `"first"`, `"max"`, `"average"`

Now let’s see how to obtain these results without using the pipeline() function!

#### **From Inputs to Prediction**
*First* we need to **tokenize our input and pass it through the model**. We instantiate the tokenizer and the model using the `AutoXxx` classes and then use them on our example.

In [None]:
from transformers import AutoTokenizer, AutoModelForTokenClassification

model_checkpoint = "dbmdz/bert-large-cased-finetuned-conll03-english"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint)

example = "My name is Sylvain and I work at Hugging Face in Brooklyn."
inputs = tokenizer(example, return_tensors="pt")
outputs = model(**inputs)

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Since we’re using `AutoModelForTokenClassification` here, we get one set of `logits` for each token in the input sequence

In [None]:
inputs

{'input_ids': tensor([[  101,  1422,  1271,  1110,   156,  7777,  2497,  1394,  1105,   146,
          1250,  1120, 20164, 10932, 10289,  1107,  6010,   119,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [None]:
print(inputs["input_ids"].shape)
print(outputs.logits.shape)

torch.Size([1, 19])
torch.Size([1, 19, 9])


We have a batch with *1 sequence* of *19 tokens* and the model has *9 different labels*, so the output of the model has a shape of `1 x 19 x 9`. Like for the text classification pipeline, we use a `softmax` function to *convert those logits to probabilities*, and we take the `argmax` *to get predictions* (note that we can take the argmax on the logits because the softmax does not change the order)

In [None]:
import torch

probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)[0].tolist() # logits to probabilities
predictions = outputs.logits.argmax(dim=-1)[0].tolist() # get the prediction
print(probabilities)
print(predictions)

[[0.9994322657585144, 1.6470330592710525e-05, 3.4267097362317145e-05, 1.6042358765844256e-05, 8.25070746941492e-05, 2.1382355043897405e-05, 0.00015649135457351804, 1.965215415111743e-05, 0.00022089284902904183], [0.9989631175994873, 1.8515771444072016e-05, 5.240457539912313e-05, 1.253474511031527e-05, 0.0004347368376329541, 3.0874354706611484e-05, 0.0003146878443658352, 2.786072582239285e-05, 0.00014510878827422857], [0.999708354473114, 8.308103133458644e-06, 2.8745558665832505e-05, 5.650347247865284e-06, 8.694831922184676e-05, 9.783449058886617e-06, 6.786132144043222e-05, 1.1793958947237115e-05, 7.241879211505875e-05], [0.9998350143432617, 5.645520559482975e-06, 1.3955125723441597e-05, 4.3133691178809386e-06, 4.017683386337012e-05, 8.123054612951819e-06, 5.648485239362344e-05, 8.991617505671456e-06, 2.7239060727879405e-05], [0.00018333387561142445, 2.5156570700346492e-05, 4.846194133278914e-05, 1.4900567293807399e-05, 0.9993828535079956, 1.9997702111140825e-05, 0.00011153610103065148,

The `model.config.id2label` attribute contains the **mapping of indexes to labels** that we can use to make sense of the predictions

In [None]:
model.config.id2label

{0: 'O',
 1: 'B-MISC',
 2: 'I-MISC',
 3: 'B-PER',
 4: 'I-PER',
 5: 'B-ORG',
 6: 'I-ORG',
 7: 'B-LOC',
 8: 'I-LOC'}

As we saw earlier, there are 9 labels: `O` is the label for the tokens that are not in any named entity (it stands for “outside”), and we then have two labels for each type of entity (miscellaneous, person, organization, and location). The label `B-XXX` indicates the token is at the beginning of an entity `XXX` and the label `I-XXX` indicates the token is inside the entity `XXX`.

📓 For instance, in the current example we would expect our model to classify the token `S` as `B-PER` (beginning of a person entity) and the tokens `##yl`, `##va` and `##in` as `I-PER` (inside a person entity).

With this map, we are ready to reproduce (almost entirely) the results of the first `pipeline` — we can just grab the score and label of each token that was not classified as `O`:

In [None]:
results = []
tokens = inputs.tokens()

for idx, pred in enumerate(predictions):
    label = model.config.id2label[pred]
    if label != "O":
        results.append(
            {"entity": label, "score": probabilities[idx][pred], "word": tokens[idx]}
        )

print(results)

[{'entity': 'I-PER', 'score': 0.9993828535079956, 'word': 'S'}, {'entity': 'I-PER', 'score': 0.9981548190116882, 'word': '##yl'}, {'entity': 'I-PER', 'score': 0.995907187461853, 'word': '##va'}, {'entity': 'I-PER', 'score': 0.9992327690124512, 'word': '##in'}, {'entity': 'I-ORG', 'score': 0.9738931059837341, 'word': 'Hu'}, {'entity': 'I-ORG', 'score': 0.9761149883270264, 'word': '##gging'}, {'entity': 'I-ORG', 'score': 0.9887974858283997, 'word': 'Face'}, {'entity': 'I-LOC', 'score': 0.99321049451828, 'word': 'Brooklyn'}]


This is very similar to what we had before, with one exception: the `pipeline` also gave us information about the `start` and `end` of each entity in the original sentence. This is where our *offset mapping* will come into play. To get the offsets, we just have to set `return_offsets_mapping=True` when we apply the tokenizer to our inputs

In [None]:
inputs_with_offsets = tokenizer(example, return_offsets_mapping=True)
inputs_with_offsets["offset_mapping"]

[(0, 0),
 (0, 2),
 (3, 7),
 (8, 10),
 (11, 12),
 (12, 14),
 (14, 16),
 (16, 18),
 (19, 22),
 (23, 24),
 (25, 29),
 (30, 32),
 (33, 35),
 (35, 40),
 (41, 45),
 (46, 48),
 (49, 57),
 (57, 58),
 (0, 0)]

"My name is Sylvain and I work at Hugging Face in Brooklyn."

Each tuple is the span of text corresponding to each token, where `(0, 0)` is reserved for the special tokens. We saw before that the token at index 5 is `##yl`, which has `(12, 14)` as offsets here.

If we grab the corresponding slice in our example

In [None]:
example[12:14]

'yl'

Using this we can complete the previous results

In [None]:
results = []
inputs_with_offsets = tokenizer(example, return_offsets_mapping=True)
tokens = inputs_with_offsets.tokens()
offsets = inputs_with_offsets["offset_mapping"]

for idx, pred in enumerate(predictions):
    label = model.config.id2label[pred]
    if label != "O":
        start, end = offsets[idx]
        results.append(
            {
                "entity": label,
                "score": probabilities[idx][pred],
                "word": tokens[idx],
                "start": start,
                "end": end,
            }
        )

print(results)

[{'entity': 'I-PER', 'score': 0.9993828535079956, 'word': 'S', 'start': 11, 'end': 12}, {'entity': 'I-PER', 'score': 0.9981548190116882, 'word': '##yl', 'start': 12, 'end': 14}, {'entity': 'I-PER', 'score': 0.995907187461853, 'word': '##va', 'start': 14, 'end': 16}, {'entity': 'I-PER', 'score': 0.9992327690124512, 'word': '##in', 'start': 16, 'end': 18}, {'entity': 'I-ORG', 'score': 0.9738931059837341, 'word': 'Hu', 'start': 33, 'end': 35}, {'entity': 'I-ORG', 'score': 0.9761149883270264, 'word': '##gging', 'start': 35, 'end': 40}, {'entity': 'I-ORG', 'score': 0.9887974858283997, 'word': 'Face', 'start': 41, 'end': 45}, {'entity': 'I-LOC', 'score': 0.99321049451828, 'word': 'Brooklyn', 'start': 49, 'end': 57}]


This is the same as what we got from the first pipeline!

#### **Grouping Entities**
Using the `offsets` to determine the start and end keys for each entity is handy, but that information isn’t strictly necessary.

When we want to group the entities together, however, the offsets will save us a lot of messy code. For example, if we wanted to group together the tokens `Hu`, `##gging`, and `Face`, we could make special rules that say *the first two should be attached while removing the `##`*, and *the Face should be added with a space since it does not begin with `##`* — but that would only work for this particular type of tokenizer ➡➡➡ [*another set rules for a `SentecePiece` or a `Byte-Pair-Encoding Tokenizer`*]

With the offsets, all that custom code goes away: we just can **take the span in the original text that begins with the first token and ends with the last token**. So, in the case of the tokens `Hu`, `##gging`, and `Face`, we should start at character 33 (the beginning of `Hu`) and end before character 45 (the end of `Face`):

In [None]:
example[33:45]

'Hugging Face'

To write the code that *post-processes the predictions while grouping entities*, we will **group together entities that are consecutive and labeled with `I-XXX`, except for the first one, which can be labeled as `B-XXX` or `I-XXX`** (so, we stop grouping an entity when we get a O, a new type of entity, or a `B-XXX` that tells us an entity of the same type is starting):

In [None]:
import numpy as np

results = [] # save the results
inputs_with_offsets = tokenizer(example, return_offsets_mapping=True) # tokenizer with offset
tokens = inputs_with_offsets.tokens() # get the tokens of input
offsets = inputs_with_offsets["offset_mapping"] # slice for the offset value only

idx = 0
while idx < len(predictions):
    pred = predictions[idx] # get the prediction value by index
    label = model.config.id2label[pred] # get the label of the model from prediction values
    if label != "O":
        # Remove the B- or I-
        label = label[2:]
        start, _ = offsets[idx]

        # Grab all the tokens labeled with I-label
        all_scores = []
        while (
            idx < len(predictions)
            and model.config.id2label[predictions[idx]] == f"I-{label}"
        ):
            all_scores.append(probabilities[idx][pred])
            _, end = offsets[idx]
            idx += 1

        # The score is the mean of all the scores of the tokens in that grouped entity
        score = np.mean(all_scores).item()
        word = example[start:end]
        results.append(
            {
                "entity_group": label,
                "score": score,
                "word": word,
                "start": start,
                "end": end,
            }
        )
    idx += 1

print(results)

[{'entity_group': 'PER', 'score': 0.998169407248497, 'word': 'Sylvain', 'start': 11, 'end': 18}, {'entity_group': 'ORG', 'score': 0.9796018600463867, 'word': 'Hugging Face', 'start': 33, 'end': 45}, {'entity_group': 'LOC', 'score': 0.99321049451828, 'word': 'Brooklyn', 'start': 49, 'end': 57}]


And we get the same results as with our second pipeline!

Another example of a task where these `offsets` are extremely useful is *question answering*.

## **Fast Tokenizers in the `QA Pipeline`**
We will now dive into the `question-answering` pipeline and see how to leverage the offsets to grab the answer to the question at hand from the context, a bit like we did for the grouped entities in the previous section. Then we will see how we can deal with very long contexts that end up being truncated.

### **Using the Question-Answering pipeline**

Basic pipeline for `question-answering`

In [None]:
from transformers import pipeline

question_answerer = pipeline("question-answering", top_k=5)
context = """
🤗 Transformers is backed by the three most popular deep learning libraries — Jax, PyTorch, and TensorFlow — with a seamless integration
between them. It's straightforward to train your models with one before loading them for inference with the other.
"""

question = "Which deep learning libraries back 🤗 Transformers?"
question_answerer(question=question, context=context)

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'score': 0.98026043176651,
  'start': 78,
  'end': 106,
  'answer': 'Jax, PyTorch, and TensorFlow'},
 {'score': 0.008247774094343185,
  'start': 78,
  'end': 108,
  'answer': 'Jax, PyTorch, and TensorFlow —'},
 {'score': 0.0013676995877176523,
  'start': 78,
  'end': 90,
  'answer': 'Jax, PyTorch'},
 {'score': 0.00038108433363959193,
  'start': 83,
  'end': 106,
  'answer': 'PyTorch, and TensorFlow'},
 {'score': 0.00021684507373720407,
  'start': 96,
  'end': 106,
  'answer': 'TensorFlow'}]

Unlike the other pipelines, which can’t truncate and split texts that are longer than the maximum length accepted by the model (and thus may miss information at the end of a document), this pipeline can *deal with very long contexts* and will *return the answer to the question even if it’s at the end*

In [None]:
long_context = """
🤗 Transformers: State of the Art NLP

🤗 Transformers provides thousands of pretrained models to perform tasks on texts such as classification, information extraction,
question answering, summarization, translation, text generation and more in over 100 languages.
Its aim is to make cutting-edge NLP easier to use for everyone.

🤗 Transformers provides APIs to quickly download and use those pretrained models on a given text, fine-tune them on your own datasets and
then share them with the community on our model hub. At the same time, each python module defining an architecture is fully standalone and
can be modified to enable quick research experiments.

Why should I use transformers?

1. Easy-to-use state-of-the-art models:
  - High performance on NLU and NLG tasks.
  - Low barrier to entry for educators and practitioners.
  - Few user-facing abstractions with just three classes to learn.
  - A unified API for using all our pretrained models.
  - Lower compute costs, smaller carbon footprint:

2. Researchers can share trained models instead of always retraining.
  - Practitioners can reduce compute time and production costs.
  - Dozens of architectures with over 10,000 pretrained models, some in more than 100 languages.

3. Choose the right framework for every part of a model's lifetime:
  - Train state-of-the-art models in 3 lines of code.
  - Move a single model between TF2.0/PyTorch frameworks at will.
  - Seamlessly pick the right framework for training, evaluation and production.

4. Easily customize a model or an example to your needs:
  - We provide examples for each architecture to reproduce the results published by its original authors.
  - Model internals are exposed as consistently as possible.
  - Model files can be used independently of the library for quick experiments.

🤗 Transformers is backed by the three most popular deep learning libraries — Jax, PyTorch and TensorFlow — with a seamless integration
between them. It's straightforward to train your models with one before loading them for inference with the other.
"""

question_answerer(question=question, context=long_context)

{'score': 0.9714871048927307,
 'start': 1892,
 'end': 1919,
 'answer': 'Jax, PyTorch and TensorFlow'}

### **Using a model for Question-Answering**
Like with any other pipeline, we start by *tokenizing* our input and then *send it through the model*. The checkpoint used by default for the `question-answering` pipeline is `distilbert-base-cased-distilled-squad` (the “squad” in the name comes from the dataset on which the model was fine-tuned)

In [None]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

model_checkpoint = "distilbert-base-cased-distilled-squad"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

inputs = tokenizer(question, context, return_tensors="pt")
outputs = model(**inputs)

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

Models for question answering work a little differently from the models we’ve seen up to now. Using the picture above as an example, the model has been **trained to predict the index of the token starting the answer** (here 21) *and* **the index of the token where the answer ends** (here 24).

This is why those models don’t return one tensor of logits but two: **one for the logits corresponding to the start token of the answer**, *and* **one for the logits corresponding to the end token of the answer**. Since in this case we have only one input containing 66 tokens

In [None]:
start_logits = outputs.start_logits
end_logits = outputs.end_logits

print(start_logits.shape, end_logits.shape)
# torch.Size([1, 67]) torch.Size([1, 67])

torch.Size([1, 67]) torch.Size([1, 67])


To **convert those logits into probabilities**, we will apply a `softmax` function — but before that, we need to make sure we **mask the indices that are not part of the context**. Our input is `[CLS] question [SEP] context [SEP]`, so we **need to mask the tokens of the question as well as the `[SEP]` token**. We’ll **keep the `[CLS]` token**, however, as some models use it to indicate that the answer is not in the context.

Since we will apply a `softmax` afterward, we just need to replace the `logits` we want to mask with a large negative number. Here, we use `-10000`

In [None]:
import torch

sequence_ids = inputs.sequence_ids()

# Mask everything apart from the tokens of the context
mask = [i != 1 for i in sequence_ids]

# Unmask the [CLS] token
mask[0] = False
mask = torch.tensor(mask)[None]

start_logits[mask] = -10000
end_logits[mask] = -10000

Now that we have properly **masked the logits corresponding to positions we don’t want to predict**, we can apply the softmax

In [None]:
start_probabilities = torch.nn.functional.softmax(start_logits, dim=-1)[0]
end_probabilities = torch.nn.functional.softmax(end_logits, dim=-1)[0]

In [None]:
start_probabilities

tensor([4.4531e-07, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
        0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 8.1185e-06, 1.3470e-05,
        2.4368e-07, 2.1236e-06, 1.3220e-06, 3.7722e-04, 6.9219e-03, 1.0237e-05,
        4.3289e-06, 1.5143e-05, 3.2463e-07, 4.1933e-06, 1.6808e-04, 9.9179e-01,
        8.6288e-06, 3.8557e-04, 5.9956e-06, 4.3725e-06, 5.8977e-07, 3.0929e-06,
        3.8998e-06, 2.9493e-06, 2.1940e-04, 5.4713e-06, 7.1354e-06, 2.3212e-05,
        5.2711e-06, 4.7788e-07, 2.4291e-07, 4.4467e-07, 1.4879e-08, 4.8133e-08,
        3.7169e-07, 7.1242e-08, 3.1735e-07, 2.2365e-07, 1.3685e-06, 2.4093e-08,
        1.1470e-08, 4.4891e-07, 2.2828e-08, 5.2562e-07, 5.8092e-07, 1.6419e-06,
        1.4114e-08, 2.0591e-07, 2.0161e-08, 2.5390e-07, 2.3251e-08, 1.4667e-08,
        5.4532e-08, 2.4235e-08, 5.5390e-09, 1.8524e-08, 3.6818e-08, 3.4721e-08,
        0.0000e+00], grad_fn=<SelectBackward0>)

At this stage, we could take the `argmax` of the start and end probabilities — but we might end up with a start index that is greater than the end index, so we need to take a few more precautions.

We will compute the **probabilities** of each possible `start_index` and `end_index` where `start_index <= end_index`, then take the tuple `(start_index, end_index)` with the ***highest probability***.

Assuming the events **“The answer starts at `start_index”`** and **“The answer ends at `end_index”`** to be independent, the probability that the answer starts at `start_index` and ends at `end_index` is:

**`start_probabilities[start_index] × end_probabilities[end_index]`**



So, to compute all the scores, we just need to compute all the products **`start_probabilities[start_index] × end_probabilities[end_index]`** where `start_index <= end_index`

In [None]:
# compute al the possible products -- flatten
scores = start_probabilities[:, None] * end_probabilities[None, :]

Then we’ll mask the values where `start_index > end_index` by setting them to `0` (the other probabilities are all positive numbers). The `torch.triu()` function returns the upper triangular part of the 2D tensor passed as an argument.

Now we just have to **get the index of the maximum**. Since PyTorch will return the index in the flattened tensor, we need to use the floor division `//` and modulus `%` operations to get the `start_index and end_index`

In [None]:
max_index = scores.argmax().item()
start_index = max_index // scores.shape[1] # 23
end_index = max_index % scores.shape[1] # 35
print(scores[start_index, end_index])

tensor(0.9803, grad_fn=<SelectBackward0>)


We have the `start_index` and `end_index` of the answer in terms of tokens, so now we just need to **convert to the character indices in the context**. This is where the `offsets` will be super useful. We can grab them and use them like we did in the token classification task

In [None]:
inputs_with_offsets = tokenizer(question, context, return_offsets_mapping=True)
offsets = inputs_with_offsets["offset_mapping"]

start_char, _ = offsets[start_index]
_, end_char = offsets[end_index]
answer = context[start_char:end_char]

In [None]:
result = {
    "answer": answer,
    "start": start_char,
    "end": end_char,
    "score": scores[start_index, end_index],
}

print(result)

{'answer': 'Jax, PyTorch, and TensorFlow', 'start': 78, 'end': 106, 'score': tensor(0.9803, grad_fn=<SelectBackward0>)}


### **Handling long contexts**
If we try to tokenize the question and long context we used as an example previously, we’ll get a **number of tokens *higher* than the maximum length** used in the `question-answering` pipeline (which is `384`)

In [None]:
inputs = tokenizer(question, long_context)
print(len(inputs["input_ids"]))

461


So, we’ll need to **truncate our inputs at that maximum length**. There are several ways we can do this, but we *don’t want to truncate the question, only the context*.

Since the context is the second sentence, we’ll use the `"only_second"` truncation strategy. The problem that arises then is that the answer to the question may not be in the truncated context. Here, for instance, we picked a question where the answer is toward the end of the context, and when we truncate it that answer is not present

In [None]:
inputs = tokenizer(question, long_context, max_length=384, truncation="only_second")
print(tokenizer.decode(inputs["input_ids"]))

[CLS] Which deep learning libraries back [UNK] Transformers? [SEP] [UNK] Transformers : State of the Art NLP [UNK] Transformers provides thousands of pretrained models to perform tasks on texts such as classification, information extraction, question answering, summarization, translation, text generation and more in over 100 languages. Its aim is to make cutting - edge NLP easier to use for everyone. [UNK] Transformers provides APIs to quickly download and use those pretrained models on a given text, fine - tune them on your own datasets and then share them with the community on our model hub. At the same time, each python module defining an architecture is fully standalone and can be modified to enable quick research experiments. Why should I use transformers? 1. Easy - to - use state - of - the - art models : - High performance on NLU and NLG tasks. - Low barrier to entry for educators and practitioners. - Few user - facing abstractions with just three classes to learn. - A unified A

This means the model will have a hard time picking the correct answer. To fix this, the `question-answering` pipeline allows us to **split the context into smaller chunks, specifying the maximum length**. To make sure we don’t split the context at exactly the wrong place to make it possible to find the answer, it also includes some overlap between the chunks.

We can have the tokenizer (fast or slow) do this for us by adding `return_overflowing_tokens=True`, and we can specify the *overlap* we want with the `stride` argument

In [None]:
sentence = "This sentence is not too long but we are going to split it anyway."
inputs = tokenizer(
    sentence, truncation=True, return_overflowing_tokens=True, max_length=6, stride=2
)

for ids in inputs["input_ids"]:
    print(tokenizer.decode(ids))

[CLS] This sentence is not [SEP]
[CLS] is not too long [SEP]
[CLS] too long but we [SEP]
[CLS] but we are going [SEP]
[CLS] are going to split [SEP]
[CLS] to split it anyway [SEP]
[CLS] it anyway. [SEP]


As we can see, the sentence has been split into chunks in such a way that each entry in `inputs["input_ids"]` has at most 6 tokens (we would need to add padding to have the last entry be the same size as the others) and there is an overlap of 2 tokens between each of the entries.

In [None]:
print(inputs.keys())

dict_keys(['input_ids', 'attention_mask', 'overflow_to_sample_mapping'])


In [None]:
inputs

{'input_ids': [[101, 1188, 5650, 1110, 1136, 102], [101, 1110, 1136, 1315, 1263, 102], [101, 1315, 1263, 1133, 1195, 102], [101, 1133, 1195, 1132, 1280, 102], [101, 1132, 1280, 1106, 3325, 102], [101, 1106, 3325, 1122, 4050, 102], [101, 1122, 4050, 119, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1]], 'overflow_to_sample_mapping': [0, 0, 0, 0, 0, 0, 0]}

As expected, we get input `IDs` and an `attention mask`. The last key, `overflow_to_sample_mapping`, is a map that tells us **which sentence each of the results corresponds to** — here we have 7 results that all come from the (only) sentence we passed the tokenizer.

In [None]:
print(inputs["overflow_to_sample_mapping"])

[0, 0, 0, 0, 0, 0, 0]


Severla sentences to tokenize

In [None]:
sentences = [
    "This sentence is not too long but we are going to split it anyway.",
    "This sentence is shorter but will still get split.",
]

inputs = tokenizer(
    sentences, truncation=True, return_overflowing_tokens=True, max_length=6, stride=2
)

print(inputs["overflow_to_sample_mapping"])

[0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]


which means the first sentence is split into 7 chunks as before, and the next 4 chunks come from the second sentence.

Now let’s go back to our long context. By default the `question-answering` pipeline uses a **maximum length of 384**, as we mentioned earlier, and a **stride of 128**, which correspond to the way the model was fine-tuned (you can adjust those parameters by passing `max_seq_len` and `stride` arguments when calling the pipeline). We will thus use those parameters when tokenizing. We’ll also add padding (to have samples of the same length, so we can build tensors) as well as ask for the offsets:

In [None]:
inputs = tokenizer(
    question,
    long_context,
    stride=128,
    max_length=384,
    padding="longest",
    truncation="only_second",
    return_overflowing_tokens=True,
    return_offsets_mapping=True,
)

In [None]:
inputs

{'input_ids': [[101, 5979, 1996, 3776, 9818, 1171, 100, 25267, 136, 102, 100, 25267, 131, 1426, 1104, 1103, 2051, 21239, 2101, 100, 25267, 2790, 4674, 1104, 3073, 4487, 9044, 3584, 1106, 3870, 8249, 1113, 6685, 1216, 1112, 5393, 117, 1869, 16026, 117, 2304, 10937, 117, 7584, 7317, 2734, 117, 5179, 117, 3087, 3964, 1105, 1167, 1107, 1166, 1620, 3483, 119, 2098, 6457, 1110, 1106, 1294, 5910, 118, 2652, 21239, 2101, 5477, 1106, 1329, 1111, 2490, 119, 100, 25267, 2790, 20480, 1116, 1106, 1976, 9133, 1105, 1329, 1343, 3073, 4487, 9044, 3584, 1113, 170, 1549, 3087, 117, 2503, 118, 9253, 1172, 1113, 1240, 1319, 2233, 27948, 1105, 1173, 2934, 1172, 1114, 1103, 1661, 1113, 1412, 2235, 10960, 119, 1335, 1103, 1269, 1159, 117, 1296, 185, 25669, 8613, 13196, 13682, 1126, 4220, 1110, 3106, 2484, 20717, 1673, 1105, 1169, 1129, 5847, 1106, 9396, 3613, 1844, 7857, 119, 2009, 1431, 146, 1329, 11303, 1468, 136, 122, 119, 12167, 118, 1106, 118, 1329, 1352, 118, 1104, 118, 1103, 118, 1893, 3584, 131, 118,

Those inputs contain the input `IDs` and `attention masks` the model expects, as well as the `offsets` and the `overflow_to_sample_mappin`g we just talked about. Since those two are not parameters used by the model, we’ll pop them out of the inputs (and we won’t store the map, since it’s not useful here) before converting it to a tensor

In [None]:
_ = inputs.pop("overflow_to_sample_mapping")
offsets = inputs.pop("offset_mapping")

inputs = inputs.convert_to_tensors("pt")
print(inputs["input_ids"].shape)

torch.Size([2, 384])


Our long context was split in two, which means that after it goes through our model, we will have two sets of `start` and `end` logits:

In [None]:
outputs = model(**inputs)

start_logits = outputs.start_logits
end_logits = outputs.end_logits
print(start_logits.shape, end_logits.shape)

torch.Size([2, 384]) torch.Size([2, 384])


Like before, we first **mask the tokens that are not part of the context before taking the `softmax`**. We also **mask all the padding tokens** (as flagged by the *attention mask*)

In [None]:
inputs.tokens()

['[CLS]',
 'Which',
 'deep',
 'learning',
 'libraries',
 'back',
 '[UNK]',
 'Transformers',
 '?',
 '[SEP]',
 '[UNK]',
 'Transformers',
 ':',
 'State',
 'of',
 'the',
 'Art',
 'NL',
 '##P',
 '[UNK]',
 'Transformers',
 'provides',
 'thousands',
 'of',
 'pre',
 '##tra',
 '##ined',
 'models',
 'to',
 'perform',
 'tasks',
 'on',
 'texts',
 'such',
 'as',
 'classification',
 ',',
 'information',
 'extraction',
 ',',
 'question',
 'answering',
 ',',
 'sum',
 '##mar',
 '##ization',
 ',',
 'translation',
 ',',
 'text',
 'generation',
 'and',
 'more',
 'in',
 'over',
 '100',
 'languages',
 '.',
 'Its',
 'aim',
 'is',
 'to',
 'make',
 'cutting',
 '-',
 'edge',
 'NL',
 '##P',
 'easier',
 'to',
 'use',
 'for',
 'everyone',
 '.',
 '[UNK]',
 'Transformers',
 'provides',
 'API',
 '##s',
 'to',
 'quickly',
 'download',
 'and',
 'use',
 'those',
 'pre',
 '##tra',
 '##ined',
 'models',
 'on',
 'a',
 'given',
 'text',
 ',',
 'fine',
 '-',
 'tune',
 'them',
 'on',
 'your',
 'own',
 'data',
 '##sets',
 'and

In [None]:
sequence_ids = inputs.sequence_ids()

# Mask everything apart from the tokens of the context
mask = [i != 1 for i in sequence_ids]

# Unmask the [CLS] token
mask[0] = False

# Mask all the [PAD] tokens
mask = torch.logical_or(torch.tensor(mask)[None], (inputs["attention_mask"] == 0))

start_logits[mask] = -10000
end_logits[mask] = -10000

Then we can use the `softmax` to convert our logits to probabilities:

In [None]:
start_probabilities = torch.nn.functional.softmax(start_logits, dim=-1)
end_probabilities = torch.nn.functional.softmax(end_logits, dim=-1)

The next step is similar to what we did for the small context, but we **repeat it for each of our two chunks**. We attribute a **score to all possible spans of answer**, then **take the span with the best score**

In [None]:
candidates = []
for start_probs, end_probs in zip(start_probabilities, end_probabilities):
    scores = start_probs[:, None] * end_probs[None, :]
    idx = torch.triu(scores).argmax().item()

    start_idx = idx // scores.shape[1]
    end_idx = idx % scores.shape[1]
    score = scores[start_idx, end_idx].item()
    candidates.append((start_idx, end_idx, score))

print(candidates)

[(0, 18, 0.33867061138153076), (173, 184, 0.9714869856834412)]


Those two candidates correspond to **the best answers the model was able to find in each chunk**. The model is way more confident the right answer is in the second part (which is a good sign!). Now we just have to **map those two token spans to spans of characters in the context** (we only need to map the second one to have our answer, but it’s interesting to see what the model has picked in the first chunk).

The `offsets` we grabbed earlier is actually a list of `offsets`, with one list per chunk of text

In [None]:
for candidate, offset in zip(candidates, offsets):
    start_token, end_token, score = candidate
    start_char, _ = offset[start_token]
    _, end_char = offset[end_token]
    answer = long_context[start_char:end_char]
    result = {"answer": answer, "start": start_char, "end": end_char, "score": score}
    print(result)

{'answer': '\n🤗 Transformers: State of the Art NLP', 'start': 0, 'end': 37, 'score': 0.33867061138153076}
{'answer': 'Jax, PyTorch and TensorFlow', 'start': 1892, 'end': 1919, 'score': 0.9714869856834412}


If we ignore the first result, we get the same result as our pipeline for this long context — yay!

## **`Normalization` and Pre-Tokenization**
The three most common subword tokenization algorithms used with Transformer models (**Byte-Pair Encoding [BPE]**, **WordPiece**, and **Unigram**).

Before splitting a text into subtokens (according to its model), the tokenizer performs two steps: ***normalization*** and ***pre-tokenization***.

### **Normalization**
The normalization step involves some *general cleanup*, such as *removing needless whitespace*, *lowercasing*, and/or r*emoving accents*. If you’re familiar with **Unicode normalization** (such as NFC or NFKC), this is also something the tokenizer may apply.

The 🤗 Transformers tokenizer has an attribute called `backend_tokenizer` that provides **access to the underlying tokenizer** from the 🤗 Tokenizers library:

In [None]:
from transformers import AutoTokenizer

# tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
print(type(tokenizer.backend_tokenizer))

The `normalizer` attribute of the tokenizer object has a `normalize_str()` method that we can use to see **how the normalization is performed**

In [None]:
print(tokenizer.backend_tokenizer.normalizer.normalize_str("Héllò hôw are ü?"))

`bert-base-uncased` ➡ applied lowecasing and removed the accents

### **Pre-Tokenization**
A tokenizer cannot be trained on raw text alone. Instead, we first need to **split the texts into small entities**, like *words* on whitespace and punctuation. That’s where the pre-tokenization step comes in. Those words will be the boundaries of the subtokens the tokenizer can learn during its training

To see **how a fast tokenizer performs** `pre-tokenization`, we can use the `pre_tokenize_str()` method of the `pre_tokenizer` attribute of the `tokenizer` object

In [None]:
tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str("Hello, how are  you?")

Notice how the `tokenizer` is already keeping track of the offsets, which is how it can give us the offset mapping we used in the previous section. Here the tokenizer ignores the two spaces and replaces them with just one, but the offset jumps between are and `you` to `account` for that.

Since we’re using a `BERT` tokenizer, the pre-tokenization involves splitting on whitespace and punctuation. Other tokenizers can have different rules for this step. For example, if we use the **`GPT-2` tokenizer**:

In [None]:
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str("Hello, how are  you?")

# [('Hello', (0, 5)), (',', (5, 6)), ('Ġhow', (6, 10)), ('Ġare', (10, 14)), ('Ġ', (14, 15)), ('Ġyou', (15, 19)), ('?', (19, 20))]

it will split on whitespace and punctuation as well, but it will keep the spaces and replace them with a `Ġ` symbol.

📔 Also note that unlike the BERT tokenizer, this (GPT-2) tokenizer does not ignore the double space.

For a last example, let’s have a look at the **`T5` tokenizer**, which is based on the `SentencePiece` algorithm

In [None]:
tokenizer = AutoTokenizer.from_pretrained("t5-small")
tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str("Hello, how are  you?")

# [('▁Hello,', (0, 6)), ('▁how', (7, 10)), ('▁are', (11, 14)), ('▁you?', (16, 20))]

Like the GPT-2 tokenizer, this one *keeps spaces and replaces them with a specific token (_)*, but the T5 tokenizer *only splits on whitespace*, not punctuation. Also note that it added a space by default at the beginning of the sentence (before `Hello`) and ignored the double space between `are` and `you`

### **SentencePiece**
SentencePiece is a tokenization algorithm for the preprocessing of text that you can use with any of the models. It considers the text as a sequence of Unicode characters, and replaces spaces with a special character, `▁`. Used in conjunction with the *Unigram algorithm* (see section 7), it *doesn’t even require a pre-tokenization step*, which is very useful for languages where the space character is not used (like Chinese or Japanese)

The other main feature of `SentencePiece` is **reversible tokenization**: since there is no special treatment of spaces, **decoding the tokens** is done simply by **concatenating them and replacing the `_s` with spaces,** -- this results in the normalized text. As we saw earlier, the *`BERT tokenizer` removes repeating spaces, so its tokenization is not reversible.*

## **`Byte-Pair Encoding` Tokenization**
**Byte-Pair Encoding** (BPE) was initially developed as an **algorithm to compress texts**, and then used by OpenAI for tokenization when pretraining the GPT model. It’s used by a lot of Transformer models, including GPT, GPT-2, RoBERTa, BART, and DeBERTa

### **Training Algorithm**
Compute the **unique set of words** used in corpus ➡ **build base vocabulary** by taking all the symbols (`"hugs"` 👉 `["h","u","g","s"]`) ➡ **add new tokens** until the desired vocab size reaching by **learning merges** (merge two elements into a new one)

At any step during the tokenizer training, the **BPE algorithm** will search for **the most frequent pair of existing tokens** (by “pair,” here we mean two consecutive tokens in a word). That **most frequent pair is the one that will be merged**, and we rinse and repeat for the next step.

**Initial**:
```
corpus -> "hug", "pug", "pun", "bun", "hugs"
vocabs -> ["b", "g", "h", "n", "p", "s", "u"]
```
**Frequenies (*training*)**:
```
freq -> ("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5)
train -> ("h" "u" "g", 10), ("p" "u" "g", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "u" "g" "s", 5)
```
**Pairing** (*most frequent pair*):
```
first-merge = ("u", "g") -> "ug"
Vocabulary: ["b", "g", "h", "n", "p", "s", "u", "ug"]
Corpus: ("h" "ug", 10), ("p" "ug", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "ug" "s", 5)
```
Second pairing:
```
s2 = ("u", "n") -> "un"
Vocabulary: ["b", "g", "h", "n", "p", "s", "u", "ug", "un"]
Corpus: ("h" "ug", 10), ("p" "ug", 5), ("p" "un", 12), ("b" "un", 4), ("h" "ug" "s", 5)
```
Third pairing:
```
Vocabulary: ["b", "g", "h", "n", "p", "s", "u", "ug", "un", "hug"]
Corpus: ("hug", 10), ("p" "ug", 5), ("p" "un", 12), ("b" "un", 4), ("hug" "s", 5)
```

### **Tokenization Algorithm**
Tokenization follows the training process closely:
1. Normalization
2. Pre-tokenization
3. Splitting the words into individual characters
4. Applying the merge rules learned in order on those spli

Three Merge rules:
```
v: ["ug", "un", "hug"]
```
The word `"bug"` will be tokenized as `["b", "ug"]`. `"mug"`, however, will be tokenized as `["[UNK]", "ug"]` since the letter `"m"` was not in the base vocabulary.

### **Implementing BPE**

In [1]:
# Create a simple corpurs --> few sentences
corpus = [
    "This is the Hugging Face Course.",
    "This chapter is about tokenization.",
    "This section shows several tokenizer algorithms.",
    "Hopefully, you will be able to understand how they are trained and generate tokens.",
]

Next, we need to **`pre-tokenize` that corpus into words**. Since we are replicating a **BPE tokenizer** (like GPT-2), we will use the `gpt2` tokenizer for the pre-tokenization

In [2]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



**Compute the frequencies of each word in the corpus** as we do the pre-tokenization:

In [3]:
from collections import defaultdict

word_freqs = defaultdict(int)

for text in corpus:
    words_with_offsets = tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text)
    new_words = [word for word, offset in words_with_offsets]
    for word in new_words:
        word_freqs[word] += 1

print(word_freqs)

defaultdict(<class 'int'>, {'This': 3, 'Ġis': 2, 'Ġthe': 1, 'ĠHugging': 1, 'ĠFace': 1, 'ĠCourse': 1, '.': 4, 'Ġchapter': 1, 'Ġabout': 1, 'Ġtokenization': 1, 'Ġsection': 1, 'Ġshows': 1, 'Ġseveral': 1, 'Ġtokenizer': 1, 'Ġalgorithms': 1, 'Hopefully': 1, ',': 1, 'Ġyou': 1, 'Ġwill': 1, 'Ġbe': 1, 'Ġable': 1, 'Ġto': 1, 'Ġunderstand': 1, 'Ġhow': 1, 'Ġthey': 1, 'Ġare': 1, 'Ġtrained': 1, 'Ġand': 1, 'Ġgenerate': 1, 'Ġtokens': 1})


The next step is to **compute the base vocabulary**, formed by all the characters used in the corpus:

In [4]:
alphabet = []

for word in word_freqs.keys():
    for letter in word:
        if letter not in alphabet:
            alphabet.append(letter)
alphabet.sort()

print(alphabet)

[',', '.', 'C', 'F', 'H', 'T', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'k', 'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'v', 'w', 'y', 'z', 'Ġ']


We also **add the special tokens used by the model at the beginning of that vocabulary**. In the case of `GPT-2`, the only special token is `"<|endoftext|>"`.

In [5]:
vocab = ["<|endoftext|>"] + alphabet.copy()

We now need to **split each word into individual characters**, to be able to start training

In [6]:
splits = {word: [c for c in word] for word in word_freqs.keys()}
splits

{'This': ['T', 'h', 'i', 's'],
 'Ġis': ['Ġ', 'i', 's'],
 'Ġthe': ['Ġ', 't', 'h', 'e'],
 'ĠHugging': ['Ġ', 'H', 'u', 'g', 'g', 'i', 'n', 'g'],
 'ĠFace': ['Ġ', 'F', 'a', 'c', 'e'],
 'ĠCourse': ['Ġ', 'C', 'o', 'u', 'r', 's', 'e'],
 '.': ['.'],
 'Ġchapter': ['Ġ', 'c', 'h', 'a', 'p', 't', 'e', 'r'],
 'Ġabout': ['Ġ', 'a', 'b', 'o', 'u', 't'],
 'Ġtokenization': ['Ġ',
  't',
  'o',
  'k',
  'e',
  'n',
  'i',
  'z',
  'a',
  't',
  'i',
  'o',
  'n'],
 'Ġsection': ['Ġ', 's', 'e', 'c', 't', 'i', 'o', 'n'],
 'Ġshows': ['Ġ', 's', 'h', 'o', 'w', 's'],
 'Ġseveral': ['Ġ', 's', 'e', 'v', 'e', 'r', 'a', 'l'],
 'Ġtokenizer': ['Ġ', 't', 'o', 'k', 'e', 'n', 'i', 'z', 'e', 'r'],
 'Ġalgorithms': ['Ġ', 'a', 'l', 'g', 'o', 'r', 'i', 't', 'h', 'm', 's'],
 'Hopefully': ['H', 'o', 'p', 'e', 'f', 'u', 'l', 'l', 'y'],
 ',': [','],
 'Ġyou': ['Ġ', 'y', 'o', 'u'],
 'Ġwill': ['Ġ', 'w', 'i', 'l', 'l'],
 'Ġbe': ['Ġ', 'b', 'e'],
 'Ġable': ['Ġ', 'a', 'b', 'l', 'e'],
 'Ġto': ['Ġ', 't', 'o'],
 'Ġunderstand': ['Ġ', 'u', 'n'

Now that we are ready for **training**, let’s write a function that **computes the frequency of each pair**.

In [8]:
def compute_pair_freqs(splits):
    pair_freqs = defaultdict(int)
    for word, freq in word_freqs.items():
        # This, 3
        split = splits[word] # splits[This] --> T, h, i, s
        if len(split) == 1:
            continue
        for i in range(len(split) - 1): # i to 3
            pair = (split[i], split[i + 1]) # (split[0] T, split[1] h)
            pair_freqs[pair] += freq # pair_freqs [T, h] = 3

    return pair_freqs

In [9]:
pair_freqs = compute_pair_freqs(splits)

for i, key in enumerate(pair_freqs.keys()):
    print(f"{key}: {pair_freqs[key]}")
    if i >= 5:
        break

('T', 'h'): 3
('h', 'i'): 3
('i', 's'): 5
('Ġ', 'i'): 2
('Ġ', 't'): 7
('t', 'h'): 3


Now, finding **the most frequent pair only** takes a quick loop

In [10]:
best_pair = ""
max_freq = None

for pair, freq in pair_freqs.items():
    if max_freq is None or max_freq < freq:
        best_pair = pair
        max_freq = freq

print(best_pair, max_freq)

('Ġ', 't') 7


So the first merge to learn is `('Ġ', 't') -> 'Ġt'`, and we add `'Ġt`' to the vocabulary

In [11]:
merges = {("Ġ", "t"): "Ġt"}
vocab.append("Ġt")

To continue, we need to **apply that merge in our splits dictionary**.

In [12]:
def merge_pair(a, b, splits): #"Ġ", "t", splits
    for word in word_freqs:
        split = splits[word] #value --> 3
        if len(split) == 1:
            continue

        i = 0
        while i < len(split) - 1: # 2
            if split[i] == a and split[i + 1] == b: #"Ġ" and "t"
                split = split[:i] + [a + b] + split[i + 2 :] # split[:0] + ["Ġ"+"t"] + split[2:]
            else:
                i += 1
        splits[word] = split
    return splits

In [13]:
splits = merge_pair("Ġ", "t", splits)
print(splits["Ġtrained"])

['Ġt', 'r', 'a', 'i', 'n', 'e', 'd']


Now we have everything we need to loop until we have learned all the merges we want. Let’s aim for a vocab size of 50:

In [14]:
vocab_size = 50

while len(vocab) < vocab_size:
    pair_freqs = compute_pair_freqs(splits)
    best_pair = ""
    max_freq = None
    for pair, freq in pair_freqs.items():
        if max_freq is None or max_freq < freq:
            best_pair = pair
            max_freq = freq
    splits = merge_pair(*best_pair, splits)
    merges[best_pair] = best_pair[0] + best_pair[1]
    vocab.append(best_pair[0] + best_pair[1])

As a result, we’ve learned 19 merge rules (the initial vocabulary had a size of 31 — 30 characters in the alphabet, plus the special token):

In [15]:
print(merges)

{('Ġ', 't'): 'Ġt', ('i', 's'): 'is', ('e', 'r'): 'er', ('Ġ', 'a'): 'Ġa', ('Ġt', 'o'): 'Ġto', ('e', 'n'): 'en', ('T', 'h'): 'Th', ('Th', 'is'): 'This', ('o', 'u'): 'ou', ('s', 'e'): 'se', ('Ġto', 'k'): 'Ġtok', ('Ġtok', 'en'): 'Ġtoken', ('n', 'd'): 'nd', ('Ġ', 'is'): 'Ġis', ('Ġt', 'h'): 'Ġth', ('Ġth', 'e'): 'Ġthe', ('i', 'n'): 'in', ('Ġa', 'b'): 'Ġab', ('Ġtoken', 'i'): 'Ġtokeni'}


In [16]:
print(vocab)

['<|endoftext|>', ',', '.', 'C', 'F', 'H', 'T', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'k', 'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'v', 'w', 'y', 'z', 'Ġ', 'Ġt', 'is', 'er', 'Ġa', 'Ġto', 'en', 'Th', 'This', 'ou', 'se', 'Ġtok', 'Ġtoken', 'nd', 'Ġis', 'Ġth', 'Ġthe', 'in', 'Ġab', 'Ġtokeni']


💡 Using `train_new_from_iterator()` on the same corpus won’t result in the exact same vocabulary. This is because when there is a choice of the most frequent pair, we selected the first one encountered, while the 🤗 Tokenizers library selects the first one based on its inner IDs.

To **tokenize a new text**, we `pre-tokenize it`, `split it`, then `apply all the merge rules learned`.

In [17]:
def tokenize(text):
    pre_tokenize_result = tokenizer._tokenizer.pre_tokenizer.pre_tokenize_str(text) # word and it offsite
    pre_tokenized_text = [word for word, offset in pre_tokenize_result] # get the word only
    splits = [[l for l in word] for word in pre_tokenized_text] # [split the words into single character]
    for pair, merge in merges.items(): # ex -> ('Ġ', 't'): 'Ġt'
        for idx, split in enumerate(splits): # [..., ... , ...]
            i = 0
            while i < len(split) - 1:
                if split[i] == pair[0] and split[i + 1] == pair[1]:
                    split = split[:i] + [merge] + split[i + 2 :]
                else:
                    i += 1
            splits[idx] = split

    return sum(splits, [])

In [18]:
tokenize("This is not a token.")

['This', 'Ġis', 'Ġ', 'n', 'o', 't', 'Ġa', 'Ġtoken', '.']

In [28]:
# Example with emojis
print(tokenize("😀This is smiley😀"))

['ð', 'Ł', 'ĺ', 'Ģ', 'This', 'Ġis', 'Ġ', 's', 'm', 'i', 'l', 'e', 'y', 'ð', 'Ł', 'ĺ', 'Ģ']


⚠️ Our implementation will throw an error if there is an unknown character since we didn’t do anything to handle them. **`GPT-2` doesn’t actually have an unknown token** (*it’s impossible to get an unknown character when using byte-level BPE*), but this could happen here because we did not include all the possible bytes in the initial vocabulary. This aspect of BPE is beyond the scope of this section, so we’ve left the details out.

## **`WordPiece` Tokenization**

## **`Unigram` Tokenization**

## **Building a Tokenizer, block-by-block**