# Fast tokenizers' special powers


Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [1]:
%%capture
!pip install datasets evaluate transformers[sentencepiece]

Also, log into Hugging face.

In [2]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In this section we will take a closer look at the <font color='blue'>capabilities</font> of the <font color='blue'>tokenizers</font> in 🤗 Transformers. Up to now we have only used them to <font color='blue'>tokenize inputs</font> or <font color='blue'>decode IDs</font> back into text, but tokenizers -- especially those backed by the 🤗 Tokenizers library -- can do a lot more. To illustrate these additional features, we will explore how to reproduce the results of the `token-classification` (that we called `ner`) and `question-answering` pipelines that we first encountered in [Chapter 1](https://huggingface.co/learn/llm-course/chapter1/3).

In the following discussion, we will often make the distinction between <font color='blue'>slow</font> and <font color='blue'>fast</font> tokenizers. <font color='blue'>Slow</font> tokenizers are those <font color='blue'>written in Python</font> inside the 🤗 Transformers library, while the <font color='blue'>fast</font> versions are the ones provided by 🤗 Tokenizers, which are <font color='blue'>written in Rust</font>. If you remember the table from [Chapter 5](https://huggingface.co/learn/llm-course/chapter5/3) that reported how long it took a fast and a slow tokenizer to tokenize the Drug Review Dataset, you should have an idea of why we call them fast and slow:

|               | Fast tokenizer | Slow tokenizer
:--------------:|:--------------:|:-------------:
`batched=True`  | 10.8s          | 4min41s
`batched=False` | 59.2s          | 5min3s

<Tip warning={true}>

⚠️ When <font color='blue'>tokenizing</font> a <font color='blue'>single sentence</font>, you won't always <font color='blue'>see a difference</font> in speed between the slow and fast versions of the same tokenizer. In fact, the <font color='blue'>fast version</font> might actually be <font color='blue'>slower</font>! It's only when <font color='blue'>tokenizing</font> lots of <font color='blue'>texts</font> in <font color='blue'>parallel</font> at the same time that you will be able to clearly see the difference.

</Tip>

## Batch encoding


The <font color='blue'>output</font> of a <font color='blue'>tokenizer</font> isn't a simple Python dictionary; what we get is actually a special <font color='blue'>`BatchEncoding` object</font>. It's a <font color='blue'>subclass</font> of a <font color='blue'>dictionary</font> (which is why we were able to index into that result without any problem before), but with <font color='blue'>additional methods</font> that are mostly used by <font color='blue'>fast tokenizers</font>.

Besides their parallelization capabilities, the <font color='blue'>key functionality</font> of <font color='blue'>fast tokenizers</font> is that they always keep track of the <font color='blue'>original span</font> of <font color='blue'>texts</font> the <font color='blue'>final tokens come from</font> -- a feature we call <font color='blue'>offset mapping</font>. This in turn unlocks features like mapping each word to the tokens it generated or mapping each character of the original text to the token it's inside, and vice versa.

Let's take a look at an example:

In [39]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
example = "My name is Sylvain and I work at Hugging Face in Brooklyn."
encoding = tokenizer(example)
print(type(encoding))

<class 'transformers.tokenization_utils_base.BatchEncoding'>


As mentioned previously, we get a <font color='blue'>BatchEncoding object</font> in the <font color='blue'>tokenizer's output</font>. Since the `AutoTokenizer` class picks a <font color='blue'>fast tokenizer</font> by <font color='blue'>default</font>, we can use the <font color='blue'>additional methods</font> this <font color='blue'>`BatchEncoding`</font> object provides. We have two ways to check if our tokenizer is a fast or a slow one. We can either check the attribute `is_fast` of the `tokenizer`:


In [4]:
tokenizer.is_fast

True

or check the same attribute of our `encoding`:

In [5]:
encoding.is_fast

True

Let's see what a <font color='blue'>fast tokenizer</font> enables us to do. First, we can <font color='blue'>access the tokens</font> without having to convert the IDs back to tokens:

In [6]:
encoding.tokens()

['[CLS]',
 'My',
 'name',
 'is',
 'S',
 '##yl',
 '##va',
 '##in',
 'and',
 'I',
 'work',
 'at',
 'Hu',
 '##gging',
 'Face',
 'in',
 'Brooklyn',
 '.',
 '[SEP]']

In this case the token at <font color='blue'>index 5</font> is <font color='blue'>`##yl`</font>, which is part of the word <font color='blue'>Sylvain</font> in the original sentence. We can also use the <font color='blue'>`word_ids()` method</font> to get the <font color='blue'>index of the word</font> each token comes from:


In [7]:
encoding.word_ids()

[None, 0, 1, 2, 3, 3, 3, 3, 4, 5, 6, 7, 8, 8, 9, 10, 11, 12, None]

We can see that the tokenizer's special tokens <font color='blue'>`[CLS]`</font> and <font color='blue'>`[SEP]`</font> are mapped to <font color='blue'>`None`</font>, and then each <font color='blue'>token</font> is <font color='blue'>mapped</font> to the <font color='blue'>word</font> it <font color='blue'>originates from</font>. This is especially useful to determine if a <font color='blue'>token</font> is at the <font color='blue'>start</font> of a <font color='blue'>word</font> or if <font color='blue'>two tokens</font> are <font color='blue'>in</font> the <font color='blue'>same word</font>. We could rely on the <font color='blue'>`##` prefix</font> for that, but it only works for <font color='blue'>BERT-like tokenizers</font>; this method works for any type of tokenizer as long as it's a fast one. In the next chapter, we'll see how we can use this capability to apply the <font color='blue'>labels</font> we have for each word properly to the tokens in tasks like named entity recognition (NER) and part-of-speech (POS) tagging. We can also use it to <font color='blue'>mask all the tokens</font> coming from the <font color='blue'>same word</font> in masked language modeling (a technique called <font color='blue'>whole word masking</font>).

<Tip>

The notion of what a word is complicated. For instance, does "I'll" (a contraction of "I will") count as one or two words? It actually <font color='blue'>depends</font> on the <font color='blue'>tokenizer</font> and the <font color='blue'>pre-tokenization operation</font> it applies. Some tokenizers just split on spaces, so they will consider this as one word. Others use punctuation on top of spaces, so will consider it two words.

✏️ **Try it out!** Create a tokenizer from the `bert-base-cased` and `roberta-base` checkpoints and <font color='blue'>tokenize "81s"</font> with them. What do you observe? What are the word IDs?

</Tip>

In [47]:
from transformers import AutoTokenizer

# Create tokenizers for BERT and RoBERTa
bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
roberta_tokenizer = AutoTokenizer.from_pretrained("roberta-base")

test_string = "81s"

# Tokenization with BERT
bert_encoding = bert_tokenizer(test_string)
print("Tokenization with BERT\n")
print(f"Input: '{test_string}'")
print(f"Tokens: {bert_tokenizer.tokenize(test_string)}")
print(f"Token IDs: {bert_encoding['input_ids']}")
print(f"Word IDs: {bert_encoding.word_ids()}\n")

# Tokenization with RoBERTa
roberta_encoding = roberta_tokenizer(test_string)
print("Tokenization with RoBERTa\n")
print(f"Input: '{test_string}'")
print(f"Tokens: {roberta_tokenizer.tokenize(test_string)}")
print(f"Token IDs: {roberta_encoding['input_ids']}")
print(f"Word IDs: {roberta_encoding.word_ids()}")


Tokenization with BERT

Input: '81s'
Tokens: ['81', '##s']
Token IDs: [101, 5615, 1116, 102]
Word IDs: [None, 0, 0, None]

Tokenization with RoBERTa

Input: '81s'
Tokens: ['81', 's']
Token IDs: [0, 6668, 29, 2]
Word IDs: [None, 0, 1, None]


This comparison shows <font color='blue'>differences</font> between the <font color='blue'>BERT</font> and <font color='blue'>RoBERTa tokenization</font> strategies. <font color='blue'>BERT</font> employs the <font color='blue'>WordPiece algorithm</font>, which splits `81s` into subword pieces (`81` + `##s`),  with <font color='blue'>both tokens</font> receiving <font color='blue'>`word_id = 0`</font> since they <font color='blue'>originate</font> from the <font color='blue'>same word</font>. In contrast, <font color='blue'>RoBERTa</font> tokenizes `81s` as `['81', 's']` with word IDs `[None, 0, 1, None]`, meaning it treats <font color='blue'>`81`</font> and <font color='blue'>`s`</font> as <font color='blue'>separate words</font> rather than <font color='blue'>parts</font> of the <font color='blue'>same word</font> like BERT does. Both approaches assign `word_id = None` to their respective special tokens (`[CLS]/[SEP]` for BERT, `<s>/</s>` for RoBERTa).

Similarly, there is a <font color='blue'>`sentence_ids()` method</font> that we can use to <font color='blue'>map</font> a <font color='blue'>token</font> to the <font color='blue'>sentence it came from</font> (though in this case, the `token_type_ids` returned by the tokenizer can give us the same information).

Lastly, we can map <font color='blue'>any word</font> or <font color='blue'>token</font> to <font color='blue'>characters</font> in the <font color='blue'>original text</font>, and vice versa, via the `word_to_chars()` or `token_to_chars()` and `char_to_word()` or `char_to_token()` methods. For instance, the <font color='blue'>`word_ids()` method</font> told us that <font color='blue'>`##yl`</font> is <font color='blue'>part</font> of the <font color='blue'>word</font> at <font color='blue'>index 3</font>, but which word is it in the sentence? We can find out like this:

In [8]:
start, end = encoding.word_to_chars(3)
example[start:end]

'Sylvain'

As we mentioned previously, this is all powered by the fact the <font color='blue'>fast tokenizer</font> keeps track of the <font color='blue'>span of text</font> each <font color='blue'>token comes from</font> in a list of <font color='blue'>offsets</font>. To illustrate their use, next we'll show you how to replicate the results of the `token-classification` pipeline manually.

<Tip>

✏️ **Try it out!** Create your own example text and see if you can understand which tokens are associated with word ID, and also how to extract the character spans for a single word. For bonus points, try using two sentences as input and see if the sentence IDs make sense to you.

</Tip>

In [56]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

# Example sentence for tokens associated with a word ID
print('Example of a single sentence\n')
example = "OpenAI's ChatGPT revolutionized the ability of users to get questionable information."
encoding = tokenizer(example)

print(f"Sentence: {example}")
print(f"Tokens: {tokenizer.tokenize(example)}")
print(f"Word IDs: {encoding.word_ids()}")

# Character span for word 3 (ChatGPT)
word_id = 3
start, end = encoding.word_to_chars(word_id)
print(f"Word {word_id}: '{example[start:end]}'")

# Example for two sentences
print('\nExample for two sentences\n')
sentence1 = "I love analytic number theory!"
sentence2 = "The Riemann zeta function is the most interesting thing on planet Earth."
two_sent_encoding = tokenizer(sentence1, sentence2)

print(f"Sentence 1: {sentence1}")
print(f"Sentence 2: {sentence2}")
print(f"Tokens: {tokenizer.convert_ids_to_tokens(two_sent_encoding['input_ids'])}")
print(f"Word IDs: {two_sent_encoding.word_ids()}")
print(f"Token Type IDs: {two_sent_encoding['token_type_ids']}")

# Sentence mapping via token_type_ids (0 = first sentence, 1 = second sentence)
print('\nSentence mapping via token_type_ids\n')
tokens = tokenizer.convert_ids_to_tokens(two_sent_encoding['input_ids'])
for i, (token, type_id) in enumerate(zip(tokens, two_sent_encoding['token_type_ids'])):
    sentence = "First sentence" if type_id == 0 else "Second sentence"
    print(f"{token} -> {sentence}")

Example of a single sentence

Sentence: OpenAI's ChatGPT revolutionized the ability of users to get questionable information.
Tokens: ['Open', '##A', '##I', "'", 's', 'Cha', '##t', '##GP', '##T', 'revolution', '##ized', 'the', 'ability', 'of', 'users', 'to', 'get', 'questionable', 'information', '.']
Word IDs: [None, 0, 0, 0, 1, 2, 3, 3, 3, 3, 4, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, None]
Word 3: 'ChatGPT'

Example for two sentences

Sentence 1: I love analytic number theory!
Sentence 2: The Riemann zeta function is the most interesting thing on planet Earth.
Tokens: ['[CLS]', 'I', 'love', 'anal', '##ytic', 'number', 'theory', '!', '[SEP]', 'The', 'R', '##ie', '##mann', 'z', '##eta', 'function', 'is', 'the', 'most', 'interesting', 'thing', 'on', 'planet', 'Earth', '.', '[SEP]']
Word IDs: [None, 0, 1, 2, 2, 3, 4, 5, None, 0, 1, 1, 1, 2, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, None]
Token Type IDs: [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

Sentence mapping 

## Inside the `token-classification` pipeline

In [Chapter 1](https://huggingface.co/learn/llm-course/chapter1/3) we got our first taste of applying <font color='blue'>NER</font> -- where the task is to identify which <font color='blue'>parts of the text</font> correspond to <font color='blue'>entities</font> like <font color='blue'>persons, locations, or organizations</font> -- with the 🤗 Transformers `pipeline()` function. Then, in [Chapter 2](https://huggingface.co/learn/llm-course/chapter2/2), we saw how a <font color='blue'>pipeline groups together</font> the <font color='blue'>three stages</font> necessary to get the <font color='blue'>predictions</font> from a <font color='blue'>raw text</font>: tokenization, passing the inputs through the model, and post-processing. The <font color='blue'>first two steps</font> in the `token-classification` pipeline are the <font color='blue'>same</font> as in any other pipeline, but the <font color='blue'>post-processing</font> is a little <font color='blue'>more complex</font> -- let's see how!


### Getting the base results with the pipeline

First, let's grab a token classification pipeline so we can get some results to compare manually. The model used by default is [`dbmdz/bert-large-cased-finetuned-conll03-english`](https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english); it performs NER on sentences:


In [40]:
from transformers import pipeline

token_classifier = pipeline("token-classification")
token_classifier("My name is Sylvain and I work at Hugging Face in Brooklyn.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


[{'entity': 'I-PER',
  'score': np.float32(0.99938285),
  'index': 4,
  'word': 'S',
  'start': 11,
  'end': 12},
 {'entity': 'I-PER',
  'score': np.float32(0.99815494),
  'index': 5,
  'word': '##yl',
  'start': 12,
  'end': 14},
 {'entity': 'I-PER',
  'score': np.float32(0.99590707),
  'index': 6,
  'word': '##va',
  'start': 14,
  'end': 16},
 {'entity': 'I-PER',
  'score': np.float32(0.99923277),
  'index': 7,
  'word': '##in',
  'start': 16,
  'end': 18},
 {'entity': 'I-ORG',
  'score': np.float32(0.9738931),
  'index': 12,
  'word': 'Hu',
  'start': 33,
  'end': 35},
 {'entity': 'I-ORG',
  'score': np.float32(0.976115),
  'index': 13,
  'word': '##gging',
  'start': 35,
  'end': 40},
 {'entity': 'I-ORG',
  'score': np.float32(0.9887976),
  'index': 14,
  'word': 'Face',
  'start': 41,
  'end': 45},
 {'entity': 'I-LOC',
  'score': np.float32(0.9932106),
  'index': 16,
  'word': 'Brooklyn',
  'start': 49,
  'end': 57}]

The <font color='blue'>model</font> properly <font color='blue'>identified</font> each <font color='blue'>token</font> generated by <font color='blue'>Sylvain</font> as a <font color='blue'>person</font>, each token generated by <font color='blue'>Hugging Face</font> as an <font color='blue'>organization</font>, and the token <font color='blue'>Brooklyn</font> as a <font color='blue'>location</font>. We can also ask the <font color='blue'>pipeline</font> to <font color='blue'>group</font> together the <font color='blue'>tokens</font> that correspond to the <font color='blue'>same entity</font>:


In [41]:
from transformers import pipeline

token_classifier = pipeline("token-classification", aggregation_strategy="simple")
token_classifier("My name is Sylvain and I work at Hugging Face in Brooklyn.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


[{'entity_group': 'PER',
  'score': np.float32(0.9981694),
  'word': 'Sylvain',
  'start': 11,
  'end': 18},
 {'entity_group': 'ORG',
  'score': np.float32(0.9796019),
  'word': 'Hugging Face',
  'start': 33,
  'end': 45},
 {'entity_group': 'LOC',
  'score': np.float32(0.9932106),
  'word': 'Brooklyn',
  'start': 49,
  'end': 57}]

The <font color='blue'>`aggregation_strategy`</font> picked will <font color='blue'>change the scores</font> computed for each grouped entity. With <font color='blue'>simple</font> the <font color='blue'>score</font> is just the <font color='blue'>mean of the scores of each token</font> in the given entity: for instance, the score of <font color='blue'>Sylvain</font> is the <font color='blue'>mean of the scores</font> we saw in the <font color='blue'>previous example</font> for the tokens `S`, `##yl`, `##va`, and `##in`. Other strategies available are:

- <font color='blue'>first</font>, where the <font color='blue'>score</font> of each entity is the <font color='blue'>score of the first token</font> of that entity (so for "Sylvain" it would be 0.993828, the score of the token `S`)
- <font color='blue'>max</font>, where the score of each entity is the <font color='blue'>maximum score of the tokens</font> in that entity (so for "Hugging Face" it would be 0.98879766, the score of "Face")
- <font color='blue'>average</font>, where the score of each entity is the <font color='blue'>average of the scores of the words</font> composing that entity (so for "Sylvain" there would be no difference from the `"simple"` strategy, but "Hugging Face" would have a score of 0.9819, the average of the scores for "Hugging", 0.975, and "Face", 0.98879)

Now let's see how to obtain these results without using the `pipeline()` function!


### From inputs to predictions

First we need to <font color='blue'>tokenize</font> our <font color='blue'>input</font> and <font color='blue'>pass</font> it <font color='blue'>through the model</font>. This is done exactly as in [Chapter 2](https://huggingface.co/learn/llm-course/chapter2/4); we instantiate the tokenizer and the model using the `AutoXxx` classes and then use them on our example:


In [42]:
from transformers import AutoTokenizer, AutoModelForTokenClassification

model_checkpoint = "dbmdz/bert-large-cased-finetuned-conll03-english"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint)

example = "My name is Sylvain and I work at Hugging Face in Brooklyn."
inputs = tokenizer(example, return_tensors="pt")
outputs = model(**inputs)

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Since we're using `AutoModelForTokenClassification` here, we get <font color='blue'>one set of logits</font> for <font color='blue'>each token</font> in the input sequence:

In [12]:
print(inputs["input_ids"].shape)
print(outputs.logits.shape)

torch.Size([1, 19])
torch.Size([1, 19, 9])


First we need to <font color='blue'>tokenize our input</font> and pass it <font color='blue'>through the model</font>. This is done exactly as in [Chapter 2](/https://huggingface.co/learn/llm-course/chapter2/4); we <font color='blue'>instantiate</font> the <font color='blue'>tokenizer</font> and the <font color='blue'>model</font> using the <font color='blue'>`TFAutoXxx` classes</font> and then use them on our example:


In [13]:
from transformers import AutoTokenizer, TFAutoModelForTokenClassification

model_checkpoint = "dbmdz/bert-large-cased-finetuned-conll03-english"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = TFAutoModelForTokenClassification.from_pretrained(model_checkpoint)

example = "My name is Sylvain and I work at Hugging Face in Brooklyn."
inputs = tokenizer(example, return_tensors="tf")
outputs = model(**inputs)

All PyTorch model weights were used when initializing TFBertForTokenClassification.

All the weights of TFBertForTokenClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForTokenClassification for predictions without further training.


Since we're using `TFAutoModelForTokenClassification` here, we get <font color='blue'>one set of logits</font> for <font color='blue'>each token</font> in the input sequence:

In [14]:
print(inputs["input_ids"].shape)
print(outputs.logits.shape)

(1, 19)
(1, 19, 9)


We have a batch with <font color='blue'>1 sequence of 19 tokens</font> and the model has <font color='blue'>9 different labels</font>, so the output of the model has a shape of <font color='blue'>`1 x 19 x 9`</font>. Like for the text classification pipeline, we use a <font color='blue'>softmax function</font> to <font color='blue'>convert</font> those <font color='blue'>logits to probabilities</font>, and we take the <font color='blue'>argmax</font> to get <font color='blue'>predictions</font> (note that we can take the argmax on the logits because the softmax does not change the order):


In [21]:
import torch
import tensorflow as tf

# Convert TensorFlow tensor to PyTorch tensor before using torch.nn.functional.softmax
logits_torch = torch.tensor(outputs.logits.numpy())
probabilities = torch.nn.functional.softmax(logits_torch, dim=-1)[0].tolist()
predictions = logits_torch.argmax(dim=-1)[0].tolist()

print(predictions)

[0, 0, 0, 0, 4, 4, 4, 4, 0, 0, 0, 0, 6, 6, 6, 0, 8, 0, 0]


In [23]:
probabilities = tf.math.softmax(outputs.logits, axis=-1)[0]
probabilities = probabilities.numpy().tolist()
predictions = tf.math.argmax(outputs.logits, axis=-1)[0]
predictions = predictions.numpy().tolist()
print(predictions)

[0, 0, 0, 0, 4, 4, 4, 4, 0, 0, 0, 0, 6, 6, 6, 0, 8, 0, 0]


The <font color='blue'>`model.config.id2label`</font> attribute contains the mapping of <font color='blue'>indexes to labels</font> that we can use to make sense of the predictions:

In [24]:
model.config.id2label

{0: 'O',
 1: 'B-MISC',
 2: 'I-MISC',
 3: 'B-PER',
 4: 'I-PER',
 5: 'B-ORG',
 6: 'I-ORG',
 7: 'B-LOC',
 8: 'I-LOC'}

As we see, there are <font color='blue'>9 labels</font>: <font color='blue'>`O`</font> is the label for the <font color='blue'>tokens</font> that are <font color='blue'>not in any named entity</font> (it stands for "outside"), and we then have <font color='blue'>two labels</font> for <font color='blue'>each type of entity</font> (miscellaneous, person, organization, and location). The label <font color='blue'>`B-XXX`</font> indicates the <font color='blue'>token</font> is at the <font color='blue'>beginning</font> of an entity `XXX` and the label <font color='blue'>`I-XXX`</font> indicates the <font color='blue'>token</font> is <font color='blue'>inside</font> the entity `XXX`. For instance, in the current example we would expect our model to classify the token <font color='blue'>`S`</font> as <font color='blue'>`B-PER`</font> (beginning of a person entity) and the tokens <font color='blue'>`##yl`, `##va`</font> and <font color='blue'>`##in`</font> as <font color='blue'>`I-PER`</font> (inside a person entity).

You might think the model was wrong in this case as it gave the label <font color='blue'>`I-PER`</font> to <font color='blue'>all four</font> of these <font color='blue'>tokens</font>, but that's not entirely true. There are actually two formats for those `B-` and `I-` labels: <font color='blue'>IOB1</font> and <font color='blue'>IOB2</font>. The <font color='blue'>IOB2</font> format (in <font color='blue'>pink</font> below), is the one we introduced whereas in the <font color='blue'>IOB1</font> format (in <font color='blue'>blue</font>), the labels <font color='blue'>beginning</font> with <font color='blue'>`B-`</font> are only ever used to <font color='blue'>separate two adjacent entities</font> of the <font color='blue'>same type</font>. The model we are using was <font color='blue'>fine-tuned</font> on a <font color='blue'>dataset</font> using <font color='blue'>that format</font>, which is why it assigns the label `I-PER` to the `S` token.

<div class="flex justify-center">
<img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter6/IOB_versions.svg" alt="IOB1 vs IOB2 format"/>
</div>

With this map, we are ready to <font color='blue'>reproduce</font> (almost entirely) the <font color='blue'>results of the first pipeline</font> -- we can just grab the score and label of each token that was not classified as `O`:

In [35]:
results = []
tokens = inputs.tokens()

for idx, pred in enumerate(predictions):
    label = model.config.id2label[pred]
    if label != "O":
        results.append(
            {"entity": label, "score": probabilities[idx][pred], "word": tokens[idx]}
        )

for result in results:
    print(result)

{'entity': 'I-PER', 'score': 0.9993829131126404, 'word': 'S'}
{'entity': 'I-PER', 'score': 0.998154878616333, 'word': '##yl'}
{'entity': 'I-PER', 'score': 0.995907187461853, 'word': '##va'}
{'entity': 'I-PER', 'score': 0.9992326498031616, 'word': '##in'}
{'entity': 'I-ORG', 'score': 0.9738930463790894, 'word': 'Hu'}
{'entity': 'I-ORG', 'score': 0.9761150479316711, 'word': '##gging'}
{'entity': 'I-ORG', 'score': 0.9887974858283997, 'word': 'Face'}
{'entity': 'I-LOC', 'score': 0.9932106137275696, 'word': 'Brooklyn'}


This is very similar to what we had before, with one exception: the <font color='blue'>pipeline</font> also gave us <font color='blue'>information</font> about the <font color='blue'>`start` and `end`</font> of each entity in the original sentence. This is where our <font color='blue'>offset mapping</font> will come into play. To get the offsets, we just have to set <font color='blue'>`return_offsets_mapping=True`</font> when we apply the tokenizer to our inputs:


In [26]:
inputs_with_offsets = tokenizer(example, return_offsets_mapping=True)
inputs_with_offsets["offset_mapping"]

[(0, 0),
 (0, 2),
 (3, 7),
 (8, 10),
 (11, 12),
 (12, 14),
 (14, 16),
 (16, 18),
 (19, 22),
 (23, 24),
 (25, 29),
 (30, 32),
 (33, 35),
 (35, 40),
 (41, 45),
 (46, 48),
 (49, 57),
 (57, 58),
 (0, 0)]

Each <font color='blue'>tuple</font> is the <font color='blue'>span of text</font> corresponding to <font color='blue'>each token</font>, where `(0, 0)` is reserved for the special tokens. We saw before that the token at index 5 is `##yl`, which has <font color='blue'>`(12, 14)`</font> as <font color='blue'>offsets</font> here. If we grab the corresponding slice in our example:


In [27]:
example[12:14]

'yl'

we get the <font color='blue'>proper span</font> of <font color='blue'>text</font> without the `##`. Using this, we can now complete the previous results:

In [36]:
results = []
inputs_with_offsets = tokenizer(example, return_offsets_mapping=True)
tokens = inputs_with_offsets.tokens()
offsets = inputs_with_offsets["offset_mapping"]

for idx, pred in enumerate(predictions):
    label = model.config.id2label[pred]
    if label != "O":
        start, end = offsets[idx]
        results.append(
            {
                "entity": label,
                "score": probabilities[idx][pred],
                "word": tokens[idx],
                "start": start,
                "end": end,
            }
        )

for result in results:
    print(result)

{'entity': 'I-PER', 'score': 0.9993829131126404, 'word': 'S', 'start': 11, 'end': 12}
{'entity': 'I-PER', 'score': 0.998154878616333, 'word': '##yl', 'start': 12, 'end': 14}
{'entity': 'I-PER', 'score': 0.995907187461853, 'word': '##va', 'start': 14, 'end': 16}
{'entity': 'I-PER', 'score': 0.9992326498031616, 'word': '##in', 'start': 16, 'end': 18}
{'entity': 'I-ORG', 'score': 0.9738930463790894, 'word': 'Hu', 'start': 33, 'end': 35}
{'entity': 'I-ORG', 'score': 0.9761150479316711, 'word': '##gging', 'start': 35, 'end': 40}
{'entity': 'I-ORG', 'score': 0.9887974858283997, 'word': 'Face', 'start': 41, 'end': 45}
{'entity': 'I-LOC', 'score': 0.9932106137275696, 'word': 'Brooklyn', 'start': 49, 'end': 57}


This is the same as what we got from the first pipeline!

### Grouping entities

Using the <font color='blue'>offsets</font> to determine the <font color='blue'>start and end keys</font> for <font color='blue'>each entity</font> is <font color='blue'>handy</font>, but that information isn't strictly necessary. When we want to <font color='blue'>group the entities together</font>, however, the <font color='blue'>offsets</font> will save us a lot of <font color='blue'>messy code</font>. For example, if we wanted to group together the tokens `Hu`, `##gging`, and `Face`, we could make special rules that say the first two should be attached while removing the `##`, and the `Face` should be added with a space since it does not begin with `##` -- but that would only work for this particular type of tokenizer. We would have to write <font color='blue'>another set of rules</font> for a <font color='blue'>SentencePiece</font> or a <font color='blue'>Byte-Pair-Encoding tokenizer</font> (discussed later in this chapter).

With the <font color='blue'>offsets</font>, all that <font color='blue'>custom code goes away</font>: we just can take the <font color='blue'>span in the original text</font> that <font color='blue'>begins</font> with the <font color='blue'>first token</font> and <font color='blue'>ends</font> with the <font color='blue'>last token</font>. So, in the case of the tokens `Hu`, `##gging`, and `Face`, we should start at character 33 (the beginning of `Hu`) and end before character 45 (the end of `Face`):


In [29]:
example[33:45]

'Hugging Face'

To write the code that <font color='blue'>post-processes</font> the <font color='blue'>predictions</font> while <font color='blue'>grouping entities</font>, we will <font color='blue'>group together entities</font> that are <font color='blue'>consecutive</font> and <font color='blue'>labeled with `I-XXX`</font>, <font color='blue'>except</font> for the <font color='blue'>first one</font>, which can be <font color='blue'>labeled</font> as <font color='blue'>`B-XXX` or `I-XXX`</font> (so, we stop grouping an entity when we get a `O`, a new type of entity, or a `B-XXX` that tells us an entity of the same type is starting):


In [37]:
import numpy as np

results = []
inputs_with_offsets = tokenizer(example, return_offsets_mapping=True)
tokens = inputs_with_offsets.tokens()
offsets = inputs_with_offsets["offset_mapping"]

idx = 0
while idx < len(predictions):
    pred = predictions[idx]
    label = model.config.id2label[pred]
    if label != "O":
        # Remove the B- or I-
        label = label[2:]
        start, _ = offsets[idx]

        # Grab all the tokens labeled with I-label
        all_scores = []
        while (
            idx < len(predictions)
            and model.config.id2label[predictions[idx]] == f"I-{label}"
        ):
            all_scores.append(probabilities[idx][pred])
            _, end = offsets[idx]
            idx += 1

        # The score is the mean of all the scores of the tokens in that grouped entity
        score = np.mean(all_scores).item()
        word = example[start:end]
        results.append(
            {
                "entity_group": label,
                "score": score,
                "word": word,
                "start": start,
                "end": end,
            }
        )
    idx += 1

for result in results:
    print(result)

{'entity_group': 'PER', 'score': 0.998169407248497, 'word': 'Sylvain', 'start': 11, 'end': 18}
{'entity_group': 'ORG', 'score': 0.9796018600463867, 'word': 'Hugging Face', 'start': 33, 'end': 45}
{'entity_group': 'LOC', 'score': 0.9932106137275696, 'word': 'Brooklyn', 'start': 49, 'end': 57}


And we get the <font color='blue'>same results</font> as with our <font color='blue'>second pipeline</font>! <font color='blue'>Another example</font> of a task where these offsets are extremely useful is <font color='blue'>question answering</font>. Diving into that pipeline, which we'll do in the next section, will also enable us to take a look at one last feature of the tokenizers in the 🤗 Transformers library: <font color='blue'>dealing</font> with <font color='blue'>overflowing tokens</font> when we <font color='blue'>truncate an input</font> to a given length.
