<a href="https://colab.research.google.com/github/chineidu/NLP-Tutorial/blob/main/notebook/06_Transformers/06a_tokenizers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Training A New Tokenizer From An Old Tokenizer

- Check [this](https://huggingface.co/learn/nlp-course/chapter6/2?fw=pt) for info on how to finetune a pretrained tokenizer.

In [1]:
!pip install rich
!pip install transformers[torch]
!pip install torch datasets evaluate



In [2]:
# Built-in library
import re
import json
from typing import Any, Dict, List, Optional, Union
import logging
import warnings

# Standard imports
import numpy as np
import pandas as pd
from rich import print

# Visualization
import matplotlib.pyplot as plt


# Pandas settings
pd.options.display.max_rows = 1_000
pd.options.display.max_columns = 1_000
pd.options.display.max_colwidth = 600

warnings.filterwarnings("ignore")

# Black code formatter (Optional)
# %load_ext lab_black

# auto reload imports
# %load_ext autoreload
# %autoreload 2

<hr><br>

## Batch Encoding Using Fast Tokenizers

In [3]:
from transformers import AutoTokenizer


CHECKPOINT: str = "bert-base-cased"
tokenizer: AutoTokenizer = AutoTokenizer.from_pretrained(CHECKPOINT)
example: str = "My name is Chineidu and I work at Hugging Face In Brooklyn."
encoding: dict["str", Any] = tokenizer(example)

print(type(encoding))

In [4]:
print(encoding)

# Access the tokens (w/o converting the IDs back to tokens)
print(encoding.tokens())

In [5]:
# Get the index of the word each token comes from.
# The special tokens [CLS] and [SEP] are represented as None.
print(encoding.word_ids())

In [6]:
# Try another tokenizer!
CHECKPOINT: str = "roberta-base"
tokenizer_2: AutoTokenizer = AutoTokenizer.from_pretrained(CHECKPOINT)
example_2: str = "My name is Chineidu and I work at Hugging Face In Brooklyn."
encoding_2: dict["str", Any] = tokenizer_2(example_2)

print(encoding_2)

# Access the tokens (w/o converting the IDs back to tokens)
print(encoding_2.tokens())

In [7]:
print(example)
print(encoding.word_ids())

# Access the tokens (w/o converting the IDs back to tokens)
print(encoding.tokens())

```text
- We can map any word or token to characters in the original text, and vice versa,
* via the:
  - word_to_chars()
  - or token_to_chars() and char_to_word()
  - or char_to_token() methods.
  
- The word_ids() method told us that ##ei is part of the word at index 3, but which word is it in the sentence? We can find out like this:
```

In [8]:
start, end = encoding.word_to_chars(3)
example[start:end]

'Chineidu'

<hr><br>

## [Text Classification Pipeline](https://huggingface.co/learn/nlp-course/chapter6/3?fw=pt)

```text
- Using a token classification pipeline, we can get some results to compare manually.
- The model used by default is dbmdz/bert-large-cased-finetuned-conll03-english and it performs NER on sentences.
```

In [9]:
from transformers import pipeline


TASK: str = "token-classification"  # Named Entity Recognition (NER)
token_classifier: pipeline = pipeline(task=TASK)
example: str = "My name is Chineidu and I work at Hugging Face In Brooklyn."

token_classifier(example)

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'entity': 'I-PER',
  'score': 0.99802446,
  'index': 4,
  'word': 'Chin',
  'start': 11,
  'end': 15},
 {'entity': 'I-PER',
  'score': 0.96976656,
  'index': 5,
  'word': '##ei',
  'start': 15,
  'end': 17},
 {'entity': 'I-PER',
  'score': 0.99290186,
  'index': 6,
  'word': '##du',
  'start': 17,
  'end': 19},
 {'entity': 'I-ORG',
  'score': 0.99207014,
  'index': 11,
  'word': 'Hu',
  'start': 34,
  'end': 36},
 {'entity': 'I-ORG',
  'score': 0.99378514,
  'index': 12,
  'word': '##gging',
  'start': 36,
  'end': 41},
 {'entity': 'I-ORG',
  'score': 0.9924396,
  'index': 13,
  'word': 'Face',
  'start': 42,
  'end': 46},
 {'entity': 'I-LOC',
  'score': 0.9217939,
  'index': 15,
  'word': 'Brooklyn',
  'start': 50,
  'end': 58}]

<br>

#### Comment

```text
- The model properly identified each token generated by `Chineidu` as a person, each token generated by “Hugging Face” as an organization, and the token “Brooklyn” as a location. We can also ask the pipeline to group together the tokens that correspond to the same entity:
```

In [10]:
from transformers import pipeline


TASK: str = "token-classification"  # Named Entity Recognition (NER)

# With "simple" the score is just the mean of the scores of each token in the
# given entity: e.g., the score of “Chineidu” is the mean of the scores
# we saw in the previous example for the tokens Chin, ##ei, and ##du
token_classifier: pipeline = pipeline(task=TASK, aggregation_strategy="simple")
example: str = "My name is Chineidu and I work at Hugging Face In Brooklyn."

token_classifier(example)

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'entity_group': 'PER',
  'score': 0.98689765,
  'word': 'Chineidu',
  'start': 11,
  'end': 19},
 {'entity_group': 'ORG',
  'score': 0.99276495,
  'word': 'Hugging Face',
  'start': 34,
  'end': 46},
 {'entity_group': 'LOC',
  'score': 0.9217939,
  'word': 'Brooklyn',
  'start': 50,
  'end': 58}]

#### Other Strategies:

```text
- "first", where the score of each entity is the score of the first token of that entity (so for “Chineidu” it would be 0.99802446, the score of the token Chin)

- "max", where the score of each entity is the maximum score of the tokens in that entity (so for “Hugging Face” it would be 0.98879766, the score of “Face”)

- "average", where the score of each entity is the average of the scores of the words composing that entity (so for “Chineidu” there would be no difference from the "simple" strategy, but “Hugging Face” would have a score of 0.9819, the average of the scores for “Hugging”, 0.975, and “Face”, 0.98879)
```

In [11]:
from transformers import AutoTokenizer, AutoModelForTokenClassification

model_checkpoint = "dbmdz/bert-large-cased-finetuned-conll03-english"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint)

example = "My name is Sylvain and I work at Hugging Face in Brooklyn."
inputs = tokenizer(example, return_tensors="pt")
outputs = model(**inputs)

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [12]:
from transformers import AutoTokenizer, AutoModelForTokenClassification


model_checkpoint: str = "dbmdz/bert-large-cased-finetuned-conll03-english"
tokenizer: AutoTokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model: AutoModelForTokenClassification = AutoModelForTokenClassification.from_pretrained(model_checkpoint)

example: str = "My name is Chineidu and I work at Hugging Face In Brooklyn."
inputs: dict[str, Any] = tokenizer(example, return_tensors="pt")
outputs = model(**inputs)


Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [13]:
inputs.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

In [14]:
print(inputs["input_ids"].shape)
print(outputs.logits.shape)

#### Comment:

```text
- The output is a batch with 1 sequence of 18 tokens and the model has 9 different labels, so the output of the model has a shape of 1 x 18 x 9.

- Like for the text classification pipeline, a softmax function is used to convert those logits to probabilities, and the argmax is calculated to get predictions (note that we can take the argmax on the logits because the softmax does not change the order)
```

In [15]:
import torch.nn.functional as F


probabilities: list[float] = F.softmax(outputs.logits, dim=-1)[0].tolist()
predictions: list[int] = outputs.logits.argmax(dim=-1)[0].tolist()
print(predictions)

In [16]:
# The model.config.id2label attribute contains the mapping of indexes to labels
# that we can use to make sense of the predictions:
print(model.config.id2label)

In [17]:
print(probabilities[4])

In [18]:
# entity, score
print((model.config.id2label[4], probabilities[4][4]))

#### Note:

```text
- There are 9 labels:
  - O is the label for the tokens that are not in any named entity (it stands for “outside”), and we then have two labels for each type of entity (miscellaneous, person, organization, and location).
  - The label B-XXX indicates the token is at the beginning of an entity XXX and the label I-XXX indicates the token is inside the entity XXX. For instance, in the current example we would expect our model to classify the token `Chin` as B-PER (beginning of a person entity) and the tokens ##ei, and ##du as I-PER (inside a person entity).
```

In [19]:
# With this map, we are ready to reproduce (almost entirely) the results of the first pipeline
# we can just grab the score and label of each token that was not classified as O:
results: list[str] = []
tokens: list[str] = inputs.tokens()

for idx, pred in enumerate(predictions):
    label = model.config.id2label[pred]
    if label != "O": # label for tokens that's `outside`
        results.append(
            {
                "entity": label, "score": probabilities[idx][pred],
                "index":idx , "word": tokens[idx]
            }
        )

print(results)

In [20]:
# To obtain the `start` and `end` of each entity in the original sentence,
# add `return_offsets_mapping=True`.
inputs_with_offsets: dict[str, Any] = tokenizer(example, return_offsets_mapping=True)
inputs_with_offsets["offset_mapping"]

[(0, 0),
 (0, 2),
 (3, 7),
 (8, 10),
 (11, 15),
 (15, 17),
 (17, 19),
 (20, 23),
 (24, 25),
 (26, 30),
 (31, 33),
 (34, 36),
 (36, 41),
 (42, 46),
 (47, 49),
 (50, 58),
 (58, 59),
 (0, 0)]

In [21]:
example[8:10], example[11:15]

('is', 'Chin')

In [22]:
# Update the logic!
results: list[str] = []
tokens: list[str] = inputs.tokens()
offsets:list[tuple[int]] = inputs_with_offsets["offset_mapping"]

for idx, pred in enumerate(predictions):
    label = model.config.id2label[pred]
    if label != "O": # label for tokens that's `outside`
        start, end = offsets[idx]
        results.append(
            {
                "entity": label, "score": probabilities[idx][pred],
                "index":idx , "word": tokens[idx],
                "start": start, "end": end,
            }
        )

print(results)

#### Grouping Entities

```text

- Using the offsets to determine the start and end keys for each entity is handy, but that information isn’t strictly necessary.
- When we want to group the entities together, however, the offsets will save us a lot of messy code. e.g., if we wanted to group together the tokens Hu, ##gging, and Face, we could make special rules that say the first two should be attached while removing the ##, and the Face should be added with a space since it does not begin with ## — but that would only work for this particular type of tokenizer. We would have to write another set of rules for a SentencePiece or a Byte-Pair-Encoding tokenizer (discussed later in this chapter).

- With the offsets, all that custom code goes away: we just can take the span in the original text that begins with the first token and ends with the last token. So, in the case of the tokens Hu, ##gging, and Face, we should start at character 33 (the beginning of Hu) and end before character 45 (the end of Face):
```

In [23]:
example[34:46]

'Hugging Face'

In [24]:
# To write the code that post-processes the predictions while grouping entities, we will group together
# entities that are consecutive and labeled with I-XXX, except for the first one, which can be labeled as
# B-XXX or I-XXX (so, we stop grouping an entity when we get a O, a new type of entity, or a B-XXX that
# tells us an entity of the same type is starting):
import numpy as np


results: list[str] = []
inputs_with_offsets: dict[str, Any] = tokenizer(example, return_offsets_mapping=True)
tokens: list[str] = inputs_with_offsets.tokens()
offsets:list[tuple[int]] = inputs_with_offsets["offset_mapping"]

idx: int = 0
while idx < len(predictions):
    pred = predictions[idx]
    label = model.config.id2label[pred]
    if label != "O":
        # Remove the B- or I-
        label = label[2:]
        start, _ = offsets[idx]

        # Grab all the tokens labeled with I-label
        all_scores = []
        while (
            idx < len(predictions)
            and model.config.id2label[predictions[idx]] == f"I-{label}"
        ):
            all_scores.append(probabilities[idx][pred])
            _, end = offsets[idx]
            idx += 1

        # The score is the mean of all the scores of the tokens in that grouped entity
        score = np.mean(all_scores).item()
        word = example[start:end]
        results.append(
            {
                "entity_group": label,
                "score": score,
                "word": word,
                "start": start,
                "end": end,
            }
        )
    idx += 1

print(results)

<hr><br>

### [Fast Tokenizers In THe QA Pipeline](https://huggingface.co/learn/nlp-course/chapter6/3b?fw=pt)

In [25]:
from transformers import pipeline

TASK: str = "question-answering"
question_answerer:pipeline = pipeline(TASK)
context: str = """
🤗 Transformers is backed by the three most popular deep learning libraries — Jax, PyTorch, and TensorFlow — with a seamless integration
between them. It's straightforward to train your models with one before loading them for inference with the other.
"""
question: str = "Which deep learning libraries back 🤗 Transformers?"
question_answerer(question=question, context=context)

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'score': 0.9802603125572205,
 'start': 78,
 'end': 106,
 'answer': 'Jax, PyTorch, and TensorFlow'}

In [26]:
# Rephrase the qs
question: str = "What packages power transformers behind the scenes?"
question_answerer(question=question, context=context)

{'score': 0.8080288171768188,
 'start': 78,
 'end': 106,
 'answer': 'Jax, PyTorch, and TensorFlow'}

In [27]:
# Unlike the other pipelines, which can’t truncate and split texts that are longer than the maximum length
# accepted by the model (and thus may miss information at the end of a document), this pipeline can deal with
# very long contexts and will return the answer to the question even if it’s at the end:
long_context: str = """
🤗 Transformers: State of the Art NLP

🤗 Transformers provides thousands of pretrained models to perform tasks on texts such as classification, information extraction,
question answering, summarization, translation, text generation and more in over 100 languages.
Its aim is to make cutting-edge NLP easier to use for everyone.

🤗 Transformers provides APIs to quickly download and use those pretrained models on a given text, fine-tune them on your own datasets and
then share them with the community on our model hub. At the same time, each python module defining an architecture is fully standalone and
can be modified to enable quick research experiments.

Why should I use transformers?

1. Easy-to-use state-of-the-art models:
  - High performance on NLU and NLG tasks.
  - Low barrier to entry for educators and practitioners.
  - Few user-facing abstractions with just three classes to learn.
  - A unified API for using all our pretrained models.
  - Lower compute costs, smaller carbon footprint:

2. Researchers can share trained models instead of always retraining.
  - Practitioners can reduce compute time and production costs.
  - Dozens of architectures with over 10,000 pretrained models, some in more than 100 languages.

3. Choose the right framework for every part of a model's lifetime:
  - Train state-of-the-art models in 3 lines of code.
  - Move a single model between TF2.0/PyTorch frameworks at will.
  - Seamlessly pick the right framework for training, evaluation and production.

4. Easily customize a model or an example to your needs:
  - We provide examples for each architecture to reproduce the results published by its original authors.
  - Model internals are exposed as consistently as possible.
  - Model files can be used independently of the library for quick experiments.

🤗 Transformers is backed by the three most popular deep learning libraries — Jax, PyTorch and TensorFlow — with a seamless integration
between them. It's straightforward to train your models with one before loading them for inference with the other.
"""
question: str = "Which deep learning libraries back 🤗 Transformers?"

question_answerer(question=question, context=long_context)

{'score': 0.9714871048927307,
 'start': 1892,
 'end': 1919,
 'answer': 'Jax, PyTorch and TensorFlow'}

### Breaking Down The QA Pipeline

```text
- We start by tokenizing our input and then send it through the model.
- The checkpoint used by default for the question-answering pipeline is distilbert-base-cased-distilled-squad (the “squad” in the name comes from the dataset on which the model was fine-tuned; we’ll talk more about the SQuAD dataset in Chapter 7)
```

In [28]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering


model_checkpoint: str = "distilbert-base-cased-distilled-squad"
tokenizer: AutoTokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model: AutoModelForQuestionAnswering = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

inputs: dict[str, Any] = tokenizer(question, context, return_tensors="pt")
outputs = model(**inputs)

[![image.png](https://i.postimg.cc/9XBkyMSp/image.png)](https://postimg.cc/nMCTJHnj)

<br>

```text
- Models for question answering work a little differently from the models we’ve seen up to now.
- Using the picture above as an example, the model has been trained to predict the index of the token starting the answer (here 21) and the index of the token where the answer ends (here 24).
- This is why those models don’t return one tensor of logits but two: one for the logits corresponding to the start token of the answer, and one for the logits corresponding to the end token of the answer.
- Since in this case we have only one input containing 66 tokens, we get:
```

In [29]:
import torch


start_logits: torch.Tensor = outputs.start_logits
end_logits: torch.Tensor = outputs.end_logits

print(start_logits.shape, end_logits.shape)

In [30]:
start_logits

tensor([[-4.4952, -6.4454, -4.7115, -7.0968, -7.0726, -7.4981, -5.5397, -4.1368,
         -5.9199, -5.4193, -1.5920, -1.0857, -5.0981, -2.9331, -3.4070,  2.2467,
          5.1563, -1.3602, -2.2209, -0.9686, -4.8112, -2.2527,  1.4383, 10.1211,
         -1.5311,  2.2685, -1.8951, -2.2108, -4.2142, -2.5571, -2.3252, -2.6046,
          1.7047, -1.9867, -1.7211, -0.5415, -2.0239, -4.4246, -5.1012, -4.4966,
         -7.8940, -6.7200, -4.6759, -6.3278, -4.8339, -5.1839, -3.3724, -7.4120,
         -8.1542, -4.4871, -7.4659, -4.3293, -4.2293, -3.1903, -7.9467, -5.2665,
         -7.5902, -5.0570, -7.4476, -7.9083, -6.5951, -7.4061, -8.8821, -7.6749,
         -6.9879, -7.0466, -5.4193]], grad_fn=<CloneBackward0>)

#### Comment

```text
- To convert those logits into probabilities, we will apply a softmax function — but before that, we need to make sure we mask the indices that are not part of the context. Our input is [CLS] question [SEP] context [SEP], so we need to mask the tokens of the question as well as the [SEP] token.
- We’ll keep the [CLS] token, however, as some models use it to indicate that the answer is not in the context.

- Since we will apply a softmax afterward, we just need to replace the logits we want to mask with a large negative number. Here, we use -10000:
```

In [31]:
import torch


sequence_ids: list[Optional[int]] = inputs.sequence_ids()

# Mask everything apart from the tokens of the context
mask: list[bool] = [i != 1 for i in sequence_ids]

# Unmask the [CLS] token
mask[0] = False
mask = torch.tensor(mask)[None]

# Replace the logits you want to mask with a large negative number. e.g. -10000
start_logits[mask] = -10000
end_logits[mask] = -10000

In [32]:
# Apply softmax

start_probabilities: torch.Tensor = F.softmax(start_logits, dim=-1)[0]
end_probabilities: torch.Tensor = F.softmax(end_logits, dim=-1)[0]

In [33]:
start_probabilities

tensor([4.4531e-07, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
        0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 8.1185e-06, 1.3470e-05,
        2.4368e-07, 2.1236e-06, 1.3220e-06, 3.7722e-04, 6.9219e-03, 1.0237e-05,
        4.3289e-06, 1.5143e-05, 3.2463e-07, 4.1933e-06, 1.6808e-04, 9.9179e-01,
        8.6288e-06, 3.8557e-04, 5.9956e-06, 4.3725e-06, 5.8977e-07, 3.0929e-06,
        3.8999e-06, 2.9493e-06, 2.1940e-04, 5.4713e-06, 7.1354e-06, 2.3212e-05,
        5.2711e-06, 4.7788e-07, 2.4291e-07, 4.4467e-07, 1.4879e-08, 4.8133e-08,
        3.7169e-07, 7.1242e-08, 3.1735e-07, 2.2365e-07, 1.3685e-06, 2.4093e-08,
        1.1470e-08, 4.4891e-07, 2.2828e-08, 5.2562e-07, 5.8092e-07, 1.6419e-06,
        1.4114e-08, 2.0591e-07, 2.0161e-08, 2.5390e-07, 2.3251e-08, 1.4667e-08,
        5.4533e-08, 2.4235e-08, 5.5390e-09, 1.8524e-08, 3.6818e-08, 3.4721e-08,
        0.0000e+00], grad_fn=<SelectBackward0>)

In [34]:
# First let’s compute all the possible products:
scores: torch.Tensor = start_probabilities[:, None] * end_probabilities[None, :]

In [35]:
# Then we’ll mask the values where start_index > end_index by setting them to 0
# (the other probabilities are all positive numbers). The torch.triu() function
# returns the upper triangular part of the 2D tensor passed as an argument, so it will do that masking for us:
scores: torch.Tensor = torch.triu(scores)

In [36]:
# Now we just have to get the index of the maximum. Since PyTorch will return the index in the flattened
# tensor, we need to use the floor division // and modulus % operations to get the start_index and end_index:
max_index: int = scores.argmax().item()
start_index: int = max_index // scores.shape[1]
end_index: int = max_index % scores.shape[1]
print(scores[start_index, end_index])

In [37]:
# Convert the start_index and end_index to the character indices in the context.
# We can grab them using the offset_mapping:

inputs_with_offsets: dict[str, Any] = tokenizer(question, context, return_offsets_mapping=True)
offsets: list[tuple[int]] = inputs_with_offsets["offset_mapping"]

start_char, _ = offsets[start_index]
_, end_char = offsets[end_index]
answer = context[start_char:end_char]

In [38]:
result: dict[str, Any] = {
    "answer": answer,
    "start": start_char,
    "end": end_char,
    "score": scores[start_index, end_index],
}

print(result)

In [39]:
# Ex: Try it out! Use the best scores you computed earlier to show the five most likely answers.
# To check your results, go back to the first pipeline and pass in top_k=5 when calling it.
long_context: str = """
🤗 Transformers: State of the Art NLP

🤗 Transformers provides thousands of pretrained models to perform tasks on texts such as classification, information extraction,
question answering, summarization, translation, text generation and more in over 100 languages.
Its aim is to make cutting-edge NLP easier to use for everyone.

🤗 Transformers provides APIs to quickly download and use those pretrained models on a given text, fine-tune them on your own datasets and
then share them with the community on our model hub. At the same time, each python module defining an architecture is fully standalone and
can be modified to enable quick research experiments.

Why should I use transformers?

1. Easy-to-use state-of-the-art models:
  - High performance on NLU and NLG tasks.
  - Low barrier to entry for educators and practitioners.
  - Few user-facing abstractions with just three classes to learn.
  - A unified API for using all our pretrained models.
  - Lower compute costs, smaller carbon footprint:

2. Researchers can share trained models instead of always retraining.
  - Practitioners can reduce compute time and production costs.
  - Dozens of architectures with over 10,000 pretrained models, some in more than 100 languages.

3. Choose the right framework for every part of a model's lifetime:
  - Train state-of-the-art models in 3 lines of code.
  - Move a single model between TF2.0/PyTorch frameworks at will.
  - Seamlessly pick the right framework for training, evaluation and production.

4. Easily customize a model or an example to your needs:
  - We provide examples for each architecture to reproduce the results published by its original authors.
  - Model internals are exposed as consistently as possible.
  - Model files can be used independently of the library for quick experiments.

🤗 Transformers is backed by the three most popular deep learning libraries — Jax, PyTorch and TensorFlow — with a seamless integration
between them. It's straightforward to train your models with one before loading them for inference with the other.
"""
question: str = "Which deep learning libraries back 🤗 Transformers?"

question_answerer(question=question, context=long_context, top_k=5)

[{'score': 0.9714871048927307,
  'start': 1892,
  'end': 1919,
  'answer': 'Jax, PyTorch and TensorFlow'},
 {'score': 0.14949701726436615,
  'start': 17,
  'end': 37,
  'answer': 'State of the Art NLP'},
 {'score': 0.015565173700451851,
  'start': 1892,
  'end': 1921,
  'answer': 'Jax, PyTorch and TensorFlow —'},
 {'score': 0.01370556652545929, 'start': 34, 'end': 37, 'answer': 'NLP'},
 {'score': 0.010596856474876404,
  'start': 3,
  'end': 37,
  'answer': 'Transformers: State of the Art NLP'}]

<br>

### Handling long contexts

```text
If we try to tokenize the question and long context we used as an example previously, we’ll get a number of tokens higher than the maximum length used in the question-answering pipeline (which is 384):
```

In [40]:
inputs: dict[str, Any] = tokenizer(question, long_context)

print(len(inputs["input_ids"]))

In [41]:
# To address the issue of exceeding the maximum input length, we can truncate the context while
# keeping the question intact using the "only_second" truncation strategy. However, this approach
# may lead to the omission of the answer in the truncated context.

inputs: dict[str, Any] = tokenizer(question, long_context, max_length=384, truncation="only_second")
print(tokenizer.decode(inputs["input_ids"]))

In [42]:
# This means the model will have a hard time picking the correct answer. To fix this, the question-answering pipeline allows
# us to split the context into smaller chunks, specifying the maximum length. To make sure we don’t split the context at exactly
# the wrong place to make it possible to find the answer, it also includes some overlap between the chunks.
# We can have the tokenizer (fast or slow) do this for us by adding return_overflowing_tokens=True, and we can specify the overlap
# we want with the stride argument. Here is an example, using a smaller sentence:
sentence: str = "This sentence is not too long but we are going to split it anyway."
inputs: dict[str, Any] = tokenizer(
    sentence, truncation=True, return_overflowing_tokens=True, max_length=6, stride=2
)


for ids in inputs["input_ids"]:
    print(tokenizer.decode(ids))

#### Comment

```text
- As we can see, the sentence has been split into chunks in such a way that each entry in inputs["input_ids"] has at most 6 tokens (we would need to add padding to have the last entry be the same size as the others) and there is an overlap of 2 tokens between each of the entries.

- Let’s take a closer look at the result of the tokenization:
```

In [43]:
print(inputs.keys())

In [44]:
# As expected, we get input IDs and an attention mask. The last key, overflow_to_sample_mapping, is a map that tells us which sentence each
# of the results corresponds to — here we have 7 results that all come from the (only) sentence we passed the tokenizer:
print(inputs["overflow_to_sample_mapping"])

In [45]:
# This is more useful when we tokenize several sentences together. For instance, this:
sentences: list[str] = [
    "This sentence is not too long but we are going to split it anyway.",
    "This sentence is shorter but will still get split.",
]
inputs: dict[str, Any] = tokenizer(
    sentences, truncation=True, return_overflowing_tokens=True, max_length=6, stride=2
)



In [46]:
# This means that the 1st sentence is split into 7 chunks as before, and the next 4 chunks come from the second sentence.
print(inputs["overflow_to_sample_mapping"])

In [47]:
print(inputs)

In [48]:
# When tokenizing the long context, the question-answering pipeline follows a default maximum length of 384 and
# a stride of 128, aligned with the model's fine-tuning. Padding and offset information will also be included.
inputs: dict[str, Any] = tokenizer(
    question,
    long_context,
    stride=128,
    max_length=384,
    padding="longest",
    truncation="only_second",
    return_overflowing_tokens=True,
    return_offsets_mapping=True,
)

In [49]:
# The inputs contain the input IDs and attention masks the model expects.
# Pop the offsets and the overflow_to_sample_mapping out of the inputs before converting it to a tensor:

_ = inputs.pop("overflow_to_sample_mapping")
offsets = inputs.pop("offset_mapping")

inputs: dict[str, Any] = inputs.convert_to_tensors("pt")
print(inputs["input_ids"].shape)

In [50]:
# Our long context was split in two, which means that after it goes through our model, we will have two sets of start and end logits:
outputs = model(**inputs)

start_logits: torch.Tensor = outputs.start_logits
end_logits: torch.Tensor = outputs.end_logits
print(start_logits.shape, end_logits.shape)

In [51]:
# Mask the tokens that are not part of the context before taking the softmax.
# Mask also all the padding tokens (as flagged by the attention mask):
sequence_ids: list[Optional[int]] = inputs.sequence_ids()
# Mask everything apart from the tokens of the context
mask: list[bool] = [i != 1 for i in sequence_ids]
# Unmask the [CLS] token
mask[0] = False
# Mask all the [PAD] tokens
mask = torch.logical_or(torch.tensor(mask)[None], (inputs["attention_mask"] == 0))

start_logits[mask] = -10000
end_logits[mask] = -10000

In [52]:
# Convert the logits to probabilities:
start_probabilities: torch.Tensor = F.softmax(start_logits, dim=-1)
end_probabilities: torch.Tensor = F.softmax(end_logits, dim=-1)

In [53]:
# For each of the two chunks, we assign scores to all potential answer spans and select the one with the highest score.

candidates: list[tuple[float]] = []
for start_probs, end_probs in zip(start_probabilities, end_probabilities):
    scores: torch.Tensor = start_probs[:, None] * end_probs[None, :]
    idx: int = torch.triu(scores).argmax().item()

    start_idx: int = idx // scores.shape[1]
    end_idx: int = idx % scores.shape[1]
    score: float = scores[start_idx, end_idx].item()
    candidates.append((start_idx, end_idx, score))

print(candidates)

In [54]:
for candidate, offset in zip(candidates, offsets):
    start_token, end_token, score = candidate
    start_char, _ = offset[start_token]
    _, end_char = offset[end_token]
    answer: str = long_context[start_char:end_char]
    result: dict[str, Any] = {"answer": answer, "start": start_char, "end": end_char, "score": score}
    print(result)