# Fast tokenizers in the QA pipeline (PyTorch)

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [3]:
%%capture
!pip install datasets evaluate transformers[sentencepiece]

Also, log into Hugging face.

In [4]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

We will now dive into the <font color='blue'>question-answering pipeline</font> and see how to leverage the offsets to grab the answer to the question at hand from the context, a bit like we did for the <font color='blue'>grouped entities</font> in the previous section. Then we will see how we can deal with <font color='blue'>very long contexts</font> that end up being <font color='blue'>truncated</font>. You can skip this section if you're not interested in the question answering task.


## Using the `question-answering` pipeline

As we saw in [Chapter 1](https://huggingface.co/learn/llm-course/chapter1/3), we can use the <font color='blue'>question-answering pipeline</font> like this to get the <font color='blue'>answer</font> to a <font color='blue'>question</font>:

In [6]:
from transformers import pipeline

question_answerer = pipeline("question-answering")
context = """
🤗 Transformers is backed by the three most popular deep learning libraries — Jax, PyTorch, and TensorFlow — with a seamless integration
between them. It's straightforward to train your models with one before loading them for inference with the other.
"""
question = "Which deep learning libraries back 🤗 Transformers?"
question_answerer(question=question, context=context)

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0


{'score': 0.98026043176651,
 'start': 78,
 'end': 106,
 'answer': 'Jax, PyTorch, and TensorFlow'}

Unlike the <font color='blue'>other pipelines</font>, which <font color='blue'>can't truncate</font> and <font color='blue'>split</font> texts that are <font color='blue'>longer</font> than the <font color='blue'>maximum length</font> accepted by the model (and thus may <font color='blue'>miss information</font> at the <font color='blue'>end</font> of a <font color='blue'>document</font>), this <font color='blue'>pipeline</font> can deal with <font color='blue'>very long contexts</font> and will <font color='blue'>return</font> the <font color='blue'>answer</font> to the <font color='blue'>question</font> even if it's at the end:


In [7]:
long_context = """
🤗 Transformers: State of the Art NLP

🤗 Transformers provides thousands of pretrained models to perform tasks on texts such as classification, information extraction,
question answering, summarization, translation, text generation and more in over 100 languages.
Its aim is to make cutting-edge NLP easier to use for everyone.

🤗 Transformers provides APIs to quickly download and use those pretrained models on a given text, fine-tune them on your own datasets and
then share them with the community on our model hub. At the same time, each python module defining an architecture is fully standalone and
can be modified to enable quick research experiments.

Why should I use transformers?

1. Easy-to-use state-of-the-art models:
  - High performance on NLU and NLG tasks.
  - Low barrier to entry for educators and practitioners.
  - Few user-facing abstractions with just three classes to learn.
  - A unified API for using all our pretrained models.
  - Lower compute costs, smaller carbon footprint:

2. Researchers can share trained models instead of always retraining.
  - Practitioners can reduce compute time and production costs.
  - Dozens of architectures with over 10,000 pretrained models, some in more than 100 languages.

3. Choose the right framework for every part of a model's lifetime:
  - Train state-of-the-art models in 3 lines of code.
  - Move a single model between TF2.0/PyTorch frameworks at will.
  - Seamlessly pick the right framework for training, evaluation and production.

4. Easily customize a model or an example to your needs:
  - We provide examples for each architecture to reproduce the results published by its original authors.
  - Model internals are exposed as consistently as possible.
  - Model files can be used independently of the library for quick experiments.

🤗 Transformers is backed by the three most popular deep learning libraries — Jax, PyTorch and TensorFlow — with a seamless integration
between them. It's straightforward to train your models with one before loading them for inference with the other.
"""
question_answerer(question=question, context=long_context)

{'score': 0.9714871048927307,
 'start': 1892,
 'end': 1919,
 'answer': 'Jax, PyTorch and TensorFlow'}

Let's see how it does all of this!

## Using a model for question answering

Like with any other pipeline, we start by <font color='blue'>tokenizing our input</font> and then <font color='blue'>send</font> it through the <font color='blue'>model</font>. The <font color='blue'>checkpoint</font> used by <font color='blue'>default</font> for the `question-answering` pipeline is [`distilbert-base-cased-distilled-squad`](https://huggingface.co/distilbert-base-cased-distilled-squad) where the <font color='blue'>squad</font> in the <font color='blue'>name</font> comes from the <font color='blue'>dataset</font> on which the <font color='blue'>model</font> was <font color='blue'>fine-tuned</font>; we'll talk more about the SQuAD dataset in [Chapter 7](https://huggingface.co/learn/llm-course/chapter7/7):

In [9]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

model_checkpoint = "distilbert-base-cased-distilled-squad"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

inputs = tokenizer(question, context, return_tensors="pt")
outputs = model(**inputs)

Note that we <font color='blue'>tokenize</font> the <font color='blue'>question</font> and the <font color='blue'>context</font> as a <font color='blue'>pair</font>, with the <font color='blue'>question first</font>.

![](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter6/question_tokens.svg)

Models for question answering work a little differently from the models we've seen up to now. Using the <font color='blue'>picture</font> above as an <font color='blue'>example</font>, the model has been <font color='blue'>trained</font> to <font color='blue'>predict</font> the <font color='blue'>index</font> of the <font color='blue'>token starting</font> the <font color='blue'>answer</font> (here <font color='blue'>21</font>) and the <font color='blue'>index</font> of the <font color='blue'>token</font> where the <font color='blue'>answer ends</font> (here <font color='blue'>24</font>). This is why those models don't return one tensor of logits but <font color='blue'>two</font>: <font color='blue'>one</font> for the <font color='blue'>logits</font> corresponding to the <font color='blue'>start token</font> of the <font color='blue'>answer</font>, and <font color='blue'>one</font> for the <font color='blue'>logits</font> corresponding to the <font color='blue'>end token</font> of the <font color='blue'>answer</font>. Since in this case we have only one input containing <font color='blue'>67 tokens</font>, we get:


In [10]:
start_logits = outputs.start_logits
end_logits = outputs.end_logits
print(start_logits.shape, end_logits.shape)

torch.Size([1, 67]) torch.Size([1, 67])


To convert those <font color='blue'>logits</font> into <font color='blue'>probabilities</font>, we will apply a <font color='blue'>softmax function</font> -- but before that, we need to make sure we <font color='blue'>mask</font> the <font color='blue'>indices</font> that are <font color='blue'>not part</font> of the <font color='blue'>context</font>. Our input is `[CLS] question [SEP] context [SEP]`, so we need to mask the tokens of the <font color='blue'>question</font> as well as the <font color='blue'>`[SEP]` token</font>. We'll <font color='blue'>keep</font> the <font color='blue'>`[CLS]` token</font>, however, as some models use it to <font color='blue'>indicate</font> that the <font color='blue'>answer</font> is <font color='blue'>not in the context</font>.

Since we will apply a <font color='blue'>softmax afterward</font>, we just need to <font color='blue'>replace</font> the <font color='blue'>logits</font> we want to <font color='blue'>mask</font> with a <font color='blue'>large negative number</font>. Here, we use `-10000`:

In [11]:
import torch

sequence_ids = inputs.sequence_ids()
# Mask everything apart from the tokens of the context
mask = [i != 1 for i in sequence_ids]
# Unmask the [CLS] token
mask[0] = False
mask = torch.tensor(mask)[None]

start_logits[mask] = -10000
end_logits[mask] = -10000

Now that we have properly <font color='blue'>masked</font> the <font color='blue'>logits</font> corresponding to <font color='blue'>positions</font> we <font color='blue'>don't want</font> to <font color='blue'>predict</font>, we can apply the <font color='blue'>softmax</font>:

In [12]:
start_probabilities = torch.nn.functional.softmax(start_logits, dim=-1)[0]
end_probabilities = torch.nn.functional.softmax(end_logits, dim=-1)[0]

At this stage, we could take the <font color='blue'>argmax</font> of the <font color='blue'>start</font> and <font color='blue'>end probabilities</font> -- but we <font color='blue'>might end up</font> with a <font color='blue'>start index</font> that is <font color='blue'>greater</font> than the <font color='blue'>end index</font>, so we need to <font color='blue'>take</font> a few <font color='blue'>more precautions</font>. We will compute the <font color='blue'>probabilities</font> of each possible <font color='blue'>`start_index`</font> and <font color='blue'>`end_index`</font> where <font color='blue'>`start_index <= end_index`</font>, then take the <font color='blue'>tuple `(start_index, end_index)`</font> with the <font color='blue'>highest probability</font>.

Assuming the <font color='blue'>events</font> "The answer starts at `start_index`" and "The answer ends at `end_index`" to be <font color='blue'>independent</font>, the <font color='blue'>probability</font> that the <font color='blue'>answer starts</font> at <font color='blue'>`start_index`</font> and <font color='blue'>ends</font> at <font color='blue'>`end_index`</font> is

$$\mathrm{start\_probabilities}\left[\mathrm{start\_index}\right] \times \mathrm{end\_probabilities}\left[\mathrm{end\_index}\right].$$

So, to compute <font color='blue'>all the scores</font>, we just need to <font color='blue'>compute</font> all the products

$$\mathrm{start\_probabilities}\left[\mathrm{start\_index}\right] \times \mathrm{end\_probabilities}\left[\mathrm{end\_index}\right].$$

where <font color='blue'>`start_index <= end_index`</font>. First let's compute <font color='blue'>all</font> the <font color='blue'>possible products</font>:

In [13]:
scores = start_probabilities[:, None] * end_probabilities[None, :]

Then we'll <font color='blue'>mask</font> the <font color='blue'>values</font> where <font color='blue'>`start_index > end_index`</font> by <font color='blue'>setting them</font> to <font color='blue'>`0`</font> (the other probabilities are all positive numbers). The <font color='blue'>`torch.triu()` function</font> returns the <font color='blue'>upper triangular part</font> of the <font color='blue'>2D tensor</font> passed as an argument, so it will <font color='blue'>do</font> that <font color='blue'>masking for us</font>:


In [14]:
scores = torch.triu(scores)

Now we just have to <font color='blue'>get</font> the <font color='blue'>index</font> of the <font color='blue'>maximum</font>. Since <font color='blue'>PyTorch</font> will <font color='blue'>return</font> the <font color='blue'>index</font> in the <font color='blue'>flattened tensor</font>, we need to use the <font color='blue'>floor division `//`</font> and <font color='blue'>modulus `%`</font> operations to get the <font color='blue'>`start_index`</font> and <font color='blue'>`end_index`</font>:


In [15]:
max_index = scores.argmax().item()
start_index = max_index // scores.shape[1]
end_index = max_index % scores.shape[1]
print(f"Start: {start_index}, End: {end_index}, Score: {scores[start_index, end_index].item():.6f}")

Start: 23, End: 35, Score: 0.980260


We're <font color='blue'>not quite done yet</font>, but at least we already have the <font color='blue'>correct score</font> for the <font color='blue'>answer</font> (you can check this by comparing it to the first result in the previous section):

✏️ **Try it out!** Compute the start and end indices for the five most likely answers.

In [16]:
# Exercise - use torch.topk to returns the k largest elements of a given input tensor
flat_scores = scores.flatten()
top_5_flat_indices = torch.topk(flat_scores, k=5).indices
start_indices = top_5_flat_indices // scores.shape[1]
end_indices = top_5_flat_indices % scores.shape[1]
top_5_scores = scores[start_indices, end_indices]

for i in range(5):
    print(f"{i+1}. Start: {start_indices[i].item()}, End: {end_indices[i].item()}, Score: {top_5_scores[i].item():.6f}")

1. Start: 23, End: 35, Score: 0.980260
2. Start: 23, End: 36, Score: 0.008248
3. Start: 16, End: 35, Score: 0.006841
4. Start: 23, End: 29, Score: 0.001368
5. Start: 25, End: 35, Score: 0.000381


We have the <font color='blue'>`start_index`</font> and <font color='blue'>`end_index`</font> of the <font color='blue'>answer</font> in terms of <font color='blue'>tokens</font>, so now we just need to <font color='blue'>convert to</font> the <font color='blue'>character indices</font> in the context. This is where the <font color='blue'>offsets</font> will be super <font color='blue'>useful</font>. We can grab them and use them like we did in the token classification task:


In [17]:
inputs_with_offsets = tokenizer(question, context, return_offsets_mapping=True)
offsets = inputs_with_offsets["offset_mapping"]

start_char, _ = offsets[start_index]
_, end_char = offsets[end_index]
answer = context[start_char:end_char]

Now we just have to <font color='blue'>format everything</font> to <font color='blue'>get our result</font>:

In [18]:
result = {
    "answer": answer,
    "start": start_char,
    "end": end_char,
    "score": scores[start_index, end_index],
}
print(result['answer'])
print(f"Start: {result['start']}, End: {result['end']}, Score: {result['score'].item():.6f}")

Jax, PyTorch, and TensorFlow
Start: 78, End: 106, Score: 0.980260


Great! That's the <font color='blue'>same</font> as in our <font color='blue'>first example</font>!

✏️ **Try it out!** Use the best scores you computed earlier to show the five most likely answers. To check your results, go back to the first pipeline and pass in `top_k=5` when calling it.

In [19]:
# Exercise - use top_k=5 and extract answers
offsets = tokenizer(question, context, return_offsets_mapping=True)["offset_mapping"]
top_5 = torch.topk(scores.flatten(), 5).indices

for i, idx in enumerate(top_5, 1):
    start_idx, end_idx = divmod(idx.item(), scores.shape[1])
    start_char, end_char = offsets[start_idx][0], offsets[end_idx][1]
    answer = context[start_char:end_char]
    score = scores[start_idx, end_idx].item()
    print(f"{i}. '{answer}' (Score: {score:.6f})")

1. 'Jax, PyTorch, and TensorFlow' (Score: 0.980260)
2. 'Jax, PyTorch, and TensorFlow —' (Score: 0.008248)
3. 'three most popular deep learning libraries — Jax, PyTorch, and TensorFlow' (Score: 0.006841)
4. 'Jax, PyTorch' (Score: 0.001368)
5. 'PyTorch, and TensorFlow' (Score: 0.000381)


## Handling long contexts

If we try to <font color='blue'>tokenize</font> the <font color='blue'>question</font> and <font color='blue'>long context</font> we used as an <font color='blue'>example previously</font>, we'll get a <font color='blue'>number of tokens higher</font> than the <font color='blue'>maximum length</font> used in the <font color='blue'>question-answering pipeline</font> (which is <font color='blue'>384</font>):


In [23]:
inputs = tokenizer(question, long_context)
print(f'The number of tokens is {len(inputs["input_ids"])}')

The number of tokens is 461


So, we'll need to <font color='blue'>truncate</font> our <font color='blue'>inputs</font> at that <font color='blue'>maximum length</font>. There are several ways we can do this, but we don't want to <font color='blue'>truncate</font> the question, only the <font color='blue'>context</font>. Since the <font color='blue'>context</font> is the <font color='blue'>second sentence</font>, we'll use the <font color='blue'>`"only_second"` truncation strategy.</font> The <font color='blue'>problem</font> that arises then is that the <font color='blue'>answer</font> to the question <font color='blue'>may not be in</font> the <font color='blue'>truncated context</font>. Here, for instance, we picked a question where the answer is toward the end of the context, and when we truncate it that answer is not present:


In [44]:
inputs = tokenizer(question, long_context, max_length=384, truncation="only_second")

# Match the format provided at the beginning of the notebook
text = tokenizer.decode(inputs["input_ids"])
text = text.replace(" - ", "-").replace(" , ", ", ").replace(" .", ".").replace(" :", ":")
text = text.replace(".-", "\n- ")
text = text.replace(":-", "\n- ")
text = text.replace("[SEP] ", "[SEP] \n\n")
text = text.replace("everyone. ", "everyone.\n\n")
text = text.replace("experiments. Why", "experiments.\n\nWhy")
text = text.replace("transformers? 1.", "transformers?\n\n1.")
text = text.replace("footprint: 2.", "footprint:\n\n2.")
text = text.replace("languages. 3.", "languages.\n\n3.")
text = text.replace("production. 4.", "production.\n\n4.")

print(text)

[CLS] Which deep learning libraries back [UNK] Transformers? [SEP] 

[UNK] Transformers: State of the Art NLP [UNK] Transformers provides thousands of pretrained models to perform tasks on texts such as classification, information extraction, question answering, summarization, translation, text generation and more in over 100 languages. Its aim is to make cutting-edge NLP easier to use for everyone.

[UNK] Transformers provides APIs to quickly download and use those pretrained models on a given text, fine-tune them on your own datasets and then share them with the community on our model hub. At the same time, each python module defining an architecture is fully standalone and can be modified to enable quick research experiments.

Why should I use transformers?

1. Easy-to-use state-of-the-art models
- High performance on NLU and NLG tasks
- Low barrier to entry for educators and practitioners
- Few user-facing abstractions with just three classes to learn
- A unified API for using all 

This means the <font color='blue'>model</font> will have a <font color='blue'>hard time picking</font> the <font color='blue'>correct answer</font>. To fix this, the <font color='blue'>`question-answering` pipeline</font> allows us to <font color='blue'>split</font> the <font color='blue'>context</font> into <font color='blue'>smaller chunks</font>, <font color='blue'>specifying</font> the <font color='blue'>maximum length</font>. To make sure we don't split the context at exactly the wrong place to make it possible to find the answer, it also <font color='blue'>includes</font> some <font color='blue'>overlap between</font> the <font color='blue'>chunks</font>.

We can have the <font color='blue'>tokenizer</font> (fast or slow) do this for us by <font color='blue'>adding `return_overflowing_tokens=True`</font>, and we can <font color='blue'>specify the overlap</font> we want with the <font color='blue'>`stride` argument</font>. Here is an example, using a smaller sentence:


In [45]:
sentence = "This sentence is not too long but we are going to split it anyway."
inputs = tokenizer(
    sentence, truncation=True, return_overflowing_tokens=True, max_length=6, stride=2
)

for ids in inputs["input_ids"]:
    print(tokenizer.decode(ids))

[CLS] This sentence is not [SEP]
[CLS] is not too long [SEP]
[CLS] too long but we [SEP]
[CLS] but we are going [SEP]
[CLS] are going to split [SEP]
[CLS] to split it anyway [SEP]
[CLS] it anyway. [SEP]


As we can see, the <font color='blue'>sentence</font> has been <font color='blue'>split into chunks</font> in such a way that <font color='blue'>each entry</font> in `inputs["input_ids"]` <font color='blue'>has at most 6 tokens</font>  (we would need to add padding to have the last entry be the same size as the others) and there is an <font color='blue'>overlap of 2 tokens</font> between <font color='blue'>each of the entries</font>.

Let's take a <font color='blue'>closer look</font> at the <font color='blue'>result</font> of the <font color='blue'>tokenization</font>:

In [19]:
print(inputs.keys())

dict_keys(['input_ids', 'attention_mask', 'overflow_to_sample_mapping'])


As expected, we get <font color='blue'>input IDs</font> and an <font color='blue'>attention mask</font>. The last key, <font color='blue'>`overflow_to_sample_mapping`</font>, is a <font color='blue'>map</font> that <font color='blue'>tells us</font> which <font color='blue'>sentence</font> each of the <font color='blue'>results corresponds to</font> -- here we have <font color='blue'>7 results</font> that all come from the <font color='blue'>(only) sentence</font> we passed the tokenizer:


In [46]:
print(inputs["overflow_to_sample_mapping"])

[0, 0, 0, 0, 0, 0, 0]


This is <font color='blue'>more useful</font> when we <font color='blue'>tokenize several sentences together</font>. For instance, this returns:

In [47]:
sentences = [
    "This sentence is not too long but we are going to split it anyway.",
    "This sentence is shorter but will still get split.",
]
inputs = tokenizer(
    sentences, truncation=True, return_overflowing_tokens=True, max_length=6, stride=2
)

print(inputs["overflow_to_sample_mapping"])

[0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]


which means the <font color='blue'>first sentence</font> is split into <font color='blue'>7 chunks</font> as before, and the <font color='blue'>next 4 chunks</font> come from the <font color='blue'>second sentence</font>.

Now let's go back to our long context. By default the <font color='blue'>`question-answering` pipeline</font> uses a <font color='blue'>maximum length</font> of <font color='blue'>384</font>, as we mentioned earlier, and a <font color='blue'>stride</font> of <font color='blue'>128</font>, which <font color='blue'>correspond</font> to the <font color='blue'>way</font> the <font color='blue'>model was fine-tuned</font> (you can <font color='blue'>adjust</font> those <font color='blue'>parameters</font> by passing <font color='blue'>`max_seq_len`</font> and <font color='blue'>`stride`</font> arguments when calling the <font color='blue'>pipeline</font>). We will thus use those parameters when tokenizing. We'll also <font color='blue'>add padding</font> (to have <font color='blue'>samples</font> of the <font color='blue'>same length</font>, so we can build tensors) as well as <font color='blue'>ask for the offsets</font>:


In [66]:
inputs = tokenizer(
    question,
    long_context,
    stride=128,
    max_length=384,
    padding="longest",
    truncation="only_second",
    return_overflowing_tokens=True,
    return_offsets_mapping=True,
)
print(inputs.keys())

dict_keys(['input_ids', 'attention_mask', 'offset_mapping', 'overflow_to_sample_mapping'])


Those <font color='blue'>inputs</font> will contain the <font color='blue'>input IDs</font> and <font color='blue'>attention masks</font> the model expects, as well as the <font color='blue'>offsets</font> and the <font color='blue'>`overflow_to_sample_mapping`</font> we just talked about. Since those two are <font color='blue'>not parameters used</font> by the <font color='blue'>model</font>, we'll <font color='blue'>pop them out</font> of the <font color='blue'>inputs`</font> (and we <font color='blue'>won't store the map</font>, since it's not useful here) before converting it to a tensor:


In [67]:
_ = inputs.pop("overflow_to_sample_mapping")
offsets = inputs.pop("offset_mapping")

inputs = inputs.convert_to_tensors("pt")
print(inputs["input_ids"].shape)

torch.Size([2, 384])


Our <font color='blue'>long context</font> was <font color='blue'>split in two</font>, which means that <font color='blue'>after</font> it <font color='blue'>goes through</font> our <font color='blue'>model</font>, we will have <font color='blue'>two sets</font> of <font color='blue'>start</font> and <font color='blue'>end logits</font>:

In [68]:
outputs = model(**inputs)

start_logits = outputs.start_logits
end_logits = outputs.end_logits
print(start_logits.shape, end_logits.shape)

torch.Size([2, 384]) torch.Size([2, 384])



Like before, we <font color='blue'>first mask</font> the <font color='blue'>tokens</font> that are <font color='blue'>not part</font> of the <font color='blue'>context before</font> taking the <font color='blue'>softmax</font>. We also <font color='blue'>mask</font> all the <font color='blue'>padding tokens</font> (as flagged by the attention mask):

In [69]:
sequence_ids = inputs.sequence_ids()
# Mask everything apart from the tokens of the context
mask = [i != 1 for i in sequence_ids]
# Unmask the [CLS] token
mask[0] = False
# Mask all the [PAD] tokens
mask = torch.logical_or(torch.tensor(mask)[None], (inputs["attention_mask"] == 0))

start_logits[mask] = -10000
end_logits[mask] = -10000

Then we can use the <font color='blue'>softmax</font> to <font color='blue'>convert</font> our <font color='blue'>logits</font> to <font color='blue'>probabilities</font>:

In [72]:
start_probabilities = torch.nn.functional.softmax(start_logits, dim=-1)
end_probabilities = torch.nn.functional.softmax(end_logits, dim=-1)

The next step is <font color='blue'>similar</font> to what we did for the <font color='blue'>small context</font>, but we <font color='blue'>repeat it</font> for each of our <font color='blue'>two chunks</font>. We <font color='blue'>attribute</font> a <font color='blue'>score</font> to <font color='blue'>all possible spans</font> of <font color='blue'>answer</font>, then take the <font color='blue'>span</font> with the <font color='blue'>best score</font>:

In [73]:
candidates = []
for start_probs, end_probs in zip(start_probabilities, end_probabilities):
    scores = start_probs[:, None] * end_probs[None, :]
    idx = torch.triu(scores).argmax().item()

    start_idx = idx // scores.shape[1]
    end_idx = idx % scores.shape[1]
    score = scores[start_idx, end_idx].item()
    candidates.append((start_idx, end_idx, score))

print(candidates)

[(0, 18, 0.3386705815792084), (173, 184, 0.9714869856834412)]


Those <font color='blue'>two candidates correspond</font> to the <font color='blue'>best answers</font> the model was able to find in <font color='blue'>each chunk</font>. The model is way more confident the <font color='blue'>right answer</font> is in the <font color='blue'>second part</font> (which is a good sign!). Now we just have to <font color='blue'>map</font> those <font color='blue'>two token spans</font> to <font color='blue'>spans of characters</font> in the <font color='blue'>context</font> (we only need to map the second one to have our answer, but it's interesting to see what the model has picked in the first chunk).


✏️ **Try it out!** Adapt the code above to return the scores and spans for the five most likely answers (in total, not per chunk).

In [130]:
# Exercise - collect all candidates from all chunks
all_candidates = []

# Process each chunk separately (through stride/max_length splitting)
for i, (s_logits, e_logits) in enumerate(zip(outputs.start_logits, outputs.end_logits)):
    # Create mask to hide non-context tokens
    # Here, sequence_ids: 0=question, 1=context, None=special tokens
    mask = torch.tensor([j != 1 for j in inputs.sequence_ids(i)])  # Hide everything except context
    mask[0] = False  # Keep [CLS] token visible for potential answers
    mask |= inputs["attention_mask"][i] == 0  # Mask padding tokens

    # Need to clone as we will do in-place modification on tensors
    s_logits, e_logits = s_logits.clone(), e_logits.clone()

    # Apply mask through the same strategy as above
    s_logits[mask] = e_logits[mask] = -10000

    # Calculate probability matrix: start_prob[i] * end_prob[j] for all i,j combinations
    # torch.triu keeps only upper triangle (start <= end positions)
    scores = torch.triu(torch.softmax(s_logits, -1)[:, None] * torch.softmax(e_logits, -1)[None, :])

    # Get top 10 scoring spans from this chunk
    for idx in torch.topk(scores.flatten(), min(10, scores.nonzero().size(0))).indices:
        # Convert flat index back to 2D coordinates (start_pos, end_pos)
        start, end = divmod(idx.item(), scores.shape[1])

        # Only keep candidates with positive scores that are in context
        if (score := scores[start, end].item()) > 0.001:
            # Verify that both start and end are in the context (sequence_id == 1)
            if inputs.sequence_ids(i)[start] == 1 and inputs.sequence_ids(i)[end] == 1:
              # Store: (score, chunk_index, start_token, end_token)
              all_candidates.append((score, i, start, end))

# Get the top 5 answers across all chunks
top_5 = sorted(all_candidates, reverse=True)[:5]

for rank, (score, chunk_idx, start_idx, end_idx) in enumerate(top_5, 1):
    # Extract the actual tokens from this span
    tokens = inputs["input_ids"][chunk_idx][start_idx:end_idx+1]
    # Decode tokens back to readable text
    #answer = tokenizer.decode(tokens, skip_special_tokens=True)
    print(f"{rank}. Score: {score:.6f}, Chunk: {chunk_idx}")

1. Score: 0.971487, Chunk: 1
2. Score: 0.149496, Chunk: 0
3. Score: 0.015565, Chunk: 1
4. Score: 0.013706, Chunk: 0
5. Score: 0.010597, Chunk: 0


The <font color='blue'>`offsets`</font> we <font color='blue'>grabbed earlier</font> is actually a <font color='blue'>list of offsets</font>, with <font color='blue'>one list per chunk</font> of text:

In [131]:
for i, (candidate, offset) in enumerate(zip(candidates, offsets), 1):
    start_token, end_token, score = candidate
    start_char, _ = offset[start_token]
    _, end_char = offset[end_token]
    answer = long_context[start_char:end_char].strip()  # Remove leading/trailing whitespace

    print(f"{i}. {answer}")
    print(f"   Score: {score:.4f}, Position: {start_char}-{end_char}, Tokens: {start_token}-{end_token}")
    print()

1. 🤗 Transformers: State of the Art NLP
   Score: 0.3387, Position: 0-37, Tokens: 0-18

2. Jax, PyTorch and TensorFlow
   Score: 0.9715, Position: 1892-1919, Tokens: 173-184



If we <font color='blue'>ignore</font> the <font color='blue'>first result</font>, we get the <font color='blue'>same result</font> as <font color='blue'>our pipeline</font> for this long context -- yay!

✏️ **Try it out!** Use the best scores you computed before to show the five most likely answers (for the whole context, not each chunk). To check your results, go back to the first pipeline and pass in `top_k=5` when calling it.


In [132]:
# Exercise - Use the work done before
for rank, (score, chunk_idx, start_idx, end_idx) in enumerate(top_5, 1):
    # Extract the actual tokens from this span
    tokens = inputs["input_ids"][chunk_idx][start_idx:end_idx+1]
    # Decode tokens back to readable text
    answer = tokenizer.decode(tokens, skip_special_tokens=True)
    print(f"{answer}, Score: {score:.6f}")

Jax, PyTorch and TensorFlow, Score: 0.971487
State of the Art NLP, Score: 0.149496
Jax, PyTorch and TensorFlow —, Score: 0.015565
NLP, Score: 0.013706
Transformers : State of the Art NLP, Score: 0.010597


In [127]:
# Verification: We compare with the pipeline results
print("Verification using pipeline with top_k=5:")
print()
pipeline_results = question_answerer(question=question, context=long_context, top_k=5)
for i, result in enumerate(pipeline_results, 1):
    print(f"{i}. {result['answer']}, Score: {result['score']:.4f}")

Verification using pipeline with top_k=5:

1. Jax, PyTorch and TensorFlow, Score: 0.9715
2. State of the Art NLP, Score: 0.1495
3. Jax, PyTorch and TensorFlow —, Score: 0.0156
4. NLP, Score: 0.0137
5. Transformers: State of the Art NLP, Score: 0.0106


This concludes our <font color='blue'>deep dive</font> into the <font color='blue'>tokenizer's capabilities</font>. We will put all of this in <font color='blue'>practice again</font> in the <font color='blue'>next chapter</font>, when we show you how to <font color='blue'>fine-tune</font> a <font color='blue'>model</font> on a <font color='blue'>range</font> of common <font color='blue'>NLP tasks</font>.
