# Question answering (extractive)

**Reference**: https://huggingface.co/learn/nlp-course/chapter7/7

> 💡 Encoder-only models like BERT tend to be great at extracting answers to factoid questions like “Who invented the Transformer architecture?” but fare poorly when given open-ended questions like “Why is the sky blue?” \
In these more challenging cases, encoder-decoder models like T5 and BART are typically used to synthesize the information in a way that’s quite similar to text summarization. \
If you’re interested in this type of generative question answering, we recommend checking out our demo based on the ELI5 dataset.

- Generative question answering demo: [https://yjernite.github.io/lfqa.html](https://yjernite.github.io/lfqa.html)

# Preparing the data
1. Most academic benchmark for extractive question answering: [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/)
    - [SQuAD v2](https://huggingface.co/datasets/squad_v2): includes questions that don't hanve an answer

## The SQuAD dataset

In [1]:
from datasets import load_dataset

raw_datasets = load_dataset("squad")
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

In [2]:
print("- Context: ",  raw_datasets["train"][0]["context"])
print("- Question: ", raw_datasets["train"][0]["question"])
print("- Answers: ",  raw_datasets["train"][0]["answers"])

- Context:  Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.
- Question:  To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?
- Answers:  {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}


- `answers`: This is the format that will be expected by the squad metric during evaluation.
    ```
    {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}
    ```
    - `answer_start`: character index    

In [3]:
context, answers = raw_datasets["train"][0]["context"], raw_datasets["train"][0]["answers"]

print(answers)
print(context[answers['answer_start'][0] : answers['answer_start'][0] + len(answers['text'][0])])

{'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}
Saint Bernadette Soubirous


During training, there is only **one** possible answer.

In [4]:
raw_datasets['train'].filter(lambda x: len(x['answers']['text']) != 1)

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 0
})

For evaluation, however, there are several possible answers for each sample, which may be the same or different.

In [5]:
print(raw_datasets['validation'][0]['answers'])
print(raw_datasets['validation'][2]['answers'])

{'text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos'], 'answer_start': [177, 177, 177]}
{'text': ['Santa Clara, California', "Levi's Stadium", "Levi's Stadium in the San Francisco Bay Area at Santa Clara, California."], 'answer_start': [403, 355, 355]}


some of the questions have several possible answers, and this script(🤗 Datasets metric) will **compare a predicted answer to all the acceptable answers and take the best score**.

In [6]:
print(raw_datasets['validation'][2]['context'])
print(raw_datasets['validation'][2]['question'])

Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.
Where did Super Bowl 50 take place?


## Procesing the training data

- Supported models: [https://huggingface.co/docs/transformers/index#supported-models-and-frameworks](https://huggingface.co/docs/transformers/index#supported-models-and-frameworks)

In [13]:
from transformers import AutoTokenizer

model_checkpoint = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

they have a Python tokenizer (called “**slow**”). A “**fast**” tokenizer backed by the 🤗 Tokenizers library

In [14]:
tokenizer.is_fast

True

- `BERT` input sentence format
    ```
    [CLS] question [SEP] context [SEP]
    ```

In [82]:
context  = raw_datasets['train'][0]['context']
question = raw_datasets['train'][0]['question']
answer   = raw_datasets['train'][0]['answers']

Task image
![](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter7/qa_labels.svg)

In [18]:
inputs = tokenizer(
    question, context,
    max_length=100,  # the number of tokens in a sentence
    truncation='only_second',
    stride=50,  # the number of overlapping tokens
    return_overflowing_tokens=True,
    return_offsets_mapping=True
)

In [28]:
question

'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?'

In [29]:
context

'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.'

In [85]:
answer

{'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}

In [19]:
inputs.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'offset_mapping', 'overflow_to_sample_mapping'])

In [60]:
n_sents = len(inputs['input_ids'])
assert n_sents == len(inputs['overflow_to_sample_mapping'])
n_sents

4

In [73]:
import pandas as pd
pd.set_option("display.max_rows", 100)
pd.set_option("display.max_columns", 100)

for idx_sent in range(n_sents):
    print(f"- sentence {idx_sent}:", tokenizer.decode(inputs['input_ids'][idx_sent]))
    print("- length:", len(inputs['input_ids'][idx_sent]))
    
    df = pd.DataFrame({
        'token': [tokenizer.decode(token_id) for token_id in inputs['input_ids'][idx_sent]],
        'token_id': inputs['input_ids'][idx_sent],
        'token_type_id': inputs['token_type_ids'][idx_sent],
        'attention_mask': inputs['attention_mask'][idx_sent],
        'offset_mapping': inputs['offset_mapping'][idx_sent]
    })
    display(df.T)

- sentence 0: [CLS] To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France? [SEP] Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend " Venite Ad Me Omnes ". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basi [SEP]
- length: 100


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99
token,[CLS],To,whom,did,the,Virgin,Mary,allegedly,appear,in,1858,in,Lou,##rdes,France,?,[SEP],Architectural,##ly,",",the,school,has,a,Catholic,character,.,At,##op,the,Main,Building,',s,gold,dome,is,a,golden,statue,of,the,Virgin,Mary,.,Immediately,in,front,of,the,Main,Building,and,facing,it,",",is,a,copper,statue,of,Christ,with,arms,up,##rai,##sed,with,the,legend,"""",V,##eni,##te,Ad,Me,O,##m,##nes,"""",.,Next,to,the,Main,Building,is,the,Basilica,of,the,Sacred,Heart,.,Immediately,behind,the,b,##asi,[SEP]
token_id,101,1706,2292,1225,1103,6567,2090,9273,2845,1107,8109,1107,10111,20500,1699,136,102,22182,1193,117,1103,1278,1144,170,2336,1959,119,1335,4184,1103,4304,4334,112,188,2284,10945,1110,170,5404,5921,1104,1103,6567,2090,119,13301,1107,1524,1104,1103,4304,4334,1105,4749,1122,117,1110,170,7335,5921,1104,4028,1114,1739,1146,14089,5591,1114,1103,7051,107,159,21462,1566,24930,2508,152,1306,3965,107,119,5893,1106,1103,4304,4334,1110,1103,19349,1104,1103,11373,4641,119,13301,1481,1103,171,17506,102
token_type_id,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
attention_mask,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
offset_mapping,"(0, 0)","(0, 2)","(3, 7)","(8, 11)","(12, 15)","(16, 22)","(23, 27)","(28, 37)","(38, 44)","(45, 47)","(48, 52)","(53, 55)","(56, 59)","(59, 63)","(64, 70)","(70, 71)","(0, 0)","(0, 13)","(13, 15)","(15, 16)","(17, 20)","(21, 27)","(28, 31)","(32, 33)","(34, 42)","(43, 52)","(52, 53)","(54, 56)","(56, 58)","(59, 62)","(63, 67)","(68, 76)","(76, 77)","(77, 78)","(79, 83)","(84, 88)","(89, 91)","(92, 93)","(94, 100)","(101, 107)","(108, 110)","(111, 114)","(115, 121)","(122, 126)","(126, 127)","(128, 139)","(140, 142)","(143, 148)","(149, 151)","(152, 155)","(156, 160)","(161, 169)","(170, 173)","(174, 180)","(181, 183)","(183, 184)","(185, 187)","(188, 189)","(190, 196)","(197, 203)","(204, 206)","(207, 213)","(214, 218)","(219, 223)","(224, 226)","(226, 229)","(229, 232)","(233, 237)","(238, 241)","(242, 248)","(249, 250)","(250, 251)","(251, 254)","(254, 256)","(257, 259)","(260, 262)","(263, 264)","(264, 265)","(265, 268)","(268, 269)","(269, 270)","(271, 275)","(276, 278)","(279, 282)","(283, 287)","(288, 296)","(297, 299)","(300, 303)","(304, 312)","(313, 315)","(316, 319)","(320, 326)","(327, 332)","(332, 333)","(334, 345)","(346, 352)","(353, 356)","(357, 358)","(358, 361)","(0, 0)"


- sentence 1: [CLS] To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France? [SEP] the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend " Venite Ad Me Omnes ". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin [SEP]
- length: 100


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99
token,[CLS],To,whom,did,the,Virgin,Mary,allegedly,appear,in,1858,in,Lou,##rdes,France,?,[SEP],the,Main,Building,and,facing,it,",",is,a,copper,statue,of,Christ,with,arms,up,##rai,##sed,with,the,legend,"""",V,##eni,##te,Ad,Me,O,##m,##nes,"""",.,Next,to,the,Main,Building,is,the,Basilica,of,the,Sacred,Heart,.,Immediately,behind,the,b,##asi,##lica,is,the,G,##rot,##to,",",a,Marian,place,of,prayer,and,reflection,.,It,is,a,replica,of,the,g,##rot,##to,at,Lou,##rdes,",",France,where,the,Virgin,[SEP]
token_id,101,1706,2292,1225,1103,6567,2090,9273,2845,1107,8109,1107,10111,20500,1699,136,102,1103,4304,4334,1105,4749,1122,117,1110,170,7335,5921,1104,4028,1114,1739,1146,14089,5591,1114,1103,7051,107,159,21462,1566,24930,2508,152,1306,3965,107,119,5893,1106,1103,4304,4334,1110,1103,19349,1104,1103,11373,4641,119,13301,1481,1103,171,17506,9538,1110,1103,144,10595,2430,117,170,14789,1282,1104,8070,1105,9284,119,1135,1110,170,16498,1104,1103,176,10595,2430,1120,10111,20500,117,1699,1187,1103,6567,102
token_type_id,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
attention_mask,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
offset_mapping,"(0, 0)","(0, 2)","(3, 7)","(8, 11)","(12, 15)","(16, 22)","(23, 27)","(28, 37)","(38, 44)","(45, 47)","(48, 52)","(53, 55)","(56, 59)","(59, 63)","(64, 70)","(70, 71)","(0, 0)","(152, 155)","(156, 160)","(161, 169)","(170, 173)","(174, 180)","(181, 183)","(183, 184)","(185, 187)","(188, 189)","(190, 196)","(197, 203)","(204, 206)","(207, 213)","(214, 218)","(219, 223)","(224, 226)","(226, 229)","(229, 232)","(233, 237)","(238, 241)","(242, 248)","(249, 250)","(250, 251)","(251, 254)","(254, 256)","(257, 259)","(260, 262)","(263, 264)","(264, 265)","(265, 268)","(268, 269)","(269, 270)","(271, 275)","(276, 278)","(279, 282)","(283, 287)","(288, 296)","(297, 299)","(300, 303)","(304, 312)","(313, 315)","(316, 319)","(320, 326)","(327, 332)","(332, 333)","(334, 345)","(346, 352)","(353, 356)","(357, 358)","(358, 361)","(361, 365)","(366, 368)","(369, 372)","(373, 374)","(374, 377)","(377, 379)","(379, 380)","(381, 382)","(383, 389)","(390, 395)","(396, 398)","(399, 405)","(406, 409)","(410, 420)","(420, 421)","(422, 424)","(425, 427)","(428, 429)","(430, 437)","(438, 440)","(441, 444)","(445, 446)","(446, 449)","(449, 451)","(452, 454)","(455, 458)","(458, 462)","(462, 463)","(464, 470)","(471, 476)","(477, 480)","(481, 487)","(0, 0)"


- sentence 2: [CLS] To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France? [SEP] Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive ( and in a direct line that connects through 3 [SEP]
- length: 100


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99
token,[CLS],To,whom,did,the,Virgin,Mary,allegedly,appear,in,1858,in,Lou,##rdes,France,?,[SEP],Next,to,the,Main,Building,is,the,Basilica,of,the,Sacred,Heart,.,Immediately,behind,the,b,##asi,##lica,is,the,G,##rot,##to,",",a,Marian,place,of,prayer,and,reflection,.,It,is,a,replica,of,the,g,##rot,##to,at,Lou,##rdes,",",France,where,the,Virgin,Mary,reputed,##ly,appeared,to,Saint,Bern,##ade,##tte,So,##ubi,##rous,in,1858,.,At,the,end,of,the,main,drive,(,and,in,a,direct,line,that,connects,through,3,[SEP]
token_id,101,1706,2292,1225,1103,6567,2090,9273,2845,1107,8109,1107,10111,20500,1699,136,102,5893,1106,1103,4304,4334,1110,1103,19349,1104,1103,11373,4641,119,13301,1481,1103,171,17506,9538,1110,1103,144,10595,2430,117,170,14789,1282,1104,8070,1105,9284,119,1135,1110,170,16498,1104,1103,176,10595,2430,1120,10111,20500,117,1699,1187,1103,6567,2090,25153,1193,1691,1106,2216,17666,6397,3786,1573,25422,13149,1107,8109,119,1335,1103,1322,1104,1103,1514,2797,113,1105,1107,170,2904,1413,1115,8200,1194,124,102
token_type_id,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
attention_mask,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
offset_mapping,"(0, 0)","(0, 2)","(3, 7)","(8, 11)","(12, 15)","(16, 22)","(23, 27)","(28, 37)","(38, 44)","(45, 47)","(48, 52)","(53, 55)","(56, 59)","(59, 63)","(64, 70)","(70, 71)","(0, 0)","(271, 275)","(276, 278)","(279, 282)","(283, 287)","(288, 296)","(297, 299)","(300, 303)","(304, 312)","(313, 315)","(316, 319)","(320, 326)","(327, 332)","(332, 333)","(334, 345)","(346, 352)","(353, 356)","(357, 358)","(358, 361)","(361, 365)","(366, 368)","(369, 372)","(373, 374)","(374, 377)","(377, 379)","(379, 380)","(381, 382)","(383, 389)","(390, 395)","(396, 398)","(399, 405)","(406, 409)","(410, 420)","(420, 421)","(422, 424)","(425, 427)","(428, 429)","(430, 437)","(438, 440)","(441, 444)","(445, 446)","(446, 449)","(449, 451)","(452, 454)","(455, 458)","(458, 462)","(462, 463)","(464, 470)","(471, 476)","(477, 480)","(481, 487)","(488, 492)","(493, 500)","(500, 502)","(503, 511)","(512, 514)","(515, 520)","(521, 525)","(525, 528)","(528, 531)","(532, 534)","(534, 537)","(537, 541)","(542, 544)","(545, 549)","(549, 550)","(551, 553)","(554, 557)","(558, 561)","(562, 564)","(565, 568)","(569, 573)","(574, 579)","(580, 581)","(581, 584)","(585, 587)","(588, 589)","(590, 596)","(597, 601)","(602, 606)","(607, 615)","(616, 623)","(624, 625)","(0, 0)"


- sentence 3: [CLS] To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France? [SEP]. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive ( and in a direct line that connects through 3 statues and the Gold Dome ), is a simple, modern stone statue of Mary. [SEP]
- length: 85


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84
token,[CLS],To,whom,did,the,Virgin,Mary,allegedly,appear,in,1858,in,Lou,##rdes,France,?,[SEP],.,It,is,a,replica,of,the,g,##rot,##to,at,Lou,##rdes,",",France,where,the,Virgin,Mary,reputed,##ly,appeared,to,Saint,Bern,##ade,##tte,So,##ubi,##rous,in,1858,.,At,the,end,of,the,main,drive,(,and,in,a,direct,line,that,connects,through,3,statues,and,the,Gold,Dome,),",",is,a,simple,",",modern,stone,statue,of,Mary,.,[SEP]
token_id,101,1706,2292,1225,1103,6567,2090,9273,2845,1107,8109,1107,10111,20500,1699,136,102,119,1135,1110,170,16498,1104,1103,176,10595,2430,1120,10111,20500,117,1699,1187,1103,6567,2090,25153,1193,1691,1106,2216,17666,6397,3786,1573,25422,13149,1107,8109,119,1335,1103,1322,1104,1103,1514,2797,113,1105,1107,170,2904,1413,1115,8200,1194,124,11739,1105,1103,3487,17917,114,117,1110,170,3014,117,2030,2576,5921,1104,2090,119,102
token_type_id,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
attention_mask,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
offset_mapping,"(0, 0)","(0, 2)","(3, 7)","(8, 11)","(12, 15)","(16, 22)","(23, 27)","(28, 37)","(38, 44)","(45, 47)","(48, 52)","(53, 55)","(56, 59)","(59, 63)","(64, 70)","(70, 71)","(0, 0)","(420, 421)","(422, 424)","(425, 427)","(428, 429)","(430, 437)","(438, 440)","(441, 444)","(445, 446)","(446, 449)","(449, 451)","(452, 454)","(455, 458)","(458, 462)","(462, 463)","(464, 470)","(471, 476)","(477, 480)","(481, 487)","(488, 492)","(493, 500)","(500, 502)","(503, 511)","(512, 514)","(515, 520)","(521, 525)","(525, 528)","(528, 531)","(532, 534)","(534, 537)","(537, 541)","(542, 544)","(545, 549)","(549, 550)","(551, 553)","(554, 557)","(558, 561)","(562, 564)","(565, 568)","(569, 573)","(574, 579)","(580, 581)","(581, 584)","(585, 587)","(588, 589)","(590, 596)","(597, 601)","(602, 606)","(607, 615)","(616, 623)","(624, 625)","(626, 633)","(634, 637)","(638, 641)","(642, 646)","(647, 651)","(651, 652)","(652, 653)","(654, 656)","(657, 658)","(659, 665)","(665, 666)","(667, 673)","(674, 679)","(680, 686)","(687, 689)","(690, 694)","(694, 695)","(0, 0)"


1. `max_length=100`
    - The number of tokens in a sentence.
2. `stride`
    - The number of overlapping tokens.
3. `return_overflowing_tokens`
    - Let the tokenizer know we want the overflowing tokens
4. `offset_mapping`
    - `(0, 0)`: Special token(`[CLS]`, `[SEP]`)
    - `(s, e)`: `question[s:e]` or `context[s:e]`
5. `token_type_ids`
    - Whether the token is special token or question or context
    - **Since those do not necessarily exist for all models (DistilBERT does not require them, for instance), we’ll instead use the `sequence_ids()` method**

In [93]:
inputs = tokenizer(
    raw_datasets["train"][2:6]["question"],
    raw_datasets["train"][2:6]["context"],
    max_length=100,
    truncation="only_second",
    stride=50,
    return_overflowing_tokens=True,
    return_offsets_mapping=True
)
answers = raw_datasets['train'][2:6]['answers']

print(f"The 4 examples gave {len(inputs['input_ids'])} features.")
print(f"Here is where each comes from: {inputs['overflow_to_sample_mapping']}.")

The 4 examples gave 19 features.
Here is where each comes from: [0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3].


In [94]:
answers

[{'text': ['the Main Building'], 'answer_start': [279]},
 {'text': ['a Marian place of prayer and reflection'], 'answer_start': [381]},
 {'text': ['a golden statue of the Virgin Mary'], 'answer_start': [92]},
 {'text': ['September 1876'], 'answer_start': [248]}]

- Generate `(start_position, end_position)` tuple labels

In [131]:
start_positions = []
end_positions   = []

for idx_feat, offset in enumerate(inputs['offset_mapping']):
    idx_sample = inputs['overflow_to_sample_mapping'][idx_feat]  # idx of answer
    answer     = answers[idx_sample]
    start_char = answer['answer_start'][0]  # offset
    end_char   = start_char + len(answer['text'][0])
    sequence_ids = inputs.sequence_ids(idx_feat)  # None: [CLS] or [SEP], 0: question, 1: context 
    
    # Find the start and end of the context
    idx = 0
    while sequence_ids[idx] != 1:
        idx += 1
    context_start = idx  # sequence_ids[context_start] == 1
    
    while sequence_ids[idx] == 1:
        idx += 1
    context_end = idx - 1    # sequence_ids[context_end] == 1
    
    # If the answer is not fully inside the context, label is (0, 0)
    if offset[context_start][0] <= start_char and end_char <= offset[context_end][1]:
        idx = context_start
        while offset[idx][0] < start_char:
            idx += 1
        start_positions.append(idx)

        while offset[idx][1] < end_char:
            idx += 1
        end_positions.append(idx)    
    else:
        start_positions.append(0), end_positions.append(0)

In [132]:
start_positions, end_positions

([83, 51, 19, 0, 0, 64, 27, 0, 34, 0, 0, 0, 67, 34, 0, 0, 0, 0, 0],
 [85, 53, 21, 0, 0, 70, 33, 0, 40, 0, 0, 0, 68, 35, 0, 0, 0, 0, 0])

- (83, 85)

In [133]:
idx = 0
sample_idx = inputs['overflow_to_sample_mapping'][idx]
answer = answers[sample_idx]['text'][0]

start, end = start_positions[idx], end_positions[idx]
labeled_answer = tokenizer.decode(inputs['input_ids'][idx][start:end+1])

print(f"Theoretical answer: {answer}, labels give: {labeled_answer}")

Theoretical answer: the Main Building, labels give: the Main Building


- (0, 0)

In [134]:
idx = 4
sample_idx = inputs["overflow_to_sample_mapping"][idx]
answer = answers[sample_idx]["text"][0]

decoded_example = tokenizer.decode(inputs["input_ids"][idx])
print(f"Theoretical answer: {answer}, decoded example: {decoded_example}")

Theoretical answer: a Marian place of prayer and reflection, decoded example: [CLS] What is the Grotto at Notre Dame? [SEP] Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend " Venite Ad Me Omnes ". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grot [SEP]


In [137]:
max_length = 384
stride = 128

def preprocess_training_examples(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=max_length,
        truncation="only_second",
        stride=stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    sample_map = inputs.pop("overflow_to_sample_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        sample_idx = sample_map[i]
        answer = answers[sample_idx]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label is (0, 0)
        if offset[context_start][0] > start_char or offset[context_end][1] < end_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

In [139]:
train_dataset = raw_datasets['train'].map(
    preprocess_training_examples,
    batched=True,
    remove_columns=raw_datasets['train'].column_names
)
len(raw_datasets['train']), len(train_dataset)

(87599, 88729)

## Processing the validation data

In [149]:
def preprocess_validation_examples(examples):
    questions = [q.strip() for q in examples['question']]
    contexts  = examples['context']
    
    inputs = tokenizer(
        questions,
        contexts,
        max_length=max_length,
        truncation='only_second',
        stride=stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding='max_length'
    )
    
    sample_map  = inputs.pop('overflow_to_sample_mapping')
    example_ids = []
    
    for i in range(len(inputs['input_ids'])):
        sample_idx = sample_map[i]
        example_ids.append(examples['id'][sample_idx])
        
        sequence_ids = inputs.sequence_ids(i)  # None: [CLS] or [SEP], 0: question, 1: context
        offset = inputs['offset_mapping'][i]
        inputs['offset_mapping'][i] = [o if sequence_ids[k] == 1 else None for k, o in enumerate(offset)]  # use only context offset
        
    inputs['example_id'] = example_ids
    return inputs

In [151]:
validation_dataset = raw_datasets['validation'].map(
    preprocess_validation_examples,
    batched=True,
    remove_columns=raw_datasets['validation'].column_names
)

len(raw_datasets['validation']), len(validation_dataset)

Map:   0%|          | 0/10570 [00:00<?, ? examples/s]

(10570, 10822)

# Fine-tuning the model with the `Trainer` API

## Post-processing

1. The hardest thing will be to write the `compute_metrics()` function
2. Since we **padded** all the samples to the maximum length we set, there is **no data collator** to define
3. Post-processing for **Question answering**
    - We masked the start and end logits corresponding to tokens outside of the context.
    - We then converted the start and end logits into probabilities using a softmax.
    - We attributed a score to each (start_token, end_token) pair by taking the product of the corresponding two probabilities.
    - We looked for the pair with the maximum score that yielded a valid answer (e.g., a start_token lower than end_token).
4. Faster version
    - We don’t need to compute actual scores (just the predicted answer) (skip the softmax step)
    - We also won’t score all the possible (start_token, end_token) pairs, but only the ones corresponding to the highest n_best logits (with n_best=20)

- Sample model

In [155]:
small_eval_set = raw_datasets['validation'].select(range(100))
trained_checkpoint = 'distilbert-base-cased-distilled-squad'

tokenizer = AutoTokenizer.from_pretrained(trained_checkpoint)
eval_set = small_eval_set.map(
    preprocess_validation_examples,
    batched=True,
    remove_columns=raw_datasets['validation'].column_names
)

tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

In [160]:
eval_set_for_model = eval_set.remove_columns(['example_id', 'offset_mapping'])
eval_set_for_model.set_format('torch')
eval_set_for_model

Dataset({
    features: ['input_ids', 'attention_mask'],
    num_rows: 100
})

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [165]:
import torch
from transformers import AutoModelForQuestionAnswering

device = torch.device('cuda')
batch = {k: eval_set_for_model[k].to(device) for k in eval_set_for_model.column_names}
trained_model = AutoModelForQuestionAnswering.from_pretrained(trained_checkpoint).to(device)

with torch.no_grad():
    outputs = trained_model(**batch)

In [171]:
start_logits = outputs.start_logits.cpu().numpy()
end_logits   = outputs.end_logits.cpu().numpy()

In [174]:
import collections

example_to_features = collections.defaultdict(list)
for idx, feature in enumerate(eval_set):
    example_to_features[feature['example_id']].append(idx)

In [179]:
import numpy as np

n_best = 20
max_answer_length = 30
predicted_answers = []

for example in small_eval_set:
    example_id = example['id']
    context = example['context']
    answers = []
    
    for feature_index in example_to_features[example_id]:
        start_logit = start_logits[feature_index]
        end_logit   = end_logits[feature_index]
        offsets     = eval_set['offset_mapping'][feature_index]
        
        start_indexes = np.argsort(start_logit)[-1 : -n_best-1 : -1].tolist()  # big n_best
        end_indexes   = np.argsort(end_logit)[-1 : -n_best-1 : -1].tolist()  # big n_best
        
        for start_index in start_indexes:
            for end_index in end_indexes:
                # Skip answers that are not fully in the context
                if offsets[start_index] is None or offsets[end_index] is None:
                    continue
                if (start_index > end_index) or (end_index - start_index + 1 > max_answer_length):
                    continue
                answers.append({
                    'text': context[offsets[start_index][0] : offsets[end_index][1]],
                    'logit_score': start_logit[start_index] + end_logit[end_index]
                })
    
    best_answer = max(answers, key=lambda x: x['logit_score'])
    predicted_answers.append({'id': example_id, 'prediction_text': best_answer['text']})

In [180]:
predicted_answers

[{'id': '56be4db0acb8001400a502ec', 'prediction_text': 'Denver Broncos'},
 {'id': '56be4db0acb8001400a502ed', 'prediction_text': 'Carolina Panthers'},
 {'id': '56be4db0acb8001400a502ee',
  'prediction_text': "Levi's Stadium in the San Francisco Bay Area at Santa Clara, California"},
 {'id': '56be4db0acb8001400a502ef', 'prediction_text': 'Carolina Panthers'},
 {'id': '56be4db0acb8001400a502f0', 'prediction_text': 'gold'},
 {'id': '56be8e613aeaaa14008c90d1', 'prediction_text': 'golden anniversary'},
 {'id': '56be8e613aeaaa14008c90d2', 'prediction_text': 'February 7, 2016'},
 {'id': '56be8e613aeaaa14008c90d3',
  'prediction_text': 'Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference'},
 {'id': '56bea9923aeaaa14008c91b9', 'prediction_text': 'golden anniversary'},
 {'id': '56bea9923aeaaa14008c91ba',
  'prediction_text': 'American Football Conference'},
 {'id': '56bea9923aeaaa14008c9

In [182]:
import evaluate

metric = evaluate.load('squad')
theoretical_answers = [{'id': ex['id'], 'answers': ex['answers']} for ex in small_eval_set]

Downloading builder script:   0%|          | 0.00/4.53k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.32k [00:00<?, ?B/s]

In [184]:
print(predicted_answers[0])
print(theoretical_answers[0])

{'id': '56be4db0acb8001400a502ec', 'prediction_text': 'Denver Broncos'}
{'id': '56be4db0acb8001400a502ec', 'answers': {'text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos'], 'answer_start': [177, 177, 177]}}


In [185]:
metric.compute(predictions=predicted_answers, references=theoretical_answers)

{'exact_match': 83.0, 'f1': 88.25000000000004}

In [212]:
from tqdm.auto import tqdm

def compute_metrics(start_logits, end_logits, features, examples):
    example_to_features = collections.defaultdict(list)
    for idx, feature in enumerate(features):
        example_to_features[feature['example_id']].append(idx)
    
    predicted_answers = []
    for example in tqdm(examples):
        example_id = example['id']
        context = example['context']
        answers = []
        
        # Loop through all features associated with that example
        for feature_index in example_to_features[example_id]:
            start_logit = start_logits[feature_index]
            end_logit   = end_logits[feature_index]
            offsets     = features[feature_index]['offset_mapping']
            
            start_indexes = np.argsort(start_logit)[-1 : -n_best-1 : -1].tolist()  # big n_best
            end_indexes   = np.argsort(end_logit)[-1 : -n_best-1 : -1].tolist()  # big n_best

            for start_index in start_indexes:
                for end_index in end_indexes:
                    # Skip answers that are not fully in the context
                    if offsets[start_index] is None or offsets[end_index] is None:
                        continue
                    if (start_index > end_index) or (end_index - start_index + 1 > max_answer_length):
                        continue
                    answers.append({
                        'text': context[offsets[start_index][0] : offsets[end_index][1]],
                        'logit_score': start_logit[start_index] + end_logit[end_index]
                    })
        
        # Select the answer with the best score
        if len(answers) > 0:
            best_answer = max(answers, key=lambda x: x['logit_score'])
            predicted_answers.append({'id': example_id, 'prediction_text': best_answer['text']})
        else:
            predicted_answers.append({'id': example_id, 'prediction_text': ''})
    
    theoretical_answers = [{'id': ex['id'], 'answers': ex['answers']} for ex in examples]    
    return metric.compute(predictions=predicted_answers, references=theoretical_answers)

In [192]:
compute_metrics(start_logits, end_logits, eval_set, small_eval_set)

  0%|          | 0/100 [00:00<?, ?it/s]

{'exact_match': 83.0, 'f1': 88.25000000000004}

## Fine-tuning the model

In [197]:
model_checkpoint

'bert-base-cased'

In [193]:
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['qa_outputs.weight', 'qa_outputs.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


token: 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

In [200]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [203]:
from transformers import TrainingArguments

args = TrainingArguments(
    'bert-finetuned-squad',
    evaluation_strategy='no',
    save_strategy='epoch',
    learning_rate=2e-5,
    num_train_epochs=1,
    weight_decay=1e-2,
    fp16=True,
    push_to_hub=True
)

In [204]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset,
    tokenizer=tokenizer
)
trainer.train()

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
500,2.5967
1000,1.6626
1500,1.4594
2000,1.3826
2500,1.3274
3000,1.2913
3500,1.2264
4000,1.1729
4500,1.1251
5000,1.1515


Checkpoint destination directory bert-finetuned-squad/checkpoint-11092 already exists and is non-empty.Saving will proceed but saved results may be invalid.


TrainOutput(global_step=11092, training_loss=1.2305639784902689, metrics={'train_runtime': 1028.6571, 'train_samples_per_second': 86.257, 'train_steps_per_second': 10.783, 'total_flos': 1.7388449946321408e+16, 'train_loss': 1.2305639784902689, 'epoch': 1.0})

In [213]:
(start_logits, end_logits), *_ = trainer.predict(validation_dataset)
compute_metrics(start_logits, end_logits, validation_dataset, raw_datasets['validation'])

  0%|          | 0/10570 [00:00<?, ?it/s]

{'exact_match': 80.21759697256385, 'f1': 87.63448875058538}

In [214]:
trainer.push_to_hub(commit_message="Training complete")

'https://huggingface.co/Alchem/bert-finetuned-squad/tree/main/'

# A custom training loop

## Preparing everything for training

In [216]:
from torch.utils.data import DataLoader
from transformers import default_data_collator

train_dataset.set_format('torch')
validation_set = validation_dataset.remove_columns(['example_id', 'offset_mapping'])
validation_set.set_format('torch')

train_dataloader = DataLoader(
    train_dataset,
    shuffle=True,
    collate_fn=default_data_collator,
    batch_size=8
)
eval_dataloader = DataLoader(
    validation_set,
    collate_fn=default_data_collator,
    batch_size=8
)

In [218]:
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['qa_outputs.weight', 'qa_outputs.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [219]:
from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=2e-5)

In [221]:
from accelerate import Accelerator

accelerator = Accelerator(mixed_precision='fp16')
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(model, optimizer, train_dataloader, eval_dataloader)

In [224]:
from transformers import get_scheduler

num_train_epochs = 1
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    'linear',
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps
)

In [225]:
from huggingface_hub import Repository, get_full_repo_name

model_name = 'bert-finetuned-squad-accelerate'
repo_name  = get_full_repo_name(model_name)
repo_name

'Alchem/bert-finetuned-squad-accelerate'

In [226]:
output_dir = 'bert-finetuned-sqaud-accelerate'
repo = Repository(output_dir, clone_from=repo_name)

For more details, please read https://huggingface.co/docs/huggingface_hub/concepts/git_vs_http.


OSError: Looks like you do not have git-lfs installed, please install. You can install from https://git-lfs.github.com/. Then run `git lfs install` (you only have to do this once).

## Training loop

In [None]:
from tqdm.auto import tqdm
import torch

progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_train_epochs):
    # Training
    model.train()
    for step, batch in enumerate(train_dataloader):
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)
        
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)
    
    # Evaluation
    model.eval()
    start_logits = []
    end_logits   = []
    accerlator.print("Evaluation!")
    for batch in tqdm(eval_dataloader):
        with torch.no_grad():
            outputs = model(**batch)
            
        start_logits.append(accerlator.gather(outputs.start_logits).cpu().numpy())
        end_logits.append(accerlator.gather(outputs.end_logits).cpu().numpy()) 
    
    start_logits = np.concatenate(start_logits)
    end_logits   = np.concatenate(end_logits)
    
    start_logits = start_logits[:len(validation_dataset)]
    end_logits   = end_logits[:len(validation_dataset)]
    
    metrics = compute_metrics(start_logits, end_logits, validation_dataset, raw_datasets['validation'])
    print(f"epoch {epoch}: {metrics}")
    
    # Save and upload
    accelerator.wait_for_everyone()
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
    if accelerator.is_main_process:
        tokenizer.save_pretrained(output_dir)
        repo.push_to_hub(
            commit_message=f"Training in progress epoch {epoch}", blocking=False
        )

In [None]:
accelerator.wait_for_everyone()
unwrapped_model = accelerator.unwrap_model(model)
unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)

# Using the fine-tuned model

In [227]:
from transformers import pipeline

model_checkpoint = "Alchem/bert-finetuned-squad"
question_answerer = pipeline("question-answering", model=model_checkpoint)

context = """
🤗 Transformers is backed by the three most popular deep learning libraries — Jax, PyTorch and TensorFlow — with a seamless integration
between them. It's straightforward to train your models with one before loading them for inference with the other.
"""
question = "Which deep learning libraries back 🤗 Transformers?"
question_answerer(question=question, context=context)

config.json:   0%|          | 0.00/671 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/431M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.19k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/669k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

{'score': 0.9340673089027405,
 'start': 78,
 'end': 105,
 'answer': 'Jax, PyTorch and TensorFlow'}