<a href="https://colab.research.google.com/github/claudio-bon/question-answering-squad/blob/main/nlp_squad.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Question Answering on SQuAD dataset
Authors:
- Angely Jazmín Oyola Suárez
- Claudio Bonetta

In order to perform this task we will use a transformer-based architecture.<br>
The transformer used will be pre-trained on a generic task and then finetuned on the task at hand.<br>
The transformers' implementation that will be used will be provided by **HuggingFace** library.<br>
Let's start by installing it.

In [None]:
! pip install datasets transformers

Collecting datasets
[?25l  Downloading https://files.pythonhosted.org/packages/86/27/9c91ddee87b06d2de12f134c5171a49890427e398389f07f6463485723c3/datasets-1.9.0-py3-none-any.whl (262kB)
[K     |████████████████████████████████| 266kB 27.1MB/s 
[?25hCollecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/fd/1a/41c644c963249fd7f3836d926afa1e3f1cc234a1c40d80c5f03ad8f6f1b2/transformers-4.8.2-py3-none-any.whl (2.5MB)
[K     |████████████████████████████████| 2.5MB 37.7MB/s 
[?25hCollecting xxhash
[?25l  Downloading https://files.pythonhosted.org/packages/7d/4f/0a862cad26aa2ed7a7cd87178cbbfa824fc1383e472d63596a0d018374e7/xxhash-2.0.2-cp37-cp37m-manylinux2010_x86_64.whl (243kB)
[K     |████████████████████████████████| 245kB 49.5MB/s 
[?25hCollecting huggingface-hub<0.1.0
  Downloading https://files.pythonhosted.org/packages/35/03/071adc023c0a7e540cf4652fa9cad13ab32e6ae469bf0cc0262045244812/huggingface_hub-0.0.13-py3-none-any.whl
Collecting fsspec>=2021.05.

## Loading the Dataset

### Dataset Downloading 
The dataset is a .json file loaded in a google drive.


In [None]:
!gdown --id "1aURk7-EAowXK-KXy7Ut1Y3z1X18kHv0E"

Downloading...
From: https://drive.google.com/uc?id=1aURk7-EAowXK-KXy7Ut1Y3z1X18kHv0E
To: /content/training_set.json
30.3MB [00:00, 53.6MB/s]


### Dataset Creation

The dataset will be loaded using HuggingFace's loading function.

In [None]:
from datasets import load_dataset

json_file_path = "training_set.json"
ds_original = load_dataset('json', data_files= json_file_path, field='data')

Using custom data configuration default-57c0658af083c8d9


Downloading and preparing dataset json/default (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /root/.cache/huggingface/datasets/json/default-57c0658af083c8d9/0.0.0/793d004298099bd3c4e61eb7878475bcf1dc212bf2e34437d85126758720d7f9...


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-57c0658af083c8d9/0.0.0/793d004298099bd3c4e61eb7878475bcf1dc212bf2e34437d85126758720d7f9. Subsequent calls will reuse this data.


HuggingFace's loading function returns a dict-link object called `DatasetDict` that incapsulate the real dataset.
The dataset loaded will be stored under the key "train", as such it will subsequently splitted according to the projects requirenmets.

In [None]:
ds_original

DatasetDict({
    train: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 442
    })
})

In [None]:
# Print the 1st row
ds_original['train'][0]

{'paragraphs': [{'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
   'qas': [{'answers': [{'answer_start': 515,
       'text': 'Saint Bernadette Soubirous'}],
     'id': '5733be284776f41900661182',
     'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?'},
    {'answers': [{'answer_start': 188, 'text': 

The needs of the assignment requires for the dataset to be framed in such a way that each title has associated a single context and a single question/answer as well.<br>
This implies a replication of the titles (since one title has more contexts) and context (since one context can hold more than one question/answer).

In [None]:
def generate_dataset(dataset, test = False):
  for data in dataset["train"]:
    title = data.get("title", "").strip()
    for paragraph in data["paragraphs"]:
      context = paragraph["context"].strip()
      for qa in paragraph["qas"]:
          # Handling questions
          question = qa["question"].strip()
          id_ = qa["id"]
          # Answers won't be present in the testing (compute_answers.py)
          if not test:
              # Handling answers
              for answer in qa["answers"]:
                answer_start = [answer["answer_start"]]
              for answer in qa["answers"]:
                answer_text = [answer["text"].strip()]

              yield id_, {
                "title": title,
                "context": context,
                "question": question,
                "id": id_,
                "answers": {
                    "answer_start": answer_start,
                    "text": answer_text,
                },
              }
          else:
              yield id_, {
              "title": title,
              "context": context,
              "question": question,
              "id": id_,
            }

The `generate_dataset` is then used to create a `DataFrame` that will contain the whole dataset framed as described above.

In [None]:
import pandas as pd

# Create a pandas dataframe that contains all the data
df = pd.DataFrame(
    [value[1] for value in generate_dataset(ds_original)]
)

The result is:

In [None]:
from IPython.display import display, HTML

def display_dataframe(df):
    display(HTML(df.to_html()))

In [None]:
display_dataframe(df.head())

Unnamed: 0,title,context,question,id,answers
0,University_of_Notre_Dame,"Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend ""Venite Ad Me Omnes"". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.",To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?,5733be284776f41900661182,"{'answer_start': [515], 'text': ['Saint Bernadette Soubirous']}"
1,University_of_Notre_Dame,"Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend ""Venite Ad Me Omnes"". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.",What is in front of the Notre Dame Main Building?,5733be284776f4190066117f,"{'answer_start': [188], 'text': ['a copper statue of Christ']}"
2,University_of_Notre_Dame,"Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend ""Venite Ad Me Omnes"". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.",The Basilica of the Sacred heart at Notre Dame is beside to which structure?,5733be284776f41900661180,"{'answer_start': [279], 'text': ['the Main Building']}"
3,University_of_Notre_Dame,"Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend ""Venite Ad Me Omnes"". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.",What is the Grotto at Notre Dame?,5733be284776f41900661181,"{'answer_start': [381], 'text': ['a Marian place of prayer and reflection']}"
4,University_of_Notre_Dame,"Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend ""Venite Ad Me Omnes"". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.",What sits on top of the Main Building at Notre Dame?,5733be284776f4190066117e,"{'answer_start': [92], 'text': ['a golden statue of the Virgin Mary']}"


Number of newly generated rows:

In [None]:
n_answers = df['answers'].count()
print("Total samples:\n{}".format(n_answers))

Total samples:
87599


### Dataset Split
The dataset has to be splitted into training set and validation set.

In [None]:
from datasets import Dataset, DatasetDict

def split_train_validation(df, train_size):
    """
    Returns a DatasetDict with the train and validation splits.

    Parameters
    ----------
    df: Pandas.Dataframe
        Dataframe to split.
    train_size : int or float
        A number that specifies the size of the train split.
        If it is less or equal than 1, represents a percentage, else
        the train's number of samples 
    
    Returns
    -------
    DatasetDict(**dataset) : datasets.dataset_dict
        Dictionary containing as keys the train and validation split and 
        as values a dataset.

    """

    dataset = {}
    # Number of samples in df
    n_answers = df['answers'].count()
    if train_size <= 1 : s_train = n_answers * train_size 
    else: s_train= train_size
    # Count of answers by title, output is sorted asc
    df_bytitle = df.groupby(by='title')['answers'].count()
    # Cumulative sum over the DataFrame in order to select the train/validation titles
    # according to the train size
    train_title = df_bytitle[df_bytitle.sort_values().cumsum() < s_train]
    # Splitting the two dataframes
    df_train = df[df.title.isin(train_title.index.tolist())].reset_index(drop=True)
    df_validation = df[~df.title.isin(train_title.index.tolist())].reset_index(drop=True)
    # Building the two HuggingFace's datasets using train and validation dataframes
    dataset["train"]= Dataset.from_pandas(df_train)
    dataset["validation"]= Dataset.from_pandas(df_validation)

    return DatasetDict(**dataset)

Call `split_train_validation` in order to split in training and validation set the previously created `DataFrame`.

In [None]:
datasets = split_train_validation(df, 0.9)

The result is:

In [None]:
datasets

DatasetDict({
    train: Dataset({
        features: ['title', 'context', 'question', 'id', 'answers'],
        num_rows: 78428
    })
    validation: Dataset({
        features: ['title', 'context', 'question', 'id', 'answers'],
        num_rows: 9171
    })
})

## Preprocessing the Data

### Choosing the Model
As stated in the beginning what will be used is a transformer that has been pretrained on a generic task. Hence, in order to finetune it, it is important to faithfully **repeat the preprocessing steps used during the pre-training phase**. As such it's needed to define the model that it's going to be used straight from the preprocessing phase.<br>
Since in this context it's required to answer the questions not by generating new text but by extracting substring from a paragraph, the ideal type of transformer to be used is the **encoder** kind.
<figure class="image">
<img src="https://drive.google.com/uc?export=view&id=1A9BFo4m5zuVNceYccmS_thUiUhwQfJmm">
<figcaption>Typical structure of an encoder-based transformer.</figcaption>
</figure>

From this family of transformers it has been decided to use DistilBERT.

In [None]:
model_checkpoint = "distilbert-base-uncased"

### Loading the Tokenizer
The preprocessing it's handled by HuggingFace's `Tokenizer` class.<br>
This class is able to handle the preprocessing of the dataset in conformity with the specification of each pre-trained model present in HuggingFace's model hub. In particular they hold the vocabulary built in the pre-training phase and the tokenization methodology used: it generally is word-based, character-based or subword-based. DistilBERT uses the same as BERT, namely, end-to-end tokenization: punctuation splitting and wordpiece (subword segmentation).<br>
The method `AutoTokenizer.from_pretrained` will download the appropriate tokenizer.

In [None]:
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=28.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=442.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




### Handling Long Sequences
The transformer models have a maximum number of tokens they are able to process with this quantity varying depending on the architecture.<br>
A solution usually adopted in case of sequences longer than the limited amount (other than choosing a model that can handle longer sequences) is to **truncate** the sentence.<br>
While this approach may be effective for some tasks in this case it's **not a valid solution** since there would be the risk of truncating out from the context the answer to the question.<br>
In order to overcome this limitation what will be done is **sliding** the input sentence over the model with a certain **stride** allowing a certain degree of **overlap**. The overlap is necessary as to avoid the truncation of a sentence in a point where an answer lies.

In [None]:
max_length = 384 # Max length of the input sequence
stride = 128 # Overlap of the context

HuggingFace's tokenizer allow to perform this kind of operation by passing to the tokenizer the argument `return_overflowing_tokens=True` and by specifying the stride through the argument `stride`.

In [None]:
def tokenize(tokenizer, max_length, stride, row):
    pad_on_right = tokenizer.padding_side == "right"
    
    return tokenizer(
        row["question" if pad_on_right else "context"],
        row["context" if pad_on_right else "question"],
        max_length=max_length,
        truncation="only_second" if pad_on_right else "only_first",
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        stride=stride,
        padding="max_length"
    )

The division of a context in numerous truncated context create some issues regarding the detection of the answer inside the context since a pair of question-context may generate multiple pairs question-truncated context. This implies that using `answers["answer_start"]` is not sufficient anymore. As such, an ulterior preprocessing steps needs to be integrated in the preprocessing pipeline: the detection of the answers in the truncated contexts.

In [None]:
import collections

# This structure is used as an aid to the following functions since they will have to deal with a lot of start and end indexes.
Position = collections.namedtuple("Position", ["start","end"])

The first step is to retrieve the answer position in the original context.

In [None]:
def get_answer_position_in_context(answers):
    # Index of the answer starting character inside the context.
    start_char = answers["answer_start"][0]
    # Index of the answer ending character inside the context.
    end_char = start_char + len(answers["text"][0])
    
    return Position(start=start_char, end=end_char)

Since the tokenized input sequence encodes both the question and the context it is necessary to indentify which part of the sequence match the context.<br>
In order to complete this task the method `sequence_ids()` come into aid.<br>
In particular `sequence_ids()` tags the input tokens as `0` if they belong to the quesiton and `1` if they belong to the context (the reverse is instead true in the case the model pad the sequence to the left); `None` is for special tokens.

In [None]:
def get_context_position_in_tokenized_input(tokenized_row, i, pad_on_right):
    # List that holds for each index (up to the lenght of the tokenized input sequence)
    # 1 if its corresponding token is a context's token, 0 if it's a question's token
    # (the contrair if pad_on_right is true). Null for the special tokens.
    sequence_ids = tokenized_row.sequence_ids(i)

    # Start context's token's index inside the input sequence.
    token_start_index = sequence_ids.index(1 if pad_on_right else 0)

    # End context's token's index inside the input sequence.
    token_end_index = len(sequence_ids)-1 - list(reversed(sequence_ids)).index(1 if pad_on_right else 0)

    return Position(start=token_start_index, end=token_end_index)

In order to properly tag the position of an answer in a truncated context the answer itself needs to be fully included inside the truncated context, since partial answers may not be fully explicative, nor have grammatical consistence, ecc...<br>
Having the start and end answer's indexes inside the original context and the position of the truncated context inside the tokenized input sequence (which is composed by the question and the context), what's left it to identify the position of the answer in the tokenized and truncated context.<br>
This is done through the aid of the tokenized sequence attribute `offset_mapping` (obtained using the argument `return_offsets_mapping=True` to call the tokenizer) which indicates for each tokenized word its starting and ending index in the original sequence.

In [None]:
def get_answer_position_in_tokenized_input(offsets, char_pos, token_pos, cls_index):
    # Check if the answer fully included in the context.
    if offsets[token_pos.start][0] <= char_pos.start and offsets[token_pos.end][1] >= char_pos.end:
        # Starting token's index of the answer with respect to the input sequence.
        start_position = token_pos.start + next(i for i,v in enumerate([offset[0] for offset in offsets[token_pos.start:]]) if v > char_pos.start or i==token_pos.end+1) - 1
        # Ending token's index of the answer with respect to the input sequence.
        end_position = next(i for i,v in reversed(list(enumerate([offset[1] for offset in offsets[:token_pos.end+1]]))) if v < char_pos.end or i==token_pos.start-1) + 1

        return Position(start=start_position, end=end_position)
    else:
        return Position(start=cls_index, end=cls_index)

From the composition of the previously defined functions follow the creation of the preprocessing function.

In [None]:
def preprocess_train(tokenizer, max_length, stride):
    pad_on_right = tokenizer.padding_side == "right"

    def preprocess_train_impl(rows):
        tokenized_rows = tokenize(tokenizer, max_length, stride, rows)
        # overflow_to_sample_mapping keeps the corrispondence between a feature and the row it was generated by.
        sample_mapping = tokenized_rows.pop("overflow_to_sample_mapping")
        # offset_mapping hold for each input token it's position in the textual counterpart
        # (be it the question or the context).
        offset_mapping = tokenized_rows.pop("offset_mapping")

        tokenized_rows["start_positions"] = []
        tokenized_rows["end_positions"] = []
        for i, offsets in enumerate(offset_mapping):
            input_ids = tokenized_rows["input_ids"][i]

            # cls is a special token. It will be used to label "impossible answers".
            cls_index = input_ids.index(tokenizer.cls_token_id)

            # One row can generate several truncated context, this is the index of the row containing this portion of context.
            sample_index = sample_mapping[i]
            answers = rows["answers"][sample_index]
            # If no answers are given, set the cls_index as answer.
            if len(answers["answer_start"]) == 0:
                pos = Position(cls_index,cls_index)
            else:
                char_pos = get_answer_position_in_context(answers)
                token_pos = get_context_position_in_tokenized_input(tokenized_rows, i, pad_on_right)
                pos = get_answer_position_in_tokenized_input(offsets, char_pos, token_pos, cls_index)

            tokenized_rows["start_positions"].append(pos.start)
            tokenized_rows["end_positions"].append(pos.end)

        return tokenized_rows
    return preprocess_train_impl

### Calling the Preprocessing Method
The `map` method of the DatasetDict apply a given function to each row of the dataset (to each dataset's split).

In [None]:
tokenized_datasets = datasets.map(preprocess_train(tokenizer, max_length, stride),
                                  batched=True,
                                  remove_columns=datasets["train"].column_names)

HBox(children=(FloatProgress(value=0.0, max=79.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))




The result is:

In [None]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'end_positions', 'input_ids', 'start_positions'],
        num_rows: 79245
    })
    validation: Dataset({
        features: ['attention_mask', 'end_positions', 'input_ids', 'start_positions'],
        num_rows: 9279
    })
})

## Training

**NOTE**: the sections about the actual training procedure have been commented and substituted with the downloading of a pretrained model.
If you wish to train the model form scratch in the code cell below uncomment:
```
#model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)
```
and comment:
```
!gdown --id "1kr60FKl-0vEjggaspkxKJzILZr2dhCtk"

with zipfile.ZipFile('squad_trained.zip', 'r') as zip_ref:
    zip_ref.extractall('./')

model = AutoModelForQuestionAnswering.from_pretrained("squad_trained")
```
Moreover the two code cells under the section "Finetuning" need to be uncommented as well.

As previously mentioned it's going to be used a pretrained model and then finetuned on the task at hand. In particular DistilBERT, just like BERT, is trained to be used mainly on masked language modeling and next sentence prediction tasks.<br>
Since the model has already been defined during the preprocessing phase, it's now possible to direcly download it for HuggingFace Model Hub using the `from_pretrained` method.<br>
`AutoModel` is the class that instantiate the correct architecture based on the model downloaded from the hub. `AutoModelForQuestionAnswering` in addition attaches to the pretrained backbone the head needed to perform this kind of task (which is not pretrained).

In [None]:
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer
import zipfile

#model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

!gdown --id "1ThyHyaFwci_SXLB6jrBnm6aacN74_YCd"

with zipfile.ZipFile('squad_trained.zip', 'r') as zip_ref:
    zip_ref.extractall('./')

model = AutoModelForQuestionAnswering.from_pretrained("squad_trained")

Downloading...
From: https://drive.google.com/uc?id=1ThyHyaFwci_SXLB6jrBnm6aacN74_YCd
To: /content/squad_trained.zip
245MB [00:04, 56.6MB/s]


### Trainer Class Definition
The pretraining of the model will be handled by the class `Trainer`.<br>
Still, some things needs to be defined before being able to use the `Trainer` class.<br>
The first thing is the `TrainingArguments` which specify the saving folder, batch's size, learning rate, ecc...

In [None]:
batch_size = 16

args = TrainingArguments(
    "squad",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01
)

The second and last thing to define is the data collator, which is used to batch together sequences having different length.

In [None]:
from transformers import default_data_collator

data_collator = default_data_collator

Now it's finally possible to define the Trainer class.

In [None]:
trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

### Finetuning
The method `train` of the `Trainer` class is used to trigger the finetuning process.

In [None]:
#trainer.train()

Saving the model.

In [None]:
#trainer.save_model("squad-trained")

## Evaluation

The evaluation phase it's not straightforward and requires some additional steps in order to perform it.<br>
In particular the output of the model are the loss and two scores indicating the likelihood of a token being the start and end of the answer.<br>
Simply taking the argmax of both will not do since it may create unfeasible situations: start position greater than end position and/or start position at question (remember that the input senquence is composed by the union of the tokenized answer and tokenized context).

### Preprocessing the Evaluation Data
Before evaluating the model some processing steps are required: all the data necessary to avoid the aforementioned problems needs to be added to the dataset.<br>
The problem of the answer being located inside the question is addressed by adding the starting token of the context inside the unified input sequence.<br>
Thanks to the column `overflow_to_sample_mapping` it's also possible to have a reference between the features and the corresponding row.

In [None]:
def preprocess_eval(tokenizer, max_length, stride):
    pad_on_right = tokenizer.padding_side == "right"
    def preprocess_eval_impl(rows):
        # Tokenize the rows
        tokenized_rows = tokenize(tokenizer, max_length, stride, rows)

        # overflow_to_sample_mapping keeps the corrispondence between a feature and the row it was generated by.
        sample_mapping = tokenized_rows.pop("overflow_to_sample_mapping")

        # For each feature save the row that generated it.
        tokenized_rows["row_id"] = [rows["id"][sample_index] for sample_index in sample_mapping]

        # Save the start and end context's token's position inside the tokenized input sequence (composed by question plus context)
        context_pos = [get_context_position_in_tokenized_input(tokenized_rows,i,pad_on_right) for i in range(len(tokenized_rows["input_ids"]))]
        tokenized_rows["context_start"], tokenized_rows["context_end"] = [index.start for index in context_pos], [index.end for index in context_pos]

        return tokenized_rows
    return preprocess_eval_impl

In [None]:
validation_features = datasets["validation"].map(
    preprocess_eval(tokenizer, max_length, stride),
    batched=True,
    remove_columns=datasets["validation"].column_names
)

HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))




The validation's features generated from the preprocessing are used to compute the predictions.

In [None]:
raw_valid_predictions = trainer.predict(validation_features)

The following columns in the test set  don't have a corresponding argument in `DistilBertForQuestionAnswering.forward` and have been ignored: row_id, context_end, offset_mapping, context_start.
***** Running Prediction *****
  Num examples = 9279
  Batch size = 16


Since the `Trainer` class hides the columns not used during the prediction they have to be set back.

In [None]:
validation_features.set_format(type=validation_features.format["type"], columns=list(validation_features.features.keys()))

### Posprocessing the Evaluation Data
The aim of the posprocessing is: given the raw prediction (composed by the likelihoods of each input token to be the starting and ending token of the answer) the function retrieve the portion of the context's text corresponding to the predicted answer.

`get_best_feasible_position` function select the best possible pairs of starting and ending tokens for each answer.<br>
The problem is easily shapeable as a linear optimization problem.<br>
The function has been originally implemented by using `z3` library, but it has been sucessively discarded because of performance issues.<br>
The used implementation can be found after `z3`'s.

In [None]:
#!pip install z3-solver

In [None]:
"""
from z3 import *

Score = collections.namedtuple("Score", ["index","score"])

def get_best_feasible_position(context_start, context_end, start_logits, end_logits):
    start_index = Int("start_index")
    end_index = Int("end_index")
    st_log = Array('st_log', IntSort(), RealSort())
    e_log = Array('e_log', IntSort(), RealSort())
    for i,sl in enumerate(start_logits):
        st_log = Store(st_log, i, sl)
    for i,el in enumerate(end_logits):
        e_log = Store(e_log, i, el)

    constraint = And(start_index < end_index,
                     start_index >= context_start,
                     end_index <= context_end)
    opt = Optimize()
    opt.add(constraint)
    opt.maximize(st_log[start_index]+e_log[end_index])
    if opt.check() == sat:
        model = opt.model()
        return Score(index=Position(start=model.evaluate(start_index).as_long(),
                                    end=model.evaluate(end_index).as_long()),
                     score=st_log[start_index]+e_log[end_index])
    else:
        raise StopIteration
"""

'\nfrom z3 import *\n\nScore = collections.namedtuple("Score", ["index","score"])\n\ndef get_best_feasible_position(context_start, context_end, start_logits, end_logits):\n    start_index = Int("start_index")\n    end_index = Int("end_index")\n    st_log = Array(\'st_log\', IntSort(), RealSort())\n    e_log = Array(\'e_log\', IntSort(), RealSort())\n    for i,sl in enumerate(start_logits):\n        st_log = Store(st_log, i, sl)\n    for i,el in enumerate(end_logits):\n        e_log = Store(e_log, i, el)\n\n    constraint = And(start_index < end_index,\n                     start_index >= context_start,\n                     end_index <= context_end)\n    opt = Optimize()\n    opt.add(constraint)\n    opt.maximize(st_log[start_index]+e_log[end_index])\n    if opt.check() == sat:\n        model = opt.model()\n        return Score(index=Position(start=model.evaluate(start_index).as_long(),\n                                    end=model.evaluate(end_index).as_long()),\n                    

The new `get_best_feasible_position` implementation evaluates only a portion of the logits, namely the ones with highest values. The pair of logits selected is the one that holds the highest score computed by their summation and that respects the consistency constraints:<br>
- The index of the starting logit must be less or equal of the one of the ending logit.
- The starting logit's index must be greater or equal than the starting token's index of the context.
- The ending logit's index must be less or equal than the ending token's index of the context.

Unsing only a portion of the logits was necessary because computing evey single score have $n^2$ complexity and the postprocessing would subsequently be too slow. Reducing the number of logits doesn't change the time complexity of $n^2$ but, being $n$ smaller (around 90% smaller than the original), the procedure becomes more bearable.

In [None]:
Score = collections.namedtuple("Score", ["index","score"])

def get_best_feasible_position(start_logits, end_logits, context_start, context_end, n_logits=0.15):
    #Sort logits in ascending order
    sorted_start_logit = sorted(enumerate(start_logits), key=lambda x: x[1], reverse=True)[:int(len(start_logits)*n_logits)]
    sorted_end_logit = sorted(enumerate(end_logits), key=lambda x: x[1], reverse=True)[:int(len(end_logits)*n_logits)]

    # Associate the positions of each pair of start and end tokens to their score and sort them in descending order of score
    sorted_scores = collections.OrderedDict(
                            sorted({Position(start=i, end=j):sl+el for i,sl in sorted_start_logit for j,el in sorted_end_logit}.items(),
                                    key=lambda x: x[1],
                                    reverse=True)
                    )
    
    # Return the position of the pair of higher score that respects the consistency constraints
    return next(Score(index=pos, score=score) for pos,score in sorted_scores.items() \
                if pos.start <= pos.end and pos.start >= context_start and pos.end <= context_end)

`map_feature_to_row` uses the `row_id` that has been added during the preprocessing step in order to create a corrispondence between a feature and the row it belong to.

In [None]:
def map_feature_to_row(dataset, features):
    # Associate rows' id with an index
    row_id_to_index = {k: i for i, k in enumerate(dataset["id"])}
    features_per_row = collections.defaultdict(list)
    # Create a corrispondence beween the previously computed rows' index with
    # the index of the features that belong to the said rows
    for i, feature in enumerate(features):
        features_per_row[row_id_to_index[feature["row_id"]]].append(i)

    return features_per_row

The `postprocess_eval` function use the two function defined above and for each raw prediction returns a portion of context's text that best match it taking into account:
- The logits values outputted by the model.
- The consistency constraints mentioned above.

In [None]:
from tqdm.notebook import tqdm

def postprocess_eval(dataset, features, raw_predictions, verbose=True):
    all_start_logits, all_end_logits = raw_predictions

    # Map the dataset's rows to their corresponding features.
    features_per_row = map_feature_to_row(dataset, features)

    predictions = collections.OrderedDict()

    if verbose:
        print(f"Post-processing {len(dataset)} dataset predictions split into {len(features)} features.")

    for row_index, row in enumerate(tqdm(dataset)):
        valid_answers = []

        # Indices of the features associated to the current row.
        feature_indices = features_per_row[row_index]
        
        context = row["context"]
        # Loop on the features associated to the current row.
        for feature_index in feature_indices:
            context_start = features[feature_index]["context_start"]
            context_end = features[feature_index]["context_end"]

            offsets = features[feature_index]["offset_mapping"]

            # Computation of the answer from the raw preditions.
            start_logits = all_start_logits[feature_index]
            end_logits = all_end_logits[feature_index]
            try:
                valid_answers.append(get_best_feasible_position(start_logits, end_logits, context_start, context_end))
            except StopIteration:
                continue

        # For each row use as answer the best candidate generated by the row's features
        if len(valid_answers) > 0:
            answer_pos = sorted(valid_answers, key=lambda x: x.score, reverse=True)[0].index
            answer = context[offsets[answer_pos.start][0]: offsets[answer_pos.end][1]]
        # In case no candidates are found return an empty string
        else:
            print("Not found any consistent answer's start and/or end")
            answer = ""

        predictions[row["id"]] = answer

    return predictions

Calling the post-processing function over the validation set.

In [None]:
validation_predictions = postprocess_eval(datasets["validation"],
                                          validation_features,
                                          raw_valid_predictions.predictions)

Post-processing 9171 dataset predictions split into 9279 features.


HBox(children=(FloatProgress(value=0.0, max=9171.0), HTML(value='')))




### Compute Metrics
The metrics that are those provided from HuggingFace for the squad dataset: exact match and f1 score.

In [None]:
from datasets import load_metric

metric = load_metric("squad")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1726.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1119.0, style=ProgressStyle(description…




In [None]:
formatted_predictions = [{"id": k, "prediction_text": v} for k, v in validation_predictions.items()]
references = [{"id": r["id"], "answers": r["answers"]} for r in datasets["validation"]]

metric.compute(predictions=formatted_predictions, references=references)

{'exact_match': 69.42536255588267, 'f1': 80.70188027992519}

### Inspect Errors
In order to analyze what kind of errors the model made, the mistaken predictions should first be retrieved.<br>
With "mistaken predictions" are intended those predictions that do not exactly match with the ground truth.

In [None]:
import re
import string

def normalize_answer(s):
    """Lower text and remove punctuation, articles and extra whitespace."""
    def remove_articles(text):
        regex = re.compile(r'\b(a|an|the)\b', re.UNICODE)
        return re.sub(regex, ' ', text)
    def white_space_fix(text):
        return ' '.join(text.split())
    def remove_punc(text):
        exclude = set(string.punctuation)
        return ''.join(ch for ch in text if ch not in exclude)
    def lower(text):
        return text.lower()
    return white_space_fix(remove_articles(remove_punc(lower(s))))

In [None]:
errors = pd.DataFrame([{"question":row["question"], "context":row["context"], "ground_truth":row["answers"]["text"][0], "prediction":validation_predictions[row["id"]]}
                       for row in datasets["validation"] \
                       if normalize_answer(row["answers"]["text"][0]) != normalize_answer(validation_predictions[row["id"]])])

Total number of mistaken predictions.

In [None]:
print("Wrong answers: {}/{}".format(len(errors),len(datasets["validation"])))

Wrong answers: 2804/9171


In order to check what kind of mistakes the model made, some of the errors will be displayed.<br>
First 30 errors:

In [None]:
# display_dataframe is defined in the Datast Creation paragraph
display_dataframe(errors.head(30))

Unnamed: 0,question,context,ground_truth,prediction
0,When did Beyonce start becoming popular?,"Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles ""Crazy in Love"" and ""Baby Boy"".",in the late 1990s,1990s
1,When did Beyonce leave Destiny's Child and become a solo singer?,"Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles ""Crazy in Love"" and ""Baby Boy"".",2003,1990s
2,In which decade did Beyonce become famous?,"Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles ""Crazy in Love"" and ""Baby Boy"".",late 1990s,1990s
3,In what city did Beyonce grow up?,"Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles ""Crazy in Love"" and ""Baby Boy"".",Houston,"Houston, Texas"
4,What is Beyonce's full name?,"Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles ""Crazy in Love"" and ""Baby Boy"".",Beyoncé Giselle Knowles-Carter,Giselle Knowles-Carter
5,Which album was darker in tone from her previous work?,"Following the disbandment of Destiny's Child in June 2005, she released her second solo album, B'Day (2006), which contained hits ""Déjà Vu"", ""Irreplaceable"", and ""Beautiful Liar"". Beyoncé also ventured into acting, with a Golden Globe-nominated performance in Dreamgirls (2006), and starring roles in The Pink Panther (2006) and Obsessed (2009). Her marriage to rapper Jay Z and portrayal of Etta James in Cadillac Records (2008) influenced her third album, I Am... Sasha Fierce (2008), which saw the birth of her alter-ego Sasha Fierce and earned a record-setting six Grammy Awards in 2010, including Song of the Year for ""Single Ladies (Put a Ring on It)"". Beyoncé took a hiatus from music in 2010 and took over management of her career; her fourth album 4 (2011) was subsequently mellower in tone, exploring 1970s funk, 1980s pop, and 1990s soul. Her critically acclaimed fifth studio album, Beyoncé (2013), was distinguished from previous releases by its experimental production and exploration of darker themes.",Beyoncé,"4 (2011) was subsequently mellower in tone, exploring 1970s funk, 1980s pop, and 1990s soul. Her critically acclaimed fifth studio album, Beyoncé"
6,"After what movie portraying Etta James, did Beyonce create Sasha Fierce?","Following the disbandment of Destiny's Child in June 2005, she released her second solo album, B'Day (2006), which contained hits ""Déjà Vu"", ""Irreplaceable"", and ""Beautiful Liar"". Beyoncé also ventured into acting, with a Golden Globe-nominated performance in Dreamgirls (2006), and starring roles in The Pink Panther (2006) and Obsessed (2009). Her marriage to rapper Jay Z and portrayal of Etta James in Cadillac Records (2008) influenced her third album, I Am... Sasha Fierce (2008), which saw the birth of her alter-ego Sasha Fierce and earned a record-setting six Grammy Awards in 2010, including Song of the Year for ""Single Ladies (Put a Ring on It)"". Beyoncé took a hiatus from music in 2010 and took over management of her career; her fourth album 4 (2011) was subsequently mellower in tone, exploring 1970s funk, 1980s pop, and 1990s soul. Her critically acclaimed fifth studio album, Beyoncé (2013), was distinguished from previous releases by its experimental production and exploration of darker themes.",Cadillac Records,Destiny's Child
7,"What was Beyoncé's first acting job, in 2006?","Following the disbandment of Destiny's Child in June 2005, she released her second solo album, B'Day (2006), which contained hits ""Déjà Vu"", ""Irreplaceable"", and ""Beautiful Liar"". Beyoncé also ventured into acting, with a Golden Globe-nominated performance in Dreamgirls (2006), and starring roles in The Pink Panther (2006) and Obsessed (2009). Her marriage to rapper Jay Z and portrayal of Etta James in Cadillac Records (2008) influenced her third album, I Am... Sasha Fierce (2008), which saw the birth of her alter-ego Sasha Fierce and earned a record-setting six Grammy Awards in 2010, including Song of the Year for ""Single Ladies (Put a Ring on It)"". Beyoncé took a hiatus from music in 2010 and took over management of her career; her fourth album 4 (2011) was subsequently mellower in tone, exploring 1970s funk, 1980s pop, and 1990s soul. Her critically acclaimed fifth studio album, Beyoncé (2013), was distinguished from previous releases by its experimental production and exploration of darker themes.",Dreamgirls,Golden Globe-nominated performance in Dreamgirls
8,Which magazine declared her the most dominant woman musician?,"A self-described ""modern-day feminist"", Beyoncé creates songs that are often characterized by themes of love, relationships, and monogamy, as well as female sexuality and empowerment. On stage, her dynamic, highly choreographed performances have led to critics hailing her as one of the best entertainers in contemporary popular music. Throughout a career spanning 19 years, she has sold over 118 million records as a solo artist, and a further 60 million with Destiny's Child, making her one of the best-selling music artists of all time. She has won 20 Grammy Awards and is the most nominated woman in the award's history. The Recording Industry Association of America recognized her as the Top Certified Artist in America during the 2000s decade. In 2009, Billboard named her the Top Radio Songs Artist of the Decade, the Top Female Artist of the 2000s and their Artist of the Millennium in 2011. Time listed her among the 100 most influential people in the world in 2013 and 2014. Forbes magazine also listed her as the most powerful female musician of 2015.",Forbes,Forbes magazine
9,What magazine rated Beyonce as the most powerful female musician in 2015?,"A self-described ""modern-day feminist"", Beyoncé creates songs that are often characterized by themes of love, relationships, and monogamy, as well as female sexuality and empowerment. On stage, her dynamic, highly choreographed performances have led to critics hailing her as one of the best entertainers in contemporary popular music. Throughout a career spanning 19 years, she has sold over 118 million records as a solo artist, and a further 60 million with Destiny's Child, making her one of the best-selling music artists of all time. She has won 20 Grammy Awards and is the most nominated woman in the award's history. The Recording Industry Association of America recognized her as the Top Certified Artist in America during the 2000s decade. In 2009, Billboard named her the Top Radio Songs Artist of the Decade, the Top Female Artist of the 2000s and their Artist of the Millennium in 2011. Time listed her among the 100 most influential people in the world in 2013 and 2014. Forbes magazine also listed her as the most powerful female musician of 2015.",Forbes,Forbes magazine


Random 30 errors:

In [None]:
display_dataframe(errors.sample(frac=1).reset_index(drop=True).head(30))

Unnamed: 0,question,context,ground_truth,prediction
0,"The Buddhism practiced in Tibet, the Himalayan regions and Mongolia are often referred to as?","Native Mahayana Buddhism is practiced today in China, Japan, Korea, Singapore, parts of Russia and most of Vietnam (also commonly referred to as ""Eastern Buddhism""). The Buddhism practiced in Tibet, the Himalayan regions, and Mongolia is also Mahayana in origin, but is discussed below under the heading of Vajrayana (also commonly referred to as ""Northern Buddhism""). There are a variety of strands in Eastern Buddhism, of which ""the Pure Land school of Mahayana is the most widely practised today."". In most of this area however, they are fused into a single unified form of Buddhism. In Japan in particular, they form separate denominations with the five major ones being: Nichiren, peculiar to Japan; Pure Land; Shingon, a form of Vajrayana; Tendai, and Zen. In Korea, nearly all Buddhists belong to the Chogye school, which is officially Son (Zen), but with substantial elements from other traditions.",Northern Buddhism,Mahayana
1,What is the term for insight knowledge?,"In Buddhist practice, it is said that while samatha meditation can calm the mind, only vipassanā meditation can reveal how the mind was disturbed to start with, which is what leads to insight knowledge (jñāna; Pāli ñāṇa) and understanding (prajñā Pāli paññā), and thus can lead to nirvāṇa (Pāli nibbāna). When one is in jhana, all defilements are suppressed temporarily. Only understanding (prajñā or vipassana) eradicates the defilements completely. Jhanas are also states that Arahants abide in order to rest.",jñāna,jñāna; Pāli ñāṇa
2,"During the High Middle Ages, what percentage of the European population consisted of rural peasants?","The High Middle Ages was a period of tremendous expansion of population. The estimated population of Europe grew from 35 to 80 million between 1000 and 1347, although the exact causes remain unclear: improved agricultural techniques, the decline of slaveholding, a more clement climate and the lack of invasion have all been suggested. As much as 90 per cent of the European population remained rural peasants. Many were no longer settled in isolated farms but had gathered into small communities, usually known as manors or villages. These peasants were often subject to noble overlords and owed them rents and other services, in a system known as manorialism. There remained a few free peasants throughout this period and beyond, with more of them in the regions of Southern Europe than in the north. The practice of assarting, or bringing new lands into production by offering incentives to the peasants who settled them, also contributed to the expansion of population.",90,90 per cent
3,What was the name of the revolutionary group led by Gaddafi?,"Having removed the monarchical government, Gaddafi proclaimed the foundation of the Libyan Arab Republic. Addressing the populace by radio, he proclaimed an end to the ""reactionary and corrupt"" regime, ""the stench of which has sickened and horrified us all."" Due to the coup's bloodless nature, it was initially labelled the ""White Revolution"", although was later renamed the ""One September Revolution"" after the date on which it occurred. Gaddafi insisted that the Free Officers' coup represented a revolution, marking the start of widespread change in the socio-economic and political nature of Libya. He proclaimed that the revolution meant ""freedom, socialism, and unity"", and over the coming years implemented measures to achieve this.",Free Officers,Free Officers' coup
4,The importance of what military arm increased during the Carolingian era?,"The importance of infantry and light cavalry began to decline during the early Carolingian period, with a growing dominance of elite heavy cavalry. The use of militia-type levies of the free population declined over the Carolingian period. Although much of the Carolingian armies were mounted, a large proportion during the early period appear to have been mounted infantry, rather than true cavalry. One exception was Anglo-Saxon England, where the armies were still composed of regional levies, known as the fyrd, which were led by the local elites. In military technology, one of the main changes was the return of the crossbow, which had been known in Roman times and reappeared as a military weapon during the last part of the Early Middle Ages. Another change was the introduction of the stirrup, which increased the effectiveness of cavalry as shock troops. A technological advance that had implications beyond the military was the horseshoe, which allowed horses to be used in rocky terrain.",heavy cavalry,infantry and light cavalry
5,What is the popular name of New York's advertising industry?,"As of 2013, the global advertising agencies of Omnicom Group and Interpublic Group, both based in Manhattan, had combined annual revenues of approximately US$21 billion, reflecting New York City's role as the top global center for the advertising industry, which is metonymously referred to as ""Madison Avenue"". The city's fashion industry provides approximately 180,000 employees with $11 billion in annual wages.",Madison Avenue,Omnicom Group
6,The parent company of Tidal became under the ownership of whom in 2015?,"On March 30, 2015, it was announced that Beyoncé is a co-owner, with various other music artists, in the music streaming service Tidal. The service specialises in lossless audio and high definition music videos. Beyoncé's husband Jay Z acquired the parent company of Tidal, Aspiro, in the first quarter of 2015. Including Beyoncé and Jay-Z, sixteen artist stakeholders (such as Kanye West, Rihanna, Madonna, Chris Martin, Nicki Minaj and more) co-own Tidal, with the majority owning a 3% equity stake. The idea of having an all artist owned streaming service was created by those involved to adapt to the increased demand for streaming within the current music industry, and to rival other streaming services such as Spotify, which have been criticised for their low payout of royalties. ""The challenge is to get everyone to respect music again, to recognize its value"", stated Jay-Z on the release of Tidal.",Jay Z,Aspiro
7,What is the Manueline style?,"Traditional architecture is distinctive and include the Manueline, also known as Portuguese late Gothic, a sumptuous, composite Portuguese style of architectural ornamentation of the first decades of the 16th century. A 20th-century interpretation of traditional architecture, Soft Portuguese style, appears extensively in major cities, especially Lisbon. Modern Portugal has given the world renowned architects like Eduardo Souto de Moura, Álvaro Siza Vieira (both Pritzker Prize winners) and Gonçalo Byrne. In Portugal Tomás Taveira is also noteworthy, particularly for stadium design.","a sumptuous, composite Portuguese style of architectural ornamentation of the first decades of the 16th century",Portuguese late Gothic
8,The 21st century is refereed to as what?,"In the Contemporary era, there were various socio-technological trends. Regarding the 21st century and the late modern world, the Information age and computers were forefront in use, not completely ubiquitous but often present in daily life. The development of Eastern powers was of note, with China and India becoming more powerful. In the Eurasian theater, the European Union and Russian Federation were two forces recently developed. A concern for Western world, if not the whole world, was the late modern form of terrorism and the warfare that has resulted from the contemporary terrorist acts.",the Information age,the late modern world
9,Name one of the tribes Native Americans formed in the Americas?,"In Asia, various Chinese dynasties and Japanese shogunates controlled the Asian sphere. In Japan, the Edo period from 1600 to 1868 is also referred to as the early modern period. And in Korea, from the rising of Joseon Dynasty to the enthronement of King Gojong is referred to as the early modern period. In the Americas, Native Americans had built a large and varied civilization, including the Aztec Empire and alliance, the Inca civilization, the Mayan Empire and cities, and the Chibcha Confederation. In the west, the European kingdoms and movements were in a movement of reformation and expansion. Russia reached the Pacific coast in 1647 and consolidated its control over the Russian Far East in the 19th century.",Aztec Empire,Chibcha Confederation


Retrieve an error by querying by question.

In [None]:
def get_error(errors, question):
    return errors[errors['question']==question]

In [None]:
display_dataframe(get_error(errors, "What did Gao Qiang tell reporters in Beijing?"))

Unnamed: 0,question,context,ground_truth,prediction
458,What did Gao Qiang tell reporters in Beijing?,"Experts point out that the earthquake hit an area that has been largely neglected and untouched by China's economic rise. Health care is poor in inland areas such as Sichuan, highlighting the widening gap between prosperous urban dwellers and struggling rural people. Vice Minister of Health Gao Qiang told reporters in Beijing that the ""public health care system in China is insufficient."" The Vice Minister of Health also suggested that the government would pick up the costs of care to earthquake victims, many of whom have little or no insurance: ""The government should be responsible for providing medical treatment to them,"" he said.",public health care system in China is insufficient,"that the ""public health care system in China is insufficient."""
