# Everything you (n)ever wanted to know about SQuAD evaluation
> A deep dive into computing QA predictions and evaluating a model on SQuAD2.0 

- toc: true 
- badges: true
- comments: true
- permalink: /hidden/
- hidden: true
- search_exclude: false
- categories: [PyTorch, Hugging Face, Wikipedia, BERT, Transformers]

In [None]:
- image: images/diagram.png


### Prerequisites: 
(See our previous post XXX)
* A basic understanding of Transformers and PyTorch
* A Transformer fine-tuned on squad (preferably on squad2.0)
* The SQuAD dev set

### What you'll learn
1. how to evaluate bert on squad
2. metrics for evaluating qa 
3. How to handle the Null Response -- when a question doesn't have answer in the passage
4. Implementing a more robust answering method for your QA system

### Walk through the process:
1. Overview of the squad 2.0 dev set
2. Overview of the EM and F1 metrics
3. Evaluating a model on the squad dev set and understanding the outputs
    2. preds and nbest_preds
    3. the null_odds json
5. Using the null_odds json to determine the best null answer threshold
    1. wait, what? 
    2. visualizing the null answer PR curve
4. Implementing a more robust get_answer() method
    1. checks to prevent impossible answers
    2. finds one answer for examples that are parsed into multiple features
    3. allows for null response


In [48]:
import torch
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

In our last post [Building a QA System with BERT on Wikipedia](https://qa.fastforwardlabs.com/pytorch/hugging%20face/wikipedia/bert/transformers/2020/05/19/Getting_Started_with_QA.html) we used the HuggingFace framework to train BERT on the SQuAD dataset and built a simple QA system on top of the Wikipedia search engine. This time, we'll look at the quality of BERT for Question Answering. We'll cover what metrics are used to quantify quality, how to evaluate BERT with the HuggingFace framework and how to incorporate the "null response" into your model for more realistic QA output. 

# Answering Questions is hard
Quantifying the success of answering a question is a tricky task. When you or I ask a question, the correct answer could take multiple forms. Let's look at an example. 

In our previous post, BERT answered the question, "Why is the sky blue?" with "Rayleigh scattering" but another correct answer would be 

"The Earth's atmosphere scatters short-wavelength light more efficiently than that of longer wavelengths. Because its wavelengths are shorter, blue light is more strongly scattered than the longer-wavelength lights, red or green. Hence the result that when looking at the sky away from the direct incident sunlight, the human eye perceives the sky to be blue." 

Both of these passages can be found in the Wikipedia article Diffuse Sky Radiation and both are correct but only one of these responses is returned as the answer.  


**Need a different example** 


In order to determine whether BERT is correctly answering questions, we of course need a gold standard set of questions and answers! This is exactly what the SQuAD dataset provides. 

# More about SQuAD
The SQuAD dataset comes in two flavors: SQuAD 1.1 and SQuAD 2.0. The latter contains the same questions and answers as the former but also includes additional questions that _cannot_ be answered by the accompanying passage. This is intended to provide a more realistic question answering task because often times there really won't be an answer to the question in the document we're parsing. This ability to properly identify when a question does not have an answer is much more challenging for Transformer models and it's why we focused on this dataset rather than SQuAD 1.1. 

SQuAD 2.0 consists of more than 130k questions, of which a full third do not have an answer that can be found in the associated passage. The dev set in particular contains a 50/50 split of answerable questions. SQuAD examples consist of questions + context pairs. The context is a single paragraph from a Wikipedia article. Each paragraph will have several questions (both answerable and unanswerable) associated with it. Paragraphs are drawn from 35 Wikipedia articles. Every paragraph in the article has at least one question associated with it.  The full SQuAD stats is shown below from the paper XXX. 


![](images/squad_datasets.png "SQuAD Stats")


### Load the SQuAD 2.0 dev set using HF data processors

HuggingFace provide the [Processors](https://huggingface.co/transformers/main_classes/processors.html) library for fascilitating basic processing tasks with some canonical NLP datasets. The processors can be used for loading datasets and converting their examples to features for direct use in the model. We'll be using the [SQuAD processors](https://huggingface.co/transformers/main_classes/processors.html#squad). 

In [8]:
# hide
data_dir = "/home/ryan/work/ff14/data/squad/"
dev_set_file = data_dir + "dev-v2.0.json"

In [9]:
from transformers.data.processors.squad import SquadV2Processor

processor = SquadV2Processor()
examples = processor.get_dev_examples(data_dir, filename=dev_set_file)

100%|██████████| 35/35 [00:05<00:00,  6.64it/s]


In [11]:
# generate some maps to help us identify examples of interest
qid_to_example_index = {example.qas_id: i for i, example in enumerate(examples)}
qid_to_has_answer = {example.qas_id: bool(example.answers) for example in examples}
answer_qids = [qas_id for qas_id, has_answer in qid_to_has_answer.items() if has_answer]
no_answer_qids = [qas_id for qas_id, has_answer in qid_to_has_answer.items() if not has_answer]

In [46]:
def display_example(qid):    
    idx = qid_to_example_index[qid]
    q = examples[idx].question_text
    c = examples[idx].context_text
    a = [answer['text'] for answer in examples[idx].answers]
    
    print(f'Example {idx} of {len(examples)}\n---------------------')
    print(f"Q: {q}\n")
    print(f"Context: \n{c}\n")
    print(f"True Answers:\n{a}")

#### A SQuAD example 

Approximately 50% of the examples in the dev set are questions that have answers contained within their corresponding passage. In these cases, up to five possible correct answers are provided (questions and answers were generated and identified by crowd-sourced workers). Answers must be direct excerpts from the passage but we can see there are several ways to have a "correct" answer. 

In [44]:
display_example(answer_qids[1300])

Example 2548 of 11873
---------------------
Q: Where on Earth is free oxygen found?

Context: 
Free oxygen also occurs in solution in the world's water bodies. The increased solubility of O
2 at lower temperatures (see Physical properties) has important implications for ocean life, as polar oceans support a much higher density of life due to their higher oxygen content. Water polluted with plant nutrients such as nitrates or phosphates may stimulate growth of algae by a process called eutrophication and the decay of these organisms and other biomaterials may reduce amounts of O
2 in eutrophic water bodies. Scientists assess this aspect of water quality by measuring the water's biochemical oxygen demand, or the amount of O
2 needed to restore it to a normal concentration.

True Answers:
['water', "in solution in the world's water bodies", "the world's water bodies"]


#### A SQuAD negative example

The other half of the questions in dev set do not have an answer in the corresponding passage. These questions were generated by crowd-sourced workers to be related and relevant to the passage but unanswerable by that passage. There are thus no True Answers associated with these questions as we see in the example below. 

In [45]:
display_example(no_answer_qids[2500])

Example 4954 of 11873
---------------------
Q: Why is the theory of evolution so complex?

Context: 
The principle of faunal succession is based on the appearance of fossils in sedimentary rocks. As organisms exist at the same time period throughout the world, their presence or (sometimes) absence may be used to provide a relative age of the formations in which they are found. Based on principles laid out by William Smith almost a hundred years before the publication of Charles Darwin's theory of evolution, the principles of succession were developed independently of evolutionary thought. The principle becomes quite complex, however, given the uncertainties of fossilization, the localization of fossil types due to lateral changes in habitat (facies change in sedimentary strata), and that not all fossils may be found globally at the same time.

True Answers:
[]


# Metrics for QA

There are two dominant metrics used by many question answering datasets: exact match (EM) and F1 score. These scores are computed on individual question+answer pairs. When multiple correct answers are possible for a given question, the maximum score is computed. Scores over all examples are averaged providing a final EM and F1 score for the dev set. 


### Exact Match
This metric is as simple as it sounds. For each question/answer pair, if the characters of the model's prediction exactly match the characters of the known answer, EM = 1, otherwise EM = 0. This is a strict all-or-nothing metric; being off by a single character would result in a score of 0 for that prediciton. When assessing against a negative example, if the model predicts any text at all it automatically receives a 0 for that example. 

### F1 
This metric can be found in many analyses so it shouldn't be a surprise to see here. F1 is the harmonic mean of the precision and recall. In this case, it's computed over the individual words in the prediction against those in the true answer. The number of shared words provides the basis of the f1 score.


![](images/f1score.png "F1 score")

### Load a Transformer model fine-tuned on SQuAD 2.0

In [3]:
tokenizer = AutoTokenizer.from_pretrained("twmkn9/distilbert-base-uncased-squad2")
model = AutoModelForQuestionAnswering.from_pretrained("twmkn9/distilbert-base-uncased-squad2")

In [100]:
def get_prediction(qid):
    # given a question id (qas_id or qid), load the example, get the model outputs and generate an answer
    question = examples[qid_to_example_index[qid]].question_text
    context = examples[qid_to_example_index[qid]].context_text

    inputs = tokenizer.encode_plus(question, context, return_tensors='pt')

    outputs = model(**inputs)
    answer_start = torch.argmax(outputs[0])  # Get the most likely beginning of answer with the argmax of the score
    answer_end = torch.argmax(outputs[1]) + 1 

    answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(inputs['input_ids'][0][answer_start:answer_end]))

    return answer

We can soften the blow of the Exact Match metric by normalizing the text before computation. Removing articles and punctuation, and standardizing whitespace are all typical text processing steps. 

In [98]:
# These functions are heavily influenced by the HF squad_metrics.py script
def normalize_text(s):
    import string, re
    """Lower text and remove punctuation, articles and extra whitespace."""

    def remove_articles(text):
        regex = re.compile(r"\b(a|an|the)\b", re.UNICODE)
        return re.sub(regex, " ", text)

    def white_space_fix(text):
        return " ".join(text.split())

    def remove_punc(text):
        exclude = set(string.punctuation)
        return "".join(ch for ch in text if ch not in exclude)

    def lower(text):
        return text.lower()

    return white_space_fix(remove_articles(remove_punc(lower(s))))

def compute_exact_match(prediction, truth):
    return int(normalize_text(prediction) == normalize_text(truth))

def compute_f1(prediction, truth):
    pred_tokens = normalize_text(prediction).split()
    truth_tokens = normalize_text(truth).split()
    
    # if either the prediction or the truth is no-answer then f1 = 1 if they agree, 0 otherwise
    if len(pred_tokens) == 0 or len(truth_tokens) == 0:
        return int(pred_tokens == truth_tokens)
    
    common_tokens = set(pred_tokens) & set(truth_tokens)
    
    # if there are no common tokens then f1 = 0
    if len(common_tokens) == 0:
        return 0
    
    prec = len(common_tokens) / len(pred_tokens)
    rec = len(common_tokens) / len(truth_tokens)
    
    return 2 * (prec * rec) / (prec + rec)
    
def get_gold_answers(example):
    gold_answers = [answer["text"] for answer in example.answers if answer["text"]]

    # if gold_answers doesn't exist it's because this is a negative example - 
    # the only correct answer is an empty string
    if not gold_answers:
        gold_answers = [""]
        
    return gold_answers

In [96]:
prediction = get_answer_from_qid(answer_qids[1300])
example = examples[qid_to_example_index[answer_qids[1300]]]

gold_answers = get_gold_answers(example)

em_score = max((compute_exact_match(prediction, answer)) for answer in gold_answers)
f1_score = max((compute_f1(prediction, answer)) for answer in gold_answers)

print(em_score, f1_score)

0 0.8


In [97]:
prediction = get_answer_from_qid(no_answer_qids[2500])
example = examples[qid_to_example_index[no_answer_qids[2500]]]

gold_answers = get_gold_answers(example)

em_score = max((compute_exact_match(prediction, answer)) for answer in gold_answers)
f1_score = max((compute_f1(prediction, answer)) for answer in gold_answers)

print(em_score, f1_score)

0 0


# Evaluating a model on the squad dev set

The same `run_squad.py` script we used to fine-tune a Transformer for question answering can also be used to evaluate the model! Below are the arguments you'll need to properly evaluate a fine-tuned model for question answering on the SQuAD dev set. Because we using SQuAD 2.0 it is **crucial** that you include the `--version_2_with_negative` flag!

In [103]:
!python run_squad.py  \
    --model_type distilbert   \
    --model_name_or_path twmkn9/distilbert-base-uncased-squad2  \
    --output_dir models/distilbert/twmkn9_distilbert-base-uncased-squad2 \
    --data_dir data/squad   \
    --predict_file dev-v2.0.json   \
    --do_eval   \
    --version_2_with_negative \
    --do_lower_case  \
    --per_gpu_eval_batch_size 12   \
    --max_seq_length 384   \
    --doc_stride 128


2020-06-02 19:14:41.314546: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:
2020-06-02 19:14:41.314635: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:
2020-06-02 19:14:41.314648: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
06/02/2020 19:14:42 - INFO - transformers.configuration_utils -   loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/twmkn9/distilbert-base-uncased-squad2/config.json from

06/02/2020 19:14:50 - INFO - __main__ -   Creating features from dataset file at data/squad
100%|███████████████████████████████████████████| 35/35 [00:05<00:00,  6.54it/s]
convert squad examples to features: 100%|█| 11873/11873 [02:02<00:00, 97.08it/s]
add example index and unique id: 100%|█| 11873/11873 [00:00<00:00, 461404.92it/s
06/02/2020 19:16:59 - INFO - __main__ -   Saving features into cached file data/squad/cached_dev_distilbert-base-uncased-squad2_384
06/02/2020 19:17:20 - INFO - __main__ -   ***** Running evaluation  *****
06/02/2020 19:17:20 - INFO - __main__ -     Num examples = 12232
06/02/2020 19:17:20 - INFO - __main__ -     Batch size = 12
Evaluating: 100%|███████████████████████████| 1020/1020 [02:23<00:00,  7.13it/s]
06/02/2020 19:19:43 - INFO - __main__ -     Evaluation done in total 143.126918 secs (0.011701 sec per example)
06/02/2020 19:19:43 - INFO - transformers.data.metrics.squad_metrics -   Writing predictions to: models/distilbert/twmkn9_distilbert-base-unc

We've evaluated a `distilbert` model that was fine-tuned on SQuAD2.0 by a member of the NLP community. When running the evaluation we see a number of steps performed: 

1. the dev set is loaded from disk 
2. the examples are converted to features that can be directly fed to the model
3. these features are cached to disk
4. Evaluation proceeds in batches of 12 and finishes in about 2.5 minutes (this is because distilBERT is much faster and more lightweight than BERT)
5. a slew of prediction outputs are written to disk
6. Overall model results are displayed

Let's start with the overall model results. 

In [105]:
Results = {
    # Scores averaged over all examples in the dev set
    'exact': 66.25958056093658,         
    'f1': 69.66994428499025,            
    'total': 11873,  # number of examples in the dev set
    
    # Scores averaged over only positive examples (have answers)
    'HasAns_exact': 68.91025641025641,  
    'HasAns_f1': 75.74076391627662,     
    'HasAns_total': 5928, # number of positive examples
    
    # Scores averaged over only negative examples (no answers)
    'NoAns_exact': 63.61648444070648, 
    'NoAns_f1': 63.61648444070648, 
    'NoAns_total': 5945, # number of negative examples
    
    # ***Given probabilities of no-answer for each example, what would the best scores and thresholds be? ***
    'best_exact': 66.25958056093658, 
    'best_exact_thresh': 0.0, 
    'best_f1': 69.66994428499046, 
    'best_f1_thresh': 0.0
}

The first three blocks of the `Results` output are pretty straightforward. EM and F1 scores are reported over the full dev set, the set of positive examples, and the set of negative examples. This can give you some insight into whether your model is performing adequately on both answer and no-answer questions (this particular model is pretty bad at no-answer questions). 

However, what's going on with that fourth block? This portion of the output is not useful unless you supply the evaluation with additional information. And for that we'll need to dig a bit deeper into the evaluation process. 


### Computing predictions
When the tokenized question+context is passed to the model, the output consists of two sets of logits: one for the start of the answer span, the other for the end of the answer span. These logits represent the likelihood of any given token being the start or end of the answer. Every token passed to the model is assigned a logit, including special tokens (e.g, [CLS], [SEP]), and tokens corresponding to the question itself.  

In [121]:
inputs = tokenizer.encode_plus(example.question_text, example.context_text, return_tensors='pt')
start_logits, end_logits = model(**inputs)

In [133]:
# Look at how large the logit is in the [CLS] position!  Strong possibility that this question has no answer... 
start_logits

tensor([[  3.7929,  -9.1130,  -9.6885,  -7.7843,  -8.2340, -10.3993,  -9.6347,
          -9.7876,  -9.9708,  -9.9662,  -8.7785,  -3.9508,  -5.5454,  -9.6513,
          -4.6236,  -9.8807,  -7.4872,  -8.7846,  -6.8423,  -8.9985,  -6.0774,
          -5.4852,  -9.7361,  -4.5017,  -9.2640,  -5.3205,  -7.2865,  -8.4630,
          -5.2088,  -4.2374,  -8.6602,  -9.9352,  -9.4512, -10.2235,  -9.5926,
         -10.6912, -10.3315, -10.6619,  -9.3309,  -9.5508,  -4.9999,  -6.2900,
         -10.5532,  -9.6481,  -9.9048, -10.9416,  -5.4683,  -9.1218, -10.2717,
          -9.9383,  -7.1546,  -7.1813,  -7.5937,  -6.7924,  -8.4716, -10.7609,
          -9.9768,  -9.2098,  -9.6693, -11.1971, -10.4625, -11.2761,  -8.7499,
          -8.4402,  -4.4782,  -8.3395,  -5.5925,  -9.0272, -10.4382, -10.0390,
          -7.3458,  -9.6450,  -8.6492,  -9.4535,  -9.6644, -10.3202,  -9.7984,
          -8.6443,  -9.0967,  -9.9739,  -6.8935,  -9.6771, -10.1303,  -9.9340,
          -8.6770, -10.0878,  -8.4293,  -7.1896,  -2

In our simple QA system, we predicted the best answer by selecting the start and end tokens with the largest logits, but that's not very robust. In practice, these logits are passed to a method that computes prediction scores. Tokens corresponding to the _n_ largest start_logits and the _n_ largest end_logits are selected as candidates. Any sensible combination of these start + end tokens is considered a candidate answer, however, several consistency checks must first be performed. For example, an answer wherein the end token falls _before_ the start token should be excluded because that just doesn't amke sense, even if these tokens were associated with the largest logits. Candidate answers wherein the start or end tokens are associated with question tokens are also excluded because the answer to the question should obviously not be in the question itself! It is important to note that the [CLS] token is **not** removed because this token indicates the null answer. 

In [159]:
# We can sort our list of start_logits by logit score and keep track of which token they're associated with
def to_list(tensor):
    return tensor.detach().cpu().tolist()

start_logits = to_list(start_logits)[0]
end_logits = to_list(end_logits)[0]

start_idx_and_logit = sorted(enumerate(start_logits), key=lambda x: x[1], reverse=True)
end_idx_and_logit = sorted(enumerate(end_logits), key=lambda x: x[1], reverse=True)


# The null answer token index (0) is in the top five, along with some other possible answer-start context tokens
print(start_idx_and_logit[:5])
print(end_idx_and_logit[:5]) 

[(109, 6.369344711303711), (107, 5.730345249176025), (108, 5.2197771072387695), (115, 4.952322483062744), (0, 3.792943000793457)]
[(113, 6.066335201263428), (126, 4.7193193435668945), (120, 4.3211588859558105), (134, 4.181004524230957), (0, 3.5367705821990967)]


In [155]:
start_indexes = [idx for idx, logit in start_idx_and_logit[:5]]
end_indexes = [idx for idx, logit in end_idx_and_logit[:5]]

tokens = to_list(inputs['input_ids'])[0]
# question tokens are defined as those between the CLS token (101, at position 0) and first SEP (102) token 
question_indexes = [i+1 for i, token in enumerate(tokens[1:tokens.index(102)])]
question_indexes

[1, 2, 3, 4, 5, 6, 7, 8, 9]

In [170]:
import collections

# keep track of all preliminary predictions
PrelimPrediction = collections.namedtuple(  # pylint: disable=invalid-name
    "PrelimPrediction", ["start_index", "end_index", "start_logit", "end_logit"]
)

prelim_preds = []
for start_index in start_indexes:
    for end_index in end_indexes:
        # throw out invalid predictions
        if start_index in question_indexes:
            continue
        if end_index in question_indexes:
            continue
        if end_index < start_index:
            continue
        prelim_preds.append(
            PrelimPrediction(
                start_index = start_index,
                end_index = end_index,
                start_logit = start_logits[start_index],
                end_logit = end_logits[end_index]
            )
        )

Prediction scores are computed on the remaining answer candidates' start and end tokens as the sum of their start and end logits. For a candidate answer i, score_i = start_logit_i + end_logit_i. The `n_best` candidate answers with the highest scores are retained (this number can be set using the `--n_best_size` flag of `run_squad.py` and defaults to 20). If the list of candidate answers does not contain the prediction score for the null answer, it is computed and added. The null answer score is computed as the sum of the logits corresponding to the [CLS] token in both the start_logits and end_logits lists. 

In [218]:
# sort preliminary predictions by their score
prelim_preds = sorted(prelim_preds, key=lambda x: (x.start_logit + x.end_logit), reverse=True)
len(prelim_preds)

20

In [219]:
tokens = to_list(inputs['input_ids'])[0]

# keep track of all best predictions
BestPrediction = collections.namedtuple(  # pylint: disable=invalid-name
    "BestPrediction", ["text", "start_logit", "end_logit"]
)

nbest = []
seen_predictions = []
for pred in prelim_preds:
    if pred.start_index > 0: # non-null answers have start_index > 0

        text = tokenizer.convert_tokens_to_string(
            tokenizer.convert_ids_to_tokens(
                tokens[pred.start_index:pred.end_index]
            )
        )
        # Clean whitespace
        text = text.strip()
        text = " ".join(text.split())

    if text in seen_predictions:
        continue
    
    seen_predictions.append(text)    
    nbest.append(BestPrediction(text=text, start_logit=pred.start_logit, end_logit=pred.end_logit))

# include the null answer
nbest.append(BestPrediction(text="", start_logit=start_logits[0], end_logit=end_logits[0]))


These top _n_ best answers are computed and a null answer score is computed.  The list of `n_best` answers for each question is saved to disk as `nbest_predictions_.json`.

In [222]:
len(nbest)
nbest

[BestPrediction(text='uncertainties of fossil', start_logit=6.369344711303711, end_logit=6.066335201263428),
 BestPrediction(text='given the uncertainties of fossil', start_logit=5.730345249176025, end_logit=6.066335201263428),
 BestPrediction(text='the uncertainties of fossil', start_logit=5.2197771072387695, end_logit=6.066335201263428),
 BestPrediction(text='uncertainties of fossilization , the localization of fossil types due to lateral changes in', start_logit=6.369344711303711, end_logit=4.7193193435668945),
 BestPrediction(text='uncertainties of fossilization , the localization of fossil', start_logit=6.369344711303711, end_logit=4.3211588859558105),
 BestPrediction(text='uncertainties of fossilization , the localization of fossil types due to lateral changes in habitat ( facies change in sedimentary strata', start_logit=6.369344711303711, end_logit=4.181004524230957),
 BestPrediction(text='given the uncertainties of fossilization , the localization of fossil types due to latera

The last step is to compute the odds of the null answer. According to the original BERT paper, 

> We predict a non-null answer when sˆi,j > s_null + τ , where the threshold τ is selected on the dev set to maximize F1.  

Restating that relationship, we can compute the difference between the null score and the non-null score:

In [221]:
# finally, we can compute the odds of the null answer
score_null = start_logits[0] + end_logits[0]
score_diff = score_null - nbest[0].start_logit - nbest[0].end_logit

score_diff

-5.105966329574585

This `score_diff` is computed for every example in the dev set and these scores are saved to disk in the `null_odds_.json`. Let's pull up the score stored for the example we're using and see how we did!

In [226]:
import json
filename = 'models/distilbert/twmkn9_distilbert-base-uncased-squad2/null_odds_.json'
null_odds = json.load(open(filename, 'rb'))

example.qas_id
null_odds[example.qas_id]

-5.105948209762573

Nailed it down to four decimal places! Good enough for me. 

### So what?

Now that we have an understanding of the prediction process, how to compute scores, and the difference between null scores and non-null scores, we can start to make sense of the fourth block of the results output with the help of the `null_odds_.json` file!