Model validation results
------
In this notebook, I will document my model validation results, based on a validation dataset of SQuAD Q&A data compared with the model's predictions, as generated by the `src.validate.validate` module.

Here, we are working off of the validation dataset, the file [dev-v2.0.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json) in the SQuAD dataset.

### Library imports

In [33]:
import pandas as pd
from sklearn.metrics import f1_score

### Data ingest

In [34]:
df = pd.read_csv('../data/processed/validate.csv')
df

Unnamed: 0,Title,Question,Answer,y_true,Prediction,Match,y_pred,PartialMatch
0,Normans,In what country is Normandy located?,france,1,france,1,1,1
1,Normans,When were the Normans in Normandy?,10th and 11th centuries,1,10th and 11th centuries,1,1,1
2,Normans,From which countries did the Norse originate?,"denmark, iceland and norway",1,"denmark, iceland and norway",1,1,1
3,Normans,Who was the Norse leader?,rollo,1,rollo,1,1,1
4,Normans,What century did the Normans first gain their ...,10th century,1,10th,0,0,1
...,...,...,...,...,...,...,...,...
468,Southern_California,Which conference do the teams in southern Cali...,pac-12,1,pac - 12 conference,0,0,0
469,Southern_California,The two listed teams play for which NCAA group?,division i,1,division i,1,1,1
470,Southern_California,What is a growing sport in southern California?,rugby,1,rugby,1,1,1
471,Southern_California,At which level of education is this sport beco...,high school,1,high school,1,1,1


### Calculate EM ('exact match') and F1 scores

For EM, we simply calculate the ratio of exact matches:

In [35]:
em = df['Match'].value_counts(normalize=True) * 100
print(f"We have achieved an EM score of {em[1]}%.")
em

We have achieved an EM score of 59.61945031712473%.


1    59.61945
0    40.38055
Name: Match, dtype: float64

As a bonus, we may broaden our definition of correct to also include partial matches: after all, upon visual inspection we can see that many of the model's answers are shorter or longer versions of the SQuAD answers, and thus are not necessarily very bad. (note that technically only the shortest possible answer span is correct, although what a human judge considers a useful answer -- a slightly longer one may provide a bit more context -- may vary)

In [38]:
em = df['PartialMatch'].value_counts(normalize=True) * 100
print(f"We have achieved an EM score of {em[1]}%.")
em

We have achieved an EM score of 83.72093023255815%.


1    83.72093
0    16.27907
Name: PartialMatch, dtype: float64

Scikit-learn's F1 function requires binary classification data. So for F1, we use the F1 function on two proxy ('dummy') values that represent the _true_ value (1, relating to the SQuAD baseline answer and thus always correct) and the _predicted_ value (1 if the model predicted accurately, 0 if it did not.)

In [36]:
f1 = f1_score(df['y_true'], df['y_pred'])
print(f"We have achieved an F1 score of {f1:.2}%.")

We have achieved an F1 score of 0.75%.


As a bonus, we may broaden our definition of correct to also include partial matches: after all, upon visual inspection we can see that many of the model's answers are shorter or longer versions of the SQuAD answers, and thus are not necessarily very bad. (note that technically only the _shortest_ possible answer span is correct, although what a human judge considers a useful answer -- a slightly longer one may provide a bit more context -- may vary)

In [37]:
f1 = f1_score(df['y_true'], df['PartialMatch'])
print(f"We have achieved an alternative F1 score (on partial matches) of {f1:.2}%.")

We have achieved an alternative F1 score (on partial matches) of 0.91%.
