QAEval is a question-answering based metric for estimating the content quality of a summary [1]. It generates QA pairs from reference summaries, then uses a QA model to answer the questions against a candidate summary. The final score is the portion of questions that were answered correctly.
Here is a demo of using the QAEval metric.
After installing SacreROUGE, then you must install the qaeval
package:
pip install qaeval
The qaeval
package uses PyTorch, Transformers, and AllenNLP.
In order to keep the required dependencies of SacreROUGE light, we chose to not incorporate the QAEval code into this repository.
Therefore, you must install the qaeval
package or else the code will crash.
Then, QAEval uses pretrained question-generation and question-answering models, which must be downloaded:
sacrerouge setup-metric qa-eval
By default, this will download the model files to ~/.sacrerouge/metrics/qaeval/models
.
If you want to change this directoy, set the environment variable SACREROUGE_DATA_ROOT
to whatever directory you want (instead of ~/.sacrerouge
).
To test your setup, run the following code:
>>> import json
>>> from sacrerouge.metrics import QAEval
>>>
>>> summary1 = 'Dan walked to the bakery this morning.'
>>> reference1 = 'Dan went to buy scones earlier this morning.'
>>>
>>> # This line will load the generation and answer models into memory, so it may take some time to complete.
>>> qaeval = QAEval()
>>>
>>> # To score an individual summary with a list of reference summaries. This example
>>> # only uses 1 reference, so it is wrapped in a list.
>>> scores = qaeval.score(summary1, [reference1])
>>> print(scores)
{'qa-eval': {'em': 0.5, 'f1': 0.5}}
>>>
>>> # To run batch scoring, use the score_all function and pass a list of summaries and
>>> # a list of list of references. Again, each instance here only has 1 reference, so it is wrapped
>>> # in a list
>>> summary2 = 'Roger Federer beat Rafael Nadal yesterday.'
>>> reference2 = 'Yesterday, Nadal lost to Federer'
>>> # scores_list is a list of size 2. scores_list[0] is the scores for summary1, and scores_list[1] for summary2
>>> scores_list = qaeval.score_all([summary1, summary2], [[reference1], [reference2]])
>>>
>>> # If you want the QA pairs used to score the summaries returned, add the return_qa_pairs=True argument
>>> # to any of the scoring methods. A tuple of size 2 will be returned. The first item is the scores
>>> # like above. The second item are the QA pairs.
>>> scores, qas = qaeval.score(summary2, [reference2], return_qa_pairs=True)
>>>
>>> # qas[i][j] is the j-th QA pair for the i-th reference summary. The "probability" is the QA model's
>>> # probability for the prediction. "null_probability" is its probability there is no answer.
>>> print(json.dumps(qas[0][0], indent=2))
{
"question": {
"question_id": "915ed522cfe7b798bd23f299a6eca192",
"question": "Who lost to Federer yesterday?",
"answer": "Nadal",
"sent_start": 0,
"sent_end": 32,
"answer_start": 11,
"answer_end": 16
},
"prediction": {
"prediction_id": "4a7d1ed414474e4033ac29ccb8653d9b",
"prediction": "Rafael Nadal",
"probability": 0.9939968367261187,
"null_probability": 1.9474517469108735e-06,
"em": 0,
"f1": 0.6666666666666666
}
}
>>>
>>> # If you pass the return_qa_pairs=True flag to score_all, it looks like this. "results" is parallel to "scores_list"
>>> # from before, but instead of only the scores, there is a tuple of the scores and the QA pairs
>>> results = qaeval.score_all([summary1, summary2], [[reference1], [reference2]], return_qa_pairs=True)
>>>
>>> scores1, qas1 = results[0]
>>> scores2, qas2 = results[1]
>>> print(json.dumps(qas1[0][0], indent=2))
{
"question": {
"question_id": "7cd86ecb09aa48c6e620b340f6a74592",
"question": "Who went to buy scones earlier this morning?",
"answer": "Dan",
"sent_start": 0,
"sent_end": 44,
"answer_start": 0,
"answer_end": 3
},
"prediction": {
"prediction_id": "4a7d1ed414474e4033ac29ccb8653d9b",
"prediction": "Dan",
"probability": 0.9986048904063031,
"null_probability": 2.303598142577244e-06,
"em": 1,
"f1": 1.0
}
}
Here are the correlations of QAEval metrics to the "overall responsiveness" scores on the TAC datasets. They differ slightly from those reported in the paper for reasons listed here.
Summary-level, peers only:
TAC2008 | TAC2009 | TAC2010 | TAC2011 | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
r | p | k | r | p | k | r | p | k | r | p | k | |
QA-EM | 0.35 | 0.35 | 0.29 | 0.44 | 0.41 | 0.33 | 0.43 | 0.43 | 0.36 | 0.41 | 0.39 | 0.32 |
QA-F1 | 0.46 | 0.45 | 0.36 | 0.49 | 0.46 | 0.37 | 0.55 | 0.55 | 0.44 | 0.50 | 0.46 | 0.37 |
QA-LERC | 0.50 | 0.49 | 0.40 | 0.53 | 0.48 | 0.38 | 0.61 | 0.59 | 0.48 | 0.55 | 0.49 | 0.40 |
Summary-level, peers + references:
TAC2008 | TAC2009 | TAC2010 | TAC2011 | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
r | p | k | r | p | k | r | p | k | r | p | k | |
QA-EM | 0.49 | 0.43 | 0.35 | 0.47 | 0.47 | 0.37 | 0.53 | 0.50 | 0.41 | 0.45 | 0.42 | 0.34 |
QA-F1 | 0.61 | 0.52 | 0.43 | 0.55 | 0.53 | 0.42 | 0.65 | 0.62 | 0.51 | 0.56 | 0.51 | 0.41 |
QA-LERC | 0.64 | 0.57 | 0.46 | 0.61 | 0.55 | 0.44 | 0.69 | 0.66 | 0.55 | 0.62 | 0.55 | 0.45 |
System-level, peers only:
TAC2008 | TAC2009 | TAC2010 | TAC2011 | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
r | p | k | r | p | k | r | p | k | r | p | k | |
QA-EM | 0.92 | 0.89 | 0.74 | 0.71 | 0.88 | 0.71 | 0.91 | 0.88 | 0.72 | 0.90 | 0.78 | 0.59 |
QA-F1 | 0.90 | 0.86 | 0.68 | 0.78 | 0.88 | 0.72 | 0.93 | 0.88 | 0.75 | 0.94 | 0.82 | 0.64 |
QA-LERC | 0.88 | 0.85 | 0.67 | 0.86 | 0.88 | 0.71 | 0.91 | 0.88 | 0.73 | 0.95 | 0.85 | 0.68 |
System-level, peers + references:
TAC2008 | TAC2009 | TAC2010 | TAC2011 | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
r | p | k | r | p | k | r | p | k | r | p | k | |
QA-EM | 0.97 | 0.92 | 0.79 | 0.68 | 0.92 | 0.77 | 0.96 | 0.93 | 0.79 | 0.81 | 0.81 | 0.63 |
QA-F1 | 0.96 | 0.90 | 0.74 | 0.78 | 0.92 | 0.77 | 0.97 | 0.93 | 0.81 | 0.89 | 0.88 | 0.72 |
QA-LERC | 0.95 | 0.89 | 0.73 | 0.88 | 0.92 | 0.76 | 0.96 | 0.93 | 0.80 | 0.92 | 0.91 | 0.76 |
Here are the correlations of the metrics to the expert-based relevance judgments from Fabbri (2020)
Summary-level, peers only:
Fabbri2020 | |||
---|---|---|---|
r | p | k | |
QA-EM | 0.23 | 0.23 | 0.19 |
QA-F1 | 0.30 | 0.29 | 0.22 |
QA-LERC | 0.34 | 0.31 | 0.24 |
System-level, peers only:
Fabbri2020 | |||
---|---|---|---|
r | p | k | |
QA-EM | 0.80 | 0.91 | 0.77 |
QA-F1 | 0.82 | 0.91 | 0.77 |
QA-LERC | 0.80 | 0.90 | 0.77 |
Here are the correlations to the annotations provided by Bhandari et al., (2020). Summary-level:
Bhandari2020-Abs | Bhandari2020-Ext | Bhandari2020-Mix | |||||||
---|---|---|---|---|---|---|---|---|---|
r | p | k | r | p | k | r | p | k | |
QA-EM | 0.34 | 0.34 | 0.29 | 0.09 | 0.07 | 0.06 | 0.25 | 0.24 | 0.20 |
QA-F1 | 0.45 | 0.43 | 0.35 | 0.17 | 0.13 | 0.11 | 0.35 | 0.34 | 0.27 |
QA-LERC | 0.48 | 0.46 | 0.37 | 0.19 | 0.14 | 0.11 | 0.38 | 0.36 | 0.28 |
System-level:
Bhandari2020-Abs | Bhandari2020-Ext | Bhandari2020-Mix | |||||||
---|---|---|---|---|---|---|---|---|---|
r | p | k | r | p | k | r | p | k | |
QA-EM | 0.86 | 0.81 | 0.64 | 0.30 | 0.19 | 0.13 | 0.84 | 0.84 | 0.66 |
QA-F1 | 0.92 | 0.90 | 0.77 | 0.40 | 0.35 | 0.24 | 0.90 | 0.88 | 0.72 |
QA-LERC | 0.93 | 0.92 | 0.79 | 0.52 | 0.61 | 0.38 | 0.89 | 0.90 | 0.74 |
If you use this metric, please cite the following work
@misc{deutsch2020questionanswering,
title={{Towards Question-Answering as an Automatic Metric for Evaluating the Content Quality of a Summary}},
author={Daniel Deutsch and Tania Bedrax-Weiss and Dan Roth},
year={2020},
eprint={2010.00490},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
[1] Daniel Deutsch, Tania Bedrax-Weiss, and Dan Roth. Towards Question-Answering as an Automatic Metric for Evaluating the Content Quality of a Summary. 2020.