Skip to content

Latest commit

 

History

History
591 lines (574 loc) · 11.1 KB

qaeval.md

File metadata and controls

591 lines (574 loc) · 11.1 KB

QAEval

QAEval is a question-answering based metric for estimating the content quality of a summary [1]. It generates QA pairs from reference summaries, then uses a QA model to answer the questions against a candidate summary. The final score is the portion of questions that were answered correctly.

Here is a demo of using the QAEval metric.

Setting Up

After installing SacreROUGE, then you must install the qaeval package:

pip install qaeval

The qaeval package uses PyTorch, Transformers, and AllenNLP. In order to keep the required dependencies of SacreROUGE light, we chose to not incorporate the QAEval code into this repository. Therefore, you must install the qaeval package or else the code will crash.

Then, QAEval uses pretrained question-generation and question-answering models, which must be downloaded:

sacrerouge setup-metric qa-eval

By default, this will download the model files to ~/.sacrerouge/metrics/qaeval/models. If you want to change this directoy, set the environment variable SACREROUGE_DATA_ROOT to whatever directory you want (instead of ~/.sacrerouge).

To test your setup, run the following code:

>>> import json
>>> from sacrerouge.metrics import QAEval
>>> 
>>> summary1 = 'Dan walked to the bakery this morning.'
>>> reference1 = 'Dan went to buy scones earlier this morning.'
>>> 
>>> # This line will load the generation and answer models into memory, so it may take some time to complete.
>>> qaeval = QAEval()
>>> 
>>> # To score an individual summary with a list of reference summaries. This example
>>> # only uses 1 reference, so it is wrapped in a list.
>>> scores = qaeval.score(summary1, [reference1])
>>> print(scores)
{'qa-eval': {'em': 0.5, 'f1': 0.5}}
>>>
>>> # To run batch scoring, use the score_all function and pass a list of summaries and
>>> # a list of list of references. Again, each instance here only has 1 reference, so it is wrapped
>>> # in a list
>>> summary2 = 'Roger Federer beat Rafael Nadal yesterday.'
>>> reference2 = 'Yesterday, Nadal lost to Federer'
>>> # scores_list is a list of size 2. scores_list[0] is the scores for summary1, and scores_list[1] for summary2
>>> scores_list = qaeval.score_all([summary1, summary2], [[reference1], [reference2]])
>>>
>>> # If you want the QA pairs used to score the summaries returned, add the return_qa_pairs=True argument
>>> # to any of the scoring methods. A tuple of size 2 will be returned. The first item is the scores
>>> # like above. The second item are the QA pairs.
>>> scores, qas = qaeval.score(summary2, [reference2], return_qa_pairs=True)
>>> 
>>> # qas[i][j] is the j-th QA pair for the i-th reference summary. The "probability" is the QA model's
>>> # probability for the prediction. "null_probability" is its probability there is no answer.
>>> print(json.dumps(qas[0][0], indent=2))
{
  "question": {
    "question_id": "915ed522cfe7b798bd23f299a6eca192",
    "question": "Who lost to Federer yesterday?",
    "answer": "Nadal",
    "sent_start": 0,
    "sent_end": 32,
    "answer_start": 11,
    "answer_end": 16
  },
  "prediction": {
    "prediction_id": "4a7d1ed414474e4033ac29ccb8653d9b",
    "prediction": "Rafael Nadal",
    "probability": 0.9939968367261187,
    "null_probability": 1.9474517469108735e-06,
    "em": 0,
    "f1": 0.6666666666666666
  }
}
>>>
>>> # If you pass the return_qa_pairs=True flag to score_all, it looks like this. "results" is parallel to "scores_list"
>>> # from before, but instead of only the scores, there is a tuple of the scores and the QA pairs 
>>> results = qaeval.score_all([summary1, summary2], [[reference1], [reference2]], return_qa_pairs=True)
>>> 
>>> scores1, qas1 = results[0]
>>> scores2, qas2 = results[1]
>>> print(json.dumps(qas1[0][0], indent=2))
{
  "question": {
    "question_id": "7cd86ecb09aa48c6e620b340f6a74592",
    "question": "Who went to buy scones earlier this morning?",
    "answer": "Dan",
    "sent_start": 0,
    "sent_end": 44,
    "answer_start": 0,
    "answer_end": 3
  },
  "prediction": {
    "prediction_id": "4a7d1ed414474e4033ac29ccb8653d9b",
    "prediction": "Dan",
    "probability": 0.9986048904063031,
    "null_probability": 2.303598142577244e-06,
    "em": 1,
    "f1": 1.0
  }
}

Correlations

Here are the correlations of QAEval metrics to the "overall responsiveness" scores on the TAC datasets. They differ slightly from those reported in the paper for reasons listed here.

Summary-level, peers only:

TAC2008 TAC2009 TAC2010 TAC2011
r p k r p k r p k r p k
QA-EM 0.35 0.35 0.29 0.44 0.41 0.33 0.43 0.43 0.36 0.41 0.39 0.32
QA-F1 0.46 0.45 0.36 0.49 0.46 0.37 0.55 0.55 0.44 0.50 0.46 0.37
QA-LERC 0.50 0.49 0.40 0.53 0.48 0.38 0.61 0.59 0.48 0.55 0.49 0.40

Summary-level, peers + references:

TAC2008 TAC2009 TAC2010 TAC2011
r p k r p k r p k r p k
QA-EM 0.49 0.43 0.35 0.47 0.47 0.37 0.53 0.50 0.41 0.45 0.42 0.34
QA-F1 0.61 0.52 0.43 0.55 0.53 0.42 0.65 0.62 0.51 0.56 0.51 0.41
QA-LERC 0.64 0.57 0.46 0.61 0.55 0.44 0.69 0.66 0.55 0.62 0.55 0.45

System-level, peers only:

TAC2008 TAC2009 TAC2010 TAC2011
r p k r p k r p k r p k
QA-EM 0.92 0.89 0.74 0.71 0.88 0.71 0.91 0.88 0.72 0.90 0.78 0.59
QA-F1 0.90 0.86 0.68 0.78 0.88 0.72 0.93 0.88 0.75 0.94 0.82 0.64
QA-LERC 0.88 0.85 0.67 0.86 0.88 0.71 0.91 0.88 0.73 0.95 0.85 0.68

System-level, peers + references:

TAC2008 TAC2009 TAC2010 TAC2011
r p k r p k r p k r p k
QA-EM 0.97 0.92 0.79 0.68 0.92 0.77 0.96 0.93 0.79 0.81 0.81 0.63
QA-F1 0.96 0.90 0.74 0.78 0.92 0.77 0.97 0.93 0.81 0.89 0.88 0.72
QA-LERC 0.95 0.89 0.73 0.88 0.92 0.76 0.96 0.93 0.80 0.92 0.91 0.76

Here are the correlations of the metrics to the expert-based relevance judgments from Fabbri (2020)

Summary-level, peers only:

Fabbri2020
r p k
QA-EM 0.23 0.23 0.19
QA-F1 0.30 0.29 0.22
QA-LERC 0.34 0.31 0.24

System-level, peers only:

Fabbri2020
r p k
QA-EM 0.80 0.91 0.77
QA-F1 0.82 0.91 0.77
QA-LERC 0.80 0.90 0.77

Here are the correlations to the annotations provided by Bhandari et al., (2020). Summary-level:

Bhandari2020-Abs Bhandari2020-Ext Bhandari2020-Mix
r p k r p k r p k
QA-EM 0.34 0.34 0.29 0.09 0.07 0.06 0.25 0.24 0.20
QA-F1 0.45 0.43 0.35 0.17 0.13 0.11 0.35 0.34 0.27
QA-LERC 0.48 0.46 0.37 0.19 0.14 0.11 0.38 0.36 0.28

System-level:

Bhandari2020-Abs Bhandari2020-Ext Bhandari2020-Mix
r p k r p k r p k
QA-EM 0.86 0.81 0.64 0.30 0.19 0.13 0.84 0.84 0.66
QA-F1 0.92 0.90 0.77 0.40 0.35 0.24 0.90 0.88 0.72
QA-LERC 0.93 0.92 0.79 0.52 0.61 0.38 0.89 0.90 0.74

Citation

If you use this metric, please cite the following work

@misc{deutsch2020questionanswering,
      title={{Towards Question-Answering as an Automatic Metric for Evaluating the Content Quality of a Summary}}, 
      author={Daniel Deutsch and Tania Bedrax-Weiss and Dan Roth},
      year={2020},
      eprint={2010.00490},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

References

[1] Daniel Deutsch, Tania Bedrax-Weiss, and Dan Roth. Towards Question-Answering as an Automatic Metric for Evaluating the Content Quality of a Summary. 2020.