Special eval metrics and scripts #457

awebson · 2021-09-26T01:40:06Z

lintangsutawika · 2021-09-27T09:01:09Z

There are a few points to note for some datasets:

Natural Questions - The original eval scripts measures text span while T0 generates an answer. We would need to agree on how to adapt generated text to span or alternatively adapt the ground truth span to text string. I think the latter would be better.
CoQA - The evaluation set has 7983 samples that consist of 500 context samples with multiple questions each. The evaluation is done on all the questions while the HF version that is given to the model only has 500 samples. We might need to reformat the HF dataset so that each sample consist of context+question. Currently, all the questions for one context is made into a list.
Quac - Same condition as CoQA

awebson · 2021-09-27T13:50:25Z

Yes. For all tasks we ignore numerical span indices and always measure based on text.
I will take a look at how other papers did this and revisit this.

craffel · 2021-09-27T16:59:22Z

For TriviaQA, NQ, and WebQuestions, we are evaluating on the open domain variant of the task and should use the evaluation procedure shown here: https://github.com/google-research/google-research/tree/master/t5_closed_book_qa

lintangsutawika · 2021-09-29T15:02:48Z

Posting a colab that calls the special eval scripts for the above datasets.

https://colab.research.google.com/drive/1G2zxbvi96qxbOv6LNYvTTrcJxsBX_4Hr

Edit:

Updated with Squad V2 to exclude non-answerable questions
Normalized word number to word symbol for Drop

craffel · 2021-09-30T19:45:07Z

Dataset splits for open-domain QA are a total mess, so writing this down here to track my own records:

Natural Questions: In H.1, the GPT-3 paper says it is reporting results on the "test" split. However, standard open-domain and closed-book QA practice here is to use the validation set as the test set. I'm guessing that's what they mean but I'm waiting for confirmation.
TriviaQA: GPT-3 reports results in the main paper on the test server; if they were following standard practice this is the "full-em" score in the Wikipedia tab on this leaderboard: https://competitions.codalab.org/competitions/17208#results In order to report comparable results, we need to 1) copy the templates we have for trivia_qa/rc to trivia_qa/unfiltered, 2) run inference on the test set to get predictions, 3) submit the predictions to the test server.
WebQuestions: This one is simple, there is just a test set.

craffel · 2021-10-01T02:44:06Z

Leo Gao mentions that

for ARC, OpenbookQA, and RACE in particular openai claims that a different kind of normalization described in the paper works really well (they don't really provide any evidence or explanation, they just claim it is and use it for just these 3 tasks)

We aren't doing any kind of length normalization, so if we are underperforming on those tasks, we could consider it.

lintangsutawika · 2021-10-01T12:47:04Z

For Drop, I noticed that the model would predict a number by its word instead of its numeric symbol. So I added that to the normalization process.

Predictions from drop_can_you_tell_me_1112200_predictions from finetune-t5-xxl-lm-d4-091621-512

Before:

Exact-match accuracy 26.51
F1 score 30.32
26.51   &   30.32
----
date: 152 (1.59%)
  Exact-match accuracy 64.474
  F1 score 71.829
number: 5826 (61.09%)
  Exact-match accuracy 8.393
  F1 score 8.937
span: 3069 (32.18%)
  Exact-match accuracy 63.245
  F1 score 69.709
spans: 489 (5.13%)
  Exact-match accuracy 0.000
  F1 score 24.871

After:

Exact-match accuracy 31.50
F1 score 35.29
31.50   &   35.29
----
date: 152 (1.59%)
  Exact-match accuracy 64.474
  F1 score 71.829
number: 5837 (61.21%)
  Exact-match accuracy 16.567
  F1 score 17.110
span: 3060 (32.09%)
  Exact-match accuracy 63.366
  F1 score 69.805
spans: 487 (5.11%)
  Exact-match accuracy 0.000
  F1 score 24.903

According to GPT3 Paper for Zero-shot

Name  : DROP
Metric: f1
Split : dev
Small : 9.40
Med   : 13.6
Large : 14.4
XL    : 16.4
2.7B  : 19.7
6.7B  : 17.0
13B   : 24.0
175B  : 23.6

So we are performing significantly better

craffel · 2021-10-01T17:58:35Z

For the record, each example in coqa and quac is actually N examples, where N is the number of turns of the dialog. Our prompts for coqa and quac will only evaluate on one turn per example. We need to create new serialized versions of the datasets if we are going to evaluate on the full dataset.

craffel · 2021-10-01T19:14:28Z

Natural Questions: In H.1, the GPT-3 paper says it is reporting results on the "test" split. However, standard open-domain and closed-book QA practice here is to use the validation set as the test set. I'm guessing that's what they mean but I'm waiting for confirmation.

Got confirmation that this is indeed the case. Unfortunately, this split is not available in HF, apart from in the main nq dataset which is utterly colossal. The one in the nq_open dataset (and the one in kilt nq) is different.

lintangsutawika · 2021-10-02T11:43:42Z

@craffel I've actually made prompts for coqa that tries to solve this.

{% set n=25 %}
{% if questions|length > n %}

{{story}} 

Q: {{questions[0]}}
      {% for i in range(0,n) %}
A: {{answers['input_text'][i]}}

Q: {{questions[i+1]}}
      {% endfor %}
A:
|||
{{answers['input_text'][n]}}

{% else %}

Placeholder,  Do Not Process
|||
Placeholder, Do Not Process

{% endif %}

But the downside is that it has to be a unique prompt for each number of turns. For coqa the maximum number of turns is 25 so there needs to be 25 unique prompts. The idea is to then collect the prediction to a json and run the official eval script.

So far I've made around 15 unique prompts (just change the number). I can make a pull request if this approach makes sense.

craffel · 2021-10-02T18:32:03Z

Thanks @lintangsutawika . @zaidalyafeai actually made HFDS variants of the tasks that include prior dialog turns as context. I think we can just use that as the base dataset.

craffel · 2021-10-02T18:50:21Z

Here they are in hub:
https://huggingface.co/datasets/Zaid/coqa_expanded
https://huggingface.co/datasets/Zaid/quac_expanded

craffel · 2021-10-03T19:03:49Z

I wrote ten GPT-3 style record prompts, where the model has to rank all the possible choices of the query sentence with every possible entity filled in. They will only make sense for rank eval. I can try to run eval on them before we cache. Not sure if it will help but worth a try. #490

craffel · 2021-10-05T01:36:02Z

Datasets that are mostly done but we need to re-run eval and/or compute scores manually for all models:

ReCoRD: Re-wrote prompts in a GPT-3 style "continuation choices" format; this raised the average score from ~20% to ~40% for EM. Still very far below GPT-3's performance.
SQuAD v2: Added "respond unanswerable" to the prompts; the model did sometimes say "unanswerable" but only rarely, so the average score did not change (still about 40% for EM).
DROP: Lintang added word/number normalization, this helped quite a bit.
WebQuestions: When we run eval correctly, our score is reasonable.
HellaSwag: We are just really bad at this dataset unless we train on it.

Datasets where there is still work to be done:

CoQA: Zaid created an appropriate variant of the dataset and Stephen prompted it but we need to add it to the CSV file, cache them, and re-run eval.
QuAC: Same as CoQA.
Natural Questions: We need to run eval on the same test set (= original validation set) that has been used on open domain QA papers. This will take a little work.
TriviaQA: We switched to the correct subset (unfiltered) and fixed the example filtering and re-cached, but we need to run inference on the test set and submit to the test server. Based on our validation results, we are very likely to significantly underperform GPT-3.

Datasets I don't know the status of:

Lambada
Winogrande

awebson added the evaluations label Sep 26, 2021

awebson self-assigned this Sep 26, 2021

awebson changed the title ~~Special eval scripts~~ Special eval metrics and scripts Sep 26, 2021

awebson mentioned this issue Sep 26, 2021

Writing: Final run for the gpt-3 and bigbench figures. #460

Closed

awebson assigned craffel Sep 30, 2021

awebson pinned this issue Sep 30, 2021

stephenbach closed this as completed Oct 14, 2021

stephenbach unpinned this issue Oct 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Special eval metrics and scripts #457

Special eval metrics and scripts #457

awebson commented Sep 26, 2021 •

edited

lintangsutawika commented Sep 27, 2021

awebson commented Sep 27, 2021

craffel commented Sep 27, 2021

lintangsutawika commented Sep 29, 2021 •

edited

craffel commented Sep 30, 2021 •

edited

craffel commented Oct 1, 2021

lintangsutawika commented Oct 1, 2021

craffel commented Oct 1, 2021

craffel commented Oct 1, 2021

lintangsutawika commented Oct 2, 2021 •

edited

craffel commented Oct 2, 2021

craffel commented Oct 2, 2021

craffel commented Oct 3, 2021

craffel commented Oct 5, 2021

Special eval metrics and scripts #457

Special eval metrics and scripts #457

Comments

awebson commented Sep 26, 2021 • edited

lintangsutawika commented Sep 27, 2021

awebson commented Sep 27, 2021

craffel commented Sep 27, 2021

lintangsutawika commented Sep 29, 2021 • edited

craffel commented Sep 30, 2021 • edited

craffel commented Oct 1, 2021

lintangsutawika commented Oct 1, 2021

craffel commented Oct 1, 2021

craffel commented Oct 1, 2021

lintangsutawika commented Oct 2, 2021 • edited

craffel commented Oct 2, 2021

craffel commented Oct 2, 2021

craffel commented Oct 3, 2021

craffel commented Oct 5, 2021

awebson commented Sep 26, 2021 •

edited

lintangsutawika commented Sep 29, 2021 •

edited

craffel commented Sep 30, 2021 •

edited

lintangsutawika commented Oct 2, 2021 •

edited