Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducing deepset/roberta-base-squad2 metrics on HuggingFace #2853

Closed
1 task done
sjrl opened this issue Jul 20, 2022 · 16 comments
Closed
1 task done

Reproducing deepset/roberta-base-squad2 metrics on HuggingFace #2853

sjrl opened this issue Jul 20, 2022 · 16 comments

Comments

@sjrl
Copy link
Contributor

sjrl commented Jul 20, 2022

Describe the bug
This is not really a bug, but an investigation as to why the metrics reported for deepset/roberta-base-squad2 on HuggingFace are not reproducible using the FARMReader.eval_on_file function in Haystack version 1.6.0.

Metrics on Huggingface

"exact": 79.87029394424324,
"f1": 82.91251169582613,

"total": 11873,
"HasAns_exact": 77.93522267206478,
"HasAns_f1": 84.02838248389763,
"HasAns_total": 5928,
"NoAns_exact": 81.79983179142137,
"NoAns_f1": 81.79983179142137,
"NoAns_total": 5945

Expected behavior
The following code should produce metrics similar to the ones reported on HuggingFace

from haystack.nodes import FARMReader
reader = FARMReader(
  model_name_or_path='deepset/roberta-base-squad2', use_gpu=True, max_seq_len=386
)
metrics = reader.eval_on_file(
    data_dir='./data/',
    test_filename='dev-v2.0.json',
    device='cuda',
)

Data for eval is available here (official SQuAD2 dev set): https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json

Currently, Haystack v1.6.0 produces

    "v1.6.0": {
      "EM": 76.13913922344815,
      "f1": 78.1999299751871,
      "top_n_accuracy": 97.45641371178304,
      "top_n": 4,
      "EM_text_answer": 61.52159244264508,
      "f1_text_answer": 65.64908377115323,
      "top_n_accuracy_text_answer": 94.90553306342781,
      "top_n_EM_text_answer": 81.34278002699055,
      "top_n_f1_text_answer": 90.69119695344651,
      "Total_text_answer": 5928,
      "EM_no_answer": 90.71488645920942,
      "f1_no_answer": 90.71488645920942,
      "top_n_accuracy_no_answer": 100.0,
      "Total_no_answer": 5945
    },

and we can immediately see that the EM (79.9 vs. 76.1) and F1 (82.9 vs. 78.2) scores are lower than expected.

If we roll back to Haystack v1.3.0 we get the following metrics

    "v1.3.0": {
      "EM": 0.7841320643476796,
      "f1": 0.826529420882385,
      "top_n_accuracy": 0.9741430135601785
    }

which much more closely match the ones reported on HuggingFace.

Identifying the change in Haystack
I believe the code change that caused this difference in results was introduced by the solution to this issue #2410. To check this I updated FARMReader.eval_on_file to accept the argument use_no_answer_legacy_confidence (which was originally added in this PR https://github.com/deepset-ai/haystack/pull/2414/files for FARMReader.eval but not FARMReader.eval_on_file) in the branch https://github.com/deepset-ai/haystack/tree/issue_farmreader_eval and ran the following code

results = reader.eval_on_file(
    data_dir='./data/',
    test_filename='dev-v2.0.json',
    device='cuda',
    use_no_answer_legacy_confidence=True
)

which produced

{'EM': 78.37951655015581,
 'f1': 82.61101723722291,
 'top_n_accuracy': 97.41430135601786,
 'top_n': 4,
 'EM_text_answer': 75.0,
 'f1_text_answer': 83.47513624452559,
 'top_n_accuracy_text_answer': 94.82118758434548,
 'top_n_EM_text_answer': 81.00539811066126,
 'top_n_f1_text_answer': 90.50564293271806,
 'Total_text_answer': 5928,
 'EM_no_answer': 81.74936921783012,
 'f1_no_answer': 81.74936921783012,
 'top_n_accuracy_no_answer': 100.0,
 'Total_no_answer': 5945}

The EM and F1 scores now show the expected values.

Solution
I'm not entirely sure how this should be resolved, but it seems the no_answer logic used in eval should probably be somehow linked to the models that were trained using that logic.

  • [ ] Consider saving the value of use_no_answer_legacy_confidence with model meta data so when the model is reloaded it uses the same no_answer confidence logic as it was trained with. This would not solve the issue as discussed below.

FAQ Check

System:

  • OS: ubuntu
  • GPU/CPU: GPU
  • Haystack version (commit or version number): v1.6.0
@sjrl
Copy link
Contributor Author

sjrl commented Jul 20, 2022

@tstadel @Timoeller Would it be possible to save the value of use_no_answer_legacy_confidence with model metadata so that when the model is reloaded it uses the same no_answer confidence logic as it was trained with?

@tstadel
Copy link
Member

tstadel commented Jul 20, 2022

@sjrl I think it's not really solving it as this would only fix the eval() and eval_on_file() methods but not FARMReader's predict().

@sjrl
Copy link
Contributor Author

sjrl commented Jul 20, 2022

@tstadel Yes, my PR is incomplete. I just thought to open it so the work would not need to be reproduced. So feel free to add commits or if it's easier we can close it as well.

@sjrl
Copy link
Contributor Author

sjrl commented Jul 20, 2022

I think it's not really solving it as this would only fix the eval() and eval_on_file() methods but not FARMReader's predict().

Oh sorry, I misinterpreted what you meant. I'm still not too familiar with the source code of FARMReader 😅 I agree the proposed solution wouldn't fix the predict() method.

@MichelBartels
Copy link
Contributor

I think saving the value of use_no_answer_legacy_confidence doesn't work for all cases because it's not set during training and only relevant during postprocessing.
The issue it is addressing is that we can't compare the probabilities of different documents and splits (documents whose sequence length is longer than max_seq_length of the reader) because they are created using softmax which normalizes the scores separately for each split/document. This issue is not really present during eval because we have just one document which is usually small enough so it doesn't need to be split. This means that for eval use_no_answer_legacy_confidence=True works fine while it probably does not during inference.
However, just applying sigmoid to the no_answer scores which is done when use_no_answer_legacy_confidence is False also creates issues as all other values are still scaled by softmax. This is mitigated a bit by dividing by 8, but that's not an ideal solution.

@tstadel
Copy link
Member

tstadel commented Jul 21, 2022

Just a quick status from my side. I'm still investigating: my goal is to have a clear view on what this means for:

  • Training
  • Eval
  • Inference in Haystack

If we have that let's focus on what needs to be fixed or should be parameterized in any way.

@tstadel
Copy link
Member

tstadel commented Jul 21, 2022

Ok,
so here's the result of my analysis, most of it confirms @MichelBartels. However, besides use_no_answer_legacy_confidence the eval and inference results are also affected by use_confidence_scores/use_confidence_scores_for_ranking. Both default values have changed for eval in #2410:

Training:
Not affected at all as the order of answers is not used for loss calculation, but raw logits with CrossEntropyLoss. However the dev eval function uses the same as FARMReader.eval/eval_on_file.

Eval:
use_no_answer_legacy_confidence only affects ranking/results of FARMReader.eval/eval_on_file but only if use_confidence_scores=True (False was the default before #2410 and I guess also in FARM). Hence, Dev eval during training is also affected. #2410 setuse_no_answer_legacy_confidence's default to False and use_confidence_scores_for_ranking's default to True as this is the behavior that we get from FARMReader during inference.

Inference:
Currently we only support use_no_answer_legacy_confidence=False, but this could be changed for FARMReader, . However not so for TransformersReader as here we do not have access to confidence values.
However we can set use_confidence_scores to False in FARMReader. With that we only use raw logit scores, which was the same behavior of FARMReader.eval/eval_on_file before my changes in PR #2410. However unfortunately we do not pass that parameter correctly down to Evaluator.eval() in eval_on_file. That's already addressed in #2856.

My suggestion would be to fix the passing of FARMReader's use_confidence_scores down to the Evaluator and with setting that to False we should get the same results as in FARM.
For inference you can basically choose the param that best fits your data. Setting use_confidence_scoresto True seems to favor no_answers. Setting it to False is more aggressive in showing actual answers. At least that's m my impression from our training dataset in Tutorial 2 und the eval dataset in Tutorial 5.

@tstadel
Copy link
Member

tstadel commented Jul 22, 2022

I ran eval_on_file with use_confidence_scores=False on #2856 branch like this:

from haystack.nodes import FARMReader

reader = FARMReader(
  model_name_or_path='deepset/roberta-base-squad2', use_gpu=True, max_seq_len=386, use_confidence_scores=False
)
metrics = reader.eval_on_file(
    data_dir='/home/tstad/Downloads',
    test_filename='dev-v2.0.json',
)

and here are the results:

{
    'EM': 78.43847384822708, 
    'f1': 82.65766204593291, 
    'top_n_accuracy': 97.41430135601786, 
    'top_n': 4, 
    'EM_text_answer': 75.01686909581646, 
    'f1_text_answer': 83.46734505252387, 
    'top_n_accuracy_text_answer': 94.82118758434548, 
    'top_n_EM_text_answer': 80.9885290148448, 
    'top_n_f1_text_answer': 90.49777068800371, 
    'Total_text_answer': 5928, 
    'EM_no_answer': 81.85029436501262, 
    'f1_no_answer': 81.85029436501262, 
    'top_n_accuracy_no_answer': 100.0, 
    'Total_no_answer': 5945
}

@sjrl
Copy link
Contributor Author

sjrl commented Jul 22, 2022

@tstadel Nice work! Thanks for the thorough investigation.

My suggestion would be to fix the passing of FARMReader's use_confidence_scores down to the Evaluator and with setting that to False we should get the same results as in FARM.

I also agree that we should pass down the parameters to Evaluator.eval(). Are you suggesting we should also set the default value of use_confidence_scores to False for eval and leave it to True for inference?

@sjrl
Copy link
Contributor Author

sjrl commented Jul 22, 2022

Whatever defaults we decide on, @julian-risch mentioned that we should include docstrings and docs on the website explaining when to set use_confidence_scores either True or False, which I think sounds like a great idea.

@tstadel
Copy link
Member

tstadel commented Jul 22, 2022

Yes definitely! We have to mention that setting use_confidence_scores affects ranking of no_answers. Here's what I'd suggest:

  • complete Resolving issue 2853: no answer logic in FARMReader #2856 with adding a new param use_confidence_scores_for_ranking to eval() and eval_on_file(). This param should be True by default for reproducible. In docstrings it should be clear that this will be used instead of FARMReader's use_confidence_scores. Add to use_confidence_scores's docstring that it affects ranking of no_answers. @sjrl I can support you on that
  • explain the behavior of use_confidence_scores on the website too (probably for @agnieszka-m once our docstrings are ready)
  • Create an issue about improving the handling of no_answer scores when using use_confidence_scores. This could be solved by improving the no_answer confidence calculation, however there are cases where the current default values perform better. Sometimes it seems to behave very odd: if there is one answer with confidence >95 and a no_answer with originally calculated confidence of ~1%, we get a no_answer confidence of +50%. So the current default values seems to heavily favor no_answer values. Let's investigate on whether this is the case and whether this bias towards no_answers could be achieved otherwise easier (e.g. by setting no_answer_boost to a different default value) without losing ranking consistency between True and False of use_confidence_scores. (that one is on me)

@sjrl WDYT?

@sjrl
Copy link
Contributor Author

sjrl commented Jul 25, 2022

Hi @tstadel I definitely agree with your second and third points. However, I'm a little uncertain about the first suggestion about adding use_confidence_scores_for_ranking to eval() and eval_on_file().

It seems unintuitive to me we would then have use_confidence_scores to initialize a FARMReader but then also the separate variable use_confidence_scores_for_ranking specifically only for the eval and eval_on_file functions. In this situation, the use_confidence_scores variable would only affect the predict function, which could lead to a situation where predict could return different results than eval and eval_on_file.

So I would prefer the following:

  • Remove the variable use_no_answer_legacy_confidence from FARMReader.eval() and FARMReader.eval_on_file() and instead add to to the constructor of FARMReader. And then pass on the class variables to the Evaluator.eval() function.
  • Update this line https://github.com/deepset-ai/haystack/blob/master/haystack/modeling/training/base.py#L308 to pass on the variables use_confidence_scores and use_no_answer_legacy_confidence from self.model if they are present. This way the eval results printed during training are consistent with evaluations performed after training is finished.

@tstadel WDYT?

@tstadel
Copy link
Member

tstadel commented Jul 26, 2022

@sjrl Adding use_no_answer_legacy_confidence to the constructor would introduce more confusion in my opinion. Even if we correctly pass it down to the prediction head it won't have any effect for the happy path of FARMReader (=inference). So this param is basically useless outside of eval[_on_file()] (which are also used during training). So I'd definitely keep it as a function param (and maybe add it to train() too).
To streamline the use-case "expert eval of FARMReader" further, I would add use_confidence_scores_for_ranking to the eval functions, so you always know what you actually get. Currently it's not intuitive that the constructor param use_confidence_scores would also affect ranking and thus the eval functions, even if we adjust its docstrings.
Additionally this would make comparing this settings easier as you can just call eval[on_file] twice on the same reader instance without having to instantiate a new reader. Although this might seem a bit cumbersome in the first place and needs a bit more explanation in the docstrings, this appears to me as the right compromise here:
eval[on_file] really is the expert path for evaluation.
We actually nudge the users towards the more realistic pipeline.eval() with the warning at the beginning of eval[on_file], because that's the actual performance you get when using it in a pipeline. And to further improve that real performance we address that in the issue of point 3.

@tstadel
Copy link
Member

tstadel commented Aug 2, 2022

@sjrl are we good with this plan? Can we close this issue?

@sjrl
Copy link
Contributor Author

sjrl commented Aug 9, 2022

After some discussion with @Timoeller, @ju-gu, @mathislucka and @tstadel we agreed that we would like the FARMReader.eval and FARMReader.eval_on_file functions to behave the same way as the FARMReader.predict functions.

So this means we will remove the use_no_answer_legacy_confidence, and use_confidence_scores_for_ranking options from the eval and eval_on_file functions and instead use the class variables of FARMReader to determine these values.

@sjrl
Copy link
Contributor Author

sjrl commented Aug 11, 2022

This has been resolved with PR #2856.

To be clear reproducing the metrics reported on HF only works if use_confidence_scores is set to False when initializing a FARMReader in Haystack.

@sjrl sjrl closed this as completed Aug 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants