New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reproducing deepset/roberta-base-squad2 metrics on HuggingFace #2853
Comments
@tstadel @Timoeller Would it be possible to save the value of |
@sjrl I think it's not really solving it as this would only fix the eval() and eval_on_file() methods but not |
@tstadel Yes, my PR is incomplete. I just thought to open it so the work would not need to be reproduced. So feel free to add commits or if it's easier we can close it as well. |
Oh sorry, I misinterpreted what you meant. I'm still not too familiar with the source code of |
I think saving the value of |
Just a quick status from my side. I'm still investigating: my goal is to have a clear view on what this means for:
If we have that let's focus on what needs to be fixed or should be parameterized in any way. |
Ok, Training: Eval: Inference: My suggestion would be to fix the passing of |
I ran eval_on_file with from haystack.nodes import FARMReader
reader = FARMReader(
model_name_or_path='deepset/roberta-base-squad2', use_gpu=True, max_seq_len=386, use_confidence_scores=False
)
metrics = reader.eval_on_file(
data_dir='/home/tstad/Downloads',
test_filename='dev-v2.0.json',
) and here are the results: {
'EM': 78.43847384822708,
'f1': 82.65766204593291,
'top_n_accuracy': 97.41430135601786,
'top_n': 4,
'EM_text_answer': 75.01686909581646,
'f1_text_answer': 83.46734505252387,
'top_n_accuracy_text_answer': 94.82118758434548,
'top_n_EM_text_answer': 80.9885290148448,
'top_n_f1_text_answer': 90.49777068800371,
'Total_text_answer': 5928,
'EM_no_answer': 81.85029436501262,
'f1_no_answer': 81.85029436501262,
'top_n_accuracy_no_answer': 100.0,
'Total_no_answer': 5945
} |
@tstadel Nice work! Thanks for the thorough investigation.
I also agree that we should pass down the parameters to |
Whatever defaults we decide on, @julian-risch mentioned that we should include docstrings and docs on the website explaining when to set |
Yes definitely! We have to mention that setting
@sjrl WDYT? |
Hi @tstadel I definitely agree with your second and third points. However, I'm a little uncertain about the first suggestion about adding It seems unintuitive to me we would then have So I would prefer the following:
@tstadel WDYT? |
@sjrl Adding |
@sjrl are we good with this plan? Can we close this issue? |
After some discussion with @Timoeller, @ju-gu, @mathislucka and @tstadel we agreed that we would like the So this means we will remove the |
This has been resolved with PR #2856. To be clear reproducing the metrics reported on HF only works if |
Describe the bug
This is not really a bug, but an investigation as to why the metrics reported for
deepset/roberta-base-squad2
on HuggingFace are not reproducible using theFARMReader.eval_on_file
function in Haystack version 1.6.0.Metrics on Huggingface
Expected behavior
The following code should produce metrics similar to the ones reported on HuggingFace
Data for eval is available here (official SQuAD2 dev set): https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
Currently, Haystack v1.6.0 produces
and we can immediately see that the EM (79.9 vs. 76.1) and F1 (82.9 vs. 78.2) scores are lower than expected.
If we roll back to Haystack v1.3.0 we get the following metrics
which much more closely match the ones reported on HuggingFace.
Identifying the change in Haystack
I believe the code change that caused this difference in results was introduced by the solution to this issue #2410. To check this I updated
FARMReader.eval_on_file
to accept the argumentuse_no_answer_legacy_confidence
(which was originally added in this PR https://github.com/deepset-ai/haystack/pull/2414/files forFARMReader.eval
but notFARMReader.eval_on_file
) in the branch https://github.com/deepset-ai/haystack/tree/issue_farmreader_eval and ran the following codewhich produced
The EM and F1 scores now show the expected values.
Solution
I'm not entirely sure how this should be resolved, but it seems the
no_answer
logic used in eval should probably be somehow linked to the models that were trained using that logic.[ ] Consider saving the value ofThis would not solve the issue as discussed below.use_no_answer_legacy_confidence
with model meta data so when the model is reloaded it uses the sameno_answer
confidence logic as it was trained with.FAQ Check
System:
The text was updated successfully, but these errors were encountered: