-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add More top_k handling to EvalDocuments #1133
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. As discussed, I agree that the warning about fewer documents than top_k_eval_documents is nice too have although that case does not invalidate metrics. One question though about the handling of changing top_k_eval_documents
. See below.
haystack/eval.py
Outdated
logger.warning(f"EvalDocuments was already run once with top_k_eval_documents={self.top_k_used} but is " | ||
f"being run again with top_k_eval_documents={self.top_k_eval_documents}. This will lead " | ||
f"to unreliable evaluation metrics") | ||
self.inconsistent_top_k_warning = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering whether we should call init_counts()
so that the calculated metrics make sense at least from here on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's a great idea. I can imagine that a user might be performing multiple evaluation loops with a different top_k each time. I've implemented the count reset and also removed the self.inconsistent_top_k_warning
so that it can be triggered more than once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, running the evaluation multiple times with different top_k_eval_documents
makes sense now. However, there might be a different problem now depending on how we expect users to use the evaluation. The problem is that top_k_used
is only updated in the very first run and never later on.
If you run the evaluation with top_k_eval_documents=5
twice, the counts are aggregated, which is good. top_k_used
is set to 5.
If you then run with different top_k_eval_documents
, the counts are reset, which is good too.
If you then switch back to top_k_eval_documents
, the counts are not reset because top_k_eval_documents
is again the same as top_k_used
, which is bad.
Further, from now on the counts are not aggregated anymore even for multiple runs with the same top_k_eval_documents
and that's only because of trying out different values of top_k_eval_documents
earlier. That would be confusing. Do we expect users to try out different values of top_k_eval_documents
? Maybe in jupyter notebooks? If not, I am okay with the current version of the code. Otherwise, I am happy to discuss (also in a call). I would suggest to simply add this line self.top_k_used = top_k_eval_documents
at the very end of the run method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup I agree with your points here. Added the line self.top_k_used = top_k_eval_documents
to the end of the run method.
haystack/eval.py
Outdated
if not self.top_k_used: | ||
self.top_k_used = top_k_eval_documents | ||
elif self.top_k_used != top_k_eval_documents: | ||
logger.warning(f"EvalDocuments was already run once with top_k_eval_documents={self.top_k_used} but is " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor thing: Indentation level looks odd here 🤔
Rename
top_k_eval_documents
and allow it to be set inEvalDocuments.run()
. Perform checks to ensure thattop_k_eval_documents
doesn't change over the course of evaluation and that we never get a situation whereEvalDocuments
is passed less documents than itstop_k