-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
speed up slow retriever evaluation #1557
Comments
Hey @mathislucka , this is a good observation and would indeed greatly increase the speed of the evaluation. We already added batch indexing in #1231 and have an open issue for batch querying in #1239 Btw do you have other feedback regarding the evaluation? E.g. is it clear what closed domain vs open domain means and how it is implemented? I am asking since we realized the evaluation is rather complex and we want to simplify it. Getting your feedback there would help. |
Open vs. closed domain was clear for me. I did think that adding the eval data was slightly annoying because it says: „Adds a SQuAD-style formatted file to the document store“. You actually have this description in multiple places in your docs and I think the concept is hard to understand because now I have to go and find out how the original SQuAD-dataset is formatted and then I have to transform my data and write it to a file before loading it into the document store. I‘d like to be able to have other options of adding eval data. Maybe either accept a file or a list of dicts or something like that? And actually give an example of the requested structure in your docs, that would be great. I‘ll play around with evaluation a bit more and see if I have any more feedback. |
We have created a dedicated issue for eval improvements and your suggestion of adding a SQuAD example also found its way in there: #1561 |
Is your feature request related to a problem? Please describe.
Recently, I played around with haystack and its evaluation features. I followed along your blog post https://www.deepset.ai/blog/how-to-evaluate-a-question-answering-system and I played around with different scenarios for evaluation. I then realized, that the evaluation method is quite slow and for me, it also had issues with colab notebooks because tqdm progress bars were printed for each query. Another thing that I didn't like was that you can only measure recall at one k at a time. If you want to have a chart of recall @ different k's you will have to run the evaluation multiple times.
To quantify it a bit more:
Evaluating a sentence transformer model on the german dpr test set took about 8 minutes on a P100-GPU whereas the evaluation of the same model in the sentence-transformers library took about 20 seconds.
Describe the solution you'd like
I know that there is much more overhead when compared to a library like sentence transformers because haystack has to use a real retrieval pipeline including a document store like elasticsearch. What I would like is that the eval method in the retriever (at least for dense retrievers)
haystack/haystack/retriever/base.py
Line 49 in 3539e6b
bulk_retrieve
orbatch_retrieve
method on the retriever. This bulk retrieval method could at least embed all the queries at once instead of embedding the queries one by one. Ideally, the document store would also be queried in batches although I realize that this might mean too much complexity when accounting for different document stores. Still, embedding all queries at once (or batches of queries to avoid OOM errors) will still make the evaluation much faster.Describe alternatives you've considered
None
Additional context
This is just a feature that I think would be nice to have. It's not really urgent or anything like that.
The text was updated successfully, but these errors were encountered: