bug: make ElasticSearchDocumentStore use batch_size in get_documents_by_id#3166
bug: make ElasticSearchDocumentStore use batch_size in get_documents_by_id#3166masci merged 5 commits intodeepset-ai:mainfrom anakin87:batch_size_in_ElasticSearchDocumentStore
ElasticSearchDocumentStore use batch_size in get_documents_by_id#3166Conversation
masci
left a comment
There was a problem hiding this comment.
I have a feeling this is going to be slower because of the network roundtrips the ES and OS clients would do, out of curiosity did you test this out in a way that the search is performed multiple times?
If we confirm it's slower, I would consider either not implementing this "by choice" or put a caveat in the documentation.
|
Honestly, I saw the issue and simply submitted this PR. However, I understand and share your point of view. if the We can decide not to implement this behavior or add a caveat to the documentation and raise a warning for the user. @masci please let me know, if in your opinion it is worth making some tests to evaluate the retrieval times based on the |
|
I think it's worth it testing your branch to have a sense of what's the performance penalty in order to make an informed decision. I don't have much bandwidth now but I'll try it out. |
|
I made some tests on my branch (you can find them on this Colab notebook). I used ~ 17k short documents. I tested some Here are the results:
Even if the tests are very crude, as expected, it emerges that if the I see two alternative possibilities:
@masci WDYT? |
|
@anakin87 I've been thinking about a use case for this feature that's not speed and I got one: this would be useful to avoid sending to the cluster requests that are too big for it to handle - in this case the performance penalty would be a price users are willing to pay in order to reduce pressure on the cluster. Let's go with option number 2 then, I would just add a warning note in the docstrings, no need to emit warnings IMO. |
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
|
After some usual git mess 😄, |
masci
left a comment
There was a problem hiding this comment.
LGTM, waiting for the docs team to have a look at wording before merging
|
@masci could you request a review to the docs team? |
| to performance issues. Note that Elasticsearch limits the number of results to 10,000 documents by default. | ||
| Fetch documents by specifying a list of text id strings. | ||
|
|
||
| :param ids: list of document IDs. Be aware that passing a large number of ids might lead to performance issues. |
There was a problem hiding this comment.
Let's capitalize the beginning of argument descriptions (i.e. "List" instead of "list"). Can we give the user a sense of what a large number of ids is? Is 10K ok? 100K? Does it depend on how much is already indexed?
There was a problem hiding this comment.
Honestly I do not know.
This passage was already part of the original docstring.
There was a problem hiding this comment.
In my tests, I retrieved 17k documents with no particular issue.
|
@masci @ZanSara As you can see in the logs, it seems that the CI is failing for a problem similar to that addressed in #3199. |
Related Issues
ElasticSearchDocumentStoredoes not usebatch_sizeinget_documents_by_id#3153Proposed Changes:
batch_sizeparameter wasn't used inget_documents_by_id.Now the method uses
batch_size, making several queries based on this parameter.Implementation inspired by
SQLDocumentStoreHow did you test it?
Manual verification
Checklist