bug: make `ElasticSearchDocumentStore` use `batch_size` in `get_documents_by_id` by anakin87 · Pull Request #3166 · deepset-ai/haystack

anakin87 · 2022-09-05T20:21:22Z

Related Issues

fixes ElasticSearchDocumentStore does not use batch_size in get_documents_by_id #3153

Proposed Changes:

batch_size parameter wasn't used in get_documents_by_id.
Now the method uses batch_size, making several queries based on this parameter.
Implementation inspired by SQLDocumentStore

How did you test it?

Manual verification

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added tests that demonstrate the correct behavior of the change
I've used the conventional commit convention for my PR title
I documented my code
I ran pre-commit hooks and fixed any issue

masci

I have a feeling this is going to be slower because of the network roundtrips the ES and OS clients would do, out of curiosity did you test this out in a way that the search is performed multiple times?

If we confirm it's slower, I would consider either not implementing this "by choice" or put a caveat in the documentation.

anakin87 · 2022-09-06T09:41:00Z

Honestly, I saw the issue and simply submitted this PR.
I only verified that with this PR the documents can be actually retrieved in batches. I haven't asked myself many questions about slowness/performance.

However, I understand and share your point of view. if the batch_size is small, the retrieval of documents can become slow...

We can decide not to implement this behavior or add a caveat to the documentation and raise a warning for the user.

@masci please let me know, if in your opinion it is worth making some tests to evaluate the retrieval times based on the batch_size.

masci · 2022-09-08T17:06:10Z

I think it's worth it testing your branch to have a sense of what's the performance penalty in order to make an informed decision. I don't have much bandwidth now but I'll try it out.

anakin87 · 2022-09-09T10:35:48Z

I made some tests on my branch (you can find them on this Colab notebook).

I used ~ 17k short documents. I tested some batch_size values between 10000 (max) and 1, each for 5 times.

Here are the results:

	batch_size	i	time
0	10000	0	1.68436
1	10000	1	1.52868
2	10000	2	1.51483
3	10000	3	1.01676
4	10000	4	1.08534
5	5000	0	1.56209
6	5000	1	1.10215
7	5000	2	0.880323
8	5000	3	1.15715
9	5000	4	0.867851
10	1000	0	1.47007
11	1000	1	1.05159
12	1000	2	1.03378
13	1000	3	0.999881
14	1000	4	1.30527
15	500	0	1.06979
16	500	1	1.66162
17	500	2	1.47191
18	500	3	1.07159
19	500	4	1.13922
20	100	0	2.32787
21	100	1	1.85083
22	100	2	1.65543
23	100	3	1.51368
24	100	4	1.73016
25	50	0	1.89533
26	50	1	1.6341
27	50	2	1.70037
28	50	3	1.73407
29	50	4	1.95229
30	10	0	5.19101
31	10	1	4.6031
32	10	2	4.28554
33	10	3	3.79545
34	10	4	3.54379
35	5	0	6.28028
36	5	1	5.501
37	5	2	5.09717
38	5	3	5.06938
39	5	4	5.23445
40	1	0	18.9656
41	1	1	18.6559
42	1	2	18.4501
43	1	3	19.6474
44	1	4	18.7701

Even if the tests are very crude, as expected, it emerges that if the batch_size is very low, the retrieval time becomes long.

I see two alternative possibilities:

we don't want the user to use low batch_size values, so we don't implement this option
we offer this option, but warn the user of possible performance drops (in the documentation and/or by raising a warning).

@masci WDYT?

masci · 2022-09-10T17:04:39Z

@anakin87 I've been thinking about a use case for this feature that's not speed and I got one: this would be useful to avoid sending to the cluster requests that are too big for it to handle - in this case the performance penalty would be a price users are willing to pay in order to reduce pressure on the cluster.

Let's go with option number 2 then, I would just add a warning note in the docstrings, no need to emit warnings IMO.

review-notebook-app · 2022-09-10T18:33:01Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

anakin87 · 2022-09-10T18:48:42Z

After some usual git mess 😄,
I have added some details for the batch_size parameter in the docstrings.

masci

LGTM, waiting for the docs team to have a look at wording before merging

anakin87 · 2022-09-20T07:48:27Z

@masci could you request a review to the docs team?

brandenchan · 2022-09-20T08:13:07Z

haystack/document_stores/elasticsearch.py

-        to performance issues. Note that Elasticsearch limits the number of results to 10,000 documents by default.
+        Fetch documents by specifying a list of text id strings.
+
+        :param ids: list of document IDs. Be aware that passing a large number of ids might lead to performance issues.


Let's capitalize the beginning of argument descriptions (i.e. "List" instead of "list"). Can we give the user a sense of what a large number of ids is? Is 10K ok? 100K? Does it depend on how much is already indexed?

Honestly I do not know.
This passage was already part of the original docstring.

In my tests, I retrieved 17k documents with no particular issue.

haystack/document_stores/elasticsearch.py

anakin87 · 2022-09-21T07:33:20Z

@masci @ZanSara As you can see in the logs, it seems that the CI is failing for a problem similar to that addressed in #3199.

…ntStore

use batch_size

57166aa

anakin87 requested a review from a team as a code owner September 5, 2022 20:21

anakin87 requested review from masci and removed request for a team September 5, 2022 20:21

anakin87 marked this pull request as draft September 5, 2022 21:50

anakin87 marked this pull request as ready for review September 5, 2022 21:50

masci reviewed Sep 6, 2022

View reviewed changes

anakin87 requested a review from a team as a code owner September 10, 2022 18:32

anakin87 marked this pull request as draft September 10, 2022 18:34

try to fix git mess

9a4a6b2

anakin87 marked this pull request as ready for review September 10, 2022 18:46

anakin87 requested a review from masci September 12, 2022 08:41

masci approved these changes Sep 13, 2022

View reviewed changes

brandenchan self-requested a review September 20, 2022 08:06

brandenchan suggested changes Sep 20, 2022

View reviewed changes

anakin87 added 2 commits September 20, 2022 23:11

improve docstrings

53dc35e

fix

469651c

anakin87 requested a review from brandenchan September 20, 2022 21:17

anakin87 marked this pull request as draft September 20, 2022 21:42

anakin87 marked this pull request as ready for review September 20, 2022 21:43

anakin87 marked this pull request as draft September 20, 2022 23:03

anakin87 marked this pull request as ready for review September 20, 2022 23:04

anakin87 marked this pull request as draft September 21, 2022 12:54

anakin87 marked this pull request as ready for review September 21, 2022 12:55

brandenchan approved these changes Sep 22, 2022

View reviewed changes

Merge branch 'deepset-ai:main' into batch_size_in_ElasticSearchDocume…

f99c1f5

…ntStore

masci merged commit b579b9d into deepset-ai:main Sep 26, 2022

Conversation

anakin87 commented Sep 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related Issues

Proposed Changes:

How did you test it?

Checklist

Uh oh!

masci left a comment

Choose a reason for hiding this comment

Uh oh!

anakin87 commented Sep 6, 2022

Uh oh!

masci commented Sep 8, 2022

Uh oh!

anakin87 commented Sep 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

masci commented Sep 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

review-notebook-app bot commented Sep 10, 2022

Uh oh!

anakin87 commented Sep 10, 2022

Uh oh!

masci left a comment

Choose a reason for hiding this comment

Uh oh!

anakin87 commented Sep 20, 2022

Uh oh!

brandenchan Sep 20, 2022

Choose a reason for hiding this comment

Uh oh!

anakin87 Sep 20, 2022

Choose a reason for hiding this comment

Uh oh!

anakin87 Sep 20, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

anakin87 commented Sep 21, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

anakin87 commented Sep 5, 2022 •

edited

Loading

anakin87 commented Sep 9, 2022 •

edited

Loading

masci commented Sep 10, 2022 •

edited

Loading