New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up query_by_embedding in InMemoryDocumentStore. #2091
Conversation
What is changing? Why? What are limitations? Breaking changes (Example of before vs. after) Link the issue that this relates to |
Hi @baregawi thanks for creating the PR. Looks quite good to me already. I see that one of the test cases is failing: haystack/test/test_retriever.py Line 69 in 7d769d8
|
No worries, @julian-risch. I just committed the fix to that one. |
@julian-risch the failing test should pass now but I'm realizing that it needs a change to achieve the memory safety I mentioned previously. |
functions for the CPU and GPU.
That should it. I am happy with the code now. Let me know if there is anything you want me to change or explain. |
Hi @julian-risch, I am trying to run specific tests locally but keep missing one here and there. How do you all make sure all the tests are run in a reasonable amount of time before you commit? Do you use a GPU machine for the tests? Are the tests network heavy? (my connection is not that great at the moment) In any case, I committed a fix to the most recent failed test. In addition I kicked off all the tests with |
Hello @baregawi, I share your pain with the test suite... It's currently a bit hard to run it locally from start to end. A refactoring is planned soon. In the meantime, I just approved your PR to access our GitHub CI: from now on the CI will run the tests on every commit. As long as the CI is green you don't need to worry about running the tests locally, you can rely on it for a full run of the suite. Of course you should be able to run a subset of the tests (the ones most likely to be affected by your changes) locally as well. Let me know if you have a problem with that too. |
Thank you, @ZanSara ! I got on a GPU AWS instance to run the tests in parallel with much better hardware and network. Only half the tests passed at first. After a bit debugging %90 of the tests passed. And then I realized that was happening due to various missing libraries and servers that looked to be in the Dockerfile. In any case, two questions if you don't mind:
Thank you! |
Hi @baregawi regarding the dependencies required to run the tests you can use Many of our discussions happen here on GitHub. To document also architecture decisions that happened offline, we will start to use ADRs and put them here following this template: https://github.com/deepset-ai/haystack/blob/master/docs/decisions/adr-template.md |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! 👍 Before we merge this PR, could you please double-check the batch size and the formula in the explanation? I think we should add a factor 4 in the formula. If you have any numbers from your experiments, maybe you could also share what approximate speedup to expect from CPU to GPU?
haystack/document_stores/memory.py
Outdated
@@ -34,6 +37,8 @@ def __init__( | |||
similarity: str = "dot_product", | |||
progress_bar: bool = True, | |||
duplicate_documents: str = 'overwrite', | |||
use_gpu: bool = True, | |||
scoring_batch_size: int = 500000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just double-checking: that requires about 1.5GB memory correct? In that case it's a good default value.
Hey @baregawi, thanks a lot for merging master, I was just about to do it for you 😊 Just one thing: please make sure that on your next commits the entire CI works fine. The documentation bot might fail (we're working on it), but at least Linux CI should run properly. If you notice issues please tag me, so that I'll be aware of the issue |
@ZanSara Got it! Thank you for the heads up. So I think this change will be more useful for batch APIs then for individual calls. |
8590a6f
to
0814e57
Compare
Hi @julian-risch. Do you think you would have time to take a second look at this before this weekend? It would be much appreciated! =) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very good to me know. Thank you for your contribution to Haystack and also for going the extra mile and doing some speed comparisons.
@julian-risch It is my pleasure. My hope is to gain enough trust to build your batch APIs. =) |
Proposed changes:
Status (please check what you already did):