Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up query_by_embedding in InMemoryDocumentStore. #2091

Merged
merged 9 commits into from Feb 4, 2022

Conversation

baregawi
Copy link
Contributor

@baregawi baregawi commented Jan 31, 2022

Proposed changes:

Status (please check what you already did):

  • First draft (up for discussions & feedback)
  • Final code
  • Added tests
  • Updated documentation

@baregawi
Copy link
Contributor Author

What is changing?
The performance of the query_by_embedding function in InMemoryDocumentStore is changing to take advantage of efficiency provided by batching and by using a GPU when available.

Why?
Running a for loop over documents and doing 1 dot product each loop using the CPU is very in efficient. It should be batched and use the GPU when available.

What are limitations?
The availability of a GPU.

Breaking changes (Example of before vs. after)
None. Change is transparent to all code outside of query_by_embedding.

Link the issue that this relates to
No issue on github. I just tried it the other day and felt that it was unreasonably slow.

@julian-risch
Copy link
Member

Hi @baregawi thanks for creating the PR. Looks quite good to me already. I see that one of the test cases is failing:

def test_retrieval(retriever_with_docs, document_store_with_docs):
I can have a look at it in more detail or you could explore that.

@baregawi
Copy link
Contributor Author

No worries, @julian-risch. I just committed the fix to that one.

@baregawi
Copy link
Contributor Author

@julian-risch the failing test should pass now but I'm realizing that it needs a change to achieve the memory safety I mentioned previously.

@CLAassistant
Copy link

CLAassistant commented Jan 31, 2022

CLA assistant check
All committers have signed the CLA.

@baregawi
Copy link
Contributor Author

That should it. I am happy with the code now. Let me know if there is anything you want me to change or explain.

@baregawi
Copy link
Contributor Author

baregawi commented Feb 1, 2022

Hi @julian-risch, I am trying to run specific tests locally but keep missing one here and there. How do you all make sure all the tests are run in a reasonable amount of time before you commit? Do you use a GPU machine for the tests? Are the tests network heavy? (my connection is not that great at the moment)

In any case, I committed a fix to the most recent failed test. In addition I kicked off all the tests with --document_store_type="memory". I installed pytest-xdist to run the tests in parallel but it is still running slowly on my laptop and I don't know how long it will take to finish.

@ZanSara
Copy link
Contributor

ZanSara commented Feb 1, 2022

Hello @baregawi, I share your pain with the test suite... It's currently a bit hard to run it locally from start to end. A refactoring is planned soon. In the meantime, I just approved your PR to access our GitHub CI: from now on the CI will run the tests on every commit. As long as the CI is green you don't need to worry about running the tests locally, you can rely on it for a full run of the suite. Of course you should be able to run a subset of the tests (the ones most likely to be affected by your changes) locally as well. Let me know if you have a problem with that too.

@baregawi
Copy link
Contributor Author

baregawi commented Feb 1, 2022

Thank you, @ZanSara ! I got on a GPU AWS instance to run the tests in parallel with much better hardware and network. Only half the tests passed at first. After a bit debugging %90 of the tests passed. And then I realized that was happening due to various missing libraries and servers that looked to be in the Dockerfile.

In any case, two questions if you don't mind:

  1. Where can I find documentation on how to use your docker setup? I'm familiar with Docker. Is it just docker build followed by docker-compose up?

  2. Is there a place where the discussions about refactoring are happening? I feel like that would be a good opportunity to understand the design.

Thank you!

@julian-risch
Copy link
Member

Hi @baregawi regarding the dependencies required to run the tests you can use pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[test]. For the tests, we use various document stores with the help of docker containers. These containers will be launched automatically if you have a docker instance running. A step-by-step guide on how to use our docker images for a demo can be found here but that's a slightly different topic: https://github.com/deepset-ai/haystack#beginner-quick-demo You're right that it's just these few steps.

Many of our discussions happen here on GitHub. To document also architecture decisions that happened offline, we will start to use ADRs and put them here following this template: https://github.com/deepset-ai/haystack/blob/master/docs/decisions/adr-template.md

Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! 👍 Before we merge this PR, could you please double-check the batch size and the formula in the explanation? I think we should add a factor 4 in the formula. If you have any numbers from your experiments, maybe you could also share what approximate speedup to expect from CPU to GPU?

haystack/document_stores/memory.py Outdated Show resolved Hide resolved
@@ -34,6 +37,8 @@ def __init__(
similarity: str = "dot_product",
progress_bar: bool = True,
duplicate_documents: str = 'overwrite',
use_gpu: bool = True,
scoring_batch_size: int = 500000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just double-checking: that requires about 1.5GB memory correct? In that case it's a good default value.

@ZanSara
Copy link
Contributor

ZanSara commented Feb 3, 2022

Hey @baregawi, thanks a lot for merging master, I was just about to do it for you 😊 Just one thing: please make sure that on your next commits the entire CI works fine. The documentation bot might fail (we're working on it), but at least Linux CI should run properly. If you notice issues please tag me, so that I'll be aware of the issue

@baregawi
Copy link
Contributor Author

baregawi commented Feb 3, 2022

@ZanSara Got it! Thank you for the heads up.
@julian-risch I made the update and got the dot product numbers for small inputs. When I run one query with just 500k documents there is only a 4x difference between the old version and my code (1.1 seconds vs 3.8 seconds). But when you run the same dot product in batch mode and compare 80K test samples vs 500K documents you get 100x+ speed up on my nvidia K80. And the batch size can really make a big difference so I kept it at 500K for now.

So I think this change will be more useful for batch APIs then for individual calls.

@baregawi
Copy link
Contributor Author

baregawi commented Feb 4, 2022

Hi @julian-risch. Do you think you would have time to take a second look at this before this weekend? It would be much appreciated! =)

Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very good to me know. Thank you for your contribution to Haystack and also for going the extra mile and doing some speed comparisons.

@julian-risch julian-risch merged commit d3c7768 into deepset-ai:master Feb 4, 2022
@baregawi
Copy link
Contributor Author

baregawi commented Feb 4, 2022

@julian-risch It is my pleasure. My hope is to gain enough trust to build your batch APIs. =)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants