Speed up query_by_embedding in InMemoryDocumentStore. #2091

baregawi · 2022-01-31T14:43:25Z

Proposed changes:

This pull request is related to this discussion: I'd like to contribute a change to how InMemoryDocumentStore implements query_by_embedding. #2088

Status (please check what you already did):

First draft (up for discussions & feedback)
Final code
Added tests
Updated documentation

baregawi · 2022-01-31T14:49:52Z

What is changing?
The performance of the query_by_embedding function in InMemoryDocumentStore is changing to take advantage of efficiency provided by batching and by using a GPU when available.

Why?
Running a for loop over documents and doing 1 dot product each loop using the CPU is very in efficient. It should be batched and use the GPU when available.

What are limitations?
The availability of a GPU.

Breaking changes (Example of before vs. after)
None. Change is transparent to all code outside of query_by_embedding.

Link the issue that this relates to
No issue on github. I just tried it the other day and felt that it was unreasonably slow.

…hey can vary.

julian-risch · 2022-01-31T17:09:37Z

Hi @baregawi thanks for creating the PR. Looks quite good to me already. I see that one of the test cases is failing:

haystack/test/test_retriever.py

Line 69 in 7d769d8

def test_retrieval(retriever_with_docs, document_store_with_docs):

I can have a look at it in more detail or you could explore that.

baregawi · 2022-01-31T18:32:04Z

No worries, @julian-risch. I just committed the fix to that one.

baregawi · 2022-01-31T21:58:32Z

@julian-risch the failing test should pass now but I'm realizing that it needs a change to achieve the memory safety I mentioned previously.

functions for the CPU and GPU.

CLAassistant · 2022-01-31T23:01:43Z

All committers have signed the CLA.

baregawi · 2022-01-31T23:08:51Z

That should it. I am happy with the code now. Let me know if there is anything you want me to change or explain.

baregawi · 2022-02-01T10:36:06Z

Hi @julian-risch, I am trying to run specific tests locally but keep missing one here and there. How do you all make sure all the tests are run in a reasonable amount of time before you commit? Do you use a GPU machine for the tests? Are the tests network heavy? (my connection is not that great at the moment)

In any case, I committed a fix to the most recent failed test. In addition I kicked off all the tests with --document_store_type="memory". I installed pytest-xdist to run the tests in parallel but it is still running slowly on my laptop and I don't know how long it will take to finish.

ZanSara · 2022-02-01T14:05:59Z

Hello @baregawi, I share your pain with the test suite... It's currently a bit hard to run it locally from start to end. A refactoring is planned soon. In the meantime, I just approved your PR to access our GitHub CI: from now on the CI will run the tests on every commit. As long as the CI is green you don't need to worry about running the tests locally, you can rely on it for a full run of the suite. Of course you should be able to run a subset of the tests (the ones most likely to be affected by your changes) locally as well. Let me know if you have a problem with that too.

baregawi · 2022-02-01T14:29:47Z

Thank you, @ZanSara ! I got on a GPU AWS instance to run the tests in parallel with much better hardware and network. Only half the tests passed at first. After a bit debugging %90 of the tests passed. And then I realized that was happening due to various missing libraries and servers that looked to be in the Dockerfile.

In any case, two questions if you don't mind:

Where can I find documentation on how to use your docker setup? I'm familiar with Docker. Is it just docker build followed by docker-compose up?
Is there a place where the discussions about refactoring are happening? I feel like that would be a good opportunity to understand the design.

Thank you!

julian-risch · 2022-02-02T13:18:30Z

Hi @baregawi regarding the dependencies required to run the tests you can use pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[test]. For the tests, we use various document stores with the help of docker containers. These containers will be launched automatically if you have a docker instance running. A step-by-step guide on how to use our docker images for a demo can be found here but that's a slightly different topic: https://github.com/deepset-ai/haystack#beginner-quick-demo You're right that it's just these few steps.

Many of our discussions happen here on GitHub. To document also architecture decisions that happened offline, we will start to use ADRs and put them here following this template: https://github.com/deepset-ai/haystack/blob/master/docs/decisions/adr-template.md

julian-risch

Looks good to me! 👍 Before we merge this PR, could you please double-check the batch size and the formula in the explanation? I think we should add a factor 4 in the formula. If you have any numbers from your experiments, maybe you could also share what approximate speedup to expect from CPU to GPU?

haystack/document_stores/memory.py

julian-risch · 2022-02-01T15:47:56Z

haystack/document_stores/memory.py

@@ -34,6 +37,8 @@ def __init__(
        similarity: str = "dot_product",
        progress_bar: bool = True,
        duplicate_documents: str = 'overwrite',
+        use_gpu: bool = True,
+        scoring_batch_size: int = 500000


Just double-checking: that requires about 1.5GB memory correct? In that case it's a good default value.

ZanSara · 2022-02-03T14:15:10Z

Hey @baregawi, thanks a lot for merging master, I was just about to do it for you 😊 Just one thing: please make sure that on your next commits the entire CI works fine. The documentation bot might fail (we're working on it), but at least Linux CI should run properly. If you notice issues please tag me, so that I'll be aware of the issue

baregawi · 2022-02-03T16:13:54Z

@ZanSara Got it! Thank you for the heads up.
@julian-risch I made the update and got the dot product numbers for small inputs. When I run one query with just 500k documents there is only a 4x difference between the old version and my code (1.1 seconds vs 3.8 seconds). But when you run the same dot product in batch mode and compare 80K test samples vs 500K documents you get 100x+ speed up on my nvidia K80. And the batch size can really make a big difference so I kept it at 500K for now.

So I think this change will be more useful for batch APIs then for individual calls.

baregawi · 2022-02-04T13:01:22Z

Hi @julian-risch. Do you think you would have time to take a second look at this before this weekend? It would be much appreciated! =)

julian-risch

Looks very good to me know. Thank you for your contribution to Haystack and also for going the extra mile and doing some speed comparisons.

baregawi · 2022-02-04T17:04:24Z

@julian-risch It is my pleasure. My hope is to gain enough trust to build your batch APIs. =)

Speed up query_by_embedding in InMemoryDocumentStore.

a3565bd

julian-risch self-requested a review January 31, 2022 14:46

julian-risch added topic:document_store topic:speed labels Jan 31, 2022

Make sure query and document embeddings are of the same dtype since t…

26bf8fc

…hey can vary.

Handle cases where there are 0 and 1 documents.

9e0587d

Don't put entire embedding matrix on GPU at once. Use separate get_score

386b734

functions for the CPU and GPU.

Norm the vectors in get_scores_numpy in a safer way.

a4d8f0f

julian-risch requested changes Feb 2, 2022

View reviewed changes

baregawi and others added 2 commits February 3, 2022 05:30

Merge remote-tracking branch 'origin/master' into memory_speed_up

593c439

Apply Black

ef7e2b7

baregawi requested a review from julian-risch February 3, 2022 16:14

Incorporate missing factor of 4 in memory use calculation.

0814e57

baregawi force-pushed the memory_speed_up branch from 8590a6f to 0814e57 Compare February 3, 2022 16:15

Apply Black

a31d2aa

julian-risch approved these changes Feb 4, 2022

View reviewed changes

julian-risch merged commit d3c7768 into deepset-ai:master Feb 4, 2022

mathislucka mentioned this pull request Sep 6, 2023

MemoryDocumentStore - Embedding retrieval (2.0) #5715

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up query_by_embedding in InMemoryDocumentStore. #2091

Speed up query_by_embedding in InMemoryDocumentStore. #2091

baregawi commented Jan 31, 2022 •

edited

baregawi commented Jan 31, 2022

julian-risch commented Jan 31, 2022

baregawi commented Jan 31, 2022

baregawi commented Jan 31, 2022

CLAassistant commented Jan 31, 2022 •

edited

baregawi commented Jan 31, 2022

baregawi commented Feb 1, 2022 •

edited

ZanSara commented Feb 1, 2022

baregawi commented Feb 1, 2022

julian-risch commented Feb 2, 2022

julian-risch left a comment

julian-risch Feb 1, 2022

ZanSara commented Feb 3, 2022

baregawi commented Feb 3, 2022

baregawi commented Feb 4, 2022

julian-risch left a comment

baregawi commented Feb 4, 2022

Speed up query_by_embedding in InMemoryDocumentStore. #2091

Speed up query_by_embedding in InMemoryDocumentStore. #2091

Conversation

baregawi commented Jan 31, 2022 • edited

baregawi commented Jan 31, 2022

julian-risch commented Jan 31, 2022

baregawi commented Jan 31, 2022

baregawi commented Jan 31, 2022

CLAassistant commented Jan 31, 2022 • edited

baregawi commented Jan 31, 2022

baregawi commented Feb 1, 2022 • edited

ZanSara commented Feb 1, 2022

baregawi commented Feb 1, 2022

julian-risch commented Feb 2, 2022

julian-risch left a comment

Choose a reason for hiding this comment

julian-risch Feb 1, 2022

Choose a reason for hiding this comment

ZanSara commented Feb 3, 2022

baregawi commented Feb 3, 2022

baregawi commented Feb 4, 2022

julian-risch left a comment

Choose a reason for hiding this comment

baregawi commented Feb 4, 2022

baregawi commented Jan 31, 2022 •

edited

CLAassistant commented Jan 31, 2022 •

edited

baregawi commented Feb 1, 2022 •

edited