Benchmark milvus #850

brandenchan · 2021-02-19T15:32:36Z

Benchmark Milvus to see how fast it is compared to other DocumentStores

…tack into benchmark_milvus

brandenchan · 2021-02-19T15:39:20Z

While running benchmarks, we found out that #812 was causing a noticeable slowdown in SQL speed. As a result, the benchmarks we ran showed FAISS and Milvus HNSW speed to be very comparable FAISS and Milvus Flat. To test whether SQL had slowed down, we ran retriever_query benchmarks with DPR-FAISS_HNSW. We commented out lines in FaissDocumentStore.query_by_embeddings so that it doesn't engage SQL and returns vector_ids_for_query. We also replaced BaseRetriever.eval() with this code:

        import time
        import random
        query = random.choice(["What is the capital of Sweden?", "What nationality is Michael Jackson?", "Where is the headquarters of Apple?", "When was the Second World War?"])
        query_embed = self.embed_queries([query])

        tic1 = time.time()
        ten_ids = self.document_store.query_by_embedding(query_embed, index=str(doc_index))
        toc1 = time.time()
        print("FAISS time: " + str(toc1-tic1))

        tic = time.time()
        self.document_store.get_documents_by_vector_ids(ten_ids, index="eval_document")
        toc = time.time()
        print("SQL time:  " + str(toc - tic))

        raise Exception

The results are as follows:

**Commit e91518ee00f2cab0fc10c4741c775d2fb5a4a1cf where benchmarks show fast query speed**

10K
FAISS time: 0.0027124881744384766
SQL time:  0.0073299407958984375


100K
FAISS time: 0.00408625602722168
SQL time:  0.005889892578125

**fd5c5dd23c28f30718892b39fc78dcbe2f75f776 where benchmarks show on slow query speed**

10K
FAISS time: 0.004038810729980469
SQL time:  0.039757728576660156

100K
FAISS time: 0.004107952117919922
SQL time:  0.08191871643066406

SQL takes an order of magnitude longer in the second commit.

lalitpagaria · 2021-02-23T22:13:30Z

I think issue mainly with that vector_ids filtering on following line is running on whole set of records in the sql.
https://github.com/deepset-ai/haystack/blob/master/haystack/document_store/sql.py#L227

Previously it used to do filter before hand -

        for i in range(0, len(vector_ids), batch_size):
            query = self.session.query(DocumentORM).filter(
                DocumentORM.vector_id.in_(vector_ids[i: i + batch_size]),
                DocumentORM.index == index
            )
            for row in query.all():
                documents.append(self._convert_sql_row_to_document(row))

So even for single vector id, it is scanning whole DB two times and not leveraging vector_id index.

First at following line -
https://github.com/deepset-ai/haystack/blob/master/haystack/document_store/sql.py#L214
Second times at following line -
https://github.com/deepset-ai/haystack/blob/master/haystack/document_store/sql.py#L227

tholor · 2021-03-01T10:06:34Z

@lalitpagaria Thanks a lot for the pointer! We are currently busy with some other topics but will come back to this topic in the next sprint (cc @tanaysoni )

brandenchan · 2021-04-01T13:47:52Z

Looked into this with the help of @oryx1729. We could isolate the slowdown to SQLDocumentStore.get_documents_by_vector_ids(). When this

    def get_documents_by_vector_ids(
        self,
        vector_ids: List[str],
        index: Optional[str] = None,
        batch_size: int = 10_000
    ):
        """
        Fetch documents by specifying a list of text vector id strings

        :param vector_ids: List of vector_id strings.
        :param index: Name of the index to get the documents from. If None, the
                      DocumentStore's default index (self.index) will be used.
        :param batch_size: When working with large number of documents, batching can help reduce memory footprint.
        """

        result = self._query(
            index=index,
            vector_ids=vector_ids,
            batch_size=batch_size
        )
        documents = list(result)
        sorted_documents = sorted(documents, key=lambda doc: vector_ids.index(doc.meta["vector_id"]))
        return sorted_documents

is replaced with a version of the method from the earlier, faster commit,

    def get_documents_by_vector_ids(self, vector_ids: List[str], index: Optional[str] = None, batch_size: int = 10_000):
        """Fetch documents by specifying a list of text vector id strings"""
        index = index or self.index

        documents = []
        for i in range(0, len(vector_ids), batch_size):
            query = self.session.query(DocumentORM).filter(
                DocumentORM.vector_id.in_(vector_ids[i: i + batch_size]),
                DocumentORM.index == index
            )
            for row in query.all():
                documents.append(self._convert_sql_row_to_document(row))

        sorted_documents = sorted(documents, key=lambda doc: vector_ids.index(doc.meta["vector_id"]))
        return sorted_documents

performance improves again.

While this finding unblocks the ability to benchmark Milvus, this change alone cannot be committed yet since it is not yet clear whether it impacts other parts of our code.

brandenchan · 2021-04-13T09:21:10Z

After benchmarking with the change mentioned above, FAISS speeds are back to what is expected. Milvus HNSW is only significantly than Milvus Flat at 500k docs and in general, still slower than FAISS HNSW. More needs to be done to figure out what is wrong here.

Timoeller

LGs

…tack into benchmark_milvus

This reverts commit e2efa5f.

This reverts commit b085a67.

brandenchan and others added 8 commits February 15, 2021 12:59

Add milvus benchmarking support

fb66f8b

Add latest docstring and tutorial changes

b60eb87

Edit config

0791989

Merge branch 'benchmark_milvus' of https://github.com/deepset-ai/hays…

785cdcc

…tack into benchmark_milvus

Disable docker interactive mode

35d1759

Add milvus index type support

370144f

Adjust FAISS and Milvus node branching

4aa12d3

Remove duplicate in config

313b999

brandenchan self-assigned this Feb 19, 2021

brandenchan and others added 4 commits April 12, 2021 12:24

Revert method for speedup

8f966a5

Add latest docstring and tutorial changes

e2efa5f

Add latest benchmark run

31d6e8a

Add latest docstring and tutorial changes

784d498

Timoeller approved these changes Apr 13, 2021

View reviewed changes

brandenchan and others added 7 commits April 13, 2021 12:35

Add json files

a87d6b0

Merge branch 'benchmark_milvus' of https://github.com/deepset-ai/hays…

43127df

…tack into benchmark_milvus

Revert "Add latest docstring and tutorial changes"

408f267

This reverts commit e2efa5f.

Add latest docstring and tutorial changes

b085a67

Revert "Add latest docstring and tutorial changes"

181660d

This reverts commit b085a67.

Fix typo

6eed066

Merge branch 'master' into benchmark_milvus

32e43e3

brandenchan merged commit 77d4c2c into master Apr 13, 2021

brandenchan deleted the benchmark_milvus branch April 13, 2021 12:54

brandenchan mentioned this pull request Apr 14, 2021

Ensure that SQLDocumentStore.get_documents_by_vector_ids is compatible #967

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark milvus #850

Benchmark milvus #850

brandenchan commented Feb 19, 2021

brandenchan commented Feb 19, 2021

lalitpagaria commented Feb 23, 2021

tholor commented Mar 1, 2021

brandenchan commented Apr 1, 2021

brandenchan commented Apr 13, 2021

Timoeller left a comment

Benchmark milvus #850

Benchmark milvus #850

Conversation

brandenchan commented Feb 19, 2021

brandenchan commented Feb 19, 2021

lalitpagaria commented Feb 23, 2021

tholor commented Mar 1, 2021

brandenchan commented Apr 1, 2021

brandenchan commented Apr 13, 2021

Timoeller left a comment

Choose a reason for hiding this comment