Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark milvus #850

Merged
merged 19 commits into from Apr 13, 2021
Merged

Benchmark milvus #850

merged 19 commits into from Apr 13, 2021

Conversation

brandenchan
Copy link
Contributor

Benchmark Milvus to see how fast it is compared to other DocumentStores

@brandenchan brandenchan self-assigned this Feb 19, 2021
@brandenchan
Copy link
Contributor Author

While running benchmarks, we found out that #812 was causing a noticeable slowdown in SQL speed. As a result, the benchmarks we ran showed FAISS and Milvus HNSW speed to be very comparable FAISS and Milvus Flat. To test whether SQL had slowed down, we ran retriever_query benchmarks with DPR-FAISS_HNSW. We commented out lines in FaissDocumentStore.query_by_embeddings so that it doesn't engage SQL and returns vector_ids_for_query. We also replaced BaseRetriever.eval() with this code:

        import time
        import random
        query = random.choice(["What is the capital of Sweden?", "What nationality is Michael Jackson?", "Where is the headquarters of Apple?", "When was the Second World War?"])
        query_embed = self.embed_queries([query])

        tic1 = time.time()
        ten_ids = self.document_store.query_by_embedding(query_embed, index=str(doc_index))
        toc1 = time.time()
        print("FAISS time: " + str(toc1-tic1))

        tic = time.time()
        self.document_store.get_documents_by_vector_ids(ten_ids, index="eval_document")
        toc = time.time()
        print("SQL time:  " + str(toc - tic))

        raise Exception

The results are as follows:

**Commit e91518ee00f2cab0fc10c4741c775d2fb5a4a1cf where benchmarks show fast query speed**

10K
FAISS time: 0.0027124881744384766
SQL time:  0.0073299407958984375


100K
FAISS time: 0.00408625602722168
SQL time:  0.005889892578125

**fd5c5dd23c28f30718892b39fc78dcbe2f75f776 where benchmarks show on slow query speed**

10K
FAISS time: 0.004038810729980469
SQL time:  0.039757728576660156

100K
FAISS time: 0.004107952117919922
SQL time:  0.08191871643066406

SQL takes an order of magnitude longer in the second commit.

@lalitpagaria
Copy link
Contributor

I think issue mainly with that vector_ids filtering on following line is running on whole set of records in the sql.
https://github.com/deepset-ai/haystack/blob/master/haystack/document_store/sql.py#L227

Previously it used to do filter before hand -

        for i in range(0, len(vector_ids), batch_size):
            query = self.session.query(DocumentORM).filter(
                DocumentORM.vector_id.in_(vector_ids[i: i + batch_size]),
                DocumentORM.index == index
            )
            for row in query.all():
                documents.append(self._convert_sql_row_to_document(row))

So even for single vector id, it is scanning whole DB two times and not leveraging vector_id index.

@tholor
Copy link
Member

tholor commented Mar 1, 2021

@lalitpagaria Thanks a lot for the pointer! We are currently busy with some other topics but will come back to this topic in the next sprint (cc @tanaysoni )

@brandenchan
Copy link
Contributor Author

Looked into this with the help of @oryx1729. We could isolate the slowdown to SQLDocumentStore.get_documents_by_vector_ids(). When this

    def get_documents_by_vector_ids(
        self,
        vector_ids: List[str],
        index: Optional[str] = None,
        batch_size: int = 10_000
    ):
        """
        Fetch documents by specifying a list of text vector id strings

        :param vector_ids: List of vector_id strings.
        :param index: Name of the index to get the documents from. If None, the
                      DocumentStore's default index (self.index) will be used.
        :param batch_size: When working with large number of documents, batching can help reduce memory footprint.
        """

        result = self._query(
            index=index,
            vector_ids=vector_ids,
            batch_size=batch_size
        )
        documents = list(result)
        sorted_documents = sorted(documents, key=lambda doc: vector_ids.index(doc.meta["vector_id"]))
        return sorted_documents

is replaced with a version of the method from the earlier, faster commit,

    def get_documents_by_vector_ids(self, vector_ids: List[str], index: Optional[str] = None, batch_size: int = 10_000):
        """Fetch documents by specifying a list of text vector id strings"""
        index = index or self.index

        documents = []
        for i in range(0, len(vector_ids), batch_size):
            query = self.session.query(DocumentORM).filter(
                DocumentORM.vector_id.in_(vector_ids[i: i + batch_size]),
                DocumentORM.index == index
            )
            for row in query.all():
                documents.append(self._convert_sql_row_to_document(row))

        sorted_documents = sorted(documents, key=lambda doc: vector_ids.index(doc.meta["vector_id"]))
        return sorted_documents

performance improves again.

While this finding unblocks the ability to benchmark Milvus, this change alone cannot be committed yet since it is not yet clear whether it impacts other parts of our code.

@brandenchan
Copy link
Contributor Author

After benchmarking with the change mentioned above, FAISS speeds are back to what is expected. Milvus HNSW is only significantly than Milvus Flat at 500k docs and in general, still slower than FAISS HNSW. More needs to be done to figure out what is wrong here.

Copy link
Contributor

@Timoeller Timoeller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants