New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmark milvus #850
Benchmark milvus #850
Conversation
…tack into benchmark_milvus
While running benchmarks, we found out that #812 was causing a noticeable slowdown in SQL speed. As a result, the benchmarks we ran showed FAISS and Milvus HNSW speed to be very comparable FAISS and Milvus Flat. To test whether SQL had slowed down, we ran retriever_query benchmarks with DPR-FAISS_HNSW. We commented out lines in
The results are as follows:
SQL takes an order of magnitude longer in the second commit. |
I think issue mainly with that vector_ids filtering on following line is running on whole set of records in the sql. Previously it used to do filter before hand -
So even for single vector id, it is scanning whole DB two times and not leveraging vector_id index.
|
@lalitpagaria Thanks a lot for the pointer! We are currently busy with some other topics but will come back to this topic in the next sprint (cc @tanaysoni ) |
Looked into this with the help of @oryx1729. We could isolate the slowdown to def get_documents_by_vector_ids(
self,
vector_ids: List[str],
index: Optional[str] = None,
batch_size: int = 10_000
):
"""
Fetch documents by specifying a list of text vector id strings
:param vector_ids: List of vector_id strings.
:param index: Name of the index to get the documents from. If None, the
DocumentStore's default index (self.index) will be used.
:param batch_size: When working with large number of documents, batching can help reduce memory footprint.
"""
result = self._query(
index=index,
vector_ids=vector_ids,
batch_size=batch_size
)
documents = list(result)
sorted_documents = sorted(documents, key=lambda doc: vector_ids.index(doc.meta["vector_id"]))
return sorted_documents is replaced with a version of the method from the earlier, faster commit, def get_documents_by_vector_ids(self, vector_ids: List[str], index: Optional[str] = None, batch_size: int = 10_000):
"""Fetch documents by specifying a list of text vector id strings"""
index = index or self.index
documents = []
for i in range(0, len(vector_ids), batch_size):
query = self.session.query(DocumentORM).filter(
DocumentORM.vector_id.in_(vector_ids[i: i + batch_size]),
DocumentORM.index == index
)
for row in query.all():
documents.append(self._convert_sql_row_to_document(row))
sorted_documents = sorted(documents, key=lambda doc: vector_ids.index(doc.meta["vector_id"]))
return sorted_documents performance improves again. While this finding unblocks the ability to benchmark Milvus, this change alone cannot be committed yet since it is not yet clear whether it impacts other parts of our code. |
After benchmarking with the change mentioned above, FAISS speeds are back to what is expected. Milvus HNSW is only significantly than Milvus Flat at 500k docs and in general, still slower than FAISS HNSW. More needs to be done to figure out what is wrong here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGs
Benchmark Milvus to see how fast it is compared to other DocumentStores