Remove phi normalization from FAISS & support more index types #467

tholor · 2020-10-06T09:35:54Z

As described in #422 we don't need the phi normalization trick anymore as HNSW supports the inner product metric. This will simplify the code a lot - especially for multiple subsequent writes of documents.

Remove Phi normalization from FaissDocumentStore
Set good default index type
Allow custom index types via FAISS index factory
Allow direct init of HNSW index as initi via faiss index factory gives different results (Why do results differ for HNSW from index factory vs. direct init? facebookresearch/faiss#1434)
Improve SQL speed by 3x by optimizing joins

Breaking changes:

vector_size -> vector_dim

Old:
FAISSDocumentStore(...vector_size=768)

New:
FAISSDocumentStore(...vector_dim=768)

Limitations / Future improvements

There are more potential customizations for FAISS index types that we could make accessible for the end user. Some of them were proposed by @lalitpagaria in Improvements in FaissDocumentStore: Abstracting out FAISS index related customisation to another class #385 . As of now most of them don't seem very relevant and rather blow up complexity, but we should re-evaluate when we did more experiments on diff. index types and models
Currently we don't expose the metric type to the end user. This would be a simple addition, but at this point I believe limiting it to inner product should be ok.

haystack/document_store/faiss.py

lalitpagaria · 2020-10-06T14:52:44Z

haystack/document_store/faiss.py

-    def _create_new_index(self, vector_size: int, index_factory: str = "HNSW4"):
-        index = faiss.index_factory(vector_size + 1, index_factory)
+    def _create_new_index(self, vector_dim: int, index_factory: str = "Flat", metric_type=faiss.METRIC_INNER_PRODUCT, **kwargs):
+        if index_factory == "HNSW" and metric_type == faiss.METRIC_INNER_PRODUCT:


Sorry for adding comment now, but adding here for more context -

I think better to limit faiss related internals in this class. As user of this class already getting freedom to pass self configured index in class via faiss_index and faiss_index_factory_str parameters of the constructor. It will keep this class clean.

Another point how prevent index.add() function from breaking? With faiss_index_factory_str="IDMap,Flat" parameter vector add call will fail. Which was addressed in #385 as follows:

total_indices = self.size() ids = np.arange(total_indices, total_indices + len(embeddings), 1, dtype=np.int64) if isinstance(self.faiss_index, IndexIDMap): self.faiss_index.add_with_ids(vectors_to_add, ids) else: self.faiss_index.add(vectors_to_add)

What is your view on this comment

Yes, normally I would agree that we should limit it to FAISS internals. However, HNSW Flat with IP is not giving the same results when initialized via the index factory. The performance is totally off. See facebookresearch/faiss#1434
Therefore, we need a special case handling here anyway and do a direct init without the factory. I then took the chance and added "good" defaults for the other params. I believe most users will be overwhelmed by deciding on a proper index type for their use case. I believe most will use either just "Flat" or our preconfigured "HNSW" variant. We will also benchmark those two and try to tune the default HNSW config to fit the most common use cases.

Regarding the IDMap. What would be the use case here? I think the FAISS id handling is rather something internal that is not interesting for most haystack users. Therefore, I don't think we need to support it, but maybe I am missing a good use case here 🤔 ?

Yes, normally I would agree that we should limit it to FAISS internals. However, HNSW Flat with IP is not giving the same results when initialized via the index factory. The performance is totally off. See facebookresearch/faiss#1434

Yes agree with you. We can move this part to another function to make it more explicit. I am also able to reproduce facebookresearch/faiss#1434 even tried by compiling fresh from top of the branch. I suspect some issue related to default value of metric_type in struct. They recently added support for IP metric type for HNSW so suspect some places they might have missed it.

Regarding the IDMap. What would be the use case here? I think the FAISS id handling is rather something internal that is not interesting for most haystack users. Therefore, I don't think we need to support it, but maybe I am missing a good use case here 🤔 ?

Actually IDMap have very good use case it abstract faiss internal id with external id. Hence one can update or delete embedding from faiss index store. So far what I understood that internally embeddings are sequentially indexed or the corresponding vector_id. Hence if one id is removed in between then it causes update of vector_id for all the embeddings coming after that record.
So far haystack only allow add method for faiss document_store (Elastic search don't have this issue). And if user want to update/remove particular document then they need to delete all documents and then add them again. There another solution to mark document to be removed as stale but it will not prevent faiss search to return stale document vector_id. This is what I covered in this comment.

More links related IDMap -
facebookresearch/faiss#1062
facebookresearch/faiss#859
facebookresearch/faiss#1258
https://github.com/facebookresearch/faiss/wiki/Special-operations-on-indexes#removing-elements-from-an-index
facebookresearch/faiss#1021

@tholor Have you got use case of using IDMap? BTW Milvus handle these case better
https://dzone.com/articles/how-milvus-realizes-the-delete-function

Now they are planning to support multi storage options. https://docs.google.com/document/d/1iwwLH4Jtm3OXIVb7jFYsfmcbOyX6AWZKaNJAaXC7-cw/edit

lalitpagaria · 2020-10-06T14:55:37Z

haystack/document_store/faiss.py

+            document_objects = [Document.from_dict(d) if isinstance(d, dict) else d for d in documents]
+            embeddings = [doc.embedding for doc in document_objects]
+            embeddings = np.array(embeddings, dtype="float32")
+        self.faiss_index.train(embeddings)


Need to check self.faiss_index.is_trained before calling train otherwise function will fail.

Ok, we could catch the error and make it more descriptive, but I think we should also let the user know that it's not valid to call train_index() in this case.

lalitpagaria · 2020-10-06T14:58:10Z

test/conftest.py

@@ -26,7 +26,11 @@ def elasticsearch_fixture():
    except:
        print("Starting Elasticsearch ...")
        status = subprocess.run(
-            ['docker run -d --name elasticsearch -p 9200:9200 -e "discovery.type=single-node" elasticsearch:7.9.1'],
+            ['docker rm haystack_test_elastic'],


Same is needed for tika docker container as well

tholor added 2 commits October 5, 2020 15:53

WIP remove phi normalization

402992e

add special case for hnsw

c8db83d

tholor self-assigned this Oct 6, 2020

tholor changed the title ~~Remove phi normalization from FAISS & support more index types~~ WIP Remove phi normalization from FAISS & support more index types Oct 6, 2020

tholor mentioned this pull request Oct 6, 2020

Improvements in FaissDocumentStore: Abstracting out FAISS index related customisation to another class #385

Closed

rename vector_size to vector_dim

78ba098

tholor added breaking change type:feature New feature or request labels Oct 6, 2020

tholor added 6 commits October 6, 2020 11:59

fix loading. fix extra dim in tests

290bc58

switch to new ES syntax for vector similarity

e568e2e

3x sql speed up. cascade deletes. add train_index()

90144a0

add docstrings. remove vector_dim from load()

ead1483

delete docs from faiss and sql

376bd6f

fix delete of docs in test

c31ac00

tholor changed the title ~~WIP Remove phi normalization from FAISS & support more index types~~ Remove phi normalization from FAISS & support more index types Oct 6, 2020

tholor requested a review from tanaysoni October 6, 2020 13:05

relax type hint for faiss index

a18e20f

tanaysoni approved these changes Oct 6, 2020

View reviewed changes

lalitpagaria reviewed Oct 6, 2020

View reviewed changes

haystack/document_store/faiss.py Outdated Show resolved Hide resolved

rename metric to metric_type

e4ceac4

tholor merged commit 8edeb84 into master Oct 6, 2020

lalitpagaria reviewed Oct 6, 2020

View reviewed changes

tholor mentioned this pull request Oct 7, 2020

Enable multiple calls of FAISSDocumentStore.write_documents() #421

Closed

lalitpagaria mentioned this pull request Aug 12, 2021

Add cosine similarity as similarity metric for FAISS document store #1337

Closed

julian-risch deleted the faiss_index_refactor branch November 15, 2021 07:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove phi normalization from FAISS & support more index types #467

Remove phi normalization from FAISS & support more index types #467

tholor commented Oct 6, 2020 •

edited

lalitpagaria Oct 6, 2020

tholor Oct 6, 2020

tholor Oct 6, 2020

lalitpagaria Oct 6, 2020

lalitpagaria Oct 6, 2020

lalitpagaria Oct 6, 2020 •

edited

lalitpagaria Oct 9, 2020

lalitpagaria Oct 6, 2020 •

edited

tholor Oct 6, 2020

lalitpagaria Oct 6, 2020

Remove phi normalization from FAISS & support more index types #467

Remove phi normalization from FAISS & support more index types #467

Conversation

tholor commented Oct 6, 2020 • edited

Breaking changes:

Limitations / Future improvements

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lalitpagaria Oct 6, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lalitpagaria Oct 6, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tholor commented Oct 6, 2020 •

edited

lalitpagaria Oct 6, 2020 •

edited

lalitpagaria Oct 6, 2020 •

edited