Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove phi normalization from FAISS & support more index types #467

Merged
merged 11 commits into from Oct 6, 2020

Conversation

tholor
Copy link
Member

@tholor tholor commented Oct 6, 2020

As described in #422 we don't need the phi normalization trick anymore as HNSW supports the inner product metric. This will simplify the code a lot - especially for multiple subsequent writes of documents.

Breaking changes:

vector_size -> vector_dim

Old:
FAISSDocumentStore(...vector_size=768)

New:
FAISSDocumentStore(...vector_dim=768)

Limitations / Future improvements

@tholor tholor self-assigned this Oct 6, 2020
@tholor tholor changed the title Remove phi normalization from FAISS & support more index types WIP Remove phi normalization from FAISS & support more index types Oct 6, 2020
@tholor tholor added breaking change type:feature New feature or request labels Oct 6, 2020
@tholor tholor changed the title WIP Remove phi normalization from FAISS & support more index types Remove phi normalization from FAISS & support more index types Oct 6, 2020
@tholor tholor requested a review from tanaysoni October 6, 2020 13:05
@tholor tholor merged commit 8edeb84 into master Oct 6, 2020
def _create_new_index(self, vector_size: int, index_factory: str = "HNSW4"):
index = faiss.index_factory(vector_size + 1, index_factory)
def _create_new_index(self, vector_dim: int, index_factory: str = "Flat", metric_type=faiss.METRIC_INNER_PRODUCT, **kwargs):
if index_factory == "HNSW" and metric_type == faiss.METRIC_INNER_PRODUCT:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for adding comment now, but adding here for more context -

I think better to limit faiss related internals in this class. As user of this class already getting freedom to pass self configured index in class via faiss_index and faiss_index_factory_str parameters of the constructor. It will keep this class clean.

Another point how prevent index.add() function from breaking? With faiss_index_factory_str="IDMap,Flat" parameter vector add call will fail. Which was addressed in #385 as follows:

total_indices = self.size()
        ids = np.arange(total_indices, total_indices + len(embeddings), 1, dtype=np.int64)

        if isinstance(self.faiss_index, IndexIDMap):
            self.faiss_index.add_with_ids(vectors_to_add, ids)
        else:
            self.faiss_index.add(vectors_to_add)

What is your view on this comment

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, normally I would agree that we should limit it to FAISS internals. However, HNSW Flat with IP is not giving the same results when initialized via the index factory. The performance is totally off. See facebookresearch/faiss#1434
Therefore, we need a special case handling here anyway and do a direct init without the factory. I then took the chance and added "good" defaults for the other params. I believe most users will be overwhelmed by deciding on a proper index type for their use case. I believe most will use either just "Flat" or our preconfigured "HNSW" variant. We will also benchmark those two and try to tune the default HNSW config to fit the most common use cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the IDMap. What would be the use case here? I think the FAISS id handling is rather something internal that is not interesting for most haystack users. Therefore, I don't think we need to support it, but maybe I am missing a good use case here 🤔 ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, normally I would agree that we should limit it to FAISS internals. However, HNSW Flat with IP is not giving the same results when initialized via the index factory. The performance is totally off. See facebookresearch/faiss#1434

Yes agree with you. We can move this part to another function to make it more explicit. I am also able to reproduce facebookresearch/faiss#1434 even tried by compiling fresh from top of the branch. I suspect some issue related to default value of metric_type in struct. They recently added support for IP metric type for HNSW so suspect some places they might have missed it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the IDMap. What would be the use case here? I think the FAISS id handling is rather something internal that is not interesting for most haystack users. Therefore, I don't think we need to support it, but maybe I am missing a good use case here 🤔 ?

Actually IDMap have very good use case it abstract faiss internal id with external id. Hence one can update or delete embedding from faiss index store. So far what I understood that internally embeddings are sequentially indexed or the corresponding vector_id. Hence if one id is removed in between then it causes update of vector_id for all the embeddings coming after that record.
So far haystack only allow add method for faiss document_store (Elastic search don't have this issue). And if user want to update/remove particular document then they need to delete all documents and then add them again. There another solution to mark document to be removed as stale but it will not prevent faiss search to return stale document vector_id. This is what I covered in this comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tholor Have you got use case of using IDMap? BTW Milvus handle these case better
https://dzone.com/articles/how-milvus-realizes-the-delete-function

Now they are planning to support multi storage options. https://docs.google.com/document/d/1iwwLH4Jtm3OXIVb7jFYsfmcbOyX6AWZKaNJAaXC7-cw/edit

document_objects = [Document.from_dict(d) if isinstance(d, dict) else d for d in documents]
embeddings = [doc.embedding for doc in document_objects]
embeddings = np.array(embeddings, dtype="float32")
self.faiss_index.train(embeddings)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to check self.faiss_index.is_trained before calling train otherwise function will fail.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, we could catch the error and make it more descriptive, but I think we should also let the user know that it's not valid to call train_index() in this case.

@@ -26,7 +26,11 @@ def elasticsearch_fixture():
except:
print("Starting Elasticsearch ...")
status = subprocess.run(
['docker run -d --name elasticsearch -p 9200:9200 -e "discovery.type=single-node" elasticsearch:7.9.1'],
['docker rm haystack_test_elastic'],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same is needed for tika docker container as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking change type:feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants