Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request:] Add new features to a previously built index #137

Open
kushalkafle opened this issue Nov 9, 2022 · 10 comments
Open

[Feature Request:] Add new features to a previously built index #137

kushalkafle opened this issue Nov 9, 2022 · 10 comments

Comments

@kushalkafle
Copy link

kushalkafle commented Nov 9, 2022

Right now there does not seem to be an easy way to take an already-built index and add more embeddings to it (from the same distribution). This is obviously already indirectly supported by / possible with autofaiss because distributed training already does it, and also it is something easily supported by FAISS backbone. But I wonder if we can expose an easy interface to take a built index and add more features from a new set of embeddings (Using all the bells and whistles provided by autofaiss/embedding-reader for reading embeddings from a numpy-parquet format). Perhaps a update_index interface?

Thanks!

@hitchhicker
Copy link
Contributor

Hey,

Thanks for suggesting this feature.

Why not ! Working on it...

@hitchhicker
Copy link
Contributor

I close the PR that I created because we will implement something similar on our side, once it is stable, we will consider adding here. Stay tuned.

@kushalkafle
Copy link
Author

Thanks a lot for looking into this! The update_index branch seems usable already; I might start using it and will report any hiccups/successes here!

Cheers!

@hitchhicker
Copy link
Contributor

It does have issues, it won't work if the new index_key is different than the one used in the already-built index. Essentially, we need to retrain the index in this case. That’s something I have missed in the branch. Thus, I don't recommend you use this branch right now.

@rom1504
Copy link
Contributor

rom1504 commented Nov 12, 2022

if the index_key needs to be changed then it's not possible at all to add items to the index.
so why is that method not suitable?

@kushalkafle
Copy link
Author

Yes, I have the same thoughts as @rom1504; For many use cases, index_key would not be different and it should just work fine for those cases, right? I can report back about my specific use case soon anyway :) If it does not work, I'll report here, and await the revised PR later.

But even if it has issues that you'd like to solve before merging to the main, I think it is already useful in many cases. So thanks for working on this 👍

@hitchhicker
Copy link
Contributor

I agree what both of you said. It is suitable if we don’t need to update the clustering. What I had in mind is that the new interface can handle two main use cases ideally:

  • keep the clustering unchanged, add more embedding on it;
  • update the clustering with the existing embedding sand new embedding

@kushalkafle I would be happy to see your feedbacks after using it. Thanks :)

@rom1504
Copy link
Contributor

rom1504 commented Nov 12, 2022 via email

@hitchhicker
Copy link
Contributor

@rom1504 Pardon me for replying late. We would retrain both, it is indeed equivalent to rebuilding the index from scratch.
In my opinion, it is possible to provide an interface like update_index branch whose responsibility is only to add more features/embeddings on a built index while keeping the index unchanged. It would be useful for autofaiss's users.

@nateagr is working on incremental indexing right now for our internal use cases, we will revisit this topic soon.

@kushalkafle
Copy link
Author

kushalkafle commented Nov 14, 2022

I think I want to chime in here as well. Overall, I think I agree with @rom1504.

  • If the index is going to be built from scratch, why does this even fall under the update_index's job? That is just the regular building of the index; there is no update happening.

  • Yes, what you described (i.e., only to add more features/embeddings on an already-built index) is exactly what I am suggesting, and I think it will be useful for multiple usages. Perhaps add_features_to_index is a better name 😄

  • I know this will change the index size and the query time(s) statistics, but that is something that can be re-benchmarked/adjusted without any retraining.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants