Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IndexBinaryFlat support? #124

Open
mqudsi opened this issue Feb 24, 2024 · 0 comments
Open

IndexBinaryFlat support? #124

mqudsi opened this issue Feb 24, 2024 · 0 comments

Comments

@mqudsi
Copy link

mqudsi commented Feb 24, 2024

Thanks for this library. I'm just playing around with it to see if it can fit in as a replacement for the myriad user-defined sql functions we're currently using to perform knn search on binary features and have a question regarding the use of binary hashes in place of floating-point features/embeddings.

So far as I've been able to tell, FAISS supports IndexBinaryFlat with the string BFlat and with various B-prefixed versions of the index strings for use in the factory constructor, but it's a completely separate base class from the regular index factory. Indeed, trying to use the following:

CREATE VIRTUAL TABLE IF NOT EXISTS "vss_files" using vss0 (
	embedding(144) factory="BFlat,IDMap2",
);

throws an exception:

Error building index factory for embedding: Error in std::unique_ptr<faiss::Index> faiss::{anonymous}::index_factory_sub(int, std::string, faiss::MetricType) at /home/runner/work/sqlite-vss/sqlite-vss/vendor/faiss/faiss/index_factory.cpp:877: could not parse index string BFlat

(IDMap2 is, as I understand it, implemented for IndexBinaryFlat since 2019.)

The only approach I can think of to work around this issue would be to treat the binary hash as a densely packed bitwise representation of a one-hot-encoded embedding and either insert a 1.0 or 0.0 float for each bit (so an n-byte binary vector turns into a n*8*2-byte fp16 embedding) and either insert that directly at a huge storage and compute premium, or take that and compress its features (ProductQuantizer?) into a smaller embedding increasing compute but reducing storage (and performance/accuracy).

Ideally, we would be able to use bfactory= instead of factory= to create a binary index or factory= would introspect its payload for BFlat and create a binary index instead of a regular one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant