Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle larger Indices (current limit 1GB) #1

Open
asg017 opened this issue Feb 3, 2023 · 11 comments
Open

Handle larger Indices (current limit 1GB) #1

asg017 opened this issue Feb 3, 2023 · 11 comments

Comments

@asg017
Copy link
Owner

asg017 commented Feb 3, 2023

The vector indices that support the vss0 virtual table are limited to 1GB. This is because they are stored as a BLOB in a single row in a shadow table, which has a limit of ~1GB.

Instead, we should store large FAISS indices across several rows, so they can (in theory) grow with infinite space. This will likely be complicated and require the SQLite BLOB I/O API and a custom faiss IOWriter.

@chetanladdha
Copy link

chetanladdha commented May 12, 2023

@asg017: Do you have any planned for this fix? Any timeline?

@asg017
Copy link
Owner Author

asg017 commented May 12, 2023

No timeline! I haven't had time to work on core sqlite-vss features recently.

But if anyone reading this would like to sponsor this work, let me know!

@siscia
Copy link

siscia commented Jun 26, 2023

Do we have an rough estimate of how many vectors can be index with 1GB?

@asg017
Copy link
Owner Author

asg017 commented Jun 26, 2023

Depends on the dimensions of your vectors, and if you use any additional faiss factory strings with factory="".

If you're using the default settings, the size of your index in bytes is roughly:

dimensions * 4 * number_vectors

(where 4 is sizeof(float))

With some additional overhead (a BTree that maps rowid's to each vector, some Faiss-specific storage stuff, etc.).

In the "headlines" example in this blog post, there are 209,527 vectors with 384 dimensions (using sentence transformers), and it takes up roughly 323.5MB of space.

You can also use the factory= option to change how Faiss stores and queries your vectors. There are a ton of options, but "dimension reduction" techniques could help you lower the amount of storage your index takes, to stay under the 1GB limit.

I don't have too many examples of this, but it was discussed at length in this issue. The general gist is to use the PCA directive in a custom factory string:

create virtual table vss_pca using vss0(a(1536) factory="PCA384,Flat,IDMap2");

Then train your index:

insert into vss_pca(operation, a)
  select 'training', value from vectors;

Then insert your vector data:

insert into vss_pca(rowid, a)
  select key, value from vectors;

This example reduces 1536-dimension vectors to 384 dimensions. The storage saving get better with larger datasets, but my quick test was 25% the size of the original full-length index, with 100,000 vectors.

This approach does reduce the accuracy of KNN-style searches however, so use caution.

@baughmann
Copy link

No timeline! I haven't had time to work on core sqlite-vss features recently.

But if anyone reading this would like to sponsor this work, let me know!

I did just that

@asg017
Copy link
Owner Author

asg017 commented Dec 22, 2023

Hey @baughmann , thanks for the donation! I'm going to give this a shot over the holiday.

Although to be clear, this will will only bump the 1GB limit that sqlite-vss has for indices, to (hopefully) an arbitary size. This will be done by splitting the Faiss index into multiple chunks and storing across multiple rows in shadow tables, instead of the current "store 1 big blob in a single row).

However, the following shortcomings will still exist:

  • Faiss and therefore sqlite-vss requires the entire index to be stored in-memory, so even if you could store > 1GB per vector table, you'll also need enough memory to handle it.
  • Updates will be slow, again because of Faiss. Even if you insert/update a few vectors at a time, the entire index has to be re-written from scratch.

I've been thinking a lot about these shortcomings. There is #30 where I lay out a possible workaround, but that's separate from this issue. I'll also probably chew a bit on this over my break, but will definitely try to lift this 1GB limit in the following weeks!

@cduk
Copy link

cduk commented Mar 10, 2024

Just wondering if there were any developments on this front. I guess my use case is a simple one as I have around 3-4 million embeddings to index which is >1GB, however as the data is static, I don't have the update concerns you outline in #30.

@slavakurilyak
Copy link

+1 for solving this problem as it would unlock vector use cases and possibly integrate with @tursodatabase which has scales sqlite in production workloads

@baughmann
Copy link

baughmann commented Apr 15, 2024

For everyone waiting on this, it may be better to just create an implementation using the Repository pattern and have that repository instance maintain a FAISS index alongside the SQLite database. On startup, the repository should load a BLOB column of vectors into the Index. You can utilize the IndexIDMap to track which documents were returned in your similarity search by linking that ID column to an auto-incrementing integer column on the SQLite table.

I implemented one of these myself. It will probably end up being faster and more scalable than anything that can be done in a native SQLite plugin.

@cchance27
Copy link

Any chance you could share an example implementation of that

@baughmann
Copy link

baughmann commented Apr 27, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants