Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support memory-mapped on-disk Indices #4

Open
asg017 opened this issue Feb 7, 2023 · 5 comments
Open

Support memory-mapped on-disk Indices #4

asg017 opened this issue Feb 7, 2023 · 5 comments

Comments

@asg017
Copy link
Owner

asg017 commented Feb 7, 2023

The underlying Faiss indicis are stored in SQLite shadow tables, which can't be mmaped with the IO_FLAG_MMAP.

One solution: Introduce a new option to store a vss0 column index on disk, allowing mmaped indices for larger-than-memory.

create virtual table articles using vss0(
  headline_embedding(1024) factory="..." on_disk=True,
  description_embedding(1024) factory="..." on_disk=True,
);

Then, your directory would look like:

$ tree .
.
├── my_data.db
├── my_data.db.vss0.articles.description_embedding.faissindex
└── my_data.db.vss0.articles.headline_embedding.faissindex

sqlite3_db_filename() would be useful here.

One problem: It's kindof nice to have all Faiss indices stored on one file in the SQLite database, and this config option would instead mean users would have to move around multiple files around instead of a single SQLite file. But since this is an "optimization" feature that's not enabled by default, I think it'll be ok.

@kroggen
Copy link

kroggen commented Apr 11, 2023

I suppose that on each new insertion to an indexed table makes the engine whole index BLOB to be updated, and database writes are done twice, what makes it slow.

And if the index files are not present on the folder, the code can recreate them from the content... (is it stored on 2 places?)

@asg017
Copy link
Owner Author

asg017 commented Apr 11, 2023

In this proposal, for memory-mapped on-disk indexes, it won't be stored twice. By default, the Faiss index is stored inside a "shadow table" in your SQLite DB, but this option would instead store it on disk as a separate file. It'll still work the same at a user perspective (ie same SELECT and INSERT statements), but under-the-hood the storage of the actual index would be different.

Right now the "shadow table" indexes are slow because we re-write the entire index at the end of every transactions that INSERT'ed or DELETE'ed to a vss0 table. That involves exporting the index to an in memory buffer, then re-writing the shadow table with the new contents, which isn't great. But if the Faiss index was its own file and memory mapped, then updates wouldn't be as drastic.

@asg017
Copy link
Owner Author

asg017 commented Aug 3, 2023

Thinking about this more: Instead of a on_disk= argument, I think we should change it to storage_type=faiss_ondisk. The default would be storage_type=faiss_shadow.

This is so we can easily support future storage backends like #30

@asg017 asg017 mentioned this issue Aug 17, 2023
3 tasks
@asg017
Copy link
Owner Author

asg017 commented Aug 17, 2023

@dleviminzi ok, I applied the new vss0 constructor parser to the main branch. You should be able to add a mmap=True flag inside parse_vss0_column_definition(), let me know if you run into any trouble with that.

I also change a bit of the logic of the the storage_type=faiss_ondisk logic. I'll also probably remove the schema from the generated file name, so for the following vss0 table on a database called my_database.db:

create virtual table vss_demo usinv vss0( 
  a(2) storage_type=faiss_ondisk
)

It currently saves vectors to the file:

my_database.db.main.vss_demo.a.faiss_index

But, when I change the schema, itll save to:

my_database.db.vss_demo.a.faiss_index

Mostly because I don't think the schema is required on the filename. In fact, I think it'll actually break on SQL that queries vss0 tables on ATTACHed databases, since it won't know to look into .main. file

@dleviminzi
Copy link
Contributor

You should be able to add a mmap=True flag inside parse_vss0_column_definition(), let me know if you run into any trouble with that.

I'll look through the changes and give it a go today.

I'll also probably remove the schema from the generated file name,

Yeah that makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants