Support memory-mapped on-disk Indices #4

asg017 · 2023-02-07T18:11:15Z

The underlying Faiss indicis are stored in SQLite shadow tables, which can't be mmaped with the IO_FLAG_MMAP.

One solution: Introduce a new option to store a vss0 column index on disk, allowing mmaped indices for larger-than-memory.

create virtual table articles using vss0(
  headline_embedding(1024) factory="..." on_disk=True,
  description_embedding(1024) factory="..." on_disk=True,
);

Then, your directory would look like:

$ tree .
.
├── my_data.db
├── my_data.db.vss0.articles.description_embedding.faissindex
└── my_data.db.vss0.articles.headline_embedding.faissindex

sqlite3_db_filename() would be useful here.

One problem: It's kindof nice to have all Faiss indices stored on one file in the SQLite database, and this config option would instead mean users would have to move around multiple files around instead of a single SQLite file. But since this is an "optimization" feature that's not enabled by default, I think it'll be ok.

The text was updated successfully, but these errors were encountered:

kroggen · 2023-04-11T15:07:53Z

I suppose that on each new insertion to an indexed table makes the engine whole index BLOB to be updated, and database writes are done twice, what makes it slow.

And if the index files are not present on the folder, the code can recreate them from the content... (is it stored on 2 places?)

asg017 · 2023-04-11T19:01:37Z

In this proposal, for memory-mapped on-disk indexes, it won't be stored twice. By default, the Faiss index is stored inside a "shadow table" in your SQLite DB, but this option would instead store it on disk as a separate file. It'll still work the same at a user perspective (ie same SELECT and INSERT statements), but under-the-hood the storage of the actual index would be different.

Right now the "shadow table" indexes are slow because we re-write the entire index at the end of every transactions that INSERT'ed or DELETE'ed to a vss0 table. That involves exporting the index to an in memory buffer, then re-writing the shadow table with the new contents, which isn't great. But if the Faiss index was its own file and memory mapped, then updates wouldn't be as drastic.

asg017 · 2023-08-03T03:10:59Z

Thinking about this more: Instead of a on_disk= argument, I think we should change it to storage_type=faiss_ondisk. The default would be storage_type=faiss_shadow.

This is so we can easily support future storage backends like #30

asg017 · 2023-08-17T01:18:20Z

@dleviminzi ok, I applied the new vss0 constructor parser to the main branch. You should be able to add a mmap=True flag inside parse_vss0_column_definition(), let me know if you run into any trouble with that.

I also change a bit of the logic of the the storage_type=faiss_ondisk logic. I'll also probably remove the schema from the generated file name, so for the following vss0 table on a database called my_database.db:

create virtual table vss_demo usinv vss0( 
  a(2) storage_type=faiss_ondisk
)

It currently saves vectors to the file:

my_database.db.main.vss_demo.a.faiss_index

But, when I change the schema, itll save to:

my_database.db.vss_demo.a.faiss_index

Mostly because I don't think the schema is required on the filename. In fact, I think it'll actually break on SQL that queries vss0 tables on ATTACHed databases, since it won't know to look into .main. file

dleviminzi · 2023-08-17T13:20:59Z

You should be able to add a mmap=True flag inside parse_vss0_column_definition(), let me know if you run into any trouble with that.

I'll look through the changes and give it a go today.

I'll also probably remove the schema from the generated file name,

Yeah that makes sense.

asg017 mentioned this issue Aug 17, 2023

support on disk indices #90

Merged

3 tasks

tom-pollak mentioned this issue May 3, 2024

add mmap option for faiss on disk indices #94

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support memory-mapped on-disk Indices #4

Support memory-mapped on-disk Indices #4

asg017 commented Feb 7, 2023

kroggen commented Apr 11, 2023

asg017 commented Apr 11, 2023

asg017 commented Aug 3, 2023

asg017 commented Aug 17, 2023

dleviminzi commented Aug 17, 2023

Support memory-mapped on-disk Indices #4

Support memory-mapped on-disk Indices #4

Comments

asg017 commented Feb 7, 2023

kroggen commented Apr 11, 2023

asg017 commented Apr 11, 2023

asg017 commented Aug 3, 2023

asg017 commented Aug 17, 2023

dleviminzi commented Aug 17, 2023