Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Counter-intuitive result: more RAM = slower indexing (standard inverted indexes) #2294

Closed
lintool opened this issue Dec 5, 2023 · 3 comments

Comments

@lintool
Copy link
Member

lintool commented Dec 5, 2023

I'm currently on: #2275 at 3b8bee7

I've bumped up default memory buffer size from 4G to 16G, as follows:

config.setRAMBufferSizeMB(args.memoryBuffer);

But I've discovered that more RAM actually slows indexing speed. Here are some runs with the SPLADE++ ED model on MS MARCO v2, sweeping {2G, 4G, 8G, 16G, 32G}.

logs/log.msmarco-v2-passage-splade-pp-ed.02gb.1:2023-12-03 14:21:47,515 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:315) - Total 138,364,198 documents indexed in 02:33:25
logs/log.msmarco-v2-passage-splade-pp-ed.04gb.1:2023-12-03 16:56:44,825 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:315) - Total 138,364,198 documents indexed in 02:34:55
logs/log.msmarco-v2-passage-splade-pp-ed.08gb.1:2023-12-03 20:01:08,410 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:315) - Total 138,364,198 documents indexed in 03:04:20
logs/log.msmarco-v2-passage-splade-pp-ed.16gb.1:2023-12-03 23:34:57,443 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:315) - Total 138,364,198 documents indexed in 03:33:45
logs/log.msmarco-v2-passage-splade-pp-ed.32gb.1:2023-12-04 02:54:37,891 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:315) - Total 138,364,198 documents indexed in 03:19:36

It seems like more RAM actually slows indexing... this expected behavior? (This is with spinning disks; on SSDs, same pattern persists, although not as prounced.)

@jpountz @benwtrent @ChrisHegarty @tteofili any ideas here?

@jpountz
Copy link
Contributor

jpountz commented Dec 5, 2023

My guess is that it's not actually faster, it's just taking a bit of work off indexing threads, and adding more work to merging, which is running asynchronously in its own threads.

Indexing boils down to updating large hash tables (inverted indexes) or graphs (HNSW). And the bigger they get, the slower the updates because you get more cache misses, etc.. So flushing N segments of size N is more costly than flushing N*2 segments of size S/2. But in-turn, this adds more work for merging. In your case, I'm assuming that you are not maxing out your CPU, so merging can take all the CPU it wants and indexing appears to be faster. But if you were trying to max out indexing so that indexing and merging would be competing for the same resources, then you would see a slowdown when decreasing the RAM buffer. Likewise if you told Lucene to run merging in indexing threads rather than their own threads (SerialMergeScheduler instead of ConcurrentMergeScheduler).

@lintool
Copy link
Member Author

lintool commented Dec 5, 2023

Ah, makes sense! I am using ConcurrentMergeScheduler.

Also, I guess that merging is (typically) disk throughput bound... and quite efficient since merging sorted lists is a linear time operation.

@jpountz
Copy link
Contributor

jpountz commented Dec 5, 2023

Right. It's rather efficient, but almost always still more expensive than doing less merging by accumulating bigger flush segments in the first place by configuring a bigger RAM buffer.

@lintool lintool closed this as completed Dec 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants