You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've bumped up default memory buffer size from 4G to 16G, as follows:
config.setRAMBufferSizeMB(args.memoryBuffer);
But I've discovered that more RAM actually slows indexing speed. Here are some runs with the SPLADE++ ED model on MS MARCO v2, sweeping {2G, 4G, 8G, 16G, 32G}.
logs/log.msmarco-v2-passage-splade-pp-ed.02gb.1:2023-12-03 14:21:47,515 INFO [main] index.AbstractIndexer (AbstractIndexer.java:315) - Total 138,364,198 documents indexed in 02:33:25
logs/log.msmarco-v2-passage-splade-pp-ed.04gb.1:2023-12-03 16:56:44,825 INFO [main] index.AbstractIndexer (AbstractIndexer.java:315) - Total 138,364,198 documents indexed in 02:34:55
logs/log.msmarco-v2-passage-splade-pp-ed.08gb.1:2023-12-03 20:01:08,410 INFO [main] index.AbstractIndexer (AbstractIndexer.java:315) - Total 138,364,198 documents indexed in 03:04:20
logs/log.msmarco-v2-passage-splade-pp-ed.16gb.1:2023-12-03 23:34:57,443 INFO [main] index.AbstractIndexer (AbstractIndexer.java:315) - Total 138,364,198 documents indexed in 03:33:45
logs/log.msmarco-v2-passage-splade-pp-ed.32gb.1:2023-12-04 02:54:37,891 INFO [main] index.AbstractIndexer (AbstractIndexer.java:315) - Total 138,364,198 documents indexed in 03:19:36
It seems like more RAM actually slows indexing... this expected behavior? (This is with spinning disks; on SSDs, same pattern persists, although not as prounced.)
My guess is that it's not actually faster, it's just taking a bit of work off indexing threads, and adding more work to merging, which is running asynchronously in its own threads.
Indexing boils down to updating large hash tables (inverted indexes) or graphs (HNSW). And the bigger they get, the slower the updates because you get more cache misses, etc.. So flushing N segments of size N is more costly than flushing N*2 segments of size S/2. But in-turn, this adds more work for merging. In your case, I'm assuming that you are not maxing out your CPU, so merging can take all the CPU it wants and indexing appears to be faster. But if you were trying to max out indexing so that indexing and merging would be competing for the same resources, then you would see a slowdown when decreasing the RAM buffer. Likewise if you told Lucene to run merging in indexing threads rather than their own threads (SerialMergeScheduler instead of ConcurrentMergeScheduler).
Right. It's rather efficient, but almost always still more expensive than doing less merging by accumulating bigger flush segments in the first place by configuring a bigger RAM buffer.
I'm currently on: #2275 at 3b8bee7
I've bumped up default memory buffer size from 4G to 16G, as follows:
But I've discovered that more RAM actually slows indexing speed. Here are some runs with the SPLADE++ ED model on MS MARCO v2, sweeping {2G, 4G, 8G, 16G, 32G}.
It seems like more RAM actually slows indexing... this expected behavior? (This is with spinning disks; on SSDs, same pattern persists, although not as prounced.)
@jpountz @benwtrent @ChrisHegarty @tteofili any ideas here?
The text was updated successfully, but these errors were encountered: