LUCENE-9827: avoid wasteful recompression for small segments #28

rmuir · 2021-03-21T16:53:06Z

Require that the segment has enough dirty documents to create a clean
chunk before recompressing during merge, there must be at least maxChunkSize.

This prevents wasteful recompression with small flushes (e.g. every
document): we ensure recompression achieves some "permanent" progress.

Expose maxDocsPerChunk as a parameter for Term vectors too, matching the
stored fields format. This allows for easy testing.

See JIRA for more details: https://issues.apache.org/jira/browse/LUCENE-9827?focusedCommentId=17305712&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17305712

Require that the segment has enough dirty documents to create a clean chunk before recompressing during merge, there must be at least maxChunkSize. This prevents wasteful recompression with small flushes (e.g. every document): we ensure recompression achieves some "permanent" progress. Expose maxDocsPerChunk as a parameter for Term vectors too, matching the stored fields format. This allows for easy testing.

If segment N needs recompression, we have to flush any buffered docs before bulk-copying segment N+1. Don't just increment numDirtyChunks, also make sure numDirtyDocs is incremented, too. This doesn't have a performance impact, and is unrelated to tooDirty() improvements, but it is easier to reason about things with correct statistics in the index.

rmuir · 2021-03-24T13:21:13Z

Thanks @jpountz for the commit! Code is simpler and does fine with testing I have thrown at it. If anything, it seems faster. I indexed 1M docs (flushing every doc), It went 20% faster with 4856b6f than without.

rmuir requested a review from jpountz March 21, 2021 16:53

rmuir and others added 2 commits March 21, 2021 22:06

Further tuning of how dirtiness is measured.

4856b6f

rmuir merged commit be94a66 into apache:main Apr 6, 2021

janhoy pushed a commit to cominvent/lucene that referenced this pull request May 12, 2021

SOLR-15178 Non-existent dependency listed in solr-core (apache#28)

6b3d276

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LUCENE-9827: avoid wasteful recompression for small segments #28

LUCENE-9827: avoid wasteful recompression for small segments #28

rmuir commented Mar 21, 2021

rmuir commented Mar 24, 2021

LUCENE-9827: avoid wasteful recompression for small segments #28

LUCENE-9827: avoid wasteful recompression for small segments #28

Conversation

rmuir commented Mar 21, 2021

rmuir commented Mar 24, 2021