Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-9827: avoid wasteful recompression for small segments #28

Merged
merged 3 commits into from
Apr 6, 2021

Conversation

rmuir
Copy link
Member

@rmuir rmuir commented Mar 21, 2021

Require that the segment has enough dirty documents to create a clean
chunk before recompressing during merge, there must be at least maxChunkSize.

This prevents wasteful recompression with small flushes (e.g. every
document): we ensure recompression achieves some "permanent" progress.

Expose maxDocsPerChunk as a parameter for Term vectors too, matching the
stored fields format. This allows for easy testing.

See JIRA for more details: https://issues.apache.org/jira/browse/LUCENE-9827?focusedCommentId=17305712&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17305712

Require that the segment has enough dirty documents to create a clean
chunk before recompressing during merge, there must be at least maxChunkSize.

This prevents wasteful recompression with small flushes (e.g. every
document): we ensure recompression achieves some "permanent" progress.

Expose maxDocsPerChunk as a parameter for Term vectors too, matching the
stored fields format. This allows for easy testing.
@rmuir rmuir requested a review from jpountz March 21, 2021 16:53
rmuir and others added 2 commits March 21, 2021 22:06
If segment N needs recompression, we have to flush any buffered docs
before bulk-copying segment N+1. Don't just increment numDirtyChunks,
also make sure numDirtyDocs is incremented, too.

This doesn't have a performance impact, and is unrelated to tooDirty()
improvements, but it is easier to reason about things with correct
statistics in the index.
@rmuir
Copy link
Member Author

rmuir commented Mar 24, 2021

Thanks @jpountz for the commit! Code is simpler and does fine with testing I have thrown at it. If anything, it seems faster. I indexed 1M docs (flushing every doc), It went 20% faster with 4856b6f than without.

@rmuir rmuir merged commit be94a66 into apache:main Apr 6, 2021
janhoy pushed a commit to cominvent/lucene that referenced this pull request May 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants