New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce contention in DocumentsWriterFlushControl. #12198
Conversation
lucene-util's `IndexGeoNames` benchmark is heavily contended when running with many indexing threads, 20 in my case. The main offender is `DocumentsWriterFlushControl#doAfterDocument`, which runs after every index operation to update doc and RAM accounting. This change reduces contention by only updating RAM accounting if the amount of RAM consumption that has not been committed yet by a single DWPT is at least 0.1% of the total RAM buffer size. This effectively batches updates to RAM accounting, similarly to what happens when using `IndexWriter#addDocuments` to index multiple documents at once. Since updates to RAM accounting may be batched, `FlushPolicy` can no longer distinguish between inserts, updates and deletes, so all 3 methods got merged into a single one. With this change, `IndexGeoNames` goes from ~22s to ~19s and the main offender for contention is now `DocumentsWriterPerThreadPool#getAndLock`.
Obtaining a DWPT and putting it back into the pool is subject to contention. This change reduces contention by using 8 sub pools that are tried sequentially. When applied on top of apache#12198, this reduces the time to index geonames with 20 threads from ~19s to ~16-17s.
Obtaining a DWPT and putting it back into the pool is subject to contention. This change reduces contention by using 8 sub pools that are tried sequentially. When applied on top of apache#12198, this reduces the time to index geonames with 20 threads from ~19s to ~16-17s.
…in absolute value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this change. looks good to me. I left a suggestion an a comment! Thanks @jpountz
lucene/core/src/java/org/apache/lucene/index/DocumentsWriterFlushControl.java
Outdated
Show resolved
Hide resolved
return indexWriterConfig.getMaxBufferedDocs() != IndexWriterConfig.DISABLE_AUTO_FLUSH; | ||
} | ||
|
||
@Override | ||
public long flushOnRAMGranularity() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this needs to be an abstract method. I am not sure anyone users this interface at all except of tests and unless we override this method in a test I think it can be an impl detail of the only caller.
…ushControl.java Co-authored-by: Simon Willnauer <simonw@apache.org>
lucene-util's `IndexGeoNames` benchmark is heavily contended when running with many indexing threads, 20 in my case. The main offender is `DocumentsWriterFlushControl#doAfterDocument`, which runs after every index operation to update doc and RAM accounting. This change reduces contention by only updating RAM accounting if the amount of RAM consumption that has not been committed yet by a single DWPT is at least 0.1% of the total RAM buffer size. This effectively batches updates to RAM accounting, similarly to what happens when using `IndexWriter#addDocuments` to index multiple documents at once. Since updates to RAM accounting may be batched, `FlushPolicy` can no longer distinguish between inserts, updates and deletes, so all 3 methods got merged into a single one. With this change, `IndexGeoNames` goes from ~22s to ~19s and the main offender for contention is now `DocumentsWriterPerThreadPool#getAndLock`. Co-authored-by: Simon Willnauer <simonw@apache.org>
Obtaining a DWPT and putting it back into the pool is subject to contention. This change reduces contention by using 8 sub pools that are tried sequentially. When applied on top of #12198, this reduces the time to index geonames with 20 threads from ~19s to ~16-17s.
Obtaining a DWPT and putting it back into the pool is subject to contention. This change reduces contention by using 8 sub pools that are tried sequentially. When applied on top of #12198, this reduces the time to index geonames with 20 threads from ~19s to ~16-17s.
lucene-util's
IndexGeoNames
benchmark is heavily contended when running with many indexing threads, 20 in my case. The main offender isDocumentsWriterFlushControl#doAfterDocument
, which runs after every index operation to update doc and RAM accounting.This change reduces contention by only updating RAM accounting if the amount of RAM consumption that has not been committed yet by a single DWPT is at least 0.1% of the total RAM buffer size or 16kB. This effectively batches updates to RAM accounting, similarly to what happens when using
IndexWriter#addDocuments
to index multiple documents at once. Since updates to RAM accounting may be batched,FlushPolicy
can no longer distinguish between inserts, updates and deletes, so all 3 methods got merged into a single one.With this change,
IndexGeoNames
goes from ~22s to ~19s and the main offender for contention is nowDocumentsWriterPerThreadPool#getAndLock
.