Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make sure DocumentsWriterPerThread#getAndLock never returns null on a non-empty queue. #12959

Merged

Conversation

jpountz
Copy link
Contributor

@jpountz jpountz commented Dec 21, 2023

Before this change, DocumentsWriterPerThread#getAndLock could sometimes
return null even though the queue was empty at no point in time. The
practical implication is that we can end up with more DWPTs in memory than
indexing threads, which, while not strictly a bug, may require doing more
merging than we'd like later on.

I ran luceneutil's IndexGeonames with this change, and
DocumentsWriterPerThread#getAndLock was not the main source of
contention.

Closes #12649 #12916

…ll` on a non-empty queue.

Before this change, `ConcurrentApproximatePriorityQueue#poll` could sometimes
return `null` even though the queue was empty at no point in time. The
practical implication is that we can end up with more DWPTs in memory than
indexing threads, which, while not strictly a bug, may require doing more
merging than we'd like later on.

I ran luceneutil's `IndexGeonames` with this change, and
`ConcurrentApproximatePriorityQueue#poll` was not the main source of
contention. I instrumented the code to check how many DWPTs got pulled from the
queue using the optimistic path vs. pessimistic path and got 8525598 for the
optimistic path vs. 12247 for the pessimistic path.

Closes apache#12649 apache#12916
@uschindler
Copy link
Contributor

It looks like this does not fix the issue in #12916. The bug is there but it does not look like the cause for #12916.

@dweiss
Copy link
Contributor

dweiss commented Dec 21, 2023

For what it's worth, I've tried reproducing this on Adrien's branch with:

gradlew -p lucene/core -Dtests.seed=F7B4CD7A5624D5EC
beast --tests TestIndexWriterThreadsToSegments.testSegmentCountOnFlushRandom -Dtests.jvmargs="-XX:+UseCompressedOops" -Ptests.iters=1000 -Ptests.dups=100

but nope, no fails.

@uschindler
Copy link
Contributor

I get it to fail with both OpenJ9 and Hotspot on this branch, see #12916

@jpountz
Copy link
Contributor Author

jpountz commented Dec 23, 2023

OK, I think I understand why the test is still failing, this seems to be because we unlock the DWPT after putting it back into the queue. But there is also a good reason to do so. I need to think more about it.

@jpountz
Copy link
Contributor Author

jpountz commented Jan 8, 2024

I have a new iteration of this change, which now also accounts for how DocumentsWriterPerThreadPool may lock DocumentsWriterPerThread instances while they are still in the pool. I'm at 2 hours of beasting with J9 without a failure so far.

@jpountz
Copy link
Contributor Author

jpountz commented Jan 11, 2024

If there are no concerns, I plan on merging this PR soon.

@uschindler
Copy link
Contributor

Hi,
sorry I missed to test this. I started my beasting and try to reproduce it.
Will report back.

Copy link
Contributor

@uschindler uschindler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I beasted this PR and it produced on my Ryzen CPU under load no failures. So it looks like it fixed the issue.

The AtomicInteger is only incremented so at some point it overflows. But when analyzing the code, the actual number does not matter, it just needs to be different to retry, correct? So when it overflows it does not do any harm.

@uschindler
Copy link
Contributor

FYI, I did not try to run the beasting for hours because previously it was failing very fast on a heavy loaded Ryzen CPU with both Hotspot and OpenJ9. Now its stable, so I trust the results also after 2 hours (1 hour with Hotspot, 1 hour with OpenJ9).

@jpountz jpountz changed the title Make sure ConcurrentApproximatePriorityQueue#poll never returns null on a non-empty queue. Make sure DocumentsWriterPerThread#getAndLock never returns null on a non-empty queue. Jan 12, 2024
@jpountz
Copy link
Contributor Author

jpountz commented Jan 12, 2024

Thanks a lot @uschindler !

@jpountz jpountz merged commit e0daca1 into apache:main Jan 12, 2024
4 checks passed
@jpountz jpountz deleted the concurrent_pq_poll_non_null_on_non_empty_queue branch January 12, 2024 15:21
jpountz added a commit that referenced this pull request Jan 12, 2024
…on a non-empty queue. (#12959)

Before this change, `DocumentsWriterPerThread#getAndLock` could sometimes
return `null` even though the queue was empty at no point in time. The
practical implication is that we can end up with more DWPTs in memory than
indexing threads, which, while not strictly a bug, may require doing more
merging than we'd like later on.

I ran luceneutil's `IndexGeonames` with this change, and
`DocumentsWriterPerThread#getAndLock` was not the main source of
contention.

Closes #12649 #12916
@uschindler uschindler added this to the 9.10.0 milestone Jan 12, 2024
slow-J pushed a commit to slow-J/lucene that referenced this pull request Jan 16, 2024
…on a non-empty queue. (apache#12959)

Before this change, `DocumentsWriterPerThread#getAndLock` could sometimes
return `null` even though the queue was empty at no point in time. The
practical implication is that we can end up with more DWPTs in memory than
indexing threads, which, while not strictly a bug, may require doing more
merging than we'd like later on.

I ran luceneutil's `IndexGeonames` with this change, and
`DocumentsWriterPerThread#getAndLock` was not the main source of
contention.

Closes apache#12649 apache#12916
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TestIndexWriterThreadsToSegments.testSegmentCountOnFlushRandom fails randomly
3 participants