Fix large 99 percentile match latency in Monitor during concurrent commits and purges #12801
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Within Lucene Monitor, there is a thread contention issue that manifests as multi-second latencies in the
Monitor.match
function when it is called concurrently withMonitor.register/Monitor.deleteById
andQueryIndex.purgeCache
. The source of the issue is the usage ofpurgeLock
, a ReentrantReadWriteLock.Within the QueryIndex, there are 3 usages of the
purgeLock
. They arecommit
(called byMonitor.register/Monitor.deleteById
), ‘search’ (called by Monitor.match), andpurgeCache
(called by a background thread at a consistent interval). Withincommit
andsearch
, the read lock is acquired, and withinpurgeCache
, the write lock is acquired.search
calls are very fast and only hold thepurgeLock
for short durations. However,commit
calls are quite long and will hold thepurgeLock
for multi-second durations. Ostensibly because bothcommit
andsearch
only need the read lock, they should be able to execute concurrently and the long duration ofcommit
should not impactsearch
. However, whenpurgeCache
attempts to acquire the write lock this no longer holds true. This is because oncepurgeCache
is waiting to acquire the write lock, attempts to acquire the read lock probabilistically might wait for the write lock to be acquired + released (otherwise attempts to acquire the write lock would be starved). What this means is that ifcommit
is holding the read lock, and thenpurgeCache
attempts to acquire the write lock, some amount of calls tomatch
that want to acquire the read lock will be queued behind thepurgeCache
attempt to acquire the write lock. BecausepurgeCache
cannot acquire the write lock untilcommit
releases the read lock, and because thosesearch
calls cannot acquire the read lock untilpurgeCache
acquires + releases the write lock, thosesearch
calls end up having to wait forcommit
to complete. This makes it possible forsearch
calls to take as long ascommit
to complete (multi-second rather than millisecond).Diagrams
For illustration purposes I have the following 2 diagrams of the problem. The first diagram demonstrates the wait order for the
ReentrantReadWriteLock
in the high latency match scenario. For the purpose of the diagram I am assuming thatcommit
takes 3 seconds,purgeCache
takes 3 milliseconds, andsearch
takes 3 milliseconds as well. Most of the calls tosearch
will not end up in this scenario, but without the fix included in this PR some of the calls tosearch
will end up in this scenario and be slow.The way in which the calls to
search
end up being slow is more clearly shown in the second diagram which shows a timeline that has a multisecond delayedsearch
call - making the same assumptions about the runtime of the 3 operations as the first diagram. In the below diagram,Search Thread A
avoids the problem and has expected latency, butSearch Thread B
runs into the problem and has very high latency.Solution
This issue can be resolved by ensuring that
purgeCache
never attempts to acquire the write lock whencommit
is holding the read lock. This can be done by simply using a mutual exclusion lock between the two since:purgeCache
is not time-sensitive so it can afford to wait untilcommit
completes.commit
is arguably time sensitive butpurgeCache
is a very quick operation so sometimes having to wait on it won’t make a meaningful difference in the time it takes forcommit
to complete by itself.Test Case / Demo
To demonstrate the problem I have this gist that runs
commit
andsearch
concurrently while recording the durations of the calls tosearch
. Then it runscommit
,search
andpurgeCache
concurrently while recording the durations ofsearch
. Comparing the p99 - p100 durations between whenpurgeCache
wasn’t run concurrently vs when it was run concurrently shows the high latency behavior for combining all 3 operations.Without this change, running the gist on my laptop gives:
purgeCache
- mean of p99 thru p100 is 20mspurgeCache
- mean of p99 thru p100 is 651msWith this change, running the gist on my laptop gives:
purgeCache
- mean of p99 thru p100 is 16mspurgeCache
- mean of p99 thru p100 is 9msEven though the latencies in this test are noisy, the order of magnitude difference in the mean p99 thru p100 when using this change with
commit
,search
andpurgeCache
run concurrently is statistically significant. This issue is important to fix because it prevents Lucene Monitor from having reliable low latency match performance.Merge Request
If this PR gets merged, can you please use my
dcook96@bloomberg.net
email address for the squash+merge. Thank you.