Fix large 99 percentile match latency in Monitor during concurrent commits and purges #12801

daviscook477 · 2023-11-13T17:41:07Z

Description

Within Lucene Monitor, there is a thread contention issue that manifests as multi-second latencies in the Monitor.match function when it is called concurrently with Monitor.register/Monitor.deleteById and QueryIndex.purgeCache. The source of the issue is the usage of purgeLock, a ReentrantReadWriteLock.

Within the QueryIndex, there are 3 usages of the purgeLock. They are commit (called by Monitor.register/Monitor.deleteById), ‘search’ (called by Monitor.match), and purgeCache (called by a background thread at a consistent interval). Within commit and search, the read lock is acquired, and within purgeCache, the write lock is acquired. search calls are very fast and only hold the purgeLock for short durations. However, commit calls are quite long and will hold the purgeLock for multi-second durations. Ostensibly because both commit and search only need the read lock, they should be able to execute concurrently and the long duration of commit should not impact search. However, when purgeCache attempts to acquire the write lock this no longer holds true. This is because once purgeCache is waiting to acquire the write lock, attempts to acquire the read lock probabilistically might wait for the write lock to be acquired + released (otherwise attempts to acquire the write lock would be starved). What this means is that if commit is holding the read lock, and then purgeCache attempts to acquire the write lock, some amount of calls to match that want to acquire the read lock will be queued behind the purgeCache attempt to acquire the write lock. Because purgeCache cannot acquire the write lock until commit releases the read lock, and because those search calls cannot acquire the read lock until purgeCache acquires + releases the write lock, those search calls end up having to wait for commit to complete. This makes it possible for search calls to take as long as commit to complete (multi-second rather than millisecond).

Diagrams

For illustration purposes I have the following 2 diagrams of the problem. The first diagram demonstrates the wait order for the ReentrantReadWriteLock in the high latency match scenario. For the purpose of the diagram I am assuming that commit takes 3 seconds, purgeCache takes 3 milliseconds, and search takes 3 milliseconds as well. Most of the calls to search will not end up in this scenario, but without the fix included in this PR some of the calls to search will end up in this scenario and be slow.

The way in which the calls to search end up being slow is more clearly shown in the second diagram which shows a timeline that has a multisecond delayed search call - making the same assumptions about the runtime of the 3 operations as the first diagram. In the below diagram, Search Thread A avoids the problem and has expected latency, but Search Thread B runs into the problem and has very high latency.

Solution

This issue can be resolved by ensuring that purgeCache never attempts to acquire the write lock when commit is holding the read lock. This can be done by simply using a mutual exclusion lock between the two since:

purgeCache is not time-sensitive so it can afford to wait until commit completes.
commit is arguably time sensitive but purgeCache is a very quick operation so sometimes having to wait on it won’t make a meaningful difference in the time it takes for commit to complete by itself.

Test Case / Demo

To demonstrate the problem I have this gist that runs commit and search concurrently while recording the durations of the calls to search. Then it runs commit, search and purgeCache concurrently while recording the durations of search. Comparing the p99 - p100 durations between when purgeCache wasn’t run concurrently vs when it was run concurrently shows the high latency behavior for combining all 3 operations.

Without this change, running the gist on my laptop gives:

No purgeCache - mean of p99 thru p100 is 20ms
Yes purgeCache - mean of p99 thru p100 is 651ms

With this change, running the gist on my laptop gives:

No purgeCache - mean of p99 thru p100 is 16ms
Yes purgeCache - mean of p99 thru p100 is 9ms

Even though the latencies in this test are noisy, the order of magnitude difference in the mean p99 thru p100 when using this change with commit, search and purgeCache run concurrently is statistically significant. This issue is important to fix because it prevents Lucene Monitor from having reliable low latency match performance.

Merge Request

If this PR gets merged, can you please use my dcook96@bloomberg.net email address for the squash+merge. Thank you.

romseygeek

A simple fix with a very detailed and thorough explanation - thank you! I'll merge and backport this some time tomorrow.

daviscook477 · 2023-11-13T21:06:34Z

Thank you for taking a look!

Prevent contention on the ReentrantReadWriteLock at search time by ensuring that `commit` and `purgeCache` are never both trying to hold the write lock.

fix large p99 match request latency when purgeCache is running

bbb460e

romseygeek approved these changes Nov 13, 2023

View reviewed changes

romseygeek merged commit 72dbfb7 into apache:main Nov 14, 2023
4 checks passed

romseygeek added a commit that referenced this pull request Nov 14, 2023

Add CHANGES entry for #12801

dd46971

romseygeek added a commit that referenced this pull request Nov 14, 2023

Add CHANGES entry for #12801

b6d386c

cpoerschke added the module:monitor label Mar 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix large 99 percentile match latency in Monitor during concurrent commits and purges #12801

Fix large 99 percentile match latency in Monitor during concurrent commits and purges #12801

daviscook477 commented Nov 13, 2023

romseygeek left a comment

daviscook477 commented Nov 13, 2023

Fix large 99 percentile match latency in Monitor during concurrent commits and purges #12801

Fix large 99 percentile match latency in Monitor during concurrent commits and purges #12801

Conversation

daviscook477 commented Nov 13, 2023

Description

Diagrams

Solution

Test Case / Demo

Merge Request

romseygeek left a comment

Choose a reason for hiding this comment

daviscook477 commented Nov 13, 2023