Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix large 99 percentile match latency in Monitor during concurrent commits and purges #12801

Conversation

daviscook477
Copy link
Contributor

Description

Within Lucene Monitor, there is a thread contention issue that manifests as multi-second latencies in the Monitor.match function when it is called concurrently with Monitor.register/Monitor.deleteById and QueryIndex.purgeCache. The source of the issue is the usage of purgeLock, a ReentrantReadWriteLock.

Within the QueryIndex, there are 3 usages of the purgeLock. They are commit (called by Monitor.register/Monitor.deleteById), ‘search’ (called by Monitor.match), and purgeCache (called by a background thread at a consistent interval). Within commit and search, the read lock is acquired, and within purgeCache, the write lock is acquired. search calls are very fast and only hold the purgeLock for short durations. However, commit calls are quite long and will hold the purgeLock for multi-second durations. Ostensibly because both commit and search only need the read lock, they should be able to execute concurrently and the long duration of commit should not impact search. However, when purgeCache attempts to acquire the write lock this no longer holds true. This is because once purgeCache is waiting to acquire the write lock, attempts to acquire the read lock probabilistically might wait for the write lock to be acquired + released (otherwise attempts to acquire the write lock would be starved). What this means is that if commit is holding the read lock, and then purgeCache attempts to acquire the write lock, some amount of calls to match that want to acquire the read lock will be queued behind the purgeCache attempt to acquire the write lock. Because purgeCache cannot acquire the write lock until commit releases the read lock, and because those search calls cannot acquire the read lock until purgeCache acquires + releases the write lock, those search calls end up having to wait for commit to complete. This makes it possible for search calls to take as long as commit to complete (multi-second rather than millisecond).

Diagrams

For illustration purposes I have the following 2 diagrams of the problem. The first diagram demonstrates the wait order for the ReentrantReadWriteLock in the high latency match scenario. For the purpose of the diagram I am assuming that commit takes 3 seconds, purgeCache takes 3 milliseconds, and search takes 3 milliseconds as well. Most of the calls to search will not end up in this scenario, but without the fix included in this PR some of the calls to search will end up in this scenario and be slow.
image
The way in which the calls to search end up being slow is more clearly shown in the second diagram which shows a timeline that has a multisecond delayed search call - making the same assumptions about the runtime of the 3 operations as the first diagram. In the below diagram, Search Thread A avoids the problem and has expected latency, but Search Thread B runs into the problem and has very high latency.
image

Solution

This issue can be resolved by ensuring that purgeCache never attempts to acquire the write lock when commit is holding the read lock. This can be done by simply using a mutual exclusion lock between the two since:

  1. purgeCache is not time-sensitive so it can afford to wait until commit completes.
  2. commit is arguably time sensitive but purgeCache is a very quick operation so sometimes having to wait on it won’t make a meaningful difference in the time it takes for commit to complete by itself.

Test Case / Demo

To demonstrate the problem I have this gist that runs commit and search concurrently while recording the durations of the calls to search. Then it runs commit, search and purgeCache concurrently while recording the durations of search. Comparing the p99 - p100 durations between when purgeCache wasn’t run concurrently vs when it was run concurrently shows the high latency behavior for combining all 3 operations.

Without this change, running the gist on my laptop gives:

  1. No purgeCache - mean of p99 thru p100 is 20ms
  2. Yes purgeCache - mean of p99 thru p100 is 651ms

With this change, running the gist on my laptop gives:

  1. No purgeCache - mean of p99 thru p100 is 16ms
  2. Yes purgeCache - mean of p99 thru p100 is 9ms

Even though the latencies in this test are noisy, the order of magnitude difference in the mean p99 thru p100 when using this change with commit, search and purgeCache run concurrently is statistically significant. This issue is important to fix because it prevents Lucene Monitor from having reliable low latency match performance.

Merge Request

If this PR gets merged, can you please use my dcook96@bloomberg.net email address for the squash+merge. Thank you.

Copy link
Contributor

@romseygeek romseygeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A simple fix with a very detailed and thorough explanation - thank you! I'll merge and backport this some time tomorrow.

@daviscook477
Copy link
Contributor Author

Thank you for taking a look!

@romseygeek romseygeek merged commit 72dbfb7 into apache:main Nov 14, 2023
4 checks passed
romseygeek pushed a commit that referenced this pull request Nov 14, 2023
Prevent contention on the ReentrantReadWriteLock at search time by
ensuring that `commit` and `purgeCache` are never both trying to
hold the write lock.
romseygeek added a commit that referenced this pull request Nov 14, 2023
romseygeek added a commit that referenced this pull request Nov 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants