Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition in ExponentiallyDecayingReservoir's rescale method #1033

merged 2 commits into from Jan 5, 2017


Copy link

@ho3rexqj ho3rexqj commented Dec 30, 2016

Fixes issues #1005 and #793 - note that in #793 the longPeriodsOfInactivityShouldNotCorruptSamplingState test did not cover multiple concurrent requests after a long idle period.

The race condition currently occurs when two threads update the reservoir concurrently - after the first thread has successfully updated nextScaleTime (line 162) but before the write lock is obtained (line 163) a second thread updates the reservoir (line 95+). With several threads pushing updates concurrently this presents a narrow window during which the second thread may obtain the lock before the first resulting in a corrupt reservoir state. Note, however, that if a snapshot is requested just prior to the two concurrent updates the read lock will already be active when the two threads enter rescaleIfNeeded, resulting in a much larger window for this race condition.

@arteam arteam merged commit 4abcd4d into dropwizard:3.2-development Jan 5, 2017
1 check passed
1 check passed
continuous-integration/travis-ci/pr The Travis CI build passed
Copy link

@arteam arteam commented Jan 5, 2017

Thank you very much for the fix and very detailed unit test!

Copy link

@ryantenney ryantenney commented Jan 5, 2017

I have an issue with this. The compareAndSet operation is atomic so it shouldn't allow more than one thread into that if block. Granted 'lockForRescale()' does effectively the same thing, however it is a more expensive operation than the compareAndSet.

Copy link

@arteam arteam commented Jan 5, 2017

The issue as I see it:

  • Thread 1 calls update
  • Thread 2 calls update
  • Thread 1 calls rescaleIfNeeded
  • Thread 2 calls rescaleIfNeeded
  • Thread 1 reads the CAS and updates it
  • Thread 1 gets swapped out
  • Thread 2 reads the updated CAS and exits
  • Thread 2 obtains a read lock and works with a new weight, but old (not scaled) values
  • Thread 1 gets swapped in and acquires a write lock, but it's too late

Basically, the CAS only guarantees that nextScaleTime gets updated atomically, but not other reservoir's state.

Copy link

@arteam arteam commented Jan 9, 2017

In case the fix the correct, I would like to cherry-pick this change to the 3.1.3-maintenance branch to provide users who affected by this bug an easy path to upgrade.

Copy link

@arteam arteam commented Jan 23, 2017

I didn't get more reasons opposing this fix, so I'm going to move this bugfix to the 3.1.3 branch. In the worst case, we have a performance regression, in the best we fix a concurrency bug.

@arteam arteam added this to the 3.1.3 milestone Jan 23, 2017
@arteam arteam mentioned this pull request Jan 23, 2017
@arteam arteam modified the milestones: 3.2.0, 3.1.3 Jan 23, 2017
@ho3rexqj ho3rexqj deleted the ho3rexqj:fix/exponentially_decaying_reservoir_rescale_race branch Feb 3, 2017
Copy link

@ahadadi ahadadi commented May 1, 2017

@arteam, I think your comment:
"Thread 2 obtains a read lock and works with a new weight, but old (not scaled) values"
is incorrect.
The weight becomes new when startTime gets updated, but it gets updated only under the write lock by thread 1.
I still think this is a good change, as it simplifies reasoning about the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants
You can’t perform that action at this time.