New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition in ExponentiallyDecayingReservoir's rescale method #1033

Merged
merged 2 commits into from Jan 5, 2017

Conversation

Projects
None yet
4 participants
@ho3rexqj
Contributor

ho3rexqj commented Dec 30, 2016

Fixes issues #1005 and #793 - note that in #793 the longPeriodsOfInactivityShouldNotCorruptSamplingState test did not cover multiple concurrent requests after a long idle period.

The race condition currently occurs when two threads update the reservoir concurrently - after the first thread has successfully updated nextScaleTime (line 162) but before the write lock is obtained (line 163) a second thread updates the reservoir (line 95+). With several threads pushing updates concurrently this presents a narrow window during which the second thread may obtain the lock before the first resulting in a corrupt reservoir state. Note, however, that if a snapshot is requested just prior to the two concurrent updates the read lock will already be active when the two threads enter rescaleIfNeeded, resulting in a much larger window for this race condition.

@arteam arteam merged commit 4abcd4d into dropwizard:3.2-development Jan 5, 2017

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
@arteam

This comment has been minimized.

Show comment
Hide comment
@arteam

arteam Jan 5, 2017

Member

Thank you very much for the fix and very detailed unit test!

Member

arteam commented Jan 5, 2017

Thank you very much for the fix and very detailed unit test!

@ryantenney

This comment has been minimized.

Show comment
Hide comment
@ryantenney

ryantenney Jan 5, 2017

Member

I have an issue with this. The compareAndSet operation is atomic so it shouldn't allow more than one thread into that if block. Granted 'lockForRescale()' does effectively the same thing, however it is a more expensive operation than the compareAndSet.

Member

ryantenney commented Jan 5, 2017

I have an issue with this. The compareAndSet operation is atomic so it shouldn't allow more than one thread into that if block. Granted 'lockForRescale()' does effectively the same thing, however it is a more expensive operation than the compareAndSet.

@arteam

This comment has been minimized.

Show comment
Hide comment
@arteam

arteam Jan 5, 2017

Member

The issue as I see it:

  • Thread 1 calls update
  • Thread 2 calls update
  • Thread 1 calls rescaleIfNeeded
  • Thread 2 calls rescaleIfNeeded
  • Thread 1 reads the CAS and updates it
  • Thread 1 gets swapped out
  • Thread 2 reads the updated CAS and exits
  • Thread 2 obtains a read lock and works with a new weight, but old (not scaled) values
  • Thread 1 gets swapped in and acquires a write lock, but it's too late

Basically, the CAS only guarantees that nextScaleTime gets updated atomically, but not other reservoir's state.

Member

arteam commented Jan 5, 2017

The issue as I see it:

  • Thread 1 calls update
  • Thread 2 calls update
  • Thread 1 calls rescaleIfNeeded
  • Thread 2 calls rescaleIfNeeded
  • Thread 1 reads the CAS and updates it
  • Thread 1 gets swapped out
  • Thread 2 reads the updated CAS and exits
  • Thread 2 obtains a read lock and works with a new weight, but old (not scaled) values
  • Thread 1 gets swapped in and acquires a write lock, but it's too late

Basically, the CAS only guarantees that nextScaleTime gets updated atomically, but not other reservoir's state.

@arteam

This comment has been minimized.

Show comment
Hide comment
@arteam

arteam Jan 9, 2017

Member

In case the fix the correct, I would like to cherry-pick this change to the 3.1.3-maintenance branch to provide users who affected by this bug an easy path to upgrade.

Member

arteam commented Jan 9, 2017

In case the fix the correct, I would like to cherry-pick this change to the 3.1.3-maintenance branch to provide users who affected by this bug an easy path to upgrade.

@arteam

This comment has been minimized.

Show comment
Hide comment
@arteam

arteam Jan 23, 2017

Member

I didn't get more reasons opposing this fix, so I'm going to move this bugfix to the 3.1.3 branch. In the worst case, we have a performance regression, in the best we fix a concurrency bug.

Member

arteam commented Jan 23, 2017

I didn't get more reasons opposing this fix, so I'm going to move this bugfix to the 3.1.3 branch. In the worst case, we have a performance regression, in the best we fix a concurrency bug.

@arteam arteam added this to the 3.1.3 milestone Jan 23, 2017

@arteam arteam referenced this pull request Jan 23, 2017

Merged

Backport rescale bug #1046

@arteam arteam modified the milestones: 3.2.0, 3.1.3 Jan 23, 2017

@ho3rexqj ho3rexqj deleted the ho3rexqj:fix/exponentially_decaying_reservoir_rescale_race branch Feb 3, 2017

@ahadadi

This comment has been minimized.

Show comment
Hide comment
@ahadadi

ahadadi May 1, 2017

@arteam, I think your comment:
"Thread 2 obtains a read lock and works with a new weight, but old (not scaled) values"
is incorrect.
The weight becomes new when startTime gets updated, but it gets updated only under the write lock by thread 1.
I still think this is a good change, as it simplifies reasoning about the code.

ahadadi commented May 1, 2017

@arteam, I think your comment:
"Thread 2 obtains a read lock and works with a new weight, but old (not scaled) values"
is incorrect.
The weight becomes new when startTime gets updated, but it gets updated only under the write lock by thread 1.
I still think this is a good change, as it simplifies reasoning about the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment