New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport rescale bug #1046

Merged
merged 2 commits into from Feb 8, 2017

Conversation

Projects
None yet
3 participants
@arteam
Member

arteam commented Jan 23, 2017

Backport of #1033 to the 3.1.* branch.

@arteam arteam added the bug label Jan 23, 2017

@arteam arteam added this to the 3.1.3 milestone Jan 23, 2017

@ryantenney

This comment has been minimized.

Show comment
Hide comment
@ryantenney

ryantenney Jan 23, 2017

Member

I'd like to look into this further before we backport it.

Member

ryantenney commented Jan 23, 2017

I'd like to look into this further before we backport it.

@arteam

This comment has been minimized.

Show comment
Hide comment
@arteam

arteam Jan 23, 2017

Member

Thanks, ping me when you have more thoughts on the matter.

Member

arteam commented Jan 23, 2017

Thanks, ping me when you have more thoughts on the matter.

@ryantenney

This comment has been minimized.

Show comment
Hide comment
@ryantenney

ryantenney Jan 23, 2017

Member

While it does seem to correct an issue, I have concerns about it blocking all threads on the rescale once every hour.

Member

ryantenney commented Jan 23, 2017

While it does seem to correct an issue, I have concerns about it blocking all threads on the rescale once every hour.

@ho3rexqj

This comment has been minimized.

Show comment
Hide comment
@ho3rexqj

ho3rexqj Jan 28, 2017

Contributor

In both the original and updated implementations, while a thread is performing the rescale operation (within the write lock) all concurrent updates block until the rescale operation is complete. There are behavioral differences there, however nothing fundamental changes with respect to the impact on other concurrent updates while the rescale operation is ongoing - could you clarify your concern there?

Note that the race condition the patch fixes allowed updates to sometimes be added to the reservoir outside of the rescale window (that is, without a preceding rescale operation). The fix merely enforces that requirement; updates outside of the rescale window are forced to wait for the rescale operation to complete before updating the reservoir's state. There are other implementations that would accomplish that - the write lock barrier is more than is strictly required there:

One approach would be to add a second atomic to rescale (or similar) to guard the call to lockForRescale:

if (rescaleGuard.getAndSet(next) != next) {
    lockForRescale();
    try {
        if (nextScaleTime.compareAndSet(next, now + RESCALE_THRESHOLD)) {
            // ...
        }
    } finally {
        unlockForRescale();
    }
} else {
    // a spin lock is used in this illustration; ideally something else would be used in practice
    while (nextScaleTime.get() == next)
        ; // do nothing
}

Alternatively, rather than enforcing the above requirement, it could be relaxed and the reservoir could be refactored accordingly. For example, a cache could be added to the reservoir to collect updates added in update outside of the rescale window (value/timestamp pairs) - those would then needed to be processed as part of the rescale operation, and possibly as part of getSnapshot (although that could likely be omitted).

Note, however, both of the above would add significant complexity to the reservoir. I'm not certain it's at all worthwhile given that the window during which multiple concurrent calls to update will block on lockForRescale should either be small (two concurrent calls to update) or involve three or more competing threads (one thread executing getSnapshot and two or more executing update concurrently).

Contributor

ho3rexqj commented Jan 28, 2017

In both the original and updated implementations, while a thread is performing the rescale operation (within the write lock) all concurrent updates block until the rescale operation is complete. There are behavioral differences there, however nothing fundamental changes with respect to the impact on other concurrent updates while the rescale operation is ongoing - could you clarify your concern there?

Note that the race condition the patch fixes allowed updates to sometimes be added to the reservoir outside of the rescale window (that is, without a preceding rescale operation). The fix merely enforces that requirement; updates outside of the rescale window are forced to wait for the rescale operation to complete before updating the reservoir's state. There are other implementations that would accomplish that - the write lock barrier is more than is strictly required there:

One approach would be to add a second atomic to rescale (or similar) to guard the call to lockForRescale:

if (rescaleGuard.getAndSet(next) != next) {
    lockForRescale();
    try {
        if (nextScaleTime.compareAndSet(next, now + RESCALE_THRESHOLD)) {
            // ...
        }
    } finally {
        unlockForRescale();
    }
} else {
    // a spin lock is used in this illustration; ideally something else would be used in practice
    while (nextScaleTime.get() == next)
        ; // do nothing
}

Alternatively, rather than enforcing the above requirement, it could be relaxed and the reservoir could be refactored accordingly. For example, a cache could be added to the reservoir to collect updates added in update outside of the rescale window (value/timestamp pairs) - those would then needed to be processed as part of the rescale operation, and possibly as part of getSnapshot (although that could likely be omitted).

Note, however, both of the above would add significant complexity to the reservoir. I'm not certain it's at all worthwhile given that the window during which multiple concurrent calls to update will block on lockForRescale should either be small (two concurrent calls to update) or involve three or more competing threads (one thread executing getSnapshot and two or more executing update concurrently).

ho3rexqj added some commits Dec 25, 2016

@arteam arteam merged commit b97a09c into 3.1-maintenance Feb 8, 2017

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details

@arteam arteam deleted the backport_rescale_bug branch Feb 8, 2017

arteam added a commit that referenced this pull request Feb 15, 2017

Merge changes from the 3.1 branch to 3.2 (#1071)
* Support for publishing metrics of BigInteger and BigDecimal type

* Update third-party.rst

New reporter - metrics-munin-reporter

* Add links to more third party libs

* Update third-party to include metrics-circonus

* Fix spelling in core.rst

* #814 always close socket

(cherry picked from commit 2762ce5)

* Revert "Add log4j2 xml-based config support"

This reverts commit 6667014.

* Revert "Added log4j2 InstrumentedAppender xml support to docs"

This reverts commit 26c4cd2.

* Revert "Fix javadoc link"

This reverts commit 28e6556.

* Revert "Add CPU profiling support to the AdminServlet (fixes #927)"

This reverts commit bbb52c1.

* Revert "Support for publishing metrics of BigInteger and BigDecimal type"

This reverts commit 7cfc23c.

* Suppress all kinds of Exceptions raised by report() (#1049) (#1056)

* Support the JDK's ThreadLocalRandom (#1052)

Add a proxy which provides a ThreadLocalRandom implementation depending on its
availability in runtime. If the JDK provides ThreadLocalRandom, use it. Otherwise,
fallback to the internal implementation.

* Support JDK's LongAdder (#1055)

* Support the JDK's LongAdder

Add an ability to use the JDK's implementation if it's available in the
runtime. Use a proxy for creating LongAdders which tries to use the JDK's
implementation and fallbacks to the internal one if the JDK doesn't
provide any.

Unfortunatelly, `LongAdder`s don't have a common interface, therefore we
introduce `LongAdderAdapter` as a common interface for `LongAdder`s. It
will be used in other components as an interface, but a concrete
implementation of it will be created by `LongAdderProxy`.

This pattern allows to us to still run code in the JRE, but enjoy the
benefits of the stock `LongAdder` in modern JDKs (less bugs, better
performance).

* Run tests only on JDK8

We need a compile dependency on LongAdders which is not available
in JDK7.

* Retry DNS lookups (#1064)

Do retry DNS lookups, which are just done in the constructor of
InetSocketAddress.

Reason might be that in some environments DNS is very dynamic and thus a name
might be sometimes resolveable and sometimes not. The way this is currently
implemented the InetSocketAddress is always cached, which in itself caches the
return value of the last DNS lookup, even if that failed.

* Backport rescale bug (#1046)

* Added ExponentiallyDecayingReservoir test illustrating the rescale race condition.

(cherry picked from commit d9af3c1)

* Fixed the race condition in ExponentiallyDecayingReservoir's rescale method.

(cherry picked from commit b022f04)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment