-
Notifications
You must be signed in to change notification settings - Fork 14k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KAFKA-5203: Metrics: fix resetting of histogram sample #3002
Conversation
Without the histogram cleanup, the percentiles are calculated incorrectly after purging of one or more samples: event counts go out of sync with counts in histogram buckets, and bucket with lower value gets chosen for the given quantile. This change adds the necessary histogram cleanup.
@jkreps could you please review this? |
Thanks for the PR. Fix is trivial and seems to make sense, cc @junrao in case there is some reason for the current behaviour. |
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): |
retest this please |
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iv-m : Thanks for the patch. Looks good. Just a comment on the test.
@@ -424,6 +424,14 @@ public void testPercentiles() { | |||
assertEquals(0.0, p25.value(), 1.0); | |||
assertEquals(0.0, p50.value(), 1.0); | |||
assertEquals(0.0, p75.value(), 1.0); | |||
|
|||
// record two more windows worth of sequential values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, it seems that we need to advance the mocked time to force the rolling of the old window?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test sets eventWindow to 50 (line 406 above), so we don't need to adjust the time -- it's enough to record 50 events to roll one sample, and 100 events to roll all the two of them. That's how the test works in trunk now btw -- I'm just adding one more "full roll" that reproduces the issue/confirms the fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iv-m : Thanks for the explanation. Do you know why the existing test didn't trigger this issue when we added 2 more windows of 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know why the existing test didn't trigger this issue when we added 2 more windows of 0?
When histogram is not reset, incorrect histogram bucket is selected for a given quantile: simply put, as the samples are purged, instead of p50 you get (approximate) value for p25, then p12.5 and so on. But when two windows of zeros are recorded, it does not matter, since all percentiles have the same value: zero.
The test with zeroes looks useful though, as it clearly shows that the percentiles depend on the recent data only.
@iv-m : Thanks for the explanation. LGTM. |
@junrao, thank you! |
Without the histogram cleanup, the percentiles are calculated
incorrectly after purging of one or more samples: event counts
go out of sync with counts in histogram buckets, and bucket
with lower value gets chosen for the given quantile.
This change adds the necessary histogram cleanup.