-
Notifications
You must be signed in to change notification settings - Fork 14.8k
KAFKA-19888: Clamp negative values in coordinator histograms #20986
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KAFKA-19888: Clamp negative values in coordinator histograms #20986
Conversation
Clamp negative values in coordinator histograms, instead of throwing an exception.
| assertEquals(highestTrackableValue, hdrHistogram.max(now)); | ||
|
|
||
| hdrHistogram.record(-50L); | ||
| assertEquals(0, hdrHistogram.max(now + 1000L)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no histogram min method. We can add one, but it'd be only used in this one test.
|
@squah-confluent Could you please explain the issue and the fix in the description? This must be back ported to 4.2, 4.1 and 4.0. |
|
@dajac Thanks for taking a look. I rewrote the description. Let me know your thoughts. |
dajac
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, thanks
The coordinator runtime and group coordinator currently use wall clock time to measure durations for metrics. When the system clock goes backwards due to time adjustments, we attempt to record negative durations for metrics, which throws an ArrayIndexOutOfBoundsException exception. This causes request processing and partition loading to fail while the clock is being adjusted. If partition loading fails, the group the coordinator for that partition becomes unavailable until the broker is restarted or leadership changes again. To address this, we clamp negative durations to zero in histograms instead of throwing ArrayIndexOutOfBoundsExceptions. We will move towards using a monotonic clock for metrics in future work. Reviewers: David Jacot <djacot@confluent.io>
The coordinator runtime and group coordinator currently use wall clock time to measure durations for metrics. When the system clock goes backwards due to time adjustments, we attempt to record negative durations for metrics, which throws an ArrayIndexOutOfBoundsException exception. This causes request processing and partition loading to fail while the clock is being adjusted. If partition loading fails, the group the coordinator for that partition becomes unavailable until the broker is restarted or leadership changes again. To address this, we clamp negative durations to zero in histograms instead of throwing ArrayIndexOutOfBoundsExceptions. We will move towards using a monotonic clock for metrics in future work. Reviewers: David Jacot <djacot@confluent.io>
The coordinator runtime and group coordinator currently use wall clock time to measure durations for metrics. When the system clock goes backwards due to time adjustments, we attempt to record negative durations for metrics, which throws an ArrayIndexOutOfBoundsException exception. This causes request processing and partition loading to fail while the clock is being adjusted. If partition loading fails, the group the coordinator for that partition becomes unavailable until the broker is restarted or leadership changes again. To address this, we clamp negative durations to zero in histograms instead of throwing ArrayIndexOutOfBoundsExceptions. We will move towards using a monotonic clock for metrics in future work. Reviewers: David Jacot <djacot@confluent.io>
|
Merged to trunk, 4.2, 4.1 and 4.0. |
The coordinator runtime and group coordinator currently use wall clock
time to measure durations for metrics. When the system clock goes
backwards due to time adjustments, we attempt to record negative
durations for metrics, which throws an ArrayIndexOutOfBoundsException
exception. This causes request processing and partition loading to fail
while the clock is being adjusted. If partition loading fails, the group
the coordinator for that partition becomes unavailable until the broker
is restarted or leadership changes again.
To address this, we clamp negative durations to zero in histograms
instead of throwing ArrayIndexOutOfBoundsExceptions. We will move
towards using a monotonic clock for metrics in future work.
Reviewers: David Jacot djacot@confluent.io