Skip to content

Conversation

@squah-confluent
Copy link
Contributor

@squah-confluent squah-confluent commented Nov 25, 2025

The coordinator runtime and group coordinator currently use wall clock
time to measure durations for metrics. When the system clock goes
backwards due to time adjustments, we attempt to record negative
durations for metrics, which throws an ArrayIndexOutOfBoundsException
exception. This causes request processing and partition loading to fail
while the clock is being adjusted. If partition loading fails, the group
the coordinator for that partition becomes unavailable until the broker
is restarted or leadership changes again.

To address this, we clamp negative durations to zero in histograms
instead of throwing ArrayIndexOutOfBoundsExceptions. We will move
towards using a monotonic clock for metrics in future work.

Reviewers: David Jacot djacot@confluent.io

Clamp negative values in coordinator histograms, instead of throwing
an exception.
@github-actions github-actions bot added triage PRs from the community group-coordinator small Small PRs labels Nov 25, 2025
assertEquals(highestTrackableValue, hdrHistogram.max(now));

hdrHistogram.record(-50L);
assertEquals(0, hdrHistogram.max(now + 1000L));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no histogram min method. We can add one, but it'd be only used in this one test.

@dajac
Copy link
Member

dajac commented Nov 25, 2025

@squah-confluent Could you please explain the issue and the fix in the description?

This must be back ported to 4.2, 4.1 and 4.0.

@dajac dajac self-requested a review November 25, 2025 19:07
@github-actions github-actions bot removed the triage PRs from the community label Nov 26, 2025
@squah-confluent
Copy link
Contributor Author

@dajac Thanks for taking a look. I rewrote the description. Let me know your thoughts.

Copy link
Member

@dajac dajac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks

@dajac dajac merged commit 889c3d4 into apache:trunk Nov 26, 2025
28 checks passed
@dajac dajac deleted the squah-clamp-coordinator-histogram-negative-values branch November 26, 2025 14:25
dajac pushed a commit that referenced this pull request Nov 26, 2025
The coordinator runtime and group coordinator currently use wall clock
time to measure durations for metrics. When the system clock goes
backwards due to time adjustments, we attempt to record negative
durations for metrics, which throws an ArrayIndexOutOfBoundsException
exception. This causes request processing and partition loading to fail
while the clock is being adjusted. If partition loading fails, the group
the coordinator for that partition becomes unavailable until the broker
is restarted or leadership changes again.

To address this, we clamp negative durations to zero in histograms
instead of throwing ArrayIndexOutOfBoundsExceptions. We will move
towards using a monotonic clock for metrics in future work.

Reviewers: David Jacot <djacot@confluent.io>
dajac pushed a commit that referenced this pull request Nov 26, 2025
The coordinator runtime and group coordinator currently use wall clock
time to measure durations for metrics. When the system clock goes
backwards due to time adjustments, we attempt to record negative
durations for metrics, which throws an ArrayIndexOutOfBoundsException
exception. This causes request processing and partition loading to fail
while the clock is being adjusted. If partition loading fails, the group
the coordinator for that partition becomes unavailable until the broker
is restarted or leadership changes again.

To address this, we clamp negative durations to zero in histograms
instead of throwing ArrayIndexOutOfBoundsExceptions. We will move
towards using a monotonic clock for metrics in future work.

Reviewers: David Jacot <djacot@confluent.io>
dajac pushed a commit that referenced this pull request Nov 26, 2025
The coordinator runtime and group coordinator currently use wall clock
time to measure durations for metrics. When the system clock goes
backwards due to time adjustments, we attempt to record negative
durations for metrics, which throws an ArrayIndexOutOfBoundsException
exception. This causes request processing and partition loading to fail
while the clock is being adjusted. If partition loading fails, the group
the coordinator for that partition becomes unavailable until the broker
is restarted or leadership changes again.

To address this, we clamp negative durations to zero in histograms
instead of throwing ArrayIndexOutOfBoundsExceptions. We will move
towards using a monotonic clock for metrics in future work.

Reviewers: David Jacot <djacot@confluent.io>
@dajac
Copy link
Member

dajac commented Nov 26, 2025

Merged to trunk, 4.2, 4.1 and 4.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants