Skip to content

Commit

Permalink
rpc: bump threshold for latency jump reporting
Browse files Browse the repository at this point in the history
For months I've seen this misfire in nearly every single log line I've
looked at, and I've had to grep it out in many L2 incidents.
Maybe it works better when we suppress it for latencies <=50ms.

Touches #96262.
Fixes #98066.

Epic: none
Release note: None
  • Loading branch information
tbg committed Mar 14, 2023
1 parent cb70e98 commit 02f8eaf
Showing 1 changed file with 5 additions and 2 deletions.
7 changes: 5 additions & 2 deletions pkg/rpc/clock_offset.go
Original file line number Diff line number Diff line change
Expand Up @@ -264,11 +264,14 @@ func (r *RemoteClockMonitor) UpdateOffset(
info.avgNanos.Add(newLatencyf)
r.metrics.LatencyHistogramNanos.RecordValue(roundTripLatency.Nanoseconds())

// See: https://github.com/cockroachdb/cockroach/issues/96262
// See: https://github.com/cockroachdb/cockroach/issues/98066
const thresh = 50 * 1e6 // 50ms
// If the roundtrip jumps by 50% beyond the previously recorded average, report it in logs.
// Don't report it again until it falls below 40% above the average.
// (Also requires latency > 1ms to avoid trigger on noise on low-latency connections and
// (Also requires latency > thresh to avoid trigger on noise on low-latency connections and
// the running average to be non-zero to avoid triggering on startup.)
if newLatencyf > 1e6 && prevAvg > 0.0 &&
if newLatencyf > thresh && prevAvg > 0.0 &&
info.trigger.triggers(newLatencyf, prevAvg*1.4, prevAvg*1.5) {
log.Health.Warningf(ctx, "latency jump (prev avg %.2fms, current %.2fms)",
prevAvg/1e6, newLatencyf/1e6)
Expand Down

0 comments on commit 02f8eaf

Please sign in to comment.