Currently, Apache Storm provides comprehensive metrics for throughput and average latency (execute-latency, process-latency). However, in high-precision real-time systems, averages often mask critical performance instabilities.
This proposal introduces a native Jitter Metric calculated at two levels:
- Component level (Step Jitter): Measures the variance in execution time within individual Bolts and Spouts.
- Topology level (Global Jitter): Measures the variance in e2e completion latency for fully acked tuples.
In deterministic real-time processing, the variance of the latency is as important as the latency itself (https://ieeexplore.ieee.org/abstract/document/10877871).
Why analysing jitter matters for real-time
In deterministic real-time processing, predictability of latency is as important as latency itself. This is a constraint to building a deterministic system.
- Mcro-burst detection: high jitter reveals short spikes that average latency smooths out.
- Compliance: modern SLAs rely on percentiles (e.g., P99). Jitter is a strong leading indicator of tail-latency degradation.
- Root Cause Analysis: high component jitter means GC pressure or resource contention; instead, high global jitter with stable components suggests network congestion or shuffle bottlenecks.
- Bottleneck identification: jitter enables precise identification of where bottlenecks occur in the topology and helps distinguish their underlying causes, making performance issues easier to diagnose and resolve.
Proposed model: Exponentially Weighted Moving Average (EWMA)
To ensure negligible performance impact, I propose to use an Exponentially Weighted Moving Average (EWMA), following RFC 1889 logic https://www.rfc-editor.org/rfc/rfc1889#appendix-A.8
Mathematical Model:
J_new = J_old + (|D_current - D_previous| - J_old) / 16
GIVEN a State {ewmaJitter, lastTransit}
PROCEDURE addValue(transitMs)
IF transitMs < 0 THEN
EXIT PROCEDURE
IF lastTransit IS NOT UNINITIALIZED THEN
// Calculate the absolute difference between the current and previous transit time
deviation = ABS(transitMs - lastTransit)
// Update the Exponentially Weighted Moving Average using the RFC 1889 smoothing factor
ewmaJitter = ewmaJitter + (deviation - ewmaJitter) * RFC1889_ALPHA
END IF
// Store current transit time for the next iteration
lastTransit = transitMs
END PROCEDURE
Performance impact
- Minimal computational overhead: by utilizing an EWMA, we avoid the need for storing large datasets or sliding window buffers. The jitter is updated via a single linear equation, requiring only basic arithmetic.
- Memory efficiency: The EWMA algorithm is extremely memory-light, requiring only a single persistent variable (8 bytes) per executor to maintain the moving average state, plus a reference for the previous latency sample.
- System calls: To eliminate redundant overhead, the metric hooks into existing latency tracking logic. This point requires additional brainstorming to evaluate already sampled metrics.
Limitations and constraints
- Clock skew: Global jitter may be affected in the case of unsynchronised nodes. However, since jitter measures variance between consecutive samples, constant skew cancels out mathematically.
- Sampling bias: Low sampling rates may miss high-frequency jitter spikes.
- Warm-up: as an EWMA-based metric, values may fluctuate initially before stabilizing.
Currently, Apache Storm provides comprehensive metrics for throughput and average latency (execute-latency, process-latency). However, in high-precision real-time systems, averages often mask critical performance instabilities.
This proposal introduces a native Jitter Metric calculated at two levels:
In deterministic real-time processing, the variance of the latency is as important as the latency itself (https://ieeexplore.ieee.org/abstract/document/10877871).
Why analysing jitter matters for real-time
In deterministic real-time processing, predictability of latency is as important as latency itself. This is a constraint to building a deterministic system.
Proposed model: Exponentially Weighted Moving Average (EWMA)
To ensure negligible performance impact, I propose to use an Exponentially Weighted Moving Average (EWMA), following RFC 1889 logic https://www.rfc-editor.org/rfc/rfc1889#appendix-A.8
Mathematical Model:
J_new = J_old + (|D_current - D_previous| - J_old) / 16
Performance impact
Limitations and constraints