fix(metrics): measure call time and wait time separately #1858
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
interpretation_stats
.Motivation
Including queuing time from latency measurements skews measurements when system is saturated. In such form, measurement doesn't make much sense: it will obviously grow as more work arrives.
Instead, execution latency should be measured separately, to allow for applying eg Little's Law to estimate system's throughput at given execution latencies.
Another benefit of that separation is that we get a clear picture of actual Marine performance.
Concerns
We have two sets of metrics that describe execution latencies.
One is one the higher level: in Plumber-Actor. It measures Builtin calls latencies as close as possible, which is good. However, for Service Calls its measurements include RWLock and Mutex latencies, which makes these measurements skewed.
To the rescue comes another measurement, lower in the stack, at AppService level. That one (now) measures pure Marine call latency, and time it took to wait for a Service lock to be available.
The mixture of measurement sites and absence of clear naming makes it hard to navigate metrics.
The dashboards in Grafana must be upgraded with great care, then. And documenting these caveats is crucial for external users' ability to observe Nox through these metrics.