fix(metrics): measure call time and wait time separately #1858

folex · 2023-10-26T21:52:13Z

Description

Report actor mailbox size and actor counts in every poll. Before this PR, it was only reported when there are interpretation_stats.
Measure lock wait time separately from marine execution time in AppService
Measure time it takes for service call execution to be scheduled on Blocking Pool in Actor
Separate call execution time measurement from schedule wait time in Actor

Motivation

Including queuing time from latency measurements skews measurements when system is saturated. In such form, measurement doesn't make much sense: it will obviously grow as more work arrives.

Instead, execution latency should be measured separately, to allow for applying eg Little's Law to estimate system's throughput at given execution latencies.

Another benefit of that separation is that we get a clear picture of actual Marine performance.

Concerns

We have two sets of metrics that describe execution latencies.

One is one the higher level: in Plumber-Actor. It measures Builtin calls latencies as close as possible, which is good. However, for Service Calls its measurements include RWLock and Mutex latencies, which makes these measurements skewed.

To the rescue comes another measurement, lower in the stack, at AppService level. That one (now) measures pure Marine call latency, and time it took to wait for a Service lock to be available.

The mixture of measurement sites and absence of clear naming makes it hard to navigate metrics.

The dashboards in Grafana must be upgraded with great care, then. And documenting these caveats is crucial for external users' ability to observe Nox through these metrics.

folex · 2023-10-26T23:25:30Z

crates/peer-metrics/src/services_metrics/external.rs

@@ -182,13 +183,20 @@ impl ServicesMetricsExternal {
            "number of modules per services",
        );

-        let call_time_msec: Family<_, _> = register(
+        let call_time_sec: Family<_, _> = register(
            sub_registry,
            Family::new_with_constructor(|| Histogram::new(execution_time_buckets())),
            "call_time_msec",


I don't understand why it's called msec. @kmd-fl wdyt?

fix(metrics): always report mailbox size, actor count

7140395

folex requested review from gurinderu and kmd-fl October 26, 2023 21:52

folex added 3 commits October 26, 2023 16:52

Merge branch 'master' into fix_actor_metrics

a68e6f8

fix(metrics): measure actual call time, without queueing time

800ddbc

fix(metrics): measure lock time, wait time & call time

123efef

folex changed the title ~~fix(metrics): always report mailbox size, actor count~~ fix(metrics): measure call time and wait time separately Oct 26, 2023

folex commented Oct 26, 2023

View reviewed changes

folex requested a review from ValeryAntopol October 26, 2023 23:41

gurinderu approved these changes Oct 28, 2023

View reviewed changes

folex merged commit 73bab7e into master Oct 30, 2023
14 checks passed

folex deleted the fix_actor_metrics branch October 30, 2023 17:34

fluencebot mentioned this pull request Oct 30, 2023

chore(master): release nox 0.16.0 #1875

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(metrics): measure call time and wait time separately #1858

fix(metrics): measure call time and wait time separately #1858

folex commented Oct 26, 2023 •

edited

Loading

folex Oct 26, 2023 •

edited

Loading

fix(metrics): measure call time and wait time separately #1858

fix(metrics): measure call time and wait time separately #1858

Conversation

folex commented Oct 26, 2023 • edited Loading

Description

Motivation

Concerns

folex Oct 26, 2023 • edited Loading

Choose a reason for hiding this comment

folex commented Oct 26, 2023 •

edited

Loading

folex Oct 26, 2023 •

edited

Loading