Expose Tokio runtime metrics #2968

divergentdave · 2024-04-05T19:50:50Z

We need visibility into how long it takes to poll futures, as slow polls can have impacts across an entire process and cause issues. I think it would make sense to enable Tokio's unstable --cfg flag for this purpose, so long as we can still turn the flag off again and build without the enhanced observability if we need to address breaking changes. The relevant data would come from the RuntimeMetrics::poll_count_histogram_bucket_count() method, though the API requires some additional post-processing.

I found a couple smaller projects that pipe counters from Tokio's RuntimeMetrics to the prometheus crate, but nothing that exposes the poll time histogram, and nothing that integrates with OTel. There is a first party tokio-metrics crate that provides a nicer frontend with its own RuntimeMetrics, taking care of the necessary post-processing (plus another API that instruments individual futures). However, this doesn't integrate with any observability libraries, and its examples print metrics to standard output.

The fundamental difficulty in exposing the poll time histogram to traditional metrics APIs is that Tokio's runtime is already partially pre-aggregating the histogram. Thus, prometheus::Histogram::observe() and or opentelemetry::metrics::Histogram::record() would be insufficient. There is another way we can introduce already-aggregated data, via the MetricProducer trait from the OTel SDK. This trait is implemented by a private SDK type already to collect data from SDK instruments. Readers such as ManualReader and the opentelemetry_prometheus exporter allow adding external producers during construction, and they will combine metrics from both the SDK producer and the external producers. Thus, we can write an implementation of this trait that bridges in Tokio's runtime metrics, add it as an external producer during initialization, and we should be able to see the results from Prometheus. In addition to the data, a MetricProducer must also provide OTel scope information (i.e. name) and names, descriptions, and units of metrics. This would fit well into the OTel concept of an "instrumentation library".

Related to #2955.

The text was updated successfully, but these errors were encountered:

divergentdave mentioned this issue Apr 18, 2024

Add Tokio runtime metrics #3031

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose Tokio runtime metrics #2968

Expose Tokio runtime metrics #2968

divergentdave commented Apr 5, 2024

Expose Tokio runtime metrics #2968

Expose Tokio runtime metrics #2968

Comments

divergentdave commented Apr 5, 2024