Track and expose metrics #317

fracek · 2022-05-31T16:05:07Z

I added a way to track and expose metrics. Users can decide to run the service by specifying the metrics socket address (e.g. --metrics=0.0.0.0:7060).
The metrics server exposes 3 pages:

/readyz which is the ready check page. At the moment it doesn't do much but in a future PR I will add proper startup checks.
/livez which is the liveness check page. Like the other page, in the future it will check on other modules to see if they're working correctly.
/metrics which is the prometheus metrics page.

At the moment the metrics tracked are related tothe sync service and the rpc api.

fracek · 2022-05-31T16:30:26Z

Looks like failing test is unrelated to PR?

Mirko-von-Leipzig · 2022-05-31T16:39:32Z

Looks like failing test is unrelated to PR?

Yeah the failures are due to external PRs not getting access to our github secrets. At some point in the future we will hoepfully use bors to manage /fix all this.

Mirko-von-Leipzig · 2022-05-31T16:53:20Z

I haven't done a thorough review; but a couple of more general questions.

Do the exposed metrics commonly form part of the semver guarantees? i.e. do we need to think carefully about which metrics we add / remove? I would presume yes. I'm thinking a bit ahead to when we add p2p, which means the sync related metrics will likely require a rework / rethink.
How did you select which points to metric? They make sense; I'm just curious if you had any specific requirements yourself?
I presume you chose warp because its a dev dep as well? (more a comment for myself to look into alternatives before we settle).
What's with the xxx-z extension? e.g. readyz.

fracek · 2022-05-31T17:25:32Z

Do the exposed metrics commonly form part of the semver guarantees? i.e. do we need to think carefully about which metrics we add / remove? I would presume yes. I'm thinking a bit ahead to when we add p2p, which means the sync related metrics will likely require a rework / rethink.

I never thought about this to be honest.

How did you select which points to metric? They make sense; I'm just curious if you had any specific requirements yourself?

I added counters to the RPC calls because I'm interested in looking at the total calls per second, grouped by call type. Sync counters are just a way to visualize what that process is doing. As I start running the node in production I think it will be more clear what additional metrics we need to add.

I presume you chose warp because its a dev dep as well? (more a comment for myself to look into alternatives before we settle).

Warp seemed like the smallest http server available, these endpoint don't need anything fancy. I didn't notice it was a dev dependency.

What's with the xxx-z extension? e.g. readyz.

I used the standard paths used by some software on k8s. I believe the ending z makes it clear they're not intended for human consumption.

fracek · 2022-05-31T17:27:07Z

I forgot to mention that in a PR later I will have /readyz and /livez output extra information like in the k8s page I linked. In my experience they help a lot debugging why a service is not healthy.

koivunej

Thanks for this @fracek! Sorry for leaving you without feedback for this long. I've started to look into opentelemetry a few times already but the codebase is really slow to understand. I have some questions in the meantime though :)

koivunej · 2022-06-09T17:30:39Z

crates/pathfinder/src/observability.rs

+        let metrics = warp::path!("metrics").and(warp::get()).map({
+            move || {
+                let mut buffer = Vec::new();
+                let encoder = TextEncoder::new();


I don't understand why does TextEncoder come from prometheus but otherwise opentelemetry_prometheus is used. What's more there seems to be a dependency from opentelemetry_prometheus to prometheus and it would seem that prometheus::{Encode, TextEncoder}; are already re-exported as opentelemetry_prometheus::{Encoder, TextEncoder}. Maybe it would be best to avoid direct dependency on either openprometheus or just go with only it?

Related: Did you consider just using prometheus-client here?

I will update it to use opentelemetry_prometheus only, I didn't notice the re-exports.

I'm using opentelemetry so that in the future metrics can be exported to other protocols, e.g. to an opentelemetry endpoint together with traces.

koivunej · 2022-06-09T17:34:33Z

crates/pathfinder/src/observability.rs

+/// Run metrics server if `addr` is specified.
+pub fn run_server(addr: Option<SocketAddr>) -> anyhow::Result<(ServerFuture, Option<SocketAddr>)> {
+    if let Some(addr) = addr {
+        let exporter = opentelemetry_prometheus::exporter().init();


Do the RpcMetrics and SyncMetrics require this call to happen before their global registrations happen?

Yes that's the case. Should I add a note about that in the function's doc?

If this is required to happen before, I'd just handle the addr: SocketAddr case here (nested in src/obs.rs) and do the required global init in the binary src/bin/pathfinder.rs so that no funny stuff can happen, and add a comment that this initialization must happen before any rpc or sync things are done.

fracek · 2022-06-11T21:48:10Z

I updated the PR.

fracek · 2022-07-01T19:42:23Z

I updated the PR to use a middleware to track and update rpc metrics. For RPC calls it now tracks:

count: the total number of requests
error: the number of requests that errored
duration: duration in milliseconds to server the request

fracek · 2022-07-01T19:43:53Z

Let me know if there's anything more I can do to have this PR merged. I'm running pathfinder in production and it would be great to have its metrics together with everything else in graphana.

fracek · 2022-07-07T11:01:23Z

Updated the PR to cleanly merge with the new head.

fracek · 2022-07-07T13:44:35Z

I also updated the PR to format data in a way that's more useful when building dashboards in Grafana:

include more realistic buckets for duration
add a status=ERROR|SUCCESS attribute to RPC data

Mirko-von-Leipzig · 2022-07-11T08:49:20Z

@fracek thanks for keeping up with the PR -- and sincere apologies for the lack of communications.

We've been looking into the opentelemetry crates and are unhappy with the amount of code / complexity they involve -- not necessarily for the end user, but more that the crates themselves are quite convoluted internally. Especially since we just want to export some simple metrics.

We're now investigating alternatives like metrics + metrics-exporter-prometheus or prometheus.

We realise these are directly just Prometheus exports and not opentelemetry, but as mentioned the latter just seems like a huge risk currently for support.

fracek · 2022-07-11T10:37:15Z

I think that's fair enough, otel adds some complexity but in return the node gets:

export traces (not just metrics) so that it's possible to debug performance issues in production. Without otel you would need another exporter like zipkin for this.
export to a much larger number of providers (stackdrive, lightstep, datadog, etc.) through otel-collector.

If the teams decides that the added complexity is not worth these 2 features I can rewrite the PR to use the libraries you suggested.

Mirko-von-Leipzig · 2022-07-25T12:12:58Z

We've decided that metrics + metrics-exporter-prometheus is probably a better fit right now.

If we need otel traces I think there is also an otel exporter specifically for tracing which we can use without changing metrics.

At this stage I suspect that refactoring this PR onto the changes we've been making may be more effort than its worth. If you're still keen to push this through, then I would suggest starting a new PR and just moving your changes over manually. If you're over this (which I would understand), then we're also happy to attribute your work done whenever we implement it ourselves.

It would also be appreciated if the PR could be just one thing i.e. just the metrics (no livez & readyz), since you mention you'll make them do what you intended in the future still.

koivunej · 2022-09-16T11:33:18Z

This was superceded by #545 using metrics-rs but we forgot to close this PR. Apologies on forgetting and thanks for getting the ball rolling!

$fracek$

$@fracek$ fracek force-pushed the feature/observability branch 4 times, most recently from f2786b8 to 795c502 Compare May 31, 2022 16:19

koivunej suggested changes Jun 10, 2022

View reviewed changes

$@fracek$ fracek force-pushed the feature/observability branch 2 times, most recently from 3f54b16 to ae388a7 Compare June 11, 2022 21:41

$@fracek$ fracek force-pushed the feature/observability branch from ae388a7 to 07c8012 Compare July 1, 2022 19:40

$@fracek$ fracek requested a review from a team as a code owner July 1, 2022 19:40

$@fracek$ fracek force-pushed the feature/observability branch from 07c8012 to c3a704e Compare July 7, 2022 11:01

$@fracek$ fracek force-pushed the feature/observability branch from c3a704e to 2113f58 Compare July 7, 2022 13:29

$@fracek$

feat: track and expose metrics

d2e1689

$@fracek$ fracek force-pushed the feature/observability branch from 2113f58 to d2e1689 Compare July 13, 2022 14:19

CHr15F0x mentioned this pull request Aug 31, 2022

Initial metrics support #545

Merged

koivunej closed this Sep 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Track and expose metrics #317

Track and expose metrics #317

$@fracek$ fracek commented May 31, 2022

fracek commented May 31, 2022

Mirko-von-Leipzig commented May 31, 2022

Mirko-von-Leipzig commented May 31, 2022

fracek commented May 31, 2022

fracek commented May 31, 2022

koivunej left a comment

koivunej Jun 9, 2022 •

edited

$@fracek$ fracek Jun 10, 2022

koivunej Jun 9, 2022

$@fracek$ fracek Jun 10, 2022

koivunej Jun 10, 2022

fracek commented Jun 11, 2022

fracek commented Jul 1, 2022

fracek commented Jul 1, 2022

fracek commented Jul 7, 2022

fracek commented Jul 7, 2022

Mirko-von-Leipzig commented Jul 11, 2022

fracek commented Jul 11, 2022

Mirko-von-Leipzig commented Jul 25, 2022

koivunej commented Sep 16, 2022

Track and expose metrics #317

Track and expose metrics #317

Conversation

fracek commented May 31, 2022

fracek commented May 31, 2022

Mirko-von-Leipzig commented May 31, 2022

Mirko-von-Leipzig commented May 31, 2022

fracek commented May 31, 2022

fracek commented May 31, 2022

koivunej left a comment

Choose a reason for hiding this comment

koivunej Jun 9, 2022 • edited

Choose a reason for hiding this comment

fracek Jun 10, 2022

Choose a reason for hiding this comment

koivunej Jun 9, 2022

Choose a reason for hiding this comment

fracek Jun 10, 2022

Choose a reason for hiding this comment

koivunej Jun 10, 2022

Choose a reason for hiding this comment

fracek commented Jun 11, 2022

fracek commented Jul 1, 2022

fracek commented Jul 1, 2022

fracek commented Jul 7, 2022

fracek commented Jul 7, 2022

Mirko-von-Leipzig commented Jul 11, 2022

fracek commented Jul 11, 2022

Mirko-von-Leipzig commented Jul 25, 2022

koivunej commented Sep 16, 2022

$@fracek$ fracek commented May 31, 2022

koivunej Jun 9, 2022 •

edited

$@fracek$ fracek Jun 10, 2022

$@fracek$ fracek Jun 10, 2022