Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track and expose metrics #317

Closed
wants to merge 1 commit into from
Closed

Conversation

fracek
Copy link
Contributor

@fracek fracek commented May 31, 2022

I added a way to track and expose metrics. Users can decide to run the service by specifying the metrics socket address (e.g. --metrics=0.0.0.0:7060).
The metrics server exposes 3 pages:

  • /readyz which is the ready check page. At the moment it doesn't do much but in a future PR I will add proper startup checks.
  • /livez which is the liveness check page. Like the other page, in the future it will check on other modules to see if they're working correctly.
  • /metrics which is the prometheus metrics page.

At the moment the metrics tracked are related tothe sync service and the rpc api.

@fracek fracek force-pushed the feature/observability branch 4 times, most recently from f2786b8 to 795c502 Compare May 31, 2022 16:19
@fracek
Copy link
Contributor Author

fracek commented May 31, 2022

Looks like failing test is unrelated to PR?

@Mirko-von-Leipzig
Copy link
Collaborator

Looks like failing test is unrelated to PR?

Yeah the failures are due to external PRs not getting access to our github secrets. At some point in the future we will hoepfully use bors to manage /fix all this.

@Mirko-von-Leipzig
Copy link
Collaborator

I haven't done a thorough review; but a couple of more general questions.

  1. Do the exposed metrics commonly form part of the semver guarantees? i.e. do we need to think carefully about which metrics we add / remove? I would presume yes. I'm thinking a bit ahead to when we add p2p, which means the sync related metrics will likely require a rework / rethink.
  2. How did you select which points to metric? They make sense; I'm just curious if you had any specific requirements yourself?
  3. I presume you chose warp because its a dev dep as well? (more a comment for myself to look into alternatives before we settle).
  4. What's with the xxx-z extension? e.g. readyz.

@fracek
Copy link
Contributor Author

fracek commented May 31, 2022

  1. Do the exposed metrics commonly form part of the semver guarantees? i.e. do we need to think carefully about which metrics we add / remove? I would presume yes. I'm thinking a bit ahead to when we add p2p, which means the sync related metrics will likely require a rework / rethink.

I never thought about this to be honest.

  1. How did you select which points to metric? They make sense; I'm just curious if you had any specific requirements yourself?

I added counters to the RPC calls because I'm interested in looking at the total calls per second, grouped by call type. Sync counters are just a way to visualize what that process is doing. As I start running the node in production I think it will be more clear what additional metrics we need to add.

  1. I presume you chose warp because its a dev dep as well? (more a comment for myself to look into alternatives before we settle).

Warp seemed like the smallest http server available, these endpoint don't need anything fancy. I didn't notice it was a dev dependency.

  1. What's with the xxx-z extension? e.g. readyz.

I used the standard paths used by some software on k8s. I believe the ending z makes it clear they're not intended for human consumption.

@fracek
Copy link
Contributor Author

fracek commented May 31, 2022

I forgot to mention that in a PR later I will have /readyz and /livez output extra information like in the k8s page I linked. In my experience they help a lot debugging why a service is not healthy.

Copy link
Contributor

@koivunej koivunej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @fracek! Sorry for leaving you without feedback for this long. I've started to look into opentelemetry a few times already but the codebase is really slow to understand. I have some questions in the meantime though :)

let metrics = warp::path!("metrics").and(warp::get()).map({
move || {
let mut buffer = Vec::new();
let encoder = TextEncoder::new();
Copy link
Contributor

@koivunej koivunej Jun 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why does TextEncoder come from prometheus but otherwise opentelemetry_prometheus is used. What's more there seems to be a dependency from opentelemetry_prometheus to prometheus and it would seem that prometheus::{Encode, TextEncoder}; are already re-exported as opentelemetry_prometheus::{Encoder, TextEncoder}. Maybe it would be best to avoid direct dependency on either openprometheus or just go with only it?

Related: Did you consider just using prometheus-client here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will update it to use opentelemetry_prometheus only, I didn't notice the re-exports.

I'm using opentelemetry so that in the future metrics can be exported to other protocols, e.g. to an opentelemetry endpoint together with traces.

/// Run metrics server if `addr` is specified.
pub fn run_server(addr: Option<SocketAddr>) -> anyhow::Result<(ServerFuture, Option<SocketAddr>)> {
if let Some(addr) = addr {
let exporter = opentelemetry_prometheus::exporter().init();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do the RpcMetrics and SyncMetrics require this call to happen before their global registrations happen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's the case. Should I add a note about that in the function's doc?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is required to happen before, I'd just handle the addr: SocketAddr case here (nested in src/obs.rs) and do the required global init in the binary src/bin/pathfinder.rs so that no funny stuff can happen, and add a comment that this initialization must happen before any rpc or sync things are done.

@fracek fracek force-pushed the feature/observability branch 2 times, most recently from 3f54b16 to ae388a7 Compare June 11, 2022 21:41
@fracek
Copy link
Contributor Author

fracek commented Jun 11, 2022

I updated the PR.

@fracek fracek requested a review from a team as a code owner July 1, 2022 19:40
@fracek
Copy link
Contributor Author

fracek commented Jul 1, 2022

I updated the PR to use a middleware to track and update rpc metrics. For RPC calls it now tracks:

  • count: the total number of requests
  • error: the number of requests that errored
  • duration: duration in milliseconds to server the request

@fracek
Copy link
Contributor Author

fracek commented Jul 1, 2022

Let me know if there's anything more I can do to have this PR merged. I'm running pathfinder in production and it would be great to have its metrics together with everything else in graphana.

@fracek
Copy link
Contributor Author

fracek commented Jul 7, 2022

Updated the PR to cleanly merge with the new head.

@fracek
Copy link
Contributor Author

fracek commented Jul 7, 2022

I also updated the PR to format data in a way that's more useful when building dashboards in Grafana:

  • include more realistic buckets for duration
  • add a status=ERROR|SUCCESS attribute to RPC data

@Mirko-von-Leipzig
Copy link
Collaborator

@fracek thanks for keeping up with the PR -- and sincere apologies for the lack of communications.

We've been looking into the opentelemetry crates and are unhappy with the amount of code / complexity they involve -- not necessarily for the end user, but more that the crates themselves are quite convoluted internally. Especially since we just want to export some simple metrics.

We're now investigating alternatives like metrics + metrics-exporter-prometheus or prometheus.

We realise these are directly just Prometheus exports and not opentelemetry, but as mentioned the latter just seems like a huge risk currently for support.

@fracek
Copy link
Contributor Author

fracek commented Jul 11, 2022

I think that's fair enough, otel adds some complexity but in return the node gets:

  • export traces (not just metrics) so that it's possible to debug performance issues in production. Without otel you would need another exporter like zipkin for this.
  • export to a much larger number of providers (stackdrive, lightstep, datadog, etc.) through otel-collector.

If the teams decides that the added complexity is not worth these 2 features I can rewrite the PR to use the libraries you suggested.

@Mirko-von-Leipzig
Copy link
Collaborator

We've decided that metrics + metrics-exporter-prometheus is probably a better fit right now.

If we need otel traces I think there is also an otel exporter specifically for tracing which we can use without changing metrics.

At this stage I suspect that refactoring this PR onto the changes we've been making may be more effort than its worth. If you're still keen to push this through, then I would suggest starting a new PR and just moving your changes over manually. If you're over this (which I would understand), then we're also happy to attribute your work done whenever we implement it ourselves.

It would also be appreciated if the PR could be just one thing i.e. just the metrics (no livez & readyz), since you mention you'll make them do what you intended in the future still.

@CHr15F0x CHr15F0x mentioned this pull request Aug 31, 2022
@koivunej
Copy link
Contributor

This was superceded by #545 using metrics-rs but we forgot to close this PR. Apologies on forgetting and thanks for getting the ball rolling!

@koivunej koivunej closed this Sep 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants