-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Track and expose metrics #317
Conversation
f2786b8
to
795c502
Compare
Looks like failing test is unrelated to PR? |
Yeah the failures are due to external PRs not getting access to our github secrets. At some point in the future we will hoepfully use bors to manage /fix all this. |
I haven't done a thorough review; but a couple of more general questions.
|
I never thought about this to be honest.
I added counters to the RPC calls because I'm interested in looking at the total calls per second, grouped by call type. Sync counters are just a way to visualize what that process is doing. As I start running the node in production I think it will be more clear what additional metrics we need to add.
Warp seemed like the smallest http server available, these endpoint don't need anything fancy. I didn't notice it was a dev dependency.
I used the standard paths used by some software on k8s. I believe the ending |
I forgot to mention that in a PR later I will have |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this @fracek! Sorry for leaving you without feedback for this long. I've started to look into opentelemetry a few times already but the codebase is really slow to understand. I have some questions in the meantime though :)
let metrics = warp::path!("metrics").and(warp::get()).map({ | ||
move || { | ||
let mut buffer = Vec::new(); | ||
let encoder = TextEncoder::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why does TextEncoder come from prometheus
but otherwise opentelemetry_prometheus
is used. What's more there seems to be a dependency from opentelemetry_prometheus
to prometheus
and it would seem that prometheus::{Encode, TextEncoder};
are already re-exported as opentelemetry_prometheus::{Encoder, TextEncoder}
. Maybe it would be best to avoid direct dependency on either openprometheus or just go with only it?
Related: Did you consider just using prometheus-client
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will update it to use opentelemetry_prometheus
only, I didn't notice the re-exports.
I'm using opentelemetry so that in the future metrics can be exported to other protocols, e.g. to an opentelemetry endpoint together with traces.
/// Run metrics server if `addr` is specified. | ||
pub fn run_server(addr: Option<SocketAddr>) -> anyhow::Result<(ServerFuture, Option<SocketAddr>)> { | ||
if let Some(addr) = addr { | ||
let exporter = opentelemetry_prometheus::exporter().init(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do the RpcMetrics and SyncMetrics require this call to happen before their global registrations happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's the case. Should I add a note about that in the function's doc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is required to happen before, I'd just handle the addr: SocketAddr
case here (nested in src/obs.rs) and do the required global init in the binary src/bin/pathfinder.rs
so that no funny stuff can happen, and add a comment that this initialization must happen before any rpc or sync things are done.
3f54b16
to
ae388a7
Compare
I updated the PR. |
ae388a7
to
07c8012
Compare
I updated the PR to use a middleware to track and update rpc metrics. For RPC calls it now tracks:
|
Let me know if there's anything more I can do to have this PR merged. I'm running pathfinder in production and it would be great to have its metrics together with everything else in graphana. |
07c8012
to
c3a704e
Compare
Updated the PR to cleanly merge with the new head. |
c3a704e
to
2113f58
Compare
I also updated the PR to format data in a way that's more useful when building dashboards in Grafana:
|
@fracek thanks for keeping up with the PR -- and sincere apologies for the lack of communications. We've been looking into the We're now investigating alternatives like metrics + metrics-exporter-prometheus or prometheus. We realise these are directly just Prometheus exports and not opentelemetry, but as mentioned the latter just seems like a huge risk currently for support. |
I think that's fair enough, otel adds some complexity but in return the node gets:
If the teams decides that the added complexity is not worth these 2 features I can rewrite the PR to use the libraries you suggested. |
2113f58
to
d2e1689
Compare
We've decided that metrics + metrics-exporter-prometheus is probably a better fit right now. If we need otel traces I think there is also an otel exporter specifically for tracing which we can use without changing metrics. At this stage I suspect that refactoring this PR onto the changes we've been making may be more effort than its worth. If you're still keen to push this through, then I would suggest starting a new PR and just moving your changes over manually. If you're over this (which I would understand), then we're also happy to attribute your work done whenever we implement it ourselves. It would also be appreciated if the PR could be just one thing i.e. just the metrics (no |
This was superceded by #545 using metrics-rs but we forgot to close this PR. Apologies on forgetting and thanks for getting the ball rolling! |
I added a way to track and expose metrics. Users can decide to run the service by specifying the metrics socket address (e.g.
--metrics=0.0.0.0:7060
).The metrics server exposes 3 pages:
/readyz
which is the ready check page. At the moment it doesn't do much but in a future PR I will add proper startup checks./livez
which is the liveness check page. Like the other page, in the future it will check on other modules to see if they're working correctly./metrics
which is the prometheus metrics page.At the moment the metrics tracked are related tothe sync service and the rpc api.