Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metrics/prometheus: added prometheus metrics #17077

Merged
merged 4 commits into from Apr 11, 2019

Conversation

Projects
None yet
5 participants
@chapsuk
Copy link
Contributor

commented Jun 26, 2018

Hi, guys!

  • added 2 flags:
  --metrics.prometheus             Enable prometheus HTTP server
  --metrics.prometheus.addr value  Prometheus server listening address (default: "localhost:9455")
  • when --metrics.prometheus flag provided, start http server which returns internal geth metrics by http in prometheus format; log output:
≻ ./build/bin/geth --metrics --metrics.prometheus
INFO [06-26|04:18:12.544] Enabling metrics collection
INFO [06-26|04:18:12.544] Starting prometheus http server          addr=localhost:9455
INFO [06-26|04:18:12.546] Maximum peer count                       ETH=25 LES=0 total=25
@chapsuk

This comment has been minimized.

Copy link
Contributor Author

commented Jun 28, 2018

take a look please
@karalabe @nonsense

@karalabe

This comment has been minimized.

Copy link
Member

commented Jun 28, 2018

There's no reason to start up a new HTTP server for Prometheus. The metrics are already exposed on the pprof server, you can just add an additional endpoint there to expose further metrics.

@chapsuk

This comment has been minimized.

Copy link
Contributor Author

commented Jun 28, 2018

Thanks @karalabe!

I'm looks to #1697, metrics flags put in a separate block. Looks like all metrics settings need to be there.

Several reasons to start up a new HTTP server:

  • enable/disable pprof does not affect collect internal metrics, this is set up by --metrics flag
  • different access levels to metrics and pprof ports
  • enable Prometheus handler only if needed, this can be done through --pprof.prometheus flag,
    but it confuse, because it not part of standard pprof handler;

In my case it would be great to enable the Prometheus handler when pprof is disabled. This is true for those environments where it is not possible to use pprof but need to collect metrics and monitor geth node.

Separate settings look more obvious for configuring geth according to own requirements.

@chapsuk chapsuk force-pushed the chapsuk:prom branch 2 times, most recently from 0a29635 to ed9ae6c Jul 2, 2018

@chapsuk

This comment has been minimized.

Copy link
Contributor Author

commented Jul 4, 2018

@karalabe what you think about it #17077 (comment)?

@chapsuk chapsuk force-pushed the chapsuk:prom branch from ed9ae6c to b4d5879 Jul 6, 2018

@2color

This comment has been minimized.

Copy link

commented Apr 4, 2019

This would be greatly appreciated!

@karalabe karalabe added this to the 1.9.0 milestone Apr 4, 2019

@karalabe karalabe self-requested a review Apr 4, 2019

@karalabe karalabe removed the status:triage label Apr 4, 2019

@holiman

This comment has been minimized.

Copy link
Contributor

commented Apr 4, 2019

This is true for those environments where it is not possible to use pprof but need to collect metrics and monitor geth node.

Why would that not be possible?

We're planning to include this into 1.9, but it needs to be reworked to use the same endpoint as pprof (not a separate http server) and also needs licensing headers. @chapsuk are you willing to work more on this? Otherwise we'll put it into the backlog to try and finish it ourselves.

@chapsuk

This comment has been minimized.

Copy link
Contributor Author

commented Apr 4, 2019

Thanks for replay @holiman.
Not a problem, i will change to use pprof endpoints in 2-3 days.
See you later

@chapsuk chapsuk force-pushed the chapsuk:prom branch from b4d5879 to cdd3b0b Apr 4, 2019

@chapsuk

This comment has been minimized.

Copy link
Contributor Author

commented Apr 4, 2019

Done, PTAL

@karalabe karalabe force-pushed the chapsuk:prom branch from cdd3b0b to de5398f Apr 5, 2019

@karalabe

This comment has been minimized.

Copy link
Member

commented Apr 5, 2019

Hey @chapsuk, thanks a lot for this. I've been playing around with it a bit and have a few questions though.

Prometheus generally has each metric as its separate instance:

# Minimalistic line:
metric_without_timestamp_and_labels 12.47

# A weird metric from before the epoch:
something_weird{problem="division by zero"} +Inf -3982045

However, you grouped all everything under 4 metrics (counters, gauges, meters, timers):

timers{name="chain/account/reads",quantile="0.50"} 0
timers{name="chain/account/reads",quantile="0.75"} 0
timers{name="chain/account/reads",quantile="0.95"} 0
timers{name="chain/account/reads",quantile="0.99"} 0
timers{name="chain/account/reads",quantile="0.999"} 0
timers{name="chain/account/reads",quantile="0.9999"} 0
timers_count{name="chain/storage/commits"} 22094
timers{name="chain/storage/commits",quantile="0.50"} 0
timers{name="chain/storage/commits",quantile="0.75"} 0
timers{name="chain/storage/commits",quantile="0.95"} 0
timers{name="chain/storage/commits",quantile="0.99"} 0
timers{name="chain/storage/commits",quantile="0.999"} 0
timers{name="chain/storage/commits",quantile="0.9999"} 0
timers_count{name="chain/execution"} 22094

I don't have much experience with Prometheus, but this seems the wrong way to approach it, since then all these data points will be pushed into the same table of the time series database, instead of separate ones per metric.

It also seems to make Grafana charting more complex, since you need to fetch timers first, and then filter for chain/storage/commits.

Instead of creating these grouping, which imho only add complexity but don't provide any value, couldn't er simply convert / to _ and report every metric on it's own? Seems like that's the official way. Eg.

chain_storage_commits_count 22094
chain_storage_commits{quantile="0.50"} 0
chain_storage_commits{quantile="0.75"} 0
chain_storage_commits{quantile="0.95"} 0
chain_storage_commits{quantile="0.99"} 0
chain_storage_commits{quantile="0.999"} 0
chain_storage_commits{quantile="0.9999"} 0

This would also simplify your code a lot, since it completely removes the need for the collector type, which just facilitates the groupings. Instead you can just use a singe byte buffer in the Handler and push every meter in the order of appearance, no need to change it.


A second issue is that you seem to convert go-metrics counters into Prometheus gauges. I think we can retain the counter type in Prometheus too.

@karalabe karalabe force-pushed the chapsuk:prom branch from de5398f to e419f5c Apr 5, 2019

@2color

This comment has been minimized.

Copy link

commented Apr 5, 2019

@karalabe

You're right about naming/grouping metrics based on their semantic value rather than unit type.

Some good pointers can be found:


Regarding Prometheus counters, they can only increment unlike go-metric counters which can also decrement.

A counter is a cumulative metric that represents a single monotonically increasing counter whose value can only increase or be reset to zero on restart. For example, you can use a counter to represent the number of requests served, tasks completed, or errors.

@chapsuk

This comment has been minimized.

Copy link
Contributor Author

commented Apr 5, 2019

@karalabe named metrics better then typed groups 👍

This would also simplify your code a lot, since it completely removes the need for the collector type, which just facilitates the groupings. Instead you can just use a singe byte buffer in the Handler and push every meter in the order of appearance, no need to change it.

Not sure about that, i need to send type of each metrics to Prometheus. But i will try


@2color fully right about second question.

@karalabe

This comment has been minimized.

Copy link
Member

commented Apr 5, 2019

Not sure about that, i need to send type of each metrics to Prometheus. But i will try

Based on the Prometheus sample, you don't need to explicitly specify the type. But even if you want to, you can still do with the type switch you currently have, you just don't need to "group" all counters together for example, you can simply output as the come along: Counter1, Meter1, Counter2, Counter3, Meter2, etc.

@chapsuk

This comment has been minimized.

Copy link
Contributor Author

commented Apr 5, 2019

Based on the Prometheus sample, you don't need to explicitly specify the type.

Yep, it works without TYPE , but I left him

Fixed names of metrics, PTAL

chapsuk and others added some commits Jun 25, 2018

@chapsuk

This comment has been minimized.

Copy link
Contributor Author

commented Apr 11, 2019

@karalabe
Need something else to do?

@karalabe karalabe force-pushed the chapsuk:prom branch from df701f6 to 88339c3 Apr 11, 2019

@karalabe

This comment has been minimized.

Copy link
Member

commented Apr 11, 2019

Nope, just pushes a tiny typo fix on top. LGTM, will merge when CI is green.

@karalabe
Copy link
Member

left a comment

LGTM

@karalabe

This comment has been minimized.

Copy link
Member

commented Apr 11, 2019

Oh @chapsuk, would you be willing to share your Grafana dashboard setup? I'd like to write up a blog post about metrics for the 1.9 release and would be nice to have a Grafana dashboard included there too.

@chapsuk

This comment has been minimized.

Copy link
Contributor Author

commented Apr 11, 2019

This repo contains dev environment and Grafana dashboard. I'm updated dashboard and test last changes with it, but README screenshot not actual. I will try to fix it tonight.

@karalabe karalabe merged commit 31bc2a2 into ethereum:master Apr 11, 2019

0 of 2 checks passed

continuous-integration/appveyor/pr Waiting for AppVeyor build to complete
Details
continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
@karalabe

This comment has been minimized.

Copy link
Member

commented Apr 11, 2019

Thanks a lot for this PR! Btw, we'll probably simplify the flags a bit even further so not even --pprof should be needed. I'll ping you on the followup PR when I open it.

@chapsuk chapsuk deleted the chapsuk:prom branch Apr 11, 2019

@2color

This comment has been minimized.

Copy link

commented Apr 11, 2019

Thanks @chapsuk! Much appreciated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.