add prometheus format status output #430

florianweyandt1337 · 2021-05-14T10:51:01Z

I monitor my plotting machines with prometheus/grafana. This will add the subcommand
plotman prometheus
which will output the current status of all plots in a prometheus readable format with the numeric metrics als values and the alpha-numeric information as labels. It looks something like this:

# HELP plotman_plot_phase_major The phase the plot is currently in.
# TYPE plotman_plot_phase_major gauge
plotman_plot_phase_major{plot_id="6bc265b8",tmp_dir="/chia-temp",dst_dir="/mnt/plots/seagate-16-07",run_status="RUN",phase="3:3"} 3
plotman_plot_phase_major{plot_id="eea47ced",tmp_dir="/chia-temp",dst_dir="/mnt/plots/seagate-16-04",run_status="RUN",phase="3:2"} 3
# HELP plotman_plot_phase_minor The part of the phase the plot is currently in.
# TYPE plotman_plot_phase_minor gauge
plotman_plot_phase_minor{plot_id="6bc265b8",tmp_dir="/chia-temp",dst_dir="/mnt/plots/seagate-16-07",run_status="RUN",phase="3:3"} 3
plotman_plot_phase_minor{plot_id="eea47ced",tmp_dir="/chia-temp",dst_dir="/mnt/plots/seagate-16-04",run_status="RUN",phase="3:2"} 2
# HELP plotman_plot_tmp_usage Tmp dir usage in bytes.
# TYPE plotman_plot_tmp_usage gauge
plotman_plot_tmp_usage{plot_id="6bc265b8",tmp_dir="/chia-temp",dst_dir="/mnt/plots/seagate-16-07",run_status="RUN",phase="3:3"} 185829149237
plotman_plot_tmp_usage{plot_id="eea47ced",tmp_dir="/chia-temp",dst_dir="/mnt/plots/seagate-16-04",run_status="RUN",phase="3:2"} 206162494474
...

This information can then be exported, either by a webserver or by a node-exporter running on the machine (thats how I have set it up) like this:

mkdir -p /tmp/metrics
/usr/local/bin/node_exporter --collector.textfile.directory="/tmp/metrics"
while true; do plotman prometheus > /tmp/metrics/plotman.prom; sleep 10; done

drewwells · 2021-05-23T13:58:02Z

Thanks for doing this. Prometheus style metrics is far more useful than improvements on analyze like #348.

Can you make this an HTTP endpoint, so prometheus can scrape the metrics? We would be able to make dashboards easily from many plotting machines connected over shared network space then.

florianweyandt1337 · 2021-05-24T04:19:28Z

@drewwells I could, but I'd like to hear from @ericaltendorf that he wants that in his plotman :). Having an HTTP Server running is something that you must actively want to have in your program, in my opinion.

And like I wrote, you'll probably want to measure other stuff on your plotting machine (CPU/RAM/Temps/Disk), so you'll probably have the node_exporter running on there anyways. And node_exporter can easily just pick up those metrics as well.

dylanzr · 2021-06-06T14:58:24Z

I really like the idea of having prometheus metrics. However, it would be nice if there were a set of aggregate statistics, ie. avg duration per phase, total started/completed, etc. I suppose that could be calculated via promQL, I just wouldn't want to keep high retention on metrics w/ plot IDs as labels due to cardinality concerns.

altendky · 2021-06-06T23:20:12Z

I'm on board with the idea and expect this to get merged. I also apologize for the delay here. I do appreciate the perspective that an HTTP server maybe doesn't belong here. I suspect it doesn't at this point, unless there is some particular need for it to be integrated. That said, I have started some TUI rework that will be using anyio (on asyncio, at least for now) so... *shrug*

I haven't run prometheus and while I am curious, I don't expect to get to it now. Could people that have used this code and verified it works please provide an approving review? I believe anyone can do this from the 'Files changed' tab of this PR. I will still do my own review reading the code.

The code seems pretty straightforward and non-intrusive. I do wonder if it should be grouped with status though? I believe it is roughly outputting status but in a different format? (yeah, maybe somewhat different data as well but, mostly thinking about the concept here) We could do something like shifting the present plotman status functionality to plotman status human (or whatever) and then also have plotman status prometheus for this? Perhaps the plotman status should default to plotman status human? Don't go changing the code here, I'm not confident what I want. But let me know how you feel.

florianweyandt1337 · 2021-06-07T06:51:25Z

Hey, thanks for getting around to this. I realize this projects gets a lot of attention as many people are using it, and you are doing this in your spare time :)
I agree that this fits more in a status subsection, but I didn't want to change the established behaviour of plotman status thats why I just put it next to it.

Do I have to do something about that failing check? I am a bit unexperienced in open-source work, sorry.

florianweyandt1337 · 2021-06-07T06:57:37Z

@dylanzr I agree and would also love aggregated metrics, however as I understand the code, plotman is stateless and always just a snapshot of the current reality. So once a plot passes from phase 1 to phase 2, plotman doesn't know how long it took for phase 1. You could do this by parsing the logfiles, but I think that should be another PR. Something like plotman prometheus historic or whatever status subsection it eventually ends up in.

altendky · 2021-06-07T13:17:13Z

Just ignore the failing coverage check for now. It is not required yet because I haven't gotten around to achieving sufficient test coverage to mandate it of others. Maybe 'soon'... Sorry for the confusion.

All of the progress information that plotman gets is based on parsing the log files. The jobs you pass in should already have all the information from the logs.

If anyone could opine on how this PR and #549 relate, that would be great.

florianweyandt1337 · 2021-06-07T13:32:01Z

#549 can be used to achieve the same thing as this PR by writing a bash script that parses the JSON. I think have the status output as JSON is still a very useful thing to have, as you can parse the output easier than plotman status for various use-cases. I think however this PR would solve the #549 Authors original Problem, i.e. getting plot-metrics to prometheus.

But they should remain seperate PRs in my opinion.

florianweyandt1337 · 2021-06-07T13:34:30Z

In Grafana the result could look something like this for example:

drewwells · 2021-06-07T15:16:22Z

@drewwells I could, but I'd like to hear from @ericaltendorf that he wants that in his plotman :). Having an HTTP Server running is something that you must actively want to have in your program, in my opinion.

And like I wrote, you'll probably want to measure other stuff on your plotting machine (CPU/RAM/Temps/Disk), so you'll probably have the node_exporter running on there anyways. And node_exporter can easily just pick up those metrics as well.

We wouldn't need host level metrics, there's many solutions already https://github.com/prometheus/node_exporter/.

I really like the idea of having prometheus metrics. However, it would be nice if there were a set of aggregate statistics, ie. avg duration per phase, total started/completed, etc. I suppose that could be calculated via promQL, I just wouldn't want to keep high retention on metrics w/ plot IDs as labels due to cardinality concerns.

Correct, you publish raw data to prometheus and create aggregate view from it.

Just ignore the failing coverage check for now. It is not required yet because I haven't gotten around to achieving sufficient test coverage to mandate it of others. Maybe 'soon'... Sorry for the confusion.

All of the progress information that plotman gets is based on parsing the log files. The jobs you pass in should already have all the information from the logs.

If anyone could opine on how this PR and #549 relate, that would be great.

I would personally close #549. We will get a lot more value from prometheus style metrics than custom ones that can not be consumed by standard metric collectors. HTTP is the preferred way to collect prometheus metrics. If we do not add HTTP support in this PR, it really should be done in a separate one. Without HTTP, we will require a separate exporter process just to take the cli output and send it to prometheus server.

As an example, geth supports a metrics HTTP endpoint https://geth.ethereum.org/docs/interface/metrics

drewwells · 2021-06-07T15:21:49Z

As for HTTP support, I'm thinking similar to plotman archive. It runs in a separate process and converts the internal status into the prometheus metrics described in this PR.

florianweyandt1337 · 2021-06-07T15:27:49Z

@drewwells you misunderstood me. The node_exporter will almost certainly already be running on any plotting machine being monitored by prometheus.
And node_exporter provides a function that collects any file with metrics from a given directory (textfile-collector). You can use the plotman prometheus command to write your metrics to that directory and node_exporter will pick them up and display them along with its other metrics for prometheus to consume.

drewwells · 2021-06-07T16:09:43Z

So we just need a job to update the text file. Is this going to run as a daemon to do that or it needs an external job runner?

florianweyandt1337 · 2021-06-08T06:35:15Z

An external job runner of your choice :) bash while true, crontab, systemd... there are many to choose from

philippnormann

I tested this PR using my local prometheus + node_exporter setup and can confirm that it's working as desired 👍🏽

altendky · 2021-06-14T01:46:22Z

Alrighty. The PRs stand as separate things. HTTP server shouldn't be needed, and wouldn't be added 'soon' anyways. Maybe one grows later, but that would be later after #299 at least and would need significant justification I think. It does strike me as odd that every last monitored activity on a server would have to provide its own HTTP server.. But hey, I haven't gotten into this area yet, so what do I know about normal.

So, could you go ahead and do an catch up here from development and we'll see what we can do?

altendky · 2021-06-19T02:17:32Z

I took the liberty of doing the catch up merge and adding type hints. I hope you don't mind.

altendky · 2021-06-19T02:22:46Z

Thanks for the work here. Perhaps at some point I'll find time to setup prometheus for myself, but in the mean time it appears from all the attention here that you've already got a pack of people using this. :] My apologies to all for the delay in processing this PR.

florianweyandt1337 · 2021-06-21T06:31:39Z

Thank you for the additions and the merge :)

tedzhang2891 · 2021-07-09T09:34:52Z

@florianweyandt1337 Hi, could you share the grafana template?

florianweyandt1337 · 2021-07-12T06:36:41Z

@tedzhang2891 Unfortunately no, sorry. I moved to madmax and the dashboard didn't really make sense any more. So I threw it out.

altendky · 2021-07-17T01:40:37Z

altendky/farm@79f98f4

I just today started fiddling with this so it's really rough, but hey. Also, it uses an extra field of major + minor/10. but anyways.

add prometheus format status output

38523b3

florianweyandt1337 changed the base branch from main to development May 24, 2021 06:21

Merge branch 'development' into main

2137c80

altendky mentioned this pull request Jun 6, 2021

req: Adding json flag to status command. #549

Merged

philippnormann approved these changes Jun 10, 2021

View reviewed changes

altendky added 2 commits June 18, 2021 21:49

Merge branch 'development' into fw

0a85343

add type hints

d2c9f08

add changelog entry

d92f397

altendky merged commit d011d9c into ericaltendorf:development Jun 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add prometheus format status output #430

add prometheus format status output #430

florianweyandt1337 commented May 14, 2021

drewwells commented May 23, 2021

florianweyandt1337 commented May 24, 2021

dylanzr commented Jun 6, 2021

altendky commented Jun 6, 2021

florianweyandt1337 commented Jun 7, 2021

florianweyandt1337 commented Jun 7, 2021

altendky commented Jun 7, 2021

florianweyandt1337 commented Jun 7, 2021

florianweyandt1337 commented Jun 7, 2021

drewwells commented Jun 7, 2021 •

edited

Loading

drewwells commented Jun 7, 2021

florianweyandt1337 commented Jun 7, 2021

drewwells commented Jun 7, 2021

florianweyandt1337 commented Jun 8, 2021

philippnormann left a comment

altendky commented Jun 14, 2021

altendky commented Jun 19, 2021

altendky commented Jun 19, 2021

florianweyandt1337 commented Jun 21, 2021

tedzhang2891 commented Jul 9, 2021

florianweyandt1337 commented Jul 12, 2021

altendky commented Jul 17, 2021

add prometheus format status output #430

add prometheus format status output #430

Conversation

florianweyandt1337 commented May 14, 2021

drewwells commented May 23, 2021

florianweyandt1337 commented May 24, 2021

dylanzr commented Jun 6, 2021

altendky commented Jun 6, 2021

florianweyandt1337 commented Jun 7, 2021

florianweyandt1337 commented Jun 7, 2021

altendky commented Jun 7, 2021

florianweyandt1337 commented Jun 7, 2021

florianweyandt1337 commented Jun 7, 2021

drewwells commented Jun 7, 2021 • edited Loading

drewwells commented Jun 7, 2021

florianweyandt1337 commented Jun 7, 2021

drewwells commented Jun 7, 2021

florianweyandt1337 commented Jun 8, 2021

philippnormann left a comment

Choose a reason for hiding this comment

altendky commented Jun 14, 2021

altendky commented Jun 19, 2021

altendky commented Jun 19, 2021

florianweyandt1337 commented Jun 21, 2021

tedzhang2891 commented Jul 9, 2021

florianweyandt1337 commented Jul 12, 2021

altendky commented Jul 17, 2021

drewwells commented Jun 7, 2021 •

edited

Loading