Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add prometheus format status output #430

Merged
merged 5 commits into from
Jun 19, 2021

Conversation

florianweyandt1337
Copy link
Contributor

I monitor my plotting machines with prometheus/grafana. This will add the subcommand
plotman prometheus
which will output the current status of all plots in a prometheus readable format with the numeric metrics als values and the alpha-numeric information as labels. It looks something like this:

# HELP plotman_plot_phase_major The phase the plot is currently in.
# TYPE plotman_plot_phase_major gauge
plotman_plot_phase_major{plot_id="6bc265b8",tmp_dir="/chia-temp",dst_dir="/mnt/plots/seagate-16-07",run_status="RUN",phase="3:3"} 3
plotman_plot_phase_major{plot_id="eea47ced",tmp_dir="/chia-temp",dst_dir="/mnt/plots/seagate-16-04",run_status="RUN",phase="3:2"} 3
# HELP plotman_plot_phase_minor The part of the phase the plot is currently in.
# TYPE plotman_plot_phase_minor gauge
plotman_plot_phase_minor{plot_id="6bc265b8",tmp_dir="/chia-temp",dst_dir="/mnt/plots/seagate-16-07",run_status="RUN",phase="3:3"} 3
plotman_plot_phase_minor{plot_id="eea47ced",tmp_dir="/chia-temp",dst_dir="/mnt/plots/seagate-16-04",run_status="RUN",phase="3:2"} 2
# HELP plotman_plot_tmp_usage Tmp dir usage in bytes.
# TYPE plotman_plot_tmp_usage gauge
plotman_plot_tmp_usage{plot_id="6bc265b8",tmp_dir="/chia-temp",dst_dir="/mnt/plots/seagate-16-07",run_status="RUN",phase="3:3"} 185829149237
plotman_plot_tmp_usage{plot_id="eea47ced",tmp_dir="/chia-temp",dst_dir="/mnt/plots/seagate-16-04",run_status="RUN",phase="3:2"} 206162494474
...

This information can then be exported, either by a webserver or by a node-exporter running on the machine (thats how I have set it up) like this:

mkdir -p /tmp/metrics
/usr/local/bin/node_exporter --collector.textfile.directory="/tmp/metrics"
while true; do plotman prometheus > /tmp/metrics/plotman.prom; sleep 10; done

@drewwells
Copy link

Thanks for doing this. Prometheus style metrics is far more useful than improvements on analyze like #348.

Can you make this an HTTP endpoint, so prometheus can scrape the metrics? We would be able to make dashboards easily from many plotting machines connected over shared network space then.

@florianweyandt1337
Copy link
Contributor Author

@drewwells I could, but I'd like to hear from @ericaltendorf that he wants that in his plotman :). Having an HTTP Server running is something that you must actively want to have in your program, in my opinion.

And like I wrote, you'll probably want to measure other stuff on your plotting machine (CPU/RAM/Temps/Disk), so you'll probably have the node_exporter running on there anyways. And node_exporter can easily just pick up those metrics as well.

@florianweyandt1337 florianweyandt1337 changed the base branch from main to development May 24, 2021 06:21
@dylanzr
Copy link

dylanzr commented Jun 6, 2021

I really like the idea of having prometheus metrics. However, it would be nice if there were a set of aggregate statistics, ie. avg duration per phase, total started/completed, etc. I suppose that could be calculated via promQL, I just wouldn't want to keep high retention on metrics w/ plot IDs as labels due to cardinality concerns.

@altendky
Copy link
Collaborator

altendky commented Jun 6, 2021

I'm on board with the idea and expect this to get merged. I also apologize for the delay here. I do appreciate the perspective that an HTTP server maybe doesn't belong here. I suspect it doesn't at this point, unless there is some particular need for it to be integrated. That said, I have started some TUI rework that will be using anyio (on asyncio, at least for now) so... *shrug*

I haven't run prometheus and while I am curious, I don't expect to get to it now. Could people that have used this code and verified it works please provide an approving review? I believe anyone can do this from the 'Files changed' tab of this PR. I will still do my own review reading the code.

The code seems pretty straightforward and non-intrusive. I do wonder if it should be grouped with status though? I believe it is roughly outputting status but in a different format? (yeah, maybe somewhat different data as well but, mostly thinking about the concept here) We could do something like shifting the present plotman status functionality to plotman status human (or whatever) and then also have plotman status prometheus for this? Perhaps the plotman status should default to plotman status human? Don't go changing the code here, I'm not confident what I want. But let me know how you feel.

@florianweyandt1337
Copy link
Contributor Author

Hey, thanks for getting around to this. I realize this projects gets a lot of attention as many people are using it, and you are doing this in your spare time :)
I agree that this fits more in a status subsection, but I didn't want to change the established behaviour of plotman status thats why I just put it next to it.

Do I have to do something about that failing check? I am a bit unexperienced in open-source work, sorry.

@florianweyandt1337
Copy link
Contributor Author

@dylanzr I agree and would also love aggregated metrics, however as I understand the code, plotman is stateless and always just a snapshot of the current reality. So once a plot passes from phase 1 to phase 2, plotman doesn't know how long it took for phase 1. You could do this by parsing the logfiles, but I think that should be another PR. Something like plotman prometheus historic or whatever status subsection it eventually ends up in.

@altendky
Copy link
Collaborator

altendky commented Jun 7, 2021

Just ignore the failing coverage check for now. It is not required yet because I haven't gotten around to achieving sufficient test coverage to mandate it of others. Maybe 'soon'... Sorry for the confusion.

All of the progress information that plotman gets is based on parsing the log files. The jobs you pass in should already have all the information from the logs.

If anyone could opine on how this PR and #549 relate, that would be great.

@florianweyandt1337
Copy link
Contributor Author

#549 can be used to achieve the same thing as this PR by writing a bash script that parses the JSON. I think have the status output as JSON is still a very useful thing to have, as you can parse the output easier than plotman status for various use-cases. I think however this PR would solve the #549 Authors original Problem, i.e. getting plot-metrics to prometheus.

But they should remain seperate PRs in my opinion.

@florianweyandt1337
Copy link
Contributor Author

In Grafana the result could look something like this for example:
Screenshot_2021-06-07_15-33-37

@drewwells
Copy link

drewwells commented Jun 7, 2021

@drewwells I could, but I'd like to hear from @ericaltendorf that he wants that in his plotman :). Having an HTTP Server running is something that you must actively want to have in your program, in my opinion.

And like I wrote, you'll probably want to measure other stuff on your plotting machine (CPU/RAM/Temps/Disk), so you'll probably have the node_exporter running on there anyways. And node_exporter can easily just pick up those metrics as well.

We wouldn't need host level metrics, there's many solutions already https://github.com/prometheus/node_exporter/.

I really like the idea of having prometheus metrics. However, it would be nice if there were a set of aggregate statistics, ie. avg duration per phase, total started/completed, etc. I suppose that could be calculated via promQL, I just wouldn't want to keep high retention on metrics w/ plot IDs as labels due to cardinality concerns.

Correct, you publish raw data to prometheus and create aggregate view from it.

Just ignore the failing coverage check for now. It is not required yet because I haven't gotten around to achieving sufficient test coverage to mandate it of others. Maybe 'soon'... Sorry for the confusion.

All of the progress information that plotman gets is based on parsing the log files. The jobs you pass in should already have all the information from the logs.

If anyone could opine on how this PR and #549 relate, that would be great.

I would personally close #549. We will get a lot more value from prometheus style metrics than custom ones that can not be consumed by standard metric collectors. HTTP is the preferred way to collect prometheus metrics. If we do not add HTTP support in this PR, it really should be done in a separate one. Without HTTP, we will require a separate exporter process just to take the cli output and send it to prometheus server.

As an example, geth supports a metrics HTTP endpoint https://geth.ethereum.org/docs/interface/metrics

@drewwells
Copy link

As for HTTP support, I'm thinking similar to plotman archive. It runs in a separate process and converts the internal status into the prometheus metrics described in this PR.

@florianweyandt1337
Copy link
Contributor Author

@drewwells you misunderstood me. The node_exporter will almost certainly already be running on any plotting machine being monitored by prometheus.
And node_exporter provides a function that collects any file with metrics from a given directory (textfile-collector). You can use the plotman prometheus command to write your metrics to that directory and node_exporter will pick them up and display them along with its other metrics for prometheus to consume.

@drewwells
Copy link

So we just need a job to update the text file. Is this going to run as a daemon to do that or it needs an external job runner?

@florianweyandt1337
Copy link
Contributor Author

An external job runner of your choice :) bash while true, crontab, systemd... there are many to choose from

Copy link

@philippnormann philippnormann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this PR using my local prometheus + node_exporter setup and can confirm that it's working as desired 👍🏽

@altendky
Copy link
Collaborator

Alrighty. The PRs stand as separate things. HTTP server shouldn't be needed, and wouldn't be added 'soon' anyways. Maybe one grows later, but that would be later after #299 at least and would need significant justification I think. It does strike me as odd that every last monitored activity on a server would have to provide its own HTTP server.. But hey, I haven't gotten into this area yet, so what do I know about normal.

So, could you go ahead and do an catch up here from development and we'll see what we can do?

@altendky
Copy link
Collaborator

I took the liberty of doing the catch up merge and adding type hints. I hope you don't mind.

@altendky
Copy link
Collaborator

Thanks for the work here. Perhaps at some point I'll find time to setup prometheus for myself, but in the mean time it appears from all the attention here that you've already got a pack of people using this. :] My apologies to all for the delay in processing this PR.

@altendky altendky merged commit d011d9c into ericaltendorf:development Jun 19, 2021
@florianweyandt1337
Copy link
Contributor Author

Thank you for the additions and the merge :)

@tedzhang2891
Copy link

@florianweyandt1337 Hi, could you share the grafana template?

@florianweyandt1337
Copy link
Contributor Author

@tedzhang2891 Unfortunately no, sorry. I moved to madmax and the dashboard didn't really make sense any more. So I threw it out.

@altendky
Copy link
Collaborator

altendky/farm@79f98f4

I just today started fiddling with this so it's really rough, but hey. Also, it uses an extra field of major + minor/10. but anyways.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants