Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optionally export metrics for external monitoring #530

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Optionally export metrics for external monitoring #530

wants to merge 2 commits into from

Conversation

isuldor
Copy link
Contributor

@isuldor isuldor commented Sep 18, 2019

WIP

  • use config.go for default values
  • refactor server.go function to stakepoold.go method
  • shutdown context
  • readme one-liner
  • prometheus quickstart guide
  • documentation with full samples of systemd units and useful prometheus alerts

I've added optional Prometheus instrumentation (--prometheus) for monitoring and outage alerts (#525). The intention is to add reliability to a vsp by enabling the operator to quickly detect outages. An external monitoring system can be set up to monitor stakepoold instances, in addition to the existing api health check endpoint provided by dcrstakepool. The advantage to detecting stakepoold outages is that they generally happen before dcrstakepool falls over. This will likely become increasingly critical once the planned changes for fee addresses is implemented. The Prometheus client libraries also include parsing and bridges for other monitoring systems, such as InfluxDB, Sensu, and more. The Prometheus exposition format is also being used as the basis for OpenMetrics.

I think my error handling within the goroutine is suboptimal. I'd be happy to receive feedback on that or anything else. The configurable port 9101 was to just follow the port that Prometheus node_exporter uses (it's not set in stone by any means though).

When enabled, you should be able to connect to http://localhost:9101 and see your metrics:

# HELP dcr_block_height Returns the latest block height
# TYPE dcr_block_height gauge
dcr_block_height 268862
# HELP dcr_peer_connections Returns a count of connections from peers
# TYPE dcr_peer_connections gauge
dcr_peer_connections 23
# HELP dcr_tickets_live Returns a count of live tickets
# TYPE dcr_tickets_live gauge
dcr_tickets_live 42
# HELP dcr_tickets_missed Returns a count of missed tickets
# TYPE dcr_tickets_missed gauge
dcr_tickets_missed 4

I'll be writing a guide on how to set up some useful Prometheus alerts along with a Grafana dashboard.

One useful metric gathered here is the block height metric. It's pulled from dcrwallet, so it ensures both dcrd and dcrwallet are working with the tip of the chain. My testnet nodes are fairly fragile and this metric catches any of their failures including transient ones where processes are running but are actually hung. On top of that the prometheus alertmanager can basically confirm that an error is happening a number of times before sending the actual alert, which is pretty useful for weeding out any short-term false positives. My promql expression to create alert looks like this:

- alert: TestnetBlockHeightStalled
        expr: dcr_block_height{group="testnet-voters"} < on() group_left max(dcr_block_height{group="testnet-voters"})
        for: 15m
        annotations:
          summary: Node is behind on best block height

And the other particularly useful metric is the live ticket count. You can monitor whether all nodes agree. When they don't, dcrstakepool becomes unresponsive. Live tickets are also the primary responsibility of the voting nodes, so it's pretty relevant whether they all agree about the current state of tickets. I'll write some documentation about how to monitor these metrics, and how to address issues when those alarms sound off.

Note: cve-2019-16276 is relevant here as we're using net/http. Patched Go versions are Go 1.13.1 and 1.12.10.

@dajohi
Copy link
Member

dajohi commented Sep 18, 2019

env GO111MODULE=on go mod tidy

commit go.mod and go.sum changes.

Copy link
Member

@JoeGruffins JoeGruffins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error handling looks ok to me.

It works!

prometheus

Also, I think you should make a section in the main readme, otherwise noone will know about it. The readme needs to be cleaned up anyway (#508 ), so don't worry too much about where.

backend/stakepoold/server.go Outdated Show resolved Hide resolved
backend/stakepoold/server.go Outdated Show resolved Hide resolved
backend/stakepoold/config.go Outdated Show resolved Hide resolved
backend/stakepoold/config.go Outdated Show resolved Hide resolved
backend/stakepoold/server.go Outdated Show resolved Hide resolved
backend/stakepoold/server.go Outdated Show resolved Hide resolved
@jholdstock
Copy link
Member

Hey @isuldor, I added a few in-line comments, and also I agree with Joe that we should mention this feature in README. Maybe just a one-liner in README.md which links to a new document under docs/. This could include usage instructions and some sample Prometheus configs, like the one you posted above.

Also wanted to make you aware of https://github.com/lightninglabs/lndmon which already implements a Prometheus exporter for lnd. It should be possible to port this to dcrlnd really easily

Copy link
Member

@JoeGruffins JoeGruffins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some inline comments. Looking good though!

backend/stakepoold/config.go Outdated Show resolved Hide resolved
backend/stakepoold/stakepool/stakepool.go Outdated Show resolved Hide resolved
backend/stakepoold/stakepool/stakepool.go Outdated Show resolved Hide resolved
@jholdstock
Copy link
Member

Code is looking good at this point. Just awaiting the doco updates

Copy link
Member

@JoeGruffins JoeGruffins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did your changes to sample-stakepoold.conf go? we need those. Be sure to leave a small comment about how to input the times, same as the config.

backend/stakepoold/server.go Show resolved Hide resolved
backend/stakepoold/stakepool/stakepool.go Show resolved Hide resolved
@isuldor
Copy link
Contributor Author

isuldor commented Oct 25, 2019

How do I squash my commits? I've mostly just been cheating and rolling back HEAD, but I should probably learn to do it proper.

@JoeGruffins
Copy link
Member

JoeGruffins commented Oct 25, 2019

The problem here is that you merged master onto your branch instead of rebasing your branch with master.

I only ever use the command line, but I think what you need to do is something like,
git checkout master
git fetch upstream/master (note: you may not have upstream set as so)
git merge upstream/master

git checkout prometheus_instrumentation
git rebase -i master

And here delete all the comments that aren't yours
Save and quit

You will probably have merge conflicts, fix conflicts and add the files
git add foo.go
git rebase --continue

until
git status
reborts a clean status

then
git push --force

at any time during rebase, you can revert to before the rebase using
git rebase --abort

@JoeGruffins
Copy link
Member

Ohhh. This looks awesome. Give me a little time to test it out.

@jholdstock
Copy link
Member

Hey @isuldor,

The /metrics endpoint seems to work well for me, but I am having some issues when I try to shutdown stakepoold. Using CTRL+C I get

2019-10-26 12:28:44.435 [INF] STPK: Received signal (interrupt).  Shutting down...

and then the process just hangs indefinitely. I believe it is because the call to http.ListenAndServe is blocking and does not return. I'm chatting to @JoeGruffins about how best to resolve this, as he has been touching the shutdown signals recently.

@JoeGruffins
Copy link
Member

We are waiting for a shutdown and then shutting down a server here

// Cleanly shutdown server on interrupt signal.

I think this before listening, and also listening in a go routine may be best

@isuldor
Copy link
Contributor Author

isuldor commented Oct 29, 2019

It now handles shutdown signals. I have to push back on running the http server in a thread as I see no precedent for doing this (at least dcrstakepool doesn't appear to do that). And the Golang manual explains that the http server spins off each request into their own thread already, so it seems like overkill.

Copy link
Member

@JoeGruffins JoeGruffins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following the guides, I was able to set up a monitoring service on a different machine that emailed me when my geographically separated stakepoold hit an alert condition.

Proof

alertmail

One this that I'm not sure of is clicking on the "View in alert manager" doesn't seem to do that. Maybe a mistake on my part.

I think this is fantastic and needed work. The documentation is truly professional grade, imo. I left a few inline comments, but nothing big.

However there are some unrelated services in Samples that I think deserve seperate prs. dcrd and stakepool services, and the accompanying bash script. If it's not too much trouble, could you make a new pr for those?

Again, I think you put a lot of thought and effort into this, and it looks great to me.

docs/samples/wait-for-dcrd.sh Outdated Show resolved Hide resolved
backend/stakepoold/stakepool/stakepool.go Show resolved Hide resolved
backend/stakepoold/stakepool/stakepool.go Show resolved Hide resolved
docs/prometheus-quickstart.md Outdated Show resolved Hide resolved
docs/prometheus-examples.md Outdated Show resolved Hide resolved
docs/prometheus-examples.md Outdated Show resolved Hide resolved
docs/prometheus-examples.md Outdated Show resolved Hide resolved
@isuldor
Copy link
Contributor Author

isuldor commented Oct 31, 2019

I've changed up the error handling a bit so that it will continue on errors. ie. It should wait for the next interval period to try again. The two rpc methods I'm calling already print errors, so I opted to use log.Debugf so it won't print the same error twice when debugging is off.

"View in alert manager" doesn't seem to do that

Yeah I don't recall off the top of my head which config variable it uses for the alert source. It just defaults to the prometheus server's hostname for me.

Copy link
Member

@JoeGruffins JoeGruffins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Thank you for answering my concerns.

One thing going forward, it would be better if you did not squash your commits after reviewing has started, as it is rather difficult to see what has changed. I have your past commits in my reflog, so I can see what has changed, but other's will not have that locally, and you can't see the diff on github (afaik). You can squash in the end after approval, or just let @dajohi do it for you. He enjoys it... idk... maybe.

@isuldor
Copy link
Contributor Author

isuldor commented Dec 9, 2019

For what its worth, I've been running this in production for a few weeks now with a 30 second interval and have had zero missed votes in that time.

@JoeGruffins
Copy link
Member

@dajohi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants