Serve pollable prometheus endpoint #1670

lgalabru · 2020-06-10T04:06:53Z

This PR is addressing #1432, and spawning an endpoint, exposing metrics that can be scraped by prometheus.

How to test:

Easiest way to start at this point, is by spawning a testnet, with nodes compiled w/ feature flag monitoring, ie --

Start bitcoind:

bitcoind

Start a bitcoin controller:

cd testnet/bitcoin-neon-controller
DYNAMIC_GENESIS_TIMESTAMP=1 cargo run local-leader.toml.default

Assuming prometheus is installed on your machine, start it with the sample config:

prometheus --config.file="testnet/stacks-node/conf/prometheus.yml"

Start a local leader node:

cd testnet/stacks-node
cargo run  --features "monitoring" start --config=./conf/local-leader-conf.toml

Start a local follower node:

cd testnet/stacks-node
cargo run  --features "monitoring" start --config=./conf/local-follower-conf.toml

When visiting http://127.0.0.1:4000 and http://127.0.0.1:5000 you can see the metrics exposed (respectively miner and follower), that can be scraped by prometheus.

You can also visit http://0.0.0.0:9090/ and query things like

sum(stacks_node_btc_blocks_received_total)

I wanter to start with a set of simple metrics so we can settle on the way prometheus is being integrated, we can get more fancy with more metrics, histograms and gauges in a near future.

Once debated, adjusted, accepted, merged and deployed, I'll probably be following-up with another PR, implementing some more adjustments required by DevOps team.

testnet/stacks-node/Cargo.toml

src/net/chat.rs

testnet/stacks-node/src/burnchains/bitcoin_regtest_controller.rs

testnet/stacks-node/src/monitoring.rs

Cargo.toml

jcnelson · 2020-06-16T15:45:20Z

testnet/stacks-node/src/monitoring.rs

+async fn accept(addr: String, stream: TcpStream) -> http_types::Result<()> {
+    println!("starting new connection from {}", stream.peer_addr()?);
+    async_h1::accept(&addr, stream.clone(), |_| async {
+        let encoder = TextEncoder::new();    


I'm glad to see that Prometheus separates the task of gathering monitoring data from the task of sending it over the network. This means we can easily add a HTTP endpoint for Prometheus statistics in the near future (e.g. /v2/monitoring/prometheus or something), and strip this module out completely.

Sure, however we talked as a team about the different possible approaches (and I presented the one you're suggesting) before working on this, and went with the consensus.

Should we revisit now?

I could see reasons for and against doing this -- on one hand, if this is really a compiler-time option, keeping the module separate from the HTTP code will make it a lot cleaner of a separation, on the other hand, integrating it reduces the number of HTTP servers. Though having metrics collection share the same thread as real requests is generally frowned upon.

We don't have to do it right now, but we should do it at some point before mainnet, once we're sure about the endpoints. We can (and should) have a separate thread for serving monitoring data.

jcnelson · 2020-06-26T00:01:34Z

src/net/chat.rs

@@ -1087,6 +1096,8 @@ impl ConversationP2P {
            StacksMessageType::GetNeighbors => self.handle_getneighbors(peerdb.conn(), local_peer, chain_view, &msg.preamble),
            StacksMessageType::GetBlocksInv(ref get_blocks_inv) => self.handle_getblocksinv(local_peer, burndb, chainstate, chain_view, &msg.preamble, get_blocks_inv),
            StacksMessageType::Blocks(_) => {
+                monitoring::increment_stx_blocks_downloaded_counter();


This should be a separate counter -- the blocks are being uploaded here, not downloaded.

Ha. I'm confused by the next commented line then:

we can't receive blocks too often, so close this conversation if we do.

Right -- this comment is about making sure that the remote node can't DDoS this node by sending it lots of blocks at once. The validate_blocks_push() method will send a NACK to this peer and close the socket if the remote peer is has consumed too much bandwidth.

But, it's still important to count the number of blocks pushed to this peer separately from counting the number of blocks requested from this peer :)

rpc.rs is incrementing another counter - increment_stx_blocks_served_counter.
I've renamed increment_stx_blocks_downloaded_counter to increment_stx_blocks_received_counter to avoid confusion - 4aa045a

kantai

This LGTM once the last metric issue is solved!

jcnelson

LGTM!

lgalabru added 5 commits June 9, 2020 17:05

Move from reqwest to async-h1

88bc7a9

Ability to spawn prometheus server

e3a3242

Fix DNS bitcoin rpc resolve

dfd4a9a

Server now returning compliant data

5c766c1

Start counting metrics

0fdf28b

diwakergupta added this to the 2020 W25-W27 milestone Jun 15, 2020

lgalabru added 4 commits June 15, 2020 12:09

Move to IntCounters

09302bc

Re-wire prometheus integration

9630cf8

Add more data points

1aa0ab5

Fix warning

51c7cd2

lgalabru requested review from kantai and jcnelson June 16, 2020 04:18

lgalabru marked this pull request as ready for review June 16, 2020 04:19

lgalabru self-assigned this Jun 16, 2020

Enable monitoring for builds in docker images

78e154f