-
Notifications
You must be signed in to change notification settings - Fork 664
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Serve pollable prometheus endpoint #1670
Conversation
async fn accept(addr: String, stream: TcpStream) -> http_types::Result<()> { | ||
println!("starting new connection from {}", stream.peer_addr()?); | ||
async_h1::accept(&addr, stream.clone(), |_| async { | ||
let encoder = TextEncoder::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm glad to see that Prometheus separates the task of gathering monitoring data from the task of sending it over the network. This means we can easily add a HTTP endpoint for Prometheus statistics in the near future (e.g. /v2/monitoring/prometheus
or something), and strip this module out completely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, however we talked as a team about the different possible approaches (and I presented the one you're suggesting) before working on this, and went with the consensus.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we revisit now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could see reasons for and against doing this -- on one hand, if this is really a compiler-time option, keeping the module separate from the HTTP code will make it a lot cleaner of a separation, on the other hand, integrating it reduces the number of HTTP servers. Though having metrics collection share the same thread as real requests is generally frowned upon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have to do it right now, but we should do it at some point before mainnet, once we're sure about the endpoints. We can (and should) have a separate thread for serving monitoring data.
1458a8c
to
94e5c83
Compare
94e5c83
to
5b1d305
Compare
src/net/chat.rs
Outdated
@@ -1087,6 +1096,8 @@ impl ConversationP2P { | |||
StacksMessageType::GetNeighbors => self.handle_getneighbors(peerdb.conn(), local_peer, chain_view, &msg.preamble), | |||
StacksMessageType::GetBlocksInv(ref get_blocks_inv) => self.handle_getblocksinv(local_peer, burndb, chainstate, chain_view, &msg.preamble, get_blocks_inv), | |||
StacksMessageType::Blocks(_) => { | |||
monitoring::increment_stx_blocks_downloaded_counter(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a separate counter -- the blocks are being uploaded here, not downloaded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha. I'm confused by the next commented line then:
we can't receive blocks too often, so close this conversation if we do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right -- this comment is about making sure that the remote node can't DDoS this node by sending it lots of blocks at once. The validate_blocks_push()
method will send a NACK
to this peer and close the socket if the remote peer is has consumed too much bandwidth.
But, it's still important to count the number of blocks pushed to this peer separately from counting the number of blocks requested from this peer :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rpc.rs
is incrementing another counter - increment_stx_blocks_served_counter
.
I've renamed increment_stx_blocks_downloaded_counter
to increment_stx_blocks_received_counter
to avoid confusion - 4aa045a
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM once the last metric issue is solved!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
This PR is addressing #1432, and spawning an endpoint, exposing metrics that can be scraped by prometheus.
How to test:
Easiest way to start at this point, is by spawning a testnet, with nodes compiled w/ feature flag
monitoring
, ie --Start bitcoind:
Start a bitcoin controller:
Assuming prometheus is installed on your machine, start it with the sample config:
Start a local leader node:
Start a local follower node:
When visiting
http://127.0.0.1:4000
andhttp://127.0.0.1:5000
you can see the metrics exposed (respectively miner and follower), that can be scraped by prometheus.You can also visit
http://0.0.0.0:9090/
and query things likeI wanter to start with a set of simple metrics so we can settle on the way prometheus is being integrated, we can get more fancy with more metrics, histograms and gauges in a near future.
Once debated, adjusted, accepted, merged and deployed, I'll probably be following-up with another PR, implementing some more adjustments required by DevOps team.