Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serve pollable prometheus endpoint #1670

Merged
merged 26 commits into from
Jun 29, 2020
Merged

Serve pollable prometheus endpoint #1670

merged 26 commits into from
Jun 29, 2020

Conversation

lgalabru
Copy link
Contributor

@lgalabru lgalabru commented Jun 10, 2020

This PR is addressing #1432, and spawning an endpoint, exposing metrics that can be scraped by prometheus.

How to test:

Easiest way to start at this point, is by spawning a testnet, with nodes compiled w/ feature flag monitoring, ie --

Start bitcoind:

bitcoind

Start a bitcoin controller:

cd testnet/bitcoin-neon-controller
DYNAMIC_GENESIS_TIMESTAMP=1 cargo run local-leader.toml.default

Assuming prometheus is installed on your machine, start it with the sample config:

prometheus --config.file="testnet/stacks-node/conf/prometheus.yml" 

Start a local leader node:

cd testnet/stacks-node
cargo run  --features "monitoring" start --config=./conf/local-leader-conf.toml

Start a local follower node:

cd testnet/stacks-node
cargo run  --features "monitoring" start --config=./conf/local-follower-conf.toml

When visiting http://127.0.0.1:4000 and http://127.0.0.1:5000 you can see the metrics exposed (respectively miner and follower), that can be scraped by prometheus.

You can also visit http://0.0.0.0:9090/ and query things like

sum(stacks_node_btc_blocks_received_total)

I wanter to start with a set of simple metrics so we can settle on the way prometheus is being integrated, we can get more fancy with more metrics, histograms and gauges in a near future.

Once debated, adjusted, accepted, merged and deployed, I'll probably be following-up with another PR, implementing some more adjustments required by DevOps team.

@diwakergupta diwakergupta added this to the 2020 W25-W27 milestone Jun 15, 2020
@lgalabru lgalabru requested review from kantai and jcnelson June 16, 2020 04:18
@lgalabru lgalabru marked this pull request as ready for review June 16, 2020 04:19
@lgalabru lgalabru self-assigned this Jun 16, 2020
src/net/chat.rs Outdated Show resolved Hide resolved
Cargo.toml Outdated Show resolved Hide resolved
async fn accept(addr: String, stream: TcpStream) -> http_types::Result<()> {
println!("starting new connection from {}", stream.peer_addr()?);
async_h1::accept(&addr, stream.clone(), |_| async {
let encoder = TextEncoder::new();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm glad to see that Prometheus separates the task of gathering monitoring data from the task of sending it over the network. This means we can easily add a HTTP endpoint for Prometheus statistics in the near future (e.g. /v2/monitoring/prometheus or something), and strip this module out completely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, however we talked as a team about the different possible approaches (and I presented the one you're suggesting) before working on this, and went with the consensus.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we revisit now?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could see reasons for and against doing this -- on one hand, if this is really a compiler-time option, keeping the module separate from the HTTP code will make it a lot cleaner of a separation, on the other hand, integrating it reduces the number of HTTP servers. Though having metrics collection share the same thread as real requests is generally frowned upon.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have to do it right now, but we should do it at some point before mainnet, once we're sure about the endpoints. We can (and should) have a separate thread for serving monitoring data.

@lgalabru lgalabru requested a review from jcnelson June 23, 2020 22:20
src/net/chat.rs Outdated
@@ -1087,6 +1096,8 @@ impl ConversationP2P {
StacksMessageType::GetNeighbors => self.handle_getneighbors(peerdb.conn(), local_peer, chain_view, &msg.preamble),
StacksMessageType::GetBlocksInv(ref get_blocks_inv) => self.handle_getblocksinv(local_peer, burndb, chainstate, chain_view, &msg.preamble, get_blocks_inv),
StacksMessageType::Blocks(_) => {
monitoring::increment_stx_blocks_downloaded_counter();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a separate counter -- the blocks are being uploaded here, not downloaded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha. I'm confused by the next commented line then:

we can't receive blocks too often, so close this conversation if we do.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right -- this comment is about making sure that the remote node can't DDoS this node by sending it lots of blocks at once. The validate_blocks_push() method will send a NACK to this peer and close the socket if the remote peer is has consumed too much bandwidth.

But, it's still important to count the number of blocks pushed to this peer separately from counting the number of blocks requested from this peer :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rpc.rs is incrementing another counter - increment_stx_blocks_served_counter.
I've renamed increment_stx_blocks_downloaded_counter to increment_stx_blocks_received_counter to avoid confusion - 4aa045a

Copy link
Member

@kantai kantai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM once the last metric issue is solved!

@lgalabru lgalabru requested a review from jcnelson June 29, 2020 16:52
Copy link
Member

@jcnelson jcnelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@lgalabru lgalabru merged commit 7beddb6 into master Jun 29, 2020
@lgalabru lgalabru deleted the feature/prometheus branch June 29, 2020 17:21
@wileyj wileyj mentioned this pull request Feb 18, 2021
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants