Add fine grained metrics+logging for handling, processing, and grab l… #1301

aaronbuchwald · 2023-04-06T14:45:47Z

Why this should be merged

This PR adds more fine grained tracking for the amount of time spent in message processing. Currently, message processing tracks the amount of time spent handling a message including the time spent grabbing the engine lock. This PR breaks down the handling time into:

total handling time (includes grabbing the lock)
acquire lock time (time it takes to grab the lock when processing a message)
processing time (the time spent processing the message AFTER grabbing the lock)

Note: these metrics are observed both before and after bootstrapping, so new nodes that spend a long time bootstrapping may have very skewed metrics at the start.

How this works

This PR tracks the new finer grained metrics and updates the warn log reported in #1286 to include all three, so that we have more information when this warn occurs.

How this was tested

This PR is tested in CI and by running locally with the ANR to observe that the metrics show up as expected:

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0# HELP avalanche_P_handler_state_summary_frontier_count Total # of observations of time (in ns) of handling a state_summary_frontier
# TYPE avalanche_P_handler_state_summary_frontier_count counter
avalanche_P_handler_state_summary_frontier_count 0
# HELP avalanche_P_handler_state_summary_frontier_lock_count Total # of observations of time (in ns) of acquiring a lock to process a state_summary_frontier
# TYPE avalanche_P_handler_state_summary_frontier_lock_count counter
avalanche_P_handler_state_summary_frontier_lock_count 0
# HELP avalanche_P_handler_state_summary_frontier_lock_sum Sum of time (in ns) of acquiring a lock to process a state_summary_frontier
# TYPE avalanche_P_handler_state_summary_frontier_lock_sum gauge
avalanche_P_handler_state_summary_frontier_lock_sum 0
# HELP avalanche_P_handler_state_summary_frontier_processing_count Total # of observations of time (in ns) of processing a state_summary_frontier
# TYPE avalanche_P_handler_state_summary_frontier_processing_count counter
avalanche_P_handler_state_summary_frontier_processing_count 0
# HELP avalanche_P_handler_state_summary_frontier_processing_sum Sum of time (in ns) of processing a state_summary_frontier
# TYPE avalanche_P_handler_state_summary_frontier_processing_sum gauge
avalanche_P_handler_state_summary_frontier_processing_sum 0
# HELP avalanche_P_handler_state_summary_frontier_sum Sum of time (in ns) of handling a state_summary_frontier
# TYPE avalanche_P_handler_state_summary_frontier_sum gauge
avalanche_P_handler_state_summary_frontier_sum 0
100  484k    0  484k    0     0  23.7M      0 --:--:-- --:--:-- --:--:-- 36.3M

Note: I kept the name of the original metric for the total time spent handling the message the same, so that existing metrics are unchanged. This has the unfortunate side effect that the output now shows the count at the beginning and the sum at the end, with the other metrics in between, but this is only a problem when viewing the raw metrics with grep.

…ock times in snow handler

snow/networking/handler/handler.go

snow/networking/handler/metrics.go

snow/networking/handler/handler.go

snow/networking/handler/metrics.go

joshua-kim · 2023-04-06T17:51:32Z

super minor nits, one question on if we should add a metric on acquire time in addition to the held time we're adding

aaronbuchwald · 2023-04-06T19:06:52Z

Good call @joshua-kim

Add fine grained metrics+logging for handling, processing, and grab l…

10d5101

…ock times in snow handler

aaronbuchwald added the monitoring This primarily focuses on logs, metrics, and/or tracing label Apr 6, 2023

aaronbuchwald requested a review from danlaine as a code owner April 6, 2023 14:45

aaronbuchwald self-assigned this Apr 6, 2023

aaronbuchwald requested a review from StephenButtolph as a code owner April 6, 2023 14:45