Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fine grained metrics+logging for handling, processing, and grab l… #1301

Merged
merged 5 commits into from
Apr 6, 2023

Conversation

aaronbuchwald
Copy link
Collaborator

Why this should be merged

This PR adds more fine grained tracking for the amount of time spent in message processing. Currently, message processing tracks the amount of time spent handling a message including the time spent grabbing the engine lock. This PR breaks down the handling time into:

  • total handling time (includes grabbing the lock)
  • acquire lock time (time it takes to grab the lock when processing a message)
  • processing time (the time spent processing the message AFTER grabbing the lock)

Note: these metrics are observed both before and after bootstrapping, so new nodes that spend a long time bootstrapping may have very skewed metrics at the start.

How this works

This PR tracks the new finer grained metrics and updates the warn log reported in #1286 to include all three, so that we have more information when this warn occurs.

How this was tested

This PR is tested in CI and by running locally with the ANR to observe that the metrics show up as expected:

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0# HELP avalanche_P_handler_state_summary_frontier_count Total # of observations of time (in ns) of handling a state_summary_frontier
# TYPE avalanche_P_handler_state_summary_frontier_count counter
avalanche_P_handler_state_summary_frontier_count 0
# HELP avalanche_P_handler_state_summary_frontier_lock_count Total # of observations of time (in ns) of acquiring a lock to process a state_summary_frontier
# TYPE avalanche_P_handler_state_summary_frontier_lock_count counter
avalanche_P_handler_state_summary_frontier_lock_count 0
# HELP avalanche_P_handler_state_summary_frontier_lock_sum Sum of time (in ns) of acquiring a lock to process a state_summary_frontier
# TYPE avalanche_P_handler_state_summary_frontier_lock_sum gauge
avalanche_P_handler_state_summary_frontier_lock_sum 0
# HELP avalanche_P_handler_state_summary_frontier_processing_count Total # of observations of time (in ns) of processing a state_summary_frontier
# TYPE avalanche_P_handler_state_summary_frontier_processing_count counter
avalanche_P_handler_state_summary_frontier_processing_count 0
# HELP avalanche_P_handler_state_summary_frontier_processing_sum Sum of time (in ns) of processing a state_summary_frontier
# TYPE avalanche_P_handler_state_summary_frontier_processing_sum gauge
avalanche_P_handler_state_summary_frontier_processing_sum 0
# HELP avalanche_P_handler_state_summary_frontier_sum Sum of time (in ns) of handling a state_summary_frontier
# TYPE avalanche_P_handler_state_summary_frontier_sum gauge
avalanche_P_handler_state_summary_frontier_sum 0
100  484k    0  484k    0     0  23.7M      0 --:--:-- --:--:-- --:--:-- 36.3M

Note: I kept the name of the original metric for the total time spent handling the message the same, so that existing metrics are unchanged. This has the unfortunate side effect that the output now shows the count at the beginning and the sum at the end, with the other metrics in between, but this is only a problem when viewing the raw metrics with grep.

@aaronbuchwald aaronbuchwald added the monitoring This primarily focuses on logs, metrics, and/or tracing label Apr 6, 2023
@aaronbuchwald aaronbuchwald self-assigned this Apr 6, 2023
snow/networking/handler/handler.go Show resolved Hide resolved
snow/networking/handler/handler.go Show resolved Hide resolved
snow/networking/handler/handler.go Outdated Show resolved Hide resolved
snow/networking/handler/metrics.go Outdated Show resolved Hide resolved
snow/networking/handler/metrics.go Outdated Show resolved Hide resolved
@joshua-kim
Copy link
Contributor

super minor nits, one question on if we should add a metric on acquire time in addition to the held time we're adding

@StephenButtolph StephenButtolph added this to the v1.10.0 (Cortina) milestone Apr 6, 2023
@aaronbuchwald
Copy link
Collaborator Author

Good call @joshua-kim

@StephenButtolph StephenButtolph merged commit 8e099b2 into dev Apr 6, 2023
14 checks passed
@StephenButtolph StephenButtolph deleted the improve-msg-processing-observability branch April 6, 2023 19:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
monitoring This primarily focuses on logs, metrics, and/or tracing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants