Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add subsystem metrics for the dispatcher #13989

Merged
merged 4 commits into from
May 17, 2023

Conversation

AlanCoding
Copy link
Member

@AlanCoding AlanCoding commented May 11, 2023

SUMMARY

Connect #12776

Demo:

# HELP dispatcher_pool_scale_up_events Number of times local dispatcher scaled up a worker since startup
# TYPE dispatcher_pool_scale_up_events gauge
dispatcher_pool_scale_up_events{node="awx_1"} 12
# HELP dispatcher_pool_active_task_count Number of active tasks in the worker pool when last task was submitted
# TYPE dispatcher_pool_active_task_count gauge
dispatcher_pool_active_task_count{node="awx_1"} 0
# HELP dispatcher_pool_worker_count Highest number of workers in worker pool in last collection interval, about 20s
# TYPE dispatcher_pool_worker_count gauge
dispatcher_pool_worker_count{node="awx_1"} 6
# HELP dispatcher_availability Fraction of time last interval when dispatcher was available to receive messages
# TYPE dispatcher_availability gauge
dispatcher_availability{node="awx_1"} 0.9945928728946261

I started with a very long list of metrics that we could put in, but settled on these as providing unique value with each metric, and simple enough they can be understood. I feel like we added too much data with the task manager & callback receiver.

These have absolutely red hot implications for system stability. If the availability is sub-90%, then message processing will lag notably. Scale up events should also be infrequent, and they are not. This is already a reported finding, but this gives instrumentation for the perfscale team to quantify that and hold some regression coverage over it.

ISSUE TYPE
  • Bug, Docs Fix or other nominal change
COMPONENT NAME
  • API
ADDITIONAL INFORMATION

I am trying to offer this as a model to use the metrics to make progress in performance. Here are patches that are intended to address these performance metrics:

@AlanCoding AlanCoding marked this pull request as ready for review May 17, 2023 17:17
@AlanCoding
Copy link
Member Author

I added this to the demo dashboard so here is what it looks like.

Screenshot from 2023-05-17 13-46-42

@AlanCoding AlanCoding merged commit ef99770 into ansible:devel May 17, 2023
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants