Add subsystem metrics for the dispatcher #13989

AlanCoding · 2023-05-11T15:39:01Z

SUMMARY

Demo:

# HELP dispatcher_pool_scale_up_events Number of times local dispatcher scaled up a worker since startup
# TYPE dispatcher_pool_scale_up_events gauge
dispatcher_pool_scale_up_events{node="awx_1"} 12
# HELP dispatcher_pool_active_task_count Number of active tasks in the worker pool when last task was submitted
# TYPE dispatcher_pool_active_task_count gauge
dispatcher_pool_active_task_count{node="awx_1"} 0
# HELP dispatcher_pool_worker_count Highest number of workers in worker pool in last collection interval, about 20s
# TYPE dispatcher_pool_worker_count gauge
dispatcher_pool_worker_count{node="awx_1"} 6
# HELP dispatcher_availability Fraction of time last interval when dispatcher was available to receive messages
# TYPE dispatcher_availability gauge
dispatcher_availability{node="awx_1"} 0.9945928728946261

I started with a very long list of metrics that we could put in, but settled on these as providing unique value with each metric, and simple enough they can be understood. I feel like we added too much data with the task manager & callback receiver.

These have absolutely red hot implications for system stability. If the availability is sub-90%, then message processing will lag notably. Scale up events should also be infrequent, and they are not. This is already a reported finding, but this gives instrumentation for the perfscale team to quantify that and hold some regression coverage over it.

ISSUE TYPE

Bug, Docs Fix or other nominal change

COMPONENT NAME

API

ADDITIONAL INFORMATION

I am trying to offer this as a model to use the metrics to make progress in performance. Here are patches that are intended to address these performance metrics:

[troubleshooting] Make dispatcher --status accurate real-time #13975 should increase (improve) dispatcher_availability by a static amount by avoiding running the debug template periodically
Spread out submission of scheduled tasks to avoid bursting #13990 should offer dramatic improvements by decreasing dispatcher_pool_scale_up_events (indeed, making it flat) and decreasing dispatcher_pool_worker_count
Reduce queue checks needed for task assignment #13993 should also increase dispatcher_availability

AlanCoding · 2023-05-17T17:52:44Z

I added this to the demo dashboard so here is what it looks like.

Add subsystem metrics for the dispatcher

3740258

AlanCoding requested review from kdelee, gamuniz and fosterseth May 11, 2023 15:39

github-actions bot added the component:api label May 11, 2023

Fix availability to be incremental for interval

b83f095

This was referenced May 11, 2023

Spread out submission of scheduled tasks to avoid bursting #13990

Closed

Reduce queue checks needed for task assignment #13993

Draft

kdelee approved these changes May 17, 2023

View reviewed changes

AlanCoding marked this pull request as ready for review May 17, 2023 17:17

AlanCoding added 2 commits May 17, 2023 13:50

Add max to pool workers and small reorg

9ea3fd4

Add grafana dashboard

ac6274e

john-westcott-iv approved these changes May 17, 2023

View reviewed changes

AlanCoding merged commit ef99770 into ansible:devel May 17, 2023
14 checks passed

AlanCoding mentioned this pull request Jun 12, 2023

Integrate scheduler into dispatcher main loop #14067

Merged

AlanCoding mentioned this pull request Jul 17, 2023

Dispatcher shutdown deadlock when redis is unavailable #14245

Closed

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add subsystem metrics for the dispatcher #13989

Add subsystem metrics for the dispatcher #13989

AlanCoding commented May 11, 2023 •

edited

AlanCoding commented May 17, 2023

Add subsystem metrics for the dispatcher #13989

Add subsystem metrics for the dispatcher #13989

Conversation

AlanCoding commented May 11, 2023 • edited

SUMMARY

ISSUE TYPE

COMPONENT NAME

ADDITIONAL INFORMATION

AlanCoding commented May 17, 2023

AlanCoding commented May 11, 2023 •

edited