Metrics for callback receiver job event lag #12308

rebeccahhh · 2022-06-01T15:55:00Z

SUMMARY

Spawning out of #12171

ISSUE TYPE

Feature Pull Request

COMPONENT NAME

API/Metrics

AWX VERSION

awx: 21.0.1

Additional Information

view of Grafana graph example:

awx/main/analytics/subsystem_metrics.py

awx/main/dispatch/worker/callback.py

rebeccahhh · 2022-06-01T21:10:17Z

awx/main/analytics/subsystem_metrics.py

@@ -34,7 +34,7 @@ def decode(self, conn):
    def to_prometheus(self, instance_data):
        output_text = f"# HELP {self.field} {self.help_text}\n# TYPE {self.field} gauge\n"
        for instance in instance_data:
-            output_text += f'{self.field}{{node="{instance}"}} {instance_data[instance][self.field]}\n'
+            output_text += f'{self.field}{{node="{instance}"}} {instance_data[instance].get(self.field, -1)}\n'  # TODO: fix because this -1 is neccessary when dealing with old instances (ex. you didn't clean up your database)


@fosterseth do you have any idea why this would have to be? I had some old instances around that necessitated adding a default because they didn't have the new key (ie they didn't exist and weren't running but still got caught in this).

awx/main/dispatch/worker/callback.py

kdelee · 2022-06-01T21:45:28Z

awx/main/dispatch/worker/callback.py

+                    metrics_total_job_event_processing_seconds.total_seconds() / (metrics_bulk_events_saved + metrics_singular_events_saved),
+                )  # (metrics_total_job_event_processing_seconds )
+            else:
+                self.subsystem_metrics.set('callback_receiver_event_processing_avg_seconds', 0)  # if there is no lag the avg is 0.


Suggested change

self.subsystem_metrics.set('callback_receiver_event_processing_avg_seconds', 0) # if there is no lag the avg is 0.

self.subsystem_metrics.set('callback_receiver_event_processing_avg_seconds', 0) # if we wrote no events the the avg processing time is 0.

I'm only just now catching up on much of what subsystem metrics is doing, and I am very worried that it's saving too many things (already, even before this). I don't agree that 0 is representative of this situation, it's more akin to 0/0. I would prefer to not write in the else: condition here, and let the old value hang out.

k my concern is it would never go back down again on a quiet system. but I do see where you're coming from and I'm not against dropping this condition because downside of this is we may have some idle callback receivers pushing this metric while others are busy

so +1 to dropping whole else condition

I'll throw it through some manual testing without the else, should work just fine.

To articulate more specifically what @kdelee is thinking:

I expect that performance testing will launch a large number of jobs all at the same time. These will finish significantly faster than the event processing can finish, because that's by design. Then at the time the last event is saved, the event lag will be high. At the time of processing, the redis queue is effectively empty, but the measured event lag will be high. Since it's the last event of a long burst, the callback system will be dead silent after that.

In any case, the graph would be discontinuous. If you do what I'm asking for here, it will remain at a relatively high value after the burst has finished, and drop down to a low (not zero) value upon the first events of the next job that starts. With what this was doing, it would drop it down to zero right after the burst finishes processing.

@AlanCoding We left a couple more comments below but I wanted to leave on here too. I would prefer in the short term we keep the else and actually drop down since in the future we'll be changing how this behaves anyways. I think it has a better delivery of meaningful information with the included else block but it's not a deal breaker for me if you think we should drop it in order to get this in.

awx/main/dispatch/worker/callback.py

shanemcd · 2022-06-02T17:17:17Z

awx/main/analytics/subsystem_metrics.py

-            m.clear_value(self.conn)
+    def clear_values(self, fields=None):
+        if not fields:
+            fields = self.METRICS.keys()


This seems a little weird. What's the reasoning behind this conditional? Might be worth a comment.

I may or may not have stolen this from @fosterseth 's open PR 😅
What I believe is that it's resetting the values at the end of each partition in preparation for the next cycle.

I think the best thing to do would be to revert the changes here before merge. As it stands, it's like adding mystery code.

I have validated that without this the average seconds just continues to rise. vs. being refreshed with the partitionining. I'll leave it in so this is functional but full credit to @fosterseth for it. Grafana of jobs over approximately 25m to demonstrate if it's not included.

oh, I didn't realize it was fixing something real. I guess it's in the nuance of how it interacts with the python library?

awx/main/dispatch/worker/callback.py

cypress · 2022-06-04T01:45:38Z

Test summary

629 • 0 • 830 • 0 • 1

Run details


Project	AWX - Functional
Status	Passed
Commit	`5f9326b`
Started	Jun 6, 2022 10:41 PM
Ended	Jun 7, 2022 2:07 AM
Duration	25:35 💡
OS	Linux Debian - 11.3
Browser	Chrome 99

View run in Cypress Dashboard ➡️

Flakiness

	cypress/integration/projects/project-list.spec.js		1
1	Search operations - Projects > can sort Projects by name after performing advanced search

This comment has been generated by cypress-bot as a result of this project's GitHub integration settings. You can manage this integration in this project's settings in the Cypress Dashboard

grafana via prometheus. This metric is a good indicator of how far behind the callback receiver is. The higher the load the further behind/the greater the number of seconds the metric will display. This number being high may indicate the need for horizontal scaling in the control plane or vertically scaling the number of callback receivers.

rebeccahhh · 2022-06-06T19:43:25Z

did a little rebase to get this into one commit it's passed internal tests on my end as well.

rebeccahhh self-assigned this Jun 1, 2022

github-actions bot added the component:api label Jun 1, 2022

rebeccahhh marked this pull request as draft June 1, 2022 15:56