-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metrics for callback receiver job event lag #12308
Conversation
@@ -34,7 +34,7 @@ def decode(self, conn): | |||
def to_prometheus(self, instance_data): | |||
output_text = f"# HELP {self.field} {self.help_text}\n# TYPE {self.field} gauge\n" | |||
for instance in instance_data: | |||
output_text += f'{self.field}{{node="{instance}"}} {instance_data[instance][self.field]}\n' | |||
output_text += f'{self.field}{{node="{instance}"}} {instance_data[instance].get(self.field, -1)}\n' # TODO: fix because this -1 is neccessary when dealing with old instances (ex. you didn't clean up your database) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fosterseth do you have any idea why this would have to be? I had some old instances around that necessitated adding a default because they didn't have the new key (ie they didn't exist and weren't running but still got caught in this).
awx/main/dispatch/worker/callback.py
Outdated
metrics_total_job_event_processing_seconds.total_seconds() / (metrics_bulk_events_saved + metrics_singular_events_saved), | ||
) # (metrics_total_job_event_processing_seconds ) | ||
else: | ||
self.subsystem_metrics.set('callback_receiver_event_processing_avg_seconds', 0) # if there is no lag the avg is 0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.subsystem_metrics.set('callback_receiver_event_processing_avg_seconds', 0) # if there is no lag the avg is 0. | |
self.subsystem_metrics.set('callback_receiver_event_processing_avg_seconds', 0) # if we wrote no events the the avg processing time is 0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm only just now catching up on much of what subsystem metrics is doing, and I am very worried that it's saving too many things (already, even before this). I don't agree that 0 is representative of this situation, it's more akin to 0/0
. I would prefer to not write in the else:
condition here, and let the old value hang out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
k my concern is it would never go back down again on a quiet system. but I do see where you're coming from and I'm not against dropping this condition because downside of this is we may have some idle callback receivers pushing this metric while others are busy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so +1 to dropping whole else condition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll throw it through some manual testing without the else, should work just fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To articulate more specifically what @kdelee is thinking:
I expect that performance testing will launch a large number of jobs all at the same time. These will finish significantly faster than the event processing can finish, because that's by design. Then at the time the last event is saved, the event lag will be high. At the time of processing, the redis queue is effectively empty, but the measured event lag will be high. Since it's the last event of a long burst, the callback system will be dead silent after that.
In any case, the graph would be discontinuous. If you do what I'm asking for here, it will remain at a relatively high value after the burst has finished, and drop down to a low (not zero) value upon the first events of the next job that starts. With what this was doing, it would drop it down to zero right after the burst finishes processing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AlanCoding We left a couple more comments below but I wanted to leave on here too. I would prefer in the short term we keep the else and actually drop down since in the future we'll be changing how this behaves anyways. I think it has a better delivery of meaningful information with the included else block but it's not a deal breaker for me if you think we should drop it in order to get this in.
m.clear_value(self.conn) | ||
def clear_values(self, fields=None): | ||
if not fields: | ||
fields = self.METRICS.keys() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a little weird. What's the reasoning behind this conditional? Might be worth a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may or may not have stolen this from @fosterseth 's open PR 😅
What I believe is that it's resetting the values at the end of each partition in preparation for the next cycle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the best thing to do would be to revert the changes here before merge. As it stands, it's like adding mystery code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have validated that without this the average seconds just continues to rise. vs. being refreshed with the partitionining. I'll leave it in so this is functional but full credit to @fosterseth for it. Grafana of jobs over approximately 25m to demonstrate if it's not included.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, I didn't realize it was fixing something real. I guess it's in the nuance of how it interacts with the python library?
Test summaryRun details
View run in Cypress Dashboard ➡️ Flakiness
This comment has been generated by cypress-bot as a result of this project's GitHub integration settings. You can manage this integration in this project's settings in the Cypress Dashboard |
grafana via prometheus. This metric is a good indicator of how far behind the callback receiver is. The higher the load the further behind/the greater the number of seconds the metric will display. This number being high may indicate the need for horizontal scaling in the control plane or vertically scaling the number of callback receivers.
did a little rebase to get this into one commit it's passed internal tests on my end as well. |
SUMMARY
Spawning out of #12171
ISSUE TYPE
COMPONENT NAME
AWX VERSION
Additional Information
view of Grafana graph example: