Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics for callback receiver job event lag #12308

Merged
merged 1 commit into from
Jun 7, 2022

Conversation

rebeccahhh
Copy link
Member

@rebeccahhh rebeccahhh commented Jun 1, 2022

SUMMARY

Spawning out of #12171

ISSUE TYPE
  • Feature Pull Request
COMPONENT NAME
  • API/Metrics
AWX VERSION
awx: 21.0.1
Additional Information

view of Grafana graph example:
callback_event_processing_seconds

@rebeccahhh rebeccahhh self-assigned this Jun 1, 2022
@rebeccahhh rebeccahhh marked this pull request as draft June 1, 2022 15:56
@@ -34,7 +34,7 @@ def decode(self, conn):
def to_prometheus(self, instance_data):
output_text = f"# HELP {self.field} {self.help_text}\n# TYPE {self.field} gauge\n"
for instance in instance_data:
output_text += f'{self.field}{{node="{instance}"}} {instance_data[instance][self.field]}\n'
output_text += f'{self.field}{{node="{instance}"}} {instance_data[instance].get(self.field, -1)}\n' # TODO: fix because this -1 is neccessary when dealing with old instances (ex. you didn't clean up your database)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fosterseth do you have any idea why this would have to be? I had some old instances around that necessitated adding a default because they didn't have the new key (ie they didn't exist and weren't running but still got caught in this).

metrics_total_job_event_processing_seconds.total_seconds() / (metrics_bulk_events_saved + metrics_singular_events_saved),
) # (metrics_total_job_event_processing_seconds )
else:
self.subsystem_metrics.set('callback_receiver_event_processing_avg_seconds', 0) # if there is no lag the avg is 0.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.subsystem_metrics.set('callback_receiver_event_processing_avg_seconds', 0) # if there is no lag the avg is 0.
self.subsystem_metrics.set('callback_receiver_event_processing_avg_seconds', 0) # if we wrote no events the the avg processing time is 0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm only just now catching up on much of what subsystem metrics is doing, and I am very worried that it's saving too many things (already, even before this). I don't agree that 0 is representative of this situation, it's more akin to 0/0. I would prefer to not write in the else: condition here, and let the old value hang out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

k my concern is it would never go back down again on a quiet system. but I do see where you're coming from and I'm not against dropping this condition because downside of this is we may have some idle callback receivers pushing this metric while others are busy

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so +1 to dropping whole else condition

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll throw it through some manual testing without the else, should work just fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To articulate more specifically what @kdelee is thinking:

I expect that performance testing will launch a large number of jobs all at the same time. These will finish significantly faster than the event processing can finish, because that's by design. Then at the time the last event is saved, the event lag will be high. At the time of processing, the redis queue is effectively empty, but the measured event lag will be high. Since it's the last event of a long burst, the callback system will be dead silent after that.

In any case, the graph would be discontinuous. If you do what I'm asking for here, it will remain at a relatively high value after the burst has finished, and drop down to a low (not zero) value upon the first events of the next job that starts. With what this was doing, it would drop it down to zero right after the burst finishes processing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlanCoding We left a couple more comments below but I wanted to leave on here too. I would prefer in the short term we keep the else and actually drop down since in the future we'll be changing how this behaves anyways. I think it has a better delivery of meaningful information with the included else block but it's not a deal breaker for me if you think we should drop it in order to get this in.

m.clear_value(self.conn)
def clear_values(self, fields=None):
if not fields:
fields = self.METRICS.keys()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a little weird. What's the reasoning behind this conditional? Might be worth a comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may or may not have stolen this from @fosterseth 's open PR 😅
What I believe is that it's resetting the values at the end of each partition in preparation for the next cycle.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the best thing to do would be to revert the changes here before merge. As it stands, it's like adding mystery code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have validated that without this the average seconds just continues to rise. vs. being refreshed with the partitionining. I'll leave it in so this is functional but full credit to @fosterseth for it. Grafana of jobs over approximately 25m to demonstrate if it's not included.
image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I didn't realize it was fixing something real. I guess it's in the nuance of how it interacts with the python library?

@cypress
Copy link

cypress bot commented Jun 4, 2022



Test summary

629 0 830 0Flakiness 1


Run details

Project AWX - Functional
Status Passed
Commit 5f9326b
Started Jun 6, 2022 10:41 PM
Ended Jun 7, 2022 2:07 AM
Duration 25:35 💡
OS Linux Debian - 11.3
Browser Chrome 99

View run in Cypress Dashboard ➡️


Flakiness

cypress/integration/projects/project-list.spec.js Flakiness
1 Search operations - Projects > can sort Projects by name after performing advanced search

This comment has been generated by cypress-bot as a result of this project's GitHub integration settings. You can manage this integration in this project's settings in the Cypress Dashboard

@rebeccahhh rebeccahhh marked this pull request as ready for review June 6, 2022 14:30
@rebeccahhh rebeccahhh changed the title [WIP] Metrics for callback receiver job event lag Metrics for callback receiver job event lag Jun 6, 2022
grafana via prometheus.

This metric is a good indicator of how far behind the callback receiver
is. The higher the load the further behind/the greater the number of
seconds the metric will display.

This number being high may indicate the need for horizontal scaling in
the control plane or vertically scaling the number of callback
receivers.
@rebeccahhh
Copy link
Member Author

did a little rebase to get this into one commit it's passed internal tests on my end as well.

@rebeccahhh rebeccahhh merged commit 76ffdbb into ansible:devel Jun 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants