-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a small flag to job templates to ignore some events #6641
Conversation
This also removes more references to rabbitmq and purges the old minishift cluster dev environment |
awx/main/dispatch/worker/callback.py
Outdated
@@ -98,7 +98,8 @@ def flush(self, force=False): | |||
# work around - if the host disappears before | |||
# the event can be processed | |||
e.host_id = None | |||
e.save() | |||
if not e.skip_save(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this check; if skip_save()
is False
, doesn't that mean the event won't have been placed into self.buff
in the first place (see below)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're right, I misunderstood what was happening here as a bailout from the bulk save.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The primary use case here for this code block is:
- Many events in one bulk save.
- One of the events has a
host_id
reference. - That Host is actually deleted in a parallel request elsewhere.
IntegrityError
on the bulk save due to the foreign key error.- Fall back to (slow, but accurate)
for e in events: e.save()
Build failed.
|
awx/main/models/events.py
Outdated
@@ -325,6 +327,11 @@ def _update_from_event_data(self): | |||
if isinstance(res, dict): | |||
if res.get('changed', False): | |||
self.changed = True | |||
if isinstance(self, JobEvent) and not self.job.gather_event_types == 'all': # Event filtering state machine | |||
if not self.failed and self.event_level in [0,3] and self.job.gather_event_types == "errors": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My brain found this code kind of hard to parse. What about something like:
if self.job.gather_event_types in ("none", "errors"):
self.skip_save = self.event_level in [0,3] # skip runner_on and meta events (like verbose/debug/warnings)
if self.job.gather_event_types == "errors":
self.skip_save = self.skip_save and not self.failed # skip failures/errors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
debug too perhaps? deprecated and warning are also typically filtered out by folks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it will pick up true errors as identified here.
@@ -159,6 +159,12 @@ class Meta: | |||
"If enabled, Tower will act as an Ansible Fact Cache Plugin; persisting " | |||
"facts at the end of a playbook run to the database and caching facts for use by Ansible."), | |||
) | |||
gather_event_types = models.CharField( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The next thing people are going to ask for is an /api/v2/settings/jobs/
toggle that lets you set the default behavior globally (not just per-JT).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the question is... how visible do we want to make this? A while ago I would have said... lets not show this at all. Now I'm not so sure. It might be worth having this even be exposed through the UI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this per-template instead of global?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, at least at first, this needs to be per template... having it global would affect smaller jobs that need more detail.
I feel like a global default could be set later.
@@ -152,7 +153,8 @@ def perform_work(self, body): | |||
return | |||
|
|||
event = cls.create_from_data(**body) | |||
self.buff.setdefault(cls, []).append(event) | |||
if not event.skip_save: | |||
self.buff.setdefault(cls, []).append(event) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will the final set of counters look like for jobs that have events that are ignored with this toggle? Will they still be continuous?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, we have this mechanism in the API for unified jobs called event_processing_finished
where it compares "what runner emitted" with "what got saved", and we use that to determine if event processing is actually done (and to detect lost events). The UI and the CLI both rely on this to determine when streaming stdout is actually done. It seems like this PR would break that expectation/contract.
With that in mind, we could implement this filtering on the runner callback side of the equation, but of course that won't work with external logging :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this to all work properly, we'd need code here that conditionally incremented the counter:
https://github.com/ansible/awx/blob/devel/awx/main/tasks.py#L1234
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(we'd also need to verify that this worked properly with isolated nodes): https://github.com/ansible/awx/blob/devel/awx/main/tasks.py#L1469
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the issue about save counts is now resolved. There seems to be a UI race condition now, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've shifted this code up to this event data entrypoint, I had to modify the exposure of the event level up to class scope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something I noticed when looking at data on the websocket event bus - The final_counter
on the EOF event isn't matching up with the stats event. No frontend impact but probably worth looking into.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this makes sense. The stats event is counting the tasks that come through Ansible. The final_counter
is counted by the event processor in Tower itself which occurs during filtering of the events.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We agree on the expected behavior you described but I was unclear. I'm referring to the counter
on the job_event that contains the playbook_on_stats
data, not what the playbook_on_stats
data is counting.
The EOF final_counter
should match the highest job_event counter that will be emitted to a client. If you change the gather_event_types
of a job_template to any value but all
and run a job, the EOF final_counter
doesn't match the highest job_event counter received.
compare to
CONTRIBUTING.md
Outdated
@@ -215,18 +215,23 @@ Using `docker exec`, this will create a session in the running *awx* container, | |||
If you want to start and use the development environment, you'll first need to bootstrap it by running the following command: | |||
|
|||
```bash | |||
(container)# /bootstrap_development.sh | |||
(container)# /usr/bin/bootstrap_development.sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you open a separate PR for this docs cleanup and minishift removal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in #6652
Build succeeded.
|
fe3f490
to
2613947
Compare
Build succeeded.
|
2613947
to
6052f08
Compare
Let me link some stuff: the big-picture ask there is that the JT form would have a selector, show event types of "ok,changed,failed,etc." which correspond to the Ansible task status. In any sensible implementation, this would be fed into the existing Ansible settings for this. |
One important thing to note here @AlanCoding is that the ask is "to save space, don't save/store certain events in AWX, but continue to send them over the external logger" (which means the solution probably can't come from Ansible). |
Build failed.
|
recheck |
Build succeeded.
|
The ui assigns a special meaning to gaps in the event counters. When the job is still running or processing events, these gaps are interpreted by the ui as "events you can generally expect to arrive eventually, don't immediately dump the ones you've already received to the page just yet". This flag means that gaps in event counters of running jobs can represent a different concept. Instead of "events you can most likely expect to arrive eventually assuming the socket stays connected and something hasn't majorly gone wrong on the server", they can now also mean "events that have been pre-filtered out of existence that won't ever be processed" and the ui has no way of distinguishing between the two types of missing event groups. There are definitely things we can do, code we can update, etc. to make the ui compatible with this feature but it won't be a quick fix or something we can turn around in a few hours. Out of curiosity, Is there a specific github issue, feature request, email, etc. attached to this PR? I can tell this PR solves a problem but I'm not sure which one it solves. Knowing some more details here would help us correctly budget my / the ui team's time on this. |
Alright, so one thing I think we will try on this branch to at least handle the missing (not rendered) stats event problem described above is that we will just dump whatever events we have and skip to the end of the job with an api call the moment the stats event is received for jobs that have this new parameter set. That might be enough to get by for now but we'd need to implement it and test it first. I'll push something up within the next day and we can try it out. |
Excellent! |
@@ -29,6 +29,10 @@ | |||
PERM_INVENTORY_CHECK = 'check' | |||
PERM_INVENTORY_SCAN = 'scan' | |||
|
|||
JOB_EVENTS_ALL = "all" | |||
JOB_EVENTS_ONLY_ERRORS = "errors" | |||
JOB_EVENTS_NONE = "none" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have 3 options here.
The settings in Ansible core that do almost the same thing have 2 toggles.
I agree that we should have 1 more toggle than what they have - because verbose events are not a concept to Ansible core. Also, I think it's a reasonable feature suggestion for Ansible to add toggles for display_changed_hosts
and display_failed_hosts
. I can try to make a PR which does that if you're interested in.
In any scenario, I don't see a justification for creating a new kind of event-status-filter logic. It's not about the difficulty of doing it, users themselves have to learn to use it, we have to maintain it, and so on...
To put forward a proposal, let's say we introduce these new options to the Ansible runner display callback plugin.
- record_verbose_events
- record_failed_hosts
- record_changed_hosts
- record_ok_hosts
- record_skipped_hosts
(necessary piece of technical background: our display callback inherits those options like display_ok_hosts)
Instead of _hosts maybe _events would be nicer.
We can easily add these to the callback plugin. Then it'll be picked up by the job template as the user puts it in their ansible.cfg
in the project. This has the added benefit to make it configurable along with the display of events. Then if we start to surface these options in the UI at some point, it'll be a coherent unified interface, and not disjointed implementations that all seem like they were done ignorant of the others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AlanCoding but if we go this route, we won't be able to record the "extra" events to the external logger (maybe @matburt doesn't care about that aspect as much as I thought he did?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like you may be missing the whole point of this. The point is not to short circuit event creation entirely. We definitely want the events to flow through and come out of the Runner interface especially in order to be propagated down to the logger. We just don't want those events in the database.
It'd be a very short jump to filter these at the Ansible level or, better, the Runner level... but we need these events to come through and at least make it to the logger. That is the point of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For what it's worth, there's certainly value in short circuiting things entirely and filtering at the Runner or Ansible level, that's just not what I'm going for here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR adds a toggle in the job template form in the UI, and if a user messes with that toggle, some event types stop showing up in the job standard out. Whether or not you intended this to be that, that becomes an implementation of #4092, and having first-class visibility to the user, it becomes the canonical implementation.
I'm sure it's fine for the problem you're trying to address, that's not my concern.
Now, for anyone to do a hard filter of event types (the way it probably should be done in most cases), they can put it in ansible.cfg
in their project and have it work. It's just all backwards. What you have here is a very specific and technical function for a subset of users, but that's what gets in the UI, and the Ansible settings are left to the obscure uses of Googling stuff, putting it in source control, and so on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not a toggle to do what you're doing here, but specifically for events that have empty standard out. Or better yet, give a range of options, (nothing), (log), or (record & log)
Users already have the ability to control what goes in standard out. They have the ability to print to standard out for the runner-on-start events. For other event types, they have the ability to silence them.
Today, that ability does not solve the problem they have because our system still records events even if the user turns off the display of them. The users who wanted this are more sophisticated than what I feel like this gives them credit for. If they wanted to hide the events, they can do it, but it won't reduce database use by much, because the server fights against them by saving it anyway. I'm all for fixing that without changing the default behavior.
d73ce89
to
72ce2ed
Compare
Build succeeded.
|
42f0b4d
to
124f075
Compare
Build succeeded.
|
3f333bb
to
52c2dfe
Compare
52c2dfe
to
6026ea8
Compare
c0b7908
to
a4bcebd
Compare
Build succeeded.
|
This examines the state of the flag and decides what type of events to keep from hitting the database and websocket. Events are still emitted over logging
When a non-default gather_event_types value is found on the job details, there is no expectation that missing counters correspond to job events that are still processing on the backend. When this is the case, dump all events to the page and stop handling new job events once the stats event is received.
When a non-default gather_event_types filter is found on the job details, we can't know if a missing event group should be clickable for download so disable the feature everywhere.
a6a122d
to
3222bd4
Compare
Build succeeded.
|
elif isinstance(self, RunJob) and self.instance.gather_event_types in ("none", "errors"): | ||
event_level = BasePlaybookEvent._level_event(event_data.get("event")) | ||
failed = event_data.get("event") in BasePlaybookEvent.FAILED_EVENTS | ||
skip_save = event_level in [0,3] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the intent of this, as opposed to setting skip_save = True
? YAGNI?
event_level = BasePlaybookEvent._level_event(event_data.get("event")) | ||
failed = event_data.get("event") in BasePlaybookEvent.FAILED_EVENTS | ||
skip_save = event_level in [0,3] | ||
if self.instance.gather_event_types == "errors": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You type the string "errors" here, but in other places use the variable JOB_EVENTS_ONLY_ERRORS
. If some places are going to use the hard-coded string, maybe all places should hard-code the string.
Wrapping up everything I've said here - this PR on your branch effectively does what I suggested: The pitch for this option is that it gives the user more options, it is another way to solve the same problem will be preferable in most cases, and its definition is technically unambiguous. |
I have another question about this. Some events have the field Lines 355 to 357 in 155a1d9
But at the point that Sorry if this question had already been asked. |
Add a small api-side Job Template flag that will ignore certain events but still allow them to be sent out via logging.
This also updates some contributing docs