Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a small flag to job templates to ignore some events #6641

Closed
wants to merge 7 commits into from

Conversation

matburt
Copy link
Member

@matburt matburt commented Apr 7, 2020

Add a small api-side Job Template flag that will ignore certain events but still allow them to be sent out via logging.

This also updates some contributing docs

@matburt
Copy link
Member Author

matburt commented Apr 7, 2020

This also removes more references to rabbitmq and purges the old minishift cluster dev environment

@@ -98,7 +98,8 @@ def flush(self, force=False):
# work around - if the host disappears before
# the event can be processed
e.host_id = None
e.save()
if not e.skip_save():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this check; if skip_save() is False, doesn't that mean the event won't have been placed into self.buff in the first place (see below)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right, I misunderstood what was happening here as a bailout from the bulk save.

Copy link
Contributor

@ryanpetrello ryanpetrello Apr 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The primary use case here for this code block is:

  1. Many events in one bulk save.
  2. One of the events has a host_id reference.
  3. That Host is actually deleted in a parallel request elsewhere.
  4. IntegrityError on the bulk save due to the foreign key error.
  5. Fall back to (slow, but accurate) for e in events: e.save()

@softwarefactory-project-zuul
Copy link
Contributor

Build failed.

@@ -325,6 +327,11 @@ def _update_from_event_data(self):
if isinstance(res, dict):
if res.get('changed', False):
self.changed = True
if isinstance(self, JobEvent) and not self.job.gather_event_types == 'all': # Event filtering state machine
if not self.failed and self.event_level in [0,3] and self.job.gather_event_types == "errors":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My brain found this code kind of hard to parse. What about something like:

if self.job.gather_event_types in ("none", "errors"):
    self.skip_save = self.event_level in [0,3]  # skip runner_on and meta events (like verbose/debug/warnings)
    if self.job.gather_event_types == "errors":
        self.skip_save = self.skip_save and not self.failed  # skip failures/errors

Copy link
Contributor

@ryanpetrello ryanpetrello Apr 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, should we skip level zero entirely?

image

It seems to me like most of these are actually the types of things you'd maybe want to record, especially errors, warnings, deprecation notices. Maybe we should just exclude "runner_on" (level 3) and "verbose" events?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debug too perhaps? deprecated and warning are also typically filtered out by folks?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will pick up true errors as identified here.

@@ -159,6 +159,12 @@ class Meta:
"If enabled, Tower will act as an Ansible Fact Cache Plugin; persisting "
"facts at the end of a playbook run to the database and caching facts for use by Ansible."),
)
gather_event_types = models.CharField(
Copy link
Contributor

@ryanpetrello ryanpetrello Apr 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The next thing people are going to ask for is an /api/v2/settings/jobs/ toggle that lets you set the default behavior globally (not just per-JT).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the question is... how visible do we want to make this? A while ago I would have said... lets not show this at all. Now I'm not so sure. It might be worth having this even be exposed through the UI.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this per-template instead of global?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, at least at first, this needs to be per template... having it global would affect smaller jobs that need more detail.

I feel like a global default could be set later.

@@ -152,7 +153,8 @@ def perform_work(self, body):
return

event = cls.create_from_data(**body)
self.buff.setdefault(cls, []).append(event)
if not event.skip_save:
self.buff.setdefault(cls, []).append(event)

Copy link
Contributor

@jakemcdermott jakemcdermott Apr 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will the final set of counters look like for jobs that have events that are ignored with this toggle? Will they still be continuous?

Copy link
Contributor

@ryanpetrello ryanpetrello Apr 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we have this mechanism in the API for unified jobs called event_processing_finished where it compares "what runner emitted" with "what got saved", and we use that to determine if event processing is actually done (and to detect lost events). The UI and the CLI both rely on this to determine when streaming stdout is actually done. It seems like this PR would break that expectation/contract.

With that in mind, we could implement this filtering on the runner callback side of the equation, but of course that won't work with external logging :/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this to all work properly, we'd need code here that conditionally incremented the counter:

https://github.com/ansible/awx/blob/devel/awx/main/tasks.py#L1234

Copy link
Contributor

@ryanpetrello ryanpetrello Apr 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(we'd also need to verify that this worked properly with isolated nodes): https://github.com/ansible/awx/blob/devel/awx/main/tasks.py#L1469

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the issue about save counts is now resolved. There seems to be a UI race condition now, though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've shifted this code up to this event data entrypoint, I had to modify the exposure of the event level up to class scope.

Copy link
Contributor

@jakemcdermott jakemcdermott Apr 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something I noticed when looking at data on the websocket event bus - The final_counter on the EOF event isn't matching up with the stats event. No frontend impact but probably worth looking into.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense. The stats event is counting the tasks that come through Ansible. The final_counter is counted by the event processor in Tower itself which occurs during filtering of the events.

Copy link
Contributor

@jakemcdermott jakemcdermott Apr 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We agree on the expected behavior you described but I was unclear. I'm referring to the counter on the job_event that contains the playbook_on_stats data, not what the playbook_on_stats data is counting.

Screenshot from 2020-04-09 15-54-40

The EOF final_counter should match the highest job_event counter that will be emitted to a client. If you change the gather_event_types of a job_template to any value but all and run a job, the EOF final_counter doesn't match the highest job_event counter received.

Screenshot from 2020-04-09 16-20-14

compare to

Screenshot from 2020-04-09 16-20-34

CONTRIBUTING.md Outdated
@@ -215,18 +215,23 @@ Using `docker exec`, this will create a session in the running *awx* container,
If you want to start and use the development environment, you'll first need to bootstrap it by running the following command:

```bash
(container)# /bootstrap_development.sh
(container)# /usr/bin/bootstrap_development.sh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you open a separate PR for this docs cleanup and minishift removal?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in #6652

@matburt
Copy link
Member Author

matburt commented Apr 8, 2020

Interestingly, now that I've moved this over to make sure the event counts are right it seems like the UI is missing a task and the recap event, though I see it coming in over the websocket

image
image

Also a page refresh will show the events, this is what I would expect to see without a page refresh

image

@softwarefactory-project-zuul
Copy link
Contributor

Build succeeded.

@softwarefactory-project-zuul
Copy link
Contributor

Build succeeded.

@AlanCoding
Copy link
Member

Let me link some stuff:

#4092

the big-picture ask there is that the JT form would have a selector, show event types of "ok,changed,failed,etc." which correspond to the Ansible task status. In any sensible implementation, this would be fed into the existing Ansible settings for this.

@ryanpetrello
Copy link
Contributor

ryanpetrello commented Apr 8, 2020

One important thing to note here @AlanCoding is that the ask is "to save space, don't save/store certain events in AWX, but continue to send them over the external logger" (which means the solution probably can't come from Ansible).

@softwarefactory-project-zuul
Copy link
Contributor

Build failed.

@matburt
Copy link
Member Author

matburt commented Apr 8, 2020

recheck

@softwarefactory-project-zuul
Copy link
Contributor

Build succeeded.

@jakemcdermott
Copy link
Contributor

jakemcdermott commented Apr 8, 2020

The ui assigns a special meaning to gaps in the event counters. When the job is still running or processing events, these gaps are interpreted by the ui as "events you can generally expect to arrive eventually, don't immediately dump the ones you've already received to the page just yet".

This flag means that gaps in event counters of running jobs can represent a different concept. Instead of "events you can most likely expect to arrive eventually assuming the socket stays connected and something hasn't majorly gone wrong on the server", they can now also mean "events that have been pre-filtered out of existence that won't ever be processed" and the ui has no way of distinguishing between the two types of missing event groups.

There are definitely things we can do, code we can update, etc. to make the ui compatible with this feature but it won't be a quick fix or something we can turn around in a few hours.

Out of curiosity, Is there a specific github issue, feature request, email, etc. attached to this PR? I can tell this PR solves a problem but I'm not sure which one it solves. Knowing some more details here would help us correctly budget my / the ui team's time on this.

@jakemcdermott
Copy link
Contributor

Alright, so one thing I think we will try on this branch to at least handle the missing (not rendered) stats event problem described above is that we will just dump whatever events we have and skip to the end of the job with an api call the moment the stats event is received for jobs that have this new parameter set.

That might be enough to get by for now but we'd need to implement it and test it first. I'll push something up within the next day and we can try it out.

@matburt
Copy link
Member Author

matburt commented Apr 8, 2020

Excellent!

@jakemcdermott
Copy link
Contributor

np

@@ -29,6 +29,10 @@
PERM_INVENTORY_CHECK = 'check'
PERM_INVENTORY_SCAN = 'scan'

JOB_EVENTS_ALL = "all"
JOB_EVENTS_ONLY_ERRORS = "errors"
JOB_EVENTS_NONE = "none"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have 3 options here.

The settings in Ansible core that do almost the same thing have 2 toggles.

https://github.com/ansible/ansible/blob/devel/lib/ansible/plugins/doc_fragments/default_callback.py#L11-L36.

I agree that we should have 1 more toggle than what they have - because verbose events are not a concept to Ansible core. Also, I think it's a reasonable feature suggestion for Ansible to add toggles for display_changed_hosts and display_failed_hosts. I can try to make a PR which does that if you're interested in.

In any scenario, I don't see a justification for creating a new kind of event-status-filter logic. It's not about the difficulty of doing it, users themselves have to learn to use it, we have to maintain it, and so on...

To put forward a proposal, let's say we introduce these new options to the Ansible runner display callback plugin.

  • record_verbose_events
  • record_failed_hosts
  • record_changed_hosts
  • record_ok_hosts
  • record_skipped_hosts

(necessary piece of technical background: our display callback inherits those options like display_ok_hosts)

Instead of _hosts maybe _events would be nicer.

We can easily add these to the callback plugin. Then it'll be picked up by the job template as the user puts it in their ansible.cfg in the project. This has the added benefit to make it configurable along with the display of events. Then if we start to surface these options in the UI at some point, it'll be a coherent unified interface, and not disjointed implementations that all seem like they were done ignorant of the others.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlanCoding but if we go this route, we won't be able to record the "extra" events to the external logger (maybe @matburt doesn't care about that aspect as much as I thought he did?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like you may be missing the whole point of this. The point is not to short circuit event creation entirely. We definitely want the events to flow through and come out of the Runner interface especially in order to be propagated down to the logger. We just don't want those events in the database.

It'd be a very short jump to filter these at the Ansible level or, better, the Runner level... but we need these events to come through and at least make it to the logger. That is the point of this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what it's worth, there's certainly value in short circuiting things entirely and filtering at the Runner or Ansible level, that's just not what I'm going for here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR adds a toggle in the job template form in the UI, and if a user messes with that toggle, some event types stop showing up in the job standard out. Whether or not you intended this to be that, that becomes an implementation of #4092, and having first-class visibility to the user, it becomes the canonical implementation.

I'm sure it's fine for the problem you're trying to address, that's not my concern.

Now, for anyone to do a hard filter of event types (the way it probably should be done in most cases), they can put it in ansible.cfg in their project and have it work. It's just all backwards. What you have here is a very specific and technical function for a subset of users, but that's what gets in the UI, and the Ansible settings are left to the obscure uses of Googling stuff, putting it in source control, and so on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not a toggle to do what you're doing here, but specifically for events that have empty standard out. Or better yet, give a range of options, (nothing), (log), or (record & log)

Users already have the ability to control what goes in standard out. They have the ability to print to standard out for the runner-on-start events. For other event types, they have the ability to silence them.

Today, that ability does not solve the problem they have because our system still records events even if the user turns off the display of them. The users who wanted this are more sophisticated than what I feel like this gives them credit for. If they wanted to hide the events, they can do it, but it won't reduce database use by much, because the server fights against them by saving it anyway. I'm all for fixing that without changing the default behavior.

@softwarefactory-project-zuul
Copy link
Contributor

Build succeeded.

@jakemcdermott jakemcdermott force-pushed the skip_some_events branch 2 times, most recently from 42f0b4d to 124f075 Compare April 9, 2020 02:10
@softwarefactory-project-zuul
Copy link
Contributor

Build succeeded.

@softwarefactory-project-zuul
Copy link
Contributor

Build succeeded.

matburt and others added 7 commits April 9, 2020 09:12
This examines the state of the flag and decides what type of events to keep from hitting the database and websocket. Events are still emitted over logging
When a non-default gather_event_types value is found on the job
details, there is no expectation that missing counters correspond to
job events that are still processing on the backend. When this is the
case, dump all events to the page and stop handling new job events once
the stats event is received.
When a non-default gather_event_types filter is found on the job
details, we can't know if a missing event group should be clickable
for download so disable the feature everywhere.
@softwarefactory-project-zuul
Copy link
Contributor

Build succeeded.

elif isinstance(self, RunJob) and self.instance.gather_event_types in ("none", "errors"):
event_level = BasePlaybookEvent._level_event(event_data.get("event"))
failed = event_data.get("event") in BasePlaybookEvent.FAILED_EVENTS
skip_save = event_level in [0,3]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the intent of this, as opposed to setting skip_save = True? YAGNI?

event_level = BasePlaybookEvent._level_event(event_data.get("event"))
failed = event_data.get("event") in BasePlaybookEvent.FAILED_EVENTS
skip_save = event_level in [0,3]
if self.instance.gather_event_types == "errors":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You type the string "errors" here, but in other places use the variable JOB_EVENTS_ONLY_ERRORS. If some places are going to use the hard-coded string, maybe all places should hard-code the string.

@AlanCoding
Copy link
Member

Wrapping up everything I've said here - this PR on your branch effectively does what I suggested:

matburt#8

The pitch for this option is that it gives the user more options, it is another way to solve the same problem will be preferable in most cases, and its definition is technically unambiguous.

@AlanCoding
Copy link
Member

I have another question about this. Some events have the field failed switched to True after the job is finished.

JobEvent.objects.filter(
job_id=self.job_id, uuid__in=failed
).update(failed=True)

But at the point that skip_save is evaluated here, the failed property of the job event would still be False. This is because it's a parent event of another event that winds up being failed. A filter in the UI (after the job finished) for failed events would return this event, but in your implementation here those events would be lost.

Sorry if this question had already been asked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants