Add a small flag to job templates to ignore some events #6641

matburt · 2020-04-07T21:29:15Z

Add a small api-side Job Template flag that will ignore certain events but still allow them to be sent out via logging.

This also updates some contributing docs

matburt · 2020-04-07T21:34:51Z

This also removes more references to rabbitmq and purges the old minishift cluster dev environment

ryanpetrello · 2020-04-07T22:01:14Z

awx/main/dispatch/worker/callback.py

@@ -98,7 +98,8 @@ def flush(self, force=False):
                                # work around - if the host disappears before
                                # the event can be processed
                                e.host_id = None
-                            e.save()
+                            if not e.skip_save():


I don't think we need this check; if skip_save() is False, doesn't that mean the event won't have been placed into self.buff in the first place (see below)?

I think you're right, I misunderstood what was happening here as a bailout from the bulk save.

The primary use case here for this code block is:

Many events in one bulk save.

One of the events has a host_id reference.

That Host is actually deleted in a parallel request elsewhere.

IntegrityError on the bulk save due to the foreign key error.

Fall back to (slow, but accurate) for e in events: e.save()

softwarefactory-project-zuul · 2020-04-07T22:01:22Z

Build failed.

awx-api-lint : SUCCESS in 17m 31s
awx-api : FAILURE in 17m 00s
awx-ui : SUCCESS in 20m 21s
awx-ui-next : SUCCESS in 20m 44s
awx-swagger : SUCCESS in 26m 43s
awx-detect-schema-change : FAILURE in 23m 52s (non-voting)
awx-ansible-modules : SUCCESS in 18m 03s

ryanpetrello · 2020-04-07T22:10:42Z

awx/main/models/events.py

@@ -325,6 +327,11 @@ def _update_from_event_data(self):
        if isinstance(res, dict):
            if res.get('changed', False):
                self.changed = True
+        if isinstance(self, JobEvent) and not self.job.gather_event_types == 'all': # Event filtering state machine
+            if not self.failed and self.event_level in [0,3] and self.job.gather_event_types == "errors":


My brain found this code kind of hard to parse. What about something like:

if self.job.gather_event_types in ("none", "errors"): self.skip_save = self.event_level in [0,3] # skip runner_on and meta events (like verbose/debug/warnings) if self.job.gather_event_types == "errors": self.skip_save = self.skip_save and not self.failed # skip failures/errors

Also, should we skip level zero entirely?

It seems to me like most of these are actually the types of things you'd maybe want to record, especially errors, warnings, deprecation notices. Maybe we should just exclude "runner_on" (level 3) and "verbose" events?

debug too perhaps? deprecated and warning are also typically filtered out by folks?

it will pick up true errors as identified here.

ryanpetrello · 2020-04-07T22:15:09Z

awx/main/models/jobs.py

@@ -159,6 +159,12 @@ class Meta:
            "If enabled, Tower will act as an Ansible Fact Cache Plugin; persisting "
            "facts at the end of a playbook run to the database and caching facts for use by Ansible."),
    )
+    gather_event_types = models.CharField(


The next thing people are going to ask for is an /api/v2/settings/jobs/ toggle that lets you set the default behavior globally (not just per-JT).

I guess the question is... how visible do we want to make this? A while ago I would have said... lets not show this at all. Now I'm not so sure. It might be worth having this even be exposed through the UI.

why is this per-template instead of global?

I think, at least at first, this needs to be per template... having it global would affect smaller jobs that need more detail.

I feel like a global default could be set later.

jakemcdermott · 2020-04-07T22:29:17Z

awx/main/dispatch/worker/callback.py

@@ -152,7 +153,8 @@ def perform_work(self, body):
                    return

                event = cls.create_from_data(**body)
-                self.buff.setdefault(cls, []).append(event)
+                if not event.skip_save:
+                    self.buff.setdefault(cls, []).append(event)



What will the final set of counters look like for jobs that have events that are ignored with this toggle? Will they still be continuous?

Also, we have this mechanism in the API for unified jobs called event_processing_finished where it compares "what runner emitted" with "what got saved", and we use that to determine if event processing is actually done (and to detect lost events). The UI and the CLI both rely on this to determine when streaming stdout is actually done. It seems like this PR would break that expectation/contract.

With that in mind, we could implement this filtering on the runner callback side of the equation, but of course that won't work with external logging :/

https://github.com/ansible/awx/blob/devel/awx/main/tasks.py#L1497
https://github.com/ansible/awx/blob/devel/awx/main/models/unified_jobs.py#L1004

For this to all work properly, we'd need code here that conditionally incremented the counter:

https://github.com/ansible/awx/blob/devel/awx/main/tasks.py#L1234

(we'd also need to verify that this worked properly with isolated nodes): https://github.com/ansible/awx/blob/devel/awx/main/tasks.py#L1469

I think the issue about save counts is now resolved. There seems to be a UI race condition now, though.

I've shifted this code up to this event data entrypoint, I had to modify the exposure of the event level up to class scope.

Something I noticed when looking at data on the websocket event bus - The final_counter on the EOF event isn't matching up with the stats event. No frontend impact but probably worth looking into.

I think this makes sense. The stats event is counting the tasks that come through Ansible. The final_counter is counted by the event processor in Tower itself which occurs during filtering of the events.

We agree on the expected behavior you described but I was unclear. I'm referring to the counter on the job_event that contains the playbook_on_stats data, not what the playbook_on_stats data is counting.

The EOF final_counter should match the highest job_event counter that will be emitted to a client. If you change the gather_event_types of a job_template to any value but all and run a job, the EOF final_counter doesn't match the highest job_event counter received.

compare to

ryanpetrello · 2020-04-08T12:49:42Z

CONTRIBUTING.md

@@ -215,18 +215,23 @@ Using `docker exec`, this will create a session in the running *awx* container,
 If you want to start and use the development environment, you'll first need to bootstrap it by running the following command:

 ```bash
-(container)# /bootstrap_development.sh
+(container)# /usr/bin/bootstrap_development.sh


Can you open a separate PR for this docs cleanup and minishift removal?

done in #6652

matburt · 2020-04-08T13:52:36Z

Interestingly, now that I've moved this over to make sure the event counts are right it seems like the UI is missing a task and the recap event, though I see it coming in over the websocket

Also a page refresh will show the events, this is what I would expect to see without a page refresh

softwarefactory-project-zuul · 2020-04-08T13:58:24Z

Build succeeded.

awx-api-lint : SUCCESS in 5m 46s
awx-api : SUCCESS in 15m 56s
awx-ui : SUCCESS in 8m 22s
awx-ui-next : SUCCESS in 14m 12s
awx-swagger : SUCCESS in 14m 00s
awx-detect-schema-change : FAILURE in 15m 31s (non-voting)
awx-ansible-modules : SUCCESS in 6m 41s

softwarefactory-project-zuul · 2020-04-08T14:15:03Z

Build succeeded.

awx-api-lint : SUCCESS in 4m 36s
awx-api : SUCCESS in 13m 40s
awx-ui : SUCCESS in 7m 13s
awx-ui-next : SUCCESS in 12m 45s
awx-swagger : SUCCESS in 10m 58s
awx-detect-schema-change : FAILURE in 12m 11s (non-voting)
awx-ansible-modules : SUCCESS in 5m 48s

AlanCoding · 2020-04-08T14:49:08Z

Let me link some stuff:

#4092

the big-picture ask there is that the JT form would have a selector, show event types of "ok,changed,failed,etc." which correspond to the Ansible task status. In any sensible implementation, this would be fed into the existing Ansible settings for this.

ryanpetrello · 2020-04-08T14:50:59Z

One important thing to note here @AlanCoding is that the ask is "to save space, don't save/store certain events in AWX, but continue to send them over the external logger" (which means the solution probably can't come from Ansible).

softwarefactory-project-zuul · 2020-04-08T15:05:38Z

Build failed.

awx-api-lint : FAILURE in 3m 36s
awx-api : FAILURE in 3m 39s
awx-ui : FAILURE in 3m 38s
awx-ui-next : FAILURE in 4m 38s
awx-swagger : SUCCESS in 13m 10s
awx-detect-schema-change : FAILURE in 20m 54s (non-voting)
awx-ansible-modules : SUCCESS in 8m 07s

matburt · 2020-04-08T15:10:35Z

recheck

softwarefactory-project-zuul · 2020-04-08T15:34:10Z

Build succeeded.

awx-api-lint : SUCCESS in 15m 12s
awx-api : SUCCESS in 19m 57s
awx-ui : SUCCESS in 17m 18s
awx-ui-next : SUCCESS in 20m 21s
awx-swagger : SUCCESS in 21m 11s
awx-detect-schema-change : FAILURE in 23m 15s (non-voting)
awx-ansible-modules : SUCCESS in 16m 16s

jakemcdermott · 2020-04-08T17:57:41Z

The ui assigns a special meaning to gaps in the event counters. When the job is still running or processing events, these gaps are interpreted by the ui as "events you can generally expect to arrive eventually, don't immediately dump the ones you've already received to the page just yet".

This flag means that gaps in event counters of running jobs can represent a different concept. Instead of "events you can most likely expect to arrive eventually assuming the socket stays connected and something hasn't majorly gone wrong on the server", they can now also mean "events that have been pre-filtered out of existence that won't ever be processed" and the ui has no way of distinguishing between the two types of missing event groups.

There are definitely things we can do, code we can update, etc. to make the ui compatible with this feature but it won't be a quick fix or something we can turn around in a few hours.

Out of curiosity, Is there a specific github issue, feature request, email, etc. attached to this PR? I can tell this PR solves a problem but I'm not sure which one it solves. Knowing some more details here would help us correctly budget my / the ui team's time on this.

jakemcdermott · 2020-04-08T18:17:49Z

Alright, so one thing I think we will try on this branch to at least handle the missing (not rendered) stats event problem described above is that we will just dump whatever events we have and skip to the end of the job with an api call the moment the stats event is received for jobs that have this new parameter set.

That might be enough to get by for now but we'd need to implement it and test it first. I'll push something up within the next day and we can try it out.

matburt · 2020-04-08T18:19:51Z

Excellent!

jakemcdermott · 2020-04-08T18:31:50Z

AlanCoding · 2020-04-08T20:41:51Z

awx/main/models/base.py

@@ -29,6 +29,10 @@
 PERM_INVENTORY_CHECK  = 'check'
 PERM_INVENTORY_SCAN   = 'scan'

+JOB_EVENTS_ALL = "all"
+JOB_EVENTS_ONLY_ERRORS = "errors"
+JOB_EVENTS_NONE = "none"


You have 3 options here.

The settings in Ansible core that do almost the same thing have 2 toggles.

https://github.com/ansible/ansible/blob/devel/lib/ansible/plugins/doc_fragments/default_callback.py#L11-L36.

I agree that we should have 1 more toggle than what they have - because verbose events are not a concept to Ansible core. Also, I think it's a reasonable feature suggestion for Ansible to add toggles for display_changed_hosts and display_failed_hosts. I can try to make a PR which does that if you're interested in.

In any scenario, I don't see a justification for creating a new kind of event-status-filter logic. It's not about the difficulty of doing it, users themselves have to learn to use it, we have to maintain it, and so on...

To put forward a proposal, let's say we introduce these new options to the Ansible runner display callback plugin.

record_verbose_events

record_failed_hosts

record_changed_hosts

record_ok_hosts

record_skipped_hosts

(necessary piece of technical background: our display callback inherits those options like display_ok_hosts)

Instead of _hosts maybe _events would be nicer.

We can easily add these to the callback plugin. Then it'll be picked up by the job template as the user puts it in their ansible.cfg in the project. This has the added benefit to make it configurable along with the display of events. Then if we start to surface these options in the UI at some point, it'll be a coherent unified interface, and not disjointed implementations that all seem like they were done ignorant of the others.

@AlanCoding but if we go this route, we won't be able to record the "extra" events to the external logger (maybe @matburt doesn't care about that aspect as much as I thought he did?)

I feel like you may be missing the whole point of this. The point is not to short circuit event creation entirely. We definitely want the events to flow through and come out of the Runner interface especially in order to be propagated down to the logger. We just don't want those events in the database.

It'd be a very short jump to filter these at the Ansible level or, better, the Runner level... but we need these events to come through and at least make it to the logger. That is the point of this PR.

For what it's worth, there's certainly value in short circuiting things entirely and filtering at the Runner or Ansible level, that's just not what I'm going for here.

This PR adds a toggle in the job template form in the UI, and if a user messes with that toggle, some event types stop showing up in the job standard out. Whether or not you intended this to be that, that becomes an implementation of #4092, and having first-class visibility to the user, it becomes the canonical implementation.

I'm sure it's fine for the problem you're trying to address, that's not my concern.

Now, for anyone to do a hard filter of event types (the way it probably should be done in most cases), they can put it in ansible.cfg in their project and have it work. It's just all backwards. What you have here is a very specific and technical function for a subset of users, but that's what gets in the UI, and the Ansible settings are left to the obscure uses of Googling stuff, putting it in source control, and so on.

Why not a toggle to do what you're doing here, but specifically for events that have empty standard out. Or better yet, give a range of options, (nothing), (log), or (record & log)

Users already have the ability to control what goes in standard out. They have the ability to print to standard out for the runner-on-start events. For other event types, they have the ability to silence them.

Today, that ability does not solve the problem they have because our system still records events even if the user turns off the display of them. The users who wanted this are more sophisticated than what I feel like this gives them credit for. If they wanted to hide the events, they can do it, but it won't reduce database use by much, because the server fights against them by saving it anyway. I'm all for fixing that without changing the default behavior.

softwarefactory-project-zuul · 2020-04-09T02:04:35Z

Build succeeded.

awx-api-lint : SUCCESS in 5m 01s
awx-api : SUCCESS in 10m 30s
awx-ui : SUCCESS in 7m 26s
awx-ui-next : SUCCESS in 10m 25s
awx-swagger : SUCCESS in 10m 52s
awx-detect-schema-change : FAILURE in 10m 20s (non-voting)
awx-ansible-modules : SUCCESS in 5m 00s

softwarefactory-project-zuul · 2020-04-09T02:24:08Z

Build succeeded.

awx-api-lint : SUCCESS in 4m 29s
awx-api : SUCCESS in 12m 21s
awx-ui : SUCCESS in 6m 29s
awx-ui-next : SUCCESS in 10m 52s
awx-swagger : SUCCESS in 12m 57s
awx-detect-schema-change : FAILURE in 13m 00s (non-voting)
awx-ansible-modules : SUCCESS in 4m 41s

softwarefactory-project-zuul · 2020-04-09T05:06:03Z

Build succeeded.

awx-api-lint : SUCCESS in 4m 10s
awx-api : SUCCESS in 12m 12s
awx-ui : SUCCESS in 9m 13s
awx-ui-next : SUCCESS in 15m 22s
awx-swagger : SUCCESS in 10m 26s
awx-detect-schema-change : FAILURE in 14m 29s (non-voting)
awx-ansible-modules : SUCCESS in 4m 55s

This examines the state of the flag and decides what type of events to keep from hitting the database and websocket. Events are still emitted over logging

When a non-default gather_event_types value is found on the job details, there is no expectation that missing counters correspond to job events that are still processing on the backend. When this is the case, dump all events to the page and stop handling new job events once the stats event is received.

When a non-default gather_event_types filter is found on the job details, we can't know if a missing event group should be clickable for download so disable the feature everywhere.

softwarefactory-project-zuul · 2020-04-09T13:30:16Z

Build succeeded.

awx-api-lint : SUCCESS in 4m 20s
awx-api : SUCCESS in 12m 31s
awx-ui : SUCCESS in 6m 17s
awx-ui-next : SUCCESS in 13m 58s
awx-swagger : SUCCESS in 11m 06s
awx-detect-schema-change : FAILURE in 12m 05s (non-voting)
awx-ansible-modules : SUCCESS in 4m 55s

AlanCoding · 2020-04-10T12:54:10Z

awx/main/tasks.py

+        elif isinstance(self, RunJob) and self.instance.gather_event_types in ("none", "errors"):
+            event_level = BasePlaybookEvent._level_event(event_data.get("event"))
+            failed = event_data.get("event") in BasePlaybookEvent.FAILED_EVENTS
+            skip_save = event_level in [0,3]


what is the intent of this, as opposed to setting skip_save = True? YAGNI?

AlanCoding · 2020-04-10T13:17:35Z

awx/main/tasks.py

+            event_level = BasePlaybookEvent._level_event(event_data.get("event"))
+            failed = event_data.get("event") in BasePlaybookEvent.FAILED_EVENTS
+            skip_save = event_level in [0,3]
+            if self.instance.gather_event_types == "errors":


You type the string "errors" here, but in other places use the variable JOB_EVENTS_ONLY_ERRORS. If some places are going to use the hard-coded string, maybe all places should hard-code the string.

AlanCoding · 2020-04-10T13:25:45Z

Wrapping up everything I've said here - this PR on your branch effectively does what I suggested:

matburt#8

The pitch for this option is that it gives the user more options, it is another way to solve the same problem will be preferable in most cases, and its definition is technically unambiguous.

AlanCoding · 2020-04-10T13:28:56Z

I have another question about this. Some events have the field failed switched to True after the job is finished.

awx/awx/main/models/events.py

Lines 355 to 357 in 155a1d9

    
           JobEvent.objects.filter( 
        
               job_id=self.job_id, uuid__in=failed 
        
           ).update(failed=True)

But at the point that skip_save is evaluated here, the failed property of the job event would still be False. This is because it's a parent event of another event that winds up being failed. A filter in the UI (after the job finished) for failed events would return this event, but in your implementation here those events would be lost.

Sorry if this question had already been asked.

ryanpetrello reviewed Apr 7, 2020

View reviewed changes

ryanpetrello requested review from beeankha, fosterseth and rebeccahhh April 7, 2020 22:24

jakemcdermott reviewed Apr 7, 2020

View reviewed changes

ryanpetrello reviewed Apr 8, 2020

View reviewed changes

matburt force-pushed the skip_some_events branch from fe3f490 to 2613947 Compare April 8, 2020 14:00

matburt force-pushed the skip_some_events branch from 2613947 to 6052f08 Compare April 8, 2020 14:44

AlanCoding reviewed Apr 8, 2020

View reviewed changes

jakemcdermott force-pushed the skip_some_events branch from d73ce89 to 72ce2ed Compare April 9, 2020 01:53

jakemcdermott force-pushed the skip_some_events branch 2 times, most recently from 42f0b4d to 124f075 Compare April 9, 2020 02:10

jakemcdermott requested a review from mabashian April 9, 2020 04:09

jakemcdermott force-pushed the skip_some_events branch from 3f333bb to 52c2dfe Compare April 9, 2020 04:26

jakemcdermott added component:api component:ui labels Apr 9, 2020

ansible deleted a comment from softwarefactory-project-zuul bot Apr 9, 2020

jakemcdermott force-pushed the skip_some_events branch from 52c2dfe to 6026ea8 Compare April 9, 2020 04:38

ansible deleted a comment from softwarefactory-project-zuul bot Apr 9, 2020

jakemcdermott force-pushed the skip_some_events branch 2 times, most recently from c0b7908 to a4bcebd Compare April 9, 2020 04:50

matburt and others added 7 commits April 9, 2020 09:12

Add a small flag to job templates to ignore some events

e6aaf75

This examines the state of the flag and decides what type of events to keep from hitting the database and websocket. Events are still emitted over logging

Conditionally disable click-to-load feature for missing events

adf798e

When a non-default gather_event_types filter is found on the job details, we can't know if a missing event group should be clickable for download so disable the feature everywhere.

Add event gather field to job template

312f88b

Unit test to cover event gathering

9e03b17

Add gather_events help text

e757f01

Renaming migration to follow spec

3222bd4

matburt force-pushed the skip_some_events branch from a6a122d to 3222bd4 Compare April 9, 2020 13:15

ansible deleted a comment from softwarefactory-project-zuul bot Apr 9, 2020

AlanCoding reviewed Apr 10, 2020

View reviewed changes

matburt mentioned this pull request Apr 13, 2020

Need a better way to control job event growth for large jobs #6680

Open

unlikelyzero added the state:needs_review label Jan 21, 2021

unlikelyzero added state:needs_revision and removed state:needs_review labels Feb 10, 2021

shanemcd closed this Oct 2, 2021

Add a small flag to job templates to ignore some events #6641

Add a small flag to job templates to ignore some events #6641

Conversation

matburt commented Apr 7, 2020

matburt commented Apr 7, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryanpetrello Apr 7, 2020 • edited Loading

Choose a reason for hiding this comment

softwarefactory-project-zuul bot commented Apr 7, 2020

Choose a reason for hiding this comment

ryanpetrello Apr 7, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryanpetrello Apr 7, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jakemcdermott Apr 7, 2020 • edited Loading

Choose a reason for hiding this comment

ryanpetrello Apr 7, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryanpetrello Apr 7, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jakemcdermott Apr 9, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jakemcdermott Apr 9, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matburt commented Apr 8, 2020 • edited Loading

softwarefactory-project-zuul bot commented Apr 8, 2020

softwarefactory-project-zuul bot commented Apr 8, 2020

AlanCoding commented Apr 8, 2020

ryanpetrello commented Apr 8, 2020 • edited Loading

softwarefactory-project-zuul bot commented Apr 8, 2020

matburt commented Apr 8, 2020

softwarefactory-project-zuul bot commented Apr 8, 2020

jakemcdermott commented Apr 8, 2020 • edited Loading

jakemcdermott commented Apr 8, 2020

matburt commented Apr 8, 2020

jakemcdermott commented Apr 8, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

softwarefactory-project-zuul bot commented Apr 9, 2020

softwarefactory-project-zuul bot commented Apr 9, 2020

softwarefactory-project-zuul bot commented Apr 9, 2020

softwarefactory-project-zuul bot commented Apr 9, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AlanCoding commented Apr 10, 2020

AlanCoding commented Apr 10, 2020

ryanpetrello Apr 7, 2020 •

edited

Loading

ryanpetrello Apr 7, 2020 •

edited

Loading

ryanpetrello Apr 7, 2020 •

edited

Loading

jakemcdermott Apr 7, 2020 •

edited

Loading

ryanpetrello Apr 7, 2020 •

edited

Loading

ryanpetrello Apr 7, 2020 •

edited

Loading

jakemcdermott Apr 9, 2020 •

edited

Loading

jakemcdermott Apr 9, 2020 •

edited

Loading

matburt commented Apr 8, 2020 •

edited

Loading

ryanpetrello commented Apr 8, 2020 •

edited

Loading

jakemcdermott commented Apr 8, 2020 •

edited

Loading