Job done isn't always correct #8771

chrismeyersfsu · 2020-12-07T15:10:20Z

After a job is complete we do post-things like send websocket notification as to the status of the job, process the fact cache, send notifications, and maybe more. There are 5 cases that are not completely correct.

Notifications are triggered when the events processes == events emitted from ansible for all job types other than Job.

Notifications for Job's are triggered via playbook_on_stats because we need data from the playbook_on_stats event. I think we only do this for convenience sake. We could trigger the notification after all events are processed and just get the playbook_on_stats event from the DB. This would reduce the deviation in call paths. Note that it's "kind of" safer to do it on playbook_on_stats because the number of events emitted does not have to equal number of events processed. This requirement could lead to notifications never being sent if an event is lost between creation from ansible and saving to the db.
EOF. We detect when events processes == events emitted and create a special EOF event at the time we push events into Redis. When this EOF event is gotten in the callback receiver, we trigger notification for all jobs other than a Job Template Job. The problem here is that when we process EOF in the callback receiver, not all other events for the job may be processed. This is because we have multiple callback receiver threads. This basically makes the EOF event useless.
Job timeout. Tower has a feature of job timeout. If a job runs longer than the timeout the job will be forcefully killed. Combine that with sending notification on playbook_on_stats events and you could run into a case where notifications are double sent.
Notification errors. We don't ensure that notifications are sent (no retry)
Job finishes 5+ seconds after all events have been emitted and saved to the database.

Ideal Solution(s)

Detect events processed == events emitted in the callback receiver (see below with an example of how). If information is needed from playbook_on_stats, query the record from the database. Alternatively, cache the playbook_on_stats info in memory until events processed == events emitted and pass along with the notification.

We also want to account for sending events for jobs when events processed never equals events emitted. This is tricky. How do we know if the callback receiver is just slow to process events or that the event is lost?

As for the job timeout double send case. Any time notifications are to be sent, we should check to see if notifications have already been sent and short-circuit if so. This solution can be applied to the reaper and the job finishing 5+ seconds after all events are processed.

Detect events processed == events emitted in callback receiver

The (1) correct count of "events emitted" is known to the dispatcher process that is running for the ansible-runner/playbook. We need to get that information from the dispatcher into the callback receiver. (2) The callback receiver needs to record number of events processed per-job across all processes. When the last event is processed, that is when notifications should be sent.

We can continue to use the method we use today to convey the "events emitted" count from the dispatcher processes to the callback receiver, specifically the EOF event. What needs to change is that we need to keep a count of "events processed" per-job in the callback receiver that is shared among all the callback receiver processes. When the last event is processed, the notification should trigger.

The text was updated successfully, but these errors were encountered:

AlanCoding · 2020-12-07T16:19:45Z

events processes == events emitted

With the receptor integration, I think this is the wrong metric.

Consuming the output from a receptor work node is a serial process. There will be 1 control node emitting events as it gets them from the receptor network. Those will go into redis, and yes, the ordering could shift around in that process, but they are all still local to that node.

Maybe at some point we add in robustness for loss of a control node, but then that means that another node picks up the job with knowledge that the job was at line X.

Either way, once it gets the EOF event, that should be a clear signal that it's done. Since the process is all local to the assigned node in the control plane, it seems like it would be easier to just not do anything that would re-order the events in the first place. In the new system, the EOF event should be completely sufficient to know you have all the events that you are going to get.

chrismeyersfsu · 2020-12-07T16:24:16Z

In the new system, the EOF event should be completely sufficient to know you have all the events that you are going to get. This is true in the existing system too.

Note that in the current system, and in the new system job events can be out of order. This is because we have multiple callback receiver processes processing the Redis callback event queue.

AlanCoding · 2020-12-07T19:18:52Z

Okay, computing events processed is not as bad as I was making it out to be, and so it shouldn't be bad that events remain out-of-order. If the per-job processed event count is maintained in memory of the callback receiver processes, then we could know if the job events are finished or not.

ryanpetrello · 2021-04-12T19:23:16Z

@chrismeyersfsu I took this out of state:in_progress. We might want to reconsider the target milestone for this work, as I believe @chrismeyersfsu has put it down for now cc @shanemcd @wenottingham

chrismeyersfsu · 2021-04-13T13:43:00Z

Yep. I have a PR for this pretty much complete. However, it's far more complicated than I originally thought and touches performance sensitive areas.

WIP https://github.com/ansible/awx/compare/devel...chrismeyersfsu:job_done_done?expand=1

shanemcd · 2021-04-20T16:34:33Z

Moving this to the backlog.

AlanCoding · 2021-12-15T18:56:05Z

I'm thinking more about this as we've seen notification failures before.

I'd prefer an approach that looks like this:

we establish 2 points in code that may be the last time we see the job
- somewhere in the run method in tasks.py
- somewhere in the processing of EOF or playbook_on_stats event
in each of those places, use select_for_update
check a special-purpose UnifiedJob field like ready_for_notifications
fully atomically,
- switch from False to True if prior value is False.
- If prior value is True, then send notifications.

Maybe there's something big I'm missing. I think maybe @chrismeyersfsu solution was trying to wait for all events to come in. I don't quite see why this is necessary as opposed to looking for a single noteworthy event like playbook_on_stats or EOF.

AlanCoding · 2022-03-14T17:35:40Z

Currently relevant issue which is a part of this epic: #11422

chrismeyersfsu added component:api type:bug labels Dec 7, 2020

wenottingham added state:needs_devel labels Dec 10, 2020

chrismeyersfsu mentioned this issue Feb 12, 2021

Possible bug when job fails in tower with syntax error, no notification is generated #9245

Open

chrismeyersfsu self-assigned this Mar 18, 2021

chrismeyersfsu added state:in_progress and removed state:needs_devel labels Mar 18, 2021

ryanpetrello added state:needs_devel and removed state:in_progress labels Apr 12, 2021

AlanCoding mentioned this issue Dec 6, 2021

Make sure to fire off failure notifications on error #11384

Merged

shanemcd removed priority:medium labels Jan 26, 2022

AlanCoding mentioned this issue Apr 26, 2022

Fix notification timing issue by sending in the latter of 2 events #12110

Merged

AlanCoding closed this as completed in #12110 Apr 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Job done isn't always correct #8771

Job done isn't always correct #8771

chrismeyersfsu commented Dec 7, 2020 •

edited

Loading

AlanCoding commented Dec 7, 2020

chrismeyersfsu commented Dec 7, 2020

AlanCoding commented Dec 7, 2020

ryanpetrello commented Apr 12, 2021

chrismeyersfsu commented Apr 13, 2021 •

edited

Loading

shanemcd commented Apr 20, 2021

AlanCoding commented Dec 15, 2021

AlanCoding commented Mar 14, 2022

Job done isn't always correct #8771

Job done isn't always correct #8771

Comments

chrismeyersfsu commented Dec 7, 2020 • edited Loading

Ideal Solution(s)

Detect events processed == events emitted in callback receiver

AlanCoding commented Dec 7, 2020

chrismeyersfsu commented Dec 7, 2020

AlanCoding commented Dec 7, 2020

ryanpetrello commented Apr 12, 2021

chrismeyersfsu commented Apr 13, 2021 • edited Loading

shanemcd commented Apr 20, 2021

AlanCoding commented Dec 15, 2021

AlanCoding commented Mar 14, 2022

chrismeyersfsu commented Dec 7, 2020 •

edited

Loading

chrismeyersfsu commented Apr 13, 2021 •

edited

Loading