Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job done isn't always correct #8771

Closed
chrismeyersfsu opened this issue Dec 7, 2020 · 8 comments · Fixed by #12110
Closed

Job done isn't always correct #8771

chrismeyersfsu opened this issue Dec 7, 2020 · 8 comments · Fixed by #12110

Comments

@chrismeyersfsu
Copy link
Member

chrismeyersfsu commented Dec 7, 2020

After a job is complete we do post-things like send websocket notification as to the status of the job, process the fact cache, send notifications, and maybe more. There are 5 cases that are not completely correct.

Notifications are triggered when the events processes == events emitted from ansible for all job types other than Job.

  1. Notifications for Job's are triggered via playbook_on_stats because we need data from the playbook_on_stats event. I think we only do this for convenience sake. We could trigger the notification after all events are processed and just get the playbook_on_stats event from the DB. This would reduce the deviation in call paths. Note that it's "kind of" safer to do it on playbook_on_stats because the number of events emitted does not have to equal number of events processed. This requirement could lead to notifications never being sent if an event is lost between creation from ansible and saving to the db.
  2. EOF. We detect when events processes == events emitted and create a special EOF event at the time we push events into Redis. When this EOF event is gotten in the callback receiver, we trigger notification for all jobs other than a Job Template Job. The problem here is that when we process EOF in the callback receiver, not all other events for the job may be processed. This is because we have multiple callback receiver threads. This basically makes the EOF event useless.
  3. Job timeout. Tower has a feature of job timeout. If a job runs longer than the timeout the job will be forcefully killed. Combine that with sending notification on playbook_on_stats events and you could run into a case where notifications are double sent.
  4. Notification errors. We don't ensure that notifications are sent (no retry)
  5. Job finishes 5+ seconds after all events have been emitted and saved to the database.

Ideal Solution(s)

Detect events processed == events emitted in the callback receiver (see below with an example of how). If information is needed from playbook_on_stats, query the record from the database. Alternatively, cache the playbook_on_stats info in memory until events processed == events emitted and pass along with the notification.

We also want to account for sending events for jobs when events processed never equals events emitted. This is tricky. How do we know if the callback receiver is just slow to process events or that the event is lost?

As for the job timeout double send case. Any time notifications are to be sent, we should check to see if notifications have already been sent and short-circuit if so. This solution can be applied to the reaper and the job finishing 5+ seconds after all events are processed.

Detect events processed == events emitted in callback receiver

The (1) correct count of "events emitted" is known to the dispatcher process that is running for the ansible-runner/playbook. We need to get that information from the dispatcher into the callback receiver. (2) The callback receiver needs to record number of events processed per-job across all processes. When the last event is processed, that is when notifications should be sent.

We can continue to use the method we use today to convey the "events emitted" count from the dispatcher processes to the callback receiver, specifically the EOF event. What needs to change is that we need to keep a count of "events processed" per-job in the callback receiver that is shared among all the callback receiver processes. When the last event is processed, the notification should trigger.

@AlanCoding
Copy link
Member

events processes == events emitted

With the receptor integration, I think this is the wrong metric.

Consuming the output from a receptor work node is a serial process. There will be 1 control node emitting events as it gets them from the receptor network. Those will go into redis, and yes, the ordering could shift around in that process, but they are all still local to that node.

Maybe at some point we add in robustness for loss of a control node, but then that means that another node picks up the job with knowledge that the job was at line X.

Either way, once it gets the EOF event, that should be a clear signal that it's done. Since the process is all local to the assigned node in the control plane, it seems like it would be easier to just not do anything that would re-order the events in the first place. In the new system, the EOF event should be completely sufficient to know you have all the events that you are going to get.

@chrismeyersfsu
Copy link
Member Author

In the new system, the EOF event should be completely sufficient to know you have all the events that you are going to get. This is true in the existing system too.

Note that in the current system, and in the new system job events can be out of order. This is because we have multiple callback receiver processes processing the Redis callback event queue.

@AlanCoding
Copy link
Member

Okay, computing events processed is not as bad as I was making it out to be, and so it shouldn't be bad that events remain out-of-order. If the per-job processed event count is maintained in memory of the callback receiver processes, then we could know if the job events are finished or not.

@ryanpetrello
Copy link
Contributor

@chrismeyersfsu I took this out of state:in_progress. We might want to reconsider the target milestone for this work, as I believe @chrismeyersfsu has put it down for now cc @shanemcd @wenottingham

@chrismeyersfsu
Copy link
Member Author

chrismeyersfsu commented Apr 13, 2021

Yep. I have a PR for this pretty much complete. However, it's far more complicated than I originally thought and touches performance sensitive areas.

WIP https://github.com/ansible/awx/compare/devel...chrismeyersfsu:job_done_done?expand=1

@shanemcd
Copy link
Member

Moving this to the backlog.

@AlanCoding
Copy link
Member

I'm thinking more about this as we've seen notification failures before.

I'd prefer an approach that looks like this:

  • we establish 2 points in code that may be the last time we see the job
    • somewhere in the run method in tasks.py
    • somewhere in the processing of EOF or playbook_on_stats event
  • in each of those places, use select_for_update
  • check a special-purpose UnifiedJob field like ready_for_notifications
  • fully atomically,
    • switch from False to True if prior value is False.
    • If prior value is True, then send notifications.

Maybe there's something big I'm missing. I think maybe @chrismeyersfsu solution was trying to wait for all events to come in. I don't quite see why this is necessary as opposed to looking for a single noteworthy event like playbook_on_stats or EOF.

@AlanCoding
Copy link
Member

Currently relevant issue which is a part of this epic: #11422

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants