-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Job done isn't always correct #8771
Comments
With the receptor integration, I think this is the wrong metric. Consuming the output from a receptor work node is a serial process. There will be 1 control node emitting events as it gets them from the receptor network. Those will go into redis, and yes, the ordering could shift around in that process, but they are all still local to that node. Maybe at some point we add in robustness for loss of a control node, but then that means that another node picks up the job with knowledge that the job was at line X. Either way, once it gets the EOF event, that should be a clear signal that it's done. Since the process is all local to the assigned node in the control plane, it seems like it would be easier to just not do anything that would re-order the events in the first place. In the new system, the EOF event should be completely sufficient to know you have all the events that you are going to get. |
Note that in the current system, and in the new system job events can be out of order. This is because we have multiple callback receiver processes processing the Redis callback event queue. |
Okay, computing |
@chrismeyersfsu I took this out of |
Yep. I have a PR for this pretty much complete. However, it's far more complicated than I originally thought and touches performance sensitive areas. WIP https://github.com/ansible/awx/compare/devel...chrismeyersfsu:job_done_done?expand=1 |
Moving this to the backlog. |
I'm thinking more about this as we've seen notification failures before. I'd prefer an approach that looks like this:
Maybe there's something big I'm missing. I think maybe @chrismeyersfsu solution was trying to wait for all events to come in. I don't quite see why this is necessary as opposed to looking for a single noteworthy event like playbook_on_stats or EOF. |
Currently relevant issue which is a part of this epic: #11422 |
After a job is complete we do post-things like send websocket notification as to the status of the job, process the fact cache, send notifications, and maybe more. There are 5 cases that are not completely correct.
Notifications are triggered when the events processes == events emitted from ansible for all job types other than Job.
playbook_on_stats
event from the DB. This would reduce the deviation in call paths. Note that it's "kind of" safer to do it onplaybook_on_stats
because the number of events emitted does not have to equal number of events processed. This requirement could lead to notifications never being sent if an event is lost between creation from ansible and saving to the db.EOF
. We detect whenevents processes == events emitted
and create a specialEOF
event at the time we push events into Redis. When thisEOF
event is gotten in the callback receiver, we trigger notification for all jobs other than a Job Template Job. The problem here is that when we processEOF
in the callback receiver, not all other events for the job may be processed. This is because we have multiple callback receiver threads. This basically makes theEOF
event useless.playbook_on_stats
events and you could run into a case where notifications are double sent.Ideal Solution(s)
Detect
events processed == events emitted
in the callback receiver (see below with an example of how). If information is needed fromplaybook_on_stats
, query the record from the database. Alternatively, cache theplaybook_on_stats
info in memory untilevents processed == events emitted
and pass along with the notification.We also want to account for sending events for jobs when
events processed
never equalsevents emitted
. This is tricky. How do we know if the callback receiver is just slow to process events or that the event is lost?As for the job timeout double send case. Any time notifications are to be sent, we should check to see if notifications have already been sent and short-circuit if so. This solution can be applied to the reaper and the job finishing 5+ seconds after all events are processed.
Detect events processed == events emitted in callback receiver
The (1) correct count of "events emitted" is known to the dispatcher process that is running for the ansible-runner/playbook. We need to get that information from the dispatcher into the callback receiver. (2) The callback receiver needs to record number of events processed per-job across all processes. When the last event is processed, that is when notifications should be sent.
We can continue to use the method we use today to convey the "events emitted" count from the dispatcher processes to the callback receiver, specifically the
EOF
event. What needs to change is that we need to keep a count of "events processed" per-job in the callback receiver that is shared among all the callback receiver processes. When the last event is processed, the notification should trigger.The text was updated successfully, but these errors were encountered: