Delay update of artifacts until final job save #11832

AlanCoding · 2022-02-28T21:38:05Z

SUMMARY

The test I had that was failing with that is passing with this patch.

EDIT: Since I raised this PR, it appears that the artifacts issue that was failing tests before was fixed by some other change. I still advocate for this code change for the tech debt aspect of things.

ISSUE TYPE

Bugfix Pull Request

COMPONENT NAME

API

ADDITIONAL INFORMATION

I don't know exactly what was happening, but some process was clearly doing a .save on the job which was regressing the artifacts field back to {}. By delaying until the final save, that actually leaves us in a safer position, because it's less likely for us to hit one of these unqualified saves after that point, and less likely for one of those saves to have a stale version of the job.

This might need some unit tests added.

I hope it also addresses the concerns that others brought up about the artifacts save in the callback class. It really shouldn't be touching the database.

jbradberry · 2022-03-01T14:46:18Z

awx/main/tasks/callback.py


    def update_model(self, pk, _attempt=0, **updates):
        return update_model(self.model, pk, _attempt=0, **updates)

+    def delay_update(self, **kwargs):


By caching artifact updates on the in-memory object like this, won't we be making more pain for ourselves in the future when we try to move to an async style of processing jobs?

If we want to wait until the end of the job, is it possible to do a final sweep over the job events for that job and accumulate the data that way? Or do we lose this data?

We can save things on the job record and job events in one of two places

the control process

the callback receiver

The problem with the callback receiver is that it can (notoriously) be delayed as the events processing is the long-tail action, where a large queue can build up because we don't have the throughput to save at the rate we can produce events. The job status can flip long before all of its events come in. This may have nothing to do with the output of that job. Events may come in at a delayed time before the last job before that job produced more events than could be immediately handled, so all new events have to wait in line.

We cannot have artifacts wait in line with other events. The status changes, and dependent jobs in workflows will start with or without the events saved, and these can use artifacts. Likewise, the event_ct is needed to figure out if all events have been saved in the first place.

AlanCoding · 2022-03-01T15:54:27Z

I realize that we have other saves in receptor.py, which should align with this mechanism. I haven't done that, and I probably should before this is a complete and coherent diff.

AlanCoding · 2022-03-04T16:17:25Z

awx/main/tasks/receptor.py

-                # If ansible-runner ran, but an error occured at runtime, the traceback information
-                # is saved via the status_handler passed in to the processor.
-                if state_name == 'Succeeded':
-                    return res


There's an in-congruence here. Yes, if the status_handler received this information the receptor output should be this... but it's an imperfect metric. These two things are associated, but other events could cause one to be true but not the other.

so I forced it by doing this

diff --git a/awx/main/tasks/callback.py b/awx/main/tasks/callback.py index ab27ce9f36..fd720ac985 100644 --- a/awx/main/tasks/callback.py +++ b/awx/main/tasks/callback.py @@ -220,6 +220,8 @@ class RunnerCallback: result_traceback = status_data.get('result_traceback', None) if result_traceback: self.delay_update(result_traceback=result_traceback) + elif status_data['status'] == 'successful': + self.delay_update(result_traceback='alan set this!') class RunnerCallbackForProjectUpdate(RunnerCallback):

And this works as I expected. The new if condition is accurate. I like this also because it avoids running additional receptor commands when they are not necessary. That's something the if here was accomplishing before, and I didn't want to lose that.

shanemcd · 2022-03-22T13:05:51Z

Please ping me for another review of this once it's ready to go.

AlanCoding · 2022-03-22T14:26:04Z

Sorry, I had neglected to pick up a test fixture change that came from the Django upgrade. That should be fixed with the latest commit, otherwise I was happy with this and the integration testing looked good.

phill-holbrook · 2022-03-31T19:19:47Z

Hi all - we've been encountering this issue since upgrading to AWX 20 earlier this week, and it's caused us a fair amount of trouble with several workflows. Are there any workarounds we can implement while waiting for the review to be completed?

AlanCoding · 2022-03-31T20:28:01Z

Trivial conflicts resolved.

No, I don't know of any other easy workaround. It was flaky, but pretty easy to hit the error at least once out of 5 attempts. This patch seemed like the easiest possible fix.

john-westcott-iv

My only other thought is could this induce a timing issue. i.e. I launch a job its almost done, I click cancel its cancels but then this delayed update thread "finishes" and it changes from canceled to some other status? I haven't really thought through if that would be possible or not.

john-westcott-iv · 2022-04-25T14:39:42Z

awx/main/tasks/callback.py

+            else:
+                self.extra_update_fields[key] = value
+
+    def get_extra_update_fields(self):


What about changing this name to match the delay_update function? i.e. get_delayed_update_fields.

john-westcott-iv · 2022-04-25T14:40:10Z

awx/main/tasks/jobs.py

+        original_traceback = self.runner_callback.get_extra_update_fields().get('result_traceback', '')
+        if 'got an unexpected keyword argument' in original_traceback:
+            if ANSIBLE_RUNNER_NEEDS_UPDATE_MESSAGE not in original_traceback:
+                self.runner_callback.delay_update(result_traceback=ANSIBLE_RUNNER_NEEDS_UPDATE_MESSAGE)


One of the things I never really loved was this code being somewhat duplicated in here and receptor.py could we just move this into an if statement in delay_update?

Probably. What I think that would look like - if delay_update gets job_explanation or result_traceback and the given string is already in that field, then it does nothing. So we would never repeat the same string twice. Seems reasonable but I should add a test.

AlanCoding · 2022-04-25T15:19:02Z

My only other thought is could this induce a timing issue. i.e. I launch a job its almost done, I click cancel its cancels but then this delayed update thread "finishes" and it changes from canceled to some other status?

We never call delay_update with "status". That would confuse the domains of responsibility. ansible-runner returns a result which has status, and this is what it's set to. If ansible-runner fires the status_callback with one status, but returns the final object with another status, then that would be a contradiction, but AWX would choose the final object.

It could be the case that the job is on its way to error, but a cancel is processed before it finalizes. In that case, we would get a result_traceback or job_explanation in conjunction with the canceled status. That sounds like desired behavior. I would want that information if I was the user.

Save tracebacks from receptor module to callback object Move receptor traceback check up to be more logical Use new mock_me fixture to avoid DB call with me method Update the special runner message to the delay_update pattern

AlanCoding · 2022-05-02T15:47:39Z

@john-westcott-iv In the last commit, I've come up with a better thought-out answer for how to handle your special ansible-runner error message.

I can't bear adding more code into the main (monolithic) code path for running jobs. So the special case is moved to get_extra_update_fields which is a method called right before we update the final status. This can accumulate whatever post-processing of the error fields we need without looking too messy IMO.

AlanCoding requested review from jbradberry and amolgautam25 February 28, 2022 21:38

github-actions bot added the component:api label Feb 28, 2022

AlanCoding force-pushed the savor_the_update branch from 45e9329 to 7a753f9 Compare March 1, 2022 03:34

jbradberry reviewed Mar 1, 2022

View reviewed changes

AlanCoding force-pushed the savor_the_update branch from 20227e1 to 56d3033 Compare March 4, 2022 14:54

AlanCoding commented Mar 4, 2022

View reviewed changes

AlanCoding requested a review from shanemcd March 7, 2022 14:23

AlanCoding force-pushed the savor_the_update branch 2 times, most recently from 3c979e2 to 2649484 Compare March 14, 2022 02:56

AlanCoding force-pushed the savor_the_update branch from 2649484 to bb5df58 Compare March 22, 2022 02:46

AlanCoding force-pushed the savor_the_update branch from 7198489 to 43cf9ef Compare March 31, 2022 20:25

AlanCoding force-pushed the savor_the_update branch from 43cf9ef to 9a1a601 Compare April 22, 2022 02:10

AlanCoding mentioned this pull request Apr 22, 2022

Consume job_explanation we sometimes set in ansible-runner #12089

Closed

AlanCoding force-pushed the savor_the_update branch from 9a1a601 to c9a9378 Compare April 22, 2022 17:50

AlanCoding requested a review from john-westcott-iv April 22, 2022 18:40

john-westcott-iv reviewed Apr 25, 2022

View reviewed changes

AlanCoding added 2 commits May 2, 2022 11:25

Delay update of artifacts until final job save

77049f6

Save tracebacks from receptor module to callback object Move receptor traceback check up to be more logical Use new mock_me fixture to avoid DB call with me method Update the special runner message to the delay_update pattern

Move special runner message into post-processing of callback fields

9bcce9c

AlanCoding force-pushed the savor_the_update branch from 583f4b8 to 9bcce9c Compare May 2, 2022 15:42

Method naming and docstring changes

bf37775

shanemcd approved these changes May 3, 2022

View reviewed changes

AlanCoding merged commit 452744b into ansible:devel May 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delay update of artifacts until final job save #11832

Delay update of artifacts until final job save #11832

AlanCoding commented Feb 28, 2022 •

edited

Loading

jbradberry Mar 1, 2022

jbradberry Mar 1, 2022

AlanCoding Mar 1, 2022

AlanCoding commented Mar 1, 2022

AlanCoding Mar 4, 2022

shanemcd commented Mar 22, 2022

AlanCoding commented Mar 22, 2022

phill-holbrook commented Mar 31, 2022

AlanCoding commented Mar 31, 2022

john-westcott-iv left a comment

john-westcott-iv Apr 25, 2022

john-westcott-iv Apr 25, 2022

AlanCoding Apr 25, 2022

AlanCoding commented Apr 25, 2022

AlanCoding commented May 2, 2022

Delay update of artifacts until final job save #11832

Delay update of artifacts until final job save #11832

Conversation

AlanCoding commented Feb 28, 2022 • edited Loading

SUMMARY

ISSUE TYPE

COMPONENT NAME

ADDITIONAL INFORMATION

jbradberry Mar 1, 2022

Choose a reason for hiding this comment

jbradberry Mar 1, 2022

Choose a reason for hiding this comment

AlanCoding Mar 1, 2022

Choose a reason for hiding this comment

AlanCoding commented Mar 1, 2022

AlanCoding Mar 4, 2022

Choose a reason for hiding this comment

shanemcd commented Mar 22, 2022

AlanCoding commented Mar 22, 2022

phill-holbrook commented Mar 31, 2022

AlanCoding commented Mar 31, 2022

john-westcott-iv left a comment

Choose a reason for hiding this comment

john-westcott-iv Apr 25, 2022

Choose a reason for hiding this comment

john-westcott-iv Apr 25, 2022

Choose a reason for hiding this comment

AlanCoding Apr 25, 2022

Choose a reason for hiding this comment

AlanCoding commented Apr 25, 2022

AlanCoding commented May 2, 2022

AlanCoding commented Feb 28, 2022 •

edited

Loading