Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delay update of artifacts until final job save #11832

Merged
merged 3 commits into from
May 3, 2022

Conversation

AlanCoding
Copy link
Member

@AlanCoding AlanCoding commented Feb 28, 2022

SUMMARY

Connect #11821

The test I had that was failing with that is passing with this patch.

EDIT: Since I raised this PR, it appears that the artifacts issue that was failing tests before was fixed by some other change. I still advocate for this code change for the tech debt aspect of things.

ISSUE TYPE
  • Bugfix Pull Request
COMPONENT NAME
  • API
ADDITIONAL INFORMATION

I don't know exactly what was happening, but some process was clearly doing a .save on the job which was regressing the artifacts field back to {}. By delaying until the final save, that actually leaves us in a safer position, because it's less likely for us to hit one of these unqualified saves after that point, and less likely for one of those saves to have a stale version of the job.

This might need some unit tests added.

I hope it also addresses the concerns that others brought up about the artifacts save in the callback class. It really shouldn't be touching the database.


def update_model(self, pk, _attempt=0, **updates):
return update_model(self.model, pk, _attempt=0, **updates)

def delay_update(self, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By caching artifact updates on the in-memory object like this, won't we be making more pain for ourselves in the future when we try to move to an async style of processing jobs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to wait until the end of the job, is it possible to do a final sweep over the job events for that job and accumulate the data that way? Or do we lose this data?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can save things on the job record and job events in one of two places

  • the control process
  • the callback receiver

The problem with the callback receiver is that it can (notoriously) be delayed as the events processing is the long-tail action, where a large queue can build up because we don't have the throughput to save at the rate we can produce events. The job status can flip long before all of its events come in. This may have nothing to do with the output of that job. Events may come in at a delayed time before the last job before that job produced more events than could be immediately handled, so all new events have to wait in line.

We cannot have artifacts wait in line with other events. The status changes, and dependent jobs in workflows will start with or without the events saved, and these can use artifacts. Likewise, the event_ct is needed to figure out if all events have been saved in the first place.

@AlanCoding
Copy link
Member Author

I realize that we have other saves in receptor.py, which should align with this mechanism. I haven't done that, and I probably should before this is a complete and coherent diff.

# If ansible-runner ran, but an error occured at runtime, the traceback information
# is saved via the status_handler passed in to the processor.
if state_name == 'Succeeded':
return res
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an in-congruence here. Yes, if the status_handler received this information the receptor output should be this... but it's an imperfect metric. These two things are associated, but other events could cause one to be true but not the other.

so I forced it by doing this

diff --git a/awx/main/tasks/callback.py b/awx/main/tasks/callback.py
index ab27ce9f36..fd720ac985 100644
--- a/awx/main/tasks/callback.py
+++ b/awx/main/tasks/callback.py
@@ -220,6 +220,8 @@ class RunnerCallback:
             result_traceback = status_data.get('result_traceback', None)
             if result_traceback:
                 self.delay_update(result_traceback=result_traceback)
+        elif status_data['status'] == 'successful':
+            self.delay_update(result_traceback='alan set this!')
 
 
 class RunnerCallbackForProjectUpdate(RunnerCallback):

And this works as I expected. The new if condition is accurate. I like this also because it avoids running additional receptor commands when they are not necessary. That's something the if here was accomplishing before, and I didn't want to lose that.

@shanemcd
Copy link
Member

Please ping me for another review of this once it's ready to go.

@AlanCoding
Copy link
Member Author

Sorry, I had neglected to pick up a test fixture change that came from the Django upgrade. That should be fixed with the latest commit, otherwise I was happy with this and the integration testing looked good.

@phill-holbrook
Copy link

Hi all - we've been encountering this issue since upgrading to AWX 20 earlier this week, and it's caused us a fair amount of trouble with several workflows. Are there any workarounds we can implement while waiting for the review to be completed?

@AlanCoding
Copy link
Member Author

Trivial conflicts resolved.

No, I don't know of any other easy workaround. It was flaky, but pretty easy to hit the error at least once out of 5 attempts. This patch seemed like the easiest possible fix.

Copy link
Member

@john-westcott-iv john-westcott-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only other thought is could this induce a timing issue. i.e. I launch a job its almost done, I click cancel its cancels but then this delayed update thread "finishes" and it changes from canceled to some other status? I haven't really thought through if that would be possible or not.

else:
self.extra_update_fields[key] = value

def get_extra_update_fields(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about changing this name to match the delay_update function? i.e. get_delayed_update_fields.

Comment on lines 577 to 580
original_traceback = self.runner_callback.get_extra_update_fields().get('result_traceback', '')
if 'got an unexpected keyword argument' in original_traceback:
if ANSIBLE_RUNNER_NEEDS_UPDATE_MESSAGE not in original_traceback:
self.runner_callback.delay_update(result_traceback=ANSIBLE_RUNNER_NEEDS_UPDATE_MESSAGE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the things I never really loved was this code being somewhat duplicated in here and receptor.py could we just move this into an if statement in delay_update?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably. What I think that would look like - if delay_update gets job_explanation or result_traceback and the given string is already in that field, then it does nothing. So we would never repeat the same string twice. Seems reasonable but I should add a test.

@AlanCoding
Copy link
Member Author

My only other thought is could this induce a timing issue. i.e. I launch a job its almost done, I click cancel its cancels but then this delayed update thread "finishes" and it changes from canceled to some other status?

We never call delay_update with "status". That would confuse the domains of responsibility. ansible-runner returns a result which has status, and this is what it's set to. If ansible-runner fires the status_callback with one status, but returns the final object with another status, then that would be a contradiction, but AWX would choose the final object.

It could be the case that the job is on its way to error, but a cancel is processed before it finalizes. In that case, we would get a result_traceback or job_explanation in conjunction with the canceled status. That sounds like desired behavior. I would want that information if I was the user.

Save tracebacks from receptor module to callback object

Move receptor traceback check up to be more logical

Use new mock_me fixture to avoid DB call with me method

Update the special runner message to the delay_update pattern
@AlanCoding
Copy link
Member Author

@john-westcott-iv In the last commit, I've come up with a better thought-out answer for how to handle your special ansible-runner error message.

I can't bear adding more code into the main (monolithic) code path for running jobs. So the special case is moved to get_extra_update_fields which is a method called right before we update the final status. This can accumulate whatever post-processing of the error fields we need without looking too messy IMO.

@AlanCoding AlanCoding merged commit 452744b into ansible:devel May 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants