Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

automation-job completes successfully but the ops-awx-task and UI show an error and stop producing output #11511

Closed
3 tasks done
bschaeffer opened this issue Jan 7, 2022 · 8 comments

Comments

@bschaeffer
Copy link

Please confirm the following

  • I agree to follow this project's code of conduct.
  • I have checked the current issues for duplicates.
  • I understand that AWX is open source software provided for free and that I am not entitled to status updates or other assurances.

Summary

We are running awx 19.5.0 in kubernetes and all of the container images either reference the 19.5.0 or latest tags. On December 7th, we noticed that all our daily jobs started erroring out in the UI and the logs would just stop in the middle of the job. This lead us to believe that the automation-job pod itself was hanging and stopped producing log events then the control plane would get to some timeout and just kill the job and mark it as errored.

After some excruciating investigation, we just now realized that the automation-job has always been completing successfully this entire time. We couldn't believe we missed it but its true. Whats even weirder is the resulting log from the ops-awx-task control plane container:

...
awx.analytics.job_lifecycle job-23621 notifications sent
awx.analytics.job_lifecycle job-23621 preparing playbook
awx.analytics.job_lifecycle job-23621 running playbook
awx.analytics.job_lifecycle job-23621 work unit id received
awx.analytics.job_lifecycle job-23621 work unit id assigned
awx.main.commands.run_callback_receiver Event processing is finished for Job 23621, sending notifications
awx.main.commands.run_callback_receiver Event processing is finished for Job 23621, sending notifications
awx.main.tasks job 23621 (running) finished running, producing 7410 events.
awx.analytics.job_lifecycle job-23621 post run
awx.analytics.job_lifecycle job-23621 finalize run
awx.main.dispatch job 23621 (error) encountered an error (rc=None), please see task stdout for details.
awx.main.tasks Executing error task id e5318fd0-16f1-413f-b8c2-528a1a59f29c, subtasks: [{'type': 'job', 'id': 23621}]

Specifically awx.main.tasks job 23621 (running) finished running, producing 7410 events The 7410 is especially interesting because the UI stops the task output at line 6278, even though it appears the task knows about the other 1200 or so events.

Also, awx.main.dispatch job 23621 (error) encountered an error (rc=None), please see task stdout for details doesn't make a whole lot of sense because, just by tailing the logs of the automation-job pod I can see that the task runs to completion without a single error in any step.

Wondering what could be going wrong here or if there are any additional debugging steps we can take to track down the problem.

AWX version

19.5.0

Installation method

kubernetes

Modifications

no

Ansible version

No response

Operating system

CentOS? (whatever the awx version is built on)

Web browser

Chrome

Steps to reproduce

This is almost impossible to describe.

Expected results

I would expect the awx UI to reflect that the job completed successfully and present all of the output from the task

Actual results

The output stops in the UI and is reported as errored even though the job itself completed successfully

Additional information

This was running with no errors prior December 7th. Our jobs run between 7am and 9am EST and ran successfully on December 6th.

@shanemcd
Copy link
Member

shanemcd commented Jan 7, 2022

Looks like another instance of #11338 and #9961

@smullenrga
Copy link

Right, I'm pretty certain that I know what's causing the issue but no idea where / how to fix it. AWX's raw log lines from the worker back to the GUI contain the step UIDs, counter numbers, etc. Docker 1.13 and higher has a 16k log line limit and splits log lines after 16k and marks them as partial.

If you have a line that has a lot of output (EX: debug hostvars, for example), the log line gets split and it SHOULD BE being reassembled at the destination with partial lines chunks being concatenated.

Whatever in AWX is supposed to be concatenating these partial log lines is wrong. It DOES concatenate the parts but it's overzealous and concatenates the subsequent log line onto the chunked one too.

As a result, AWX sees a bunch of lines ... counter = 1 ... counter = 2 ... counter = 3 and so on until it hits the chunked log line. Once that large line is hit, it starts out looking normal as in there's a uuid, counter, job info, the output, and this line also gets concatenated as expected ... it does end in a UUID again. Let's say this line is "counter": 21. PROBLEM: the log line for "counter": 22 is ALSO APPENDED to this line. AWX never sees the line with counter 22 that it's expecting because line 22 was concatenated onto the end of line 21 and it parses line by line. (OK, realistically it's that the json log line parser parses a line at a time and when it gets one big-assed line that's a 16K + log line with another log line object appended to it, the parses blows up). At this point, the GUI pukes and says it failed even though the raw output of bg assed concatenated lines does have a summary in it showing that the job completed successfully.

For a job I can easily duplicate this with, I simply add 512 "1"'s to a variable in a loop and iterate over that loop 35 times. (playbook attached
eventmaker.txt
)

In my AWX GUI, after running this, I get the raw output followed by the first 33 nicely formatted/expected output and a failed status. copying all of the raw data out into a text editor so I can get line numbers from the raw output I see that line 36 is a total of 16,320 bytes and is fine - starts with '{"uuid": "4ba5523a-b150-48e4-9834-4d84db2ab6bb", "counter": 34' and ends with '"uuid": "4ba5523a-b150-48e4-9834-4d84db2ab6bb"}}' as expected.

Line 37 (output attached too) breaks the 16k barrier and shows the problem. this line starts fine: '{"uuid": "484843d4-7171-4f72-8367-eefe743f4899", "counter": 35,'. problem is that in the middle of the line, the line with counter 36 is appended to this one ... '"uuid": "484843d4-7171-4f72-8367-eefe743f4899"}}{"uuid": "0fa38a44-886d-4ea8-b08e-e366e5b29781", "counter": 36,' where '"uuid": "484843d4-7171-4f72-8367-eefe743f4899"}}' should be the end of line 37 and '{"uuid": "0fa38a44-886d-4ea8-b08e-e366e5b29781", "counter": 36,' should be the start of line 38. There should be a carriage return between '}}' ending line 37 and '{' starting line 38.

16kBugOutput.txt

@smullenrga
Copy link

So, with that note, what I've been trying to get an answer for is ... what's responsible for reassembling the chunked 16k log lines from docker? Whatever/wherever it is is broken. For AWX lines, if a chunk ends with '"uuid": ""}}', that should signal the end of this log line and the next log line from docker should start the following ... line.

@shanemcd - do you know if there is an AWX component that's concatenating these partial docker log lines or if this is being done by a component elsewhere in the stack (perhaps in the kubernetes stack?)

@smullenrga
Copy link

smullenrga commented Jan 14, 2022

Quick update as I'm having time to try looking at this today. This is a quick running job and I was finally able to grab the output from our kubernetes stack for the spawned worker container. I'm attaching it as well. In this output, line 37 is also hosed. I'm assuming that this is the same data that AWX would be receiving from Kubernetes. In short, it seems as though the kubernetes stack could be where these partial docker messages are being reassembled incorrectly. There are so many layers being passed through here that it's very difficult to find/pinpoint code that's be causing this issue.

automation-job-4588-pl9w2.log

EDIT: The data pulled from the kubernetes STDOUT is not identical to what AWX displays. Line 37 IS different/messed up but it seems to have the first 16K of line 37, a timestamp, the rest of line 37, a timestamp, start of line 38, timestamp, rest of line38, timestamp, etc until we're out of lines that are > 16K (There should be several >16K lines starting at 37). If you look at line 37 in full, it's really messed up after line 38 is concatenated onto it.

@bschaeffer
Copy link
Author

bschaeffer commented Jan 14, 2022

FWIW, this does seem to be related to container runtime. In GKE, we recently upgraded to 1.20 which defaults to containerd at the same time we saw this error. When we rolled back to 1.19, it was also broken, but we realised it was still using containerd://1.4.8.

When we switched the runtime to docker://19.3.15, it actually fixed our problem.

So things are now in a working state, but it will be EOL for us in about 5 months.

@smullenrga
Copy link

Thanks for that info, I just tried using Docker Desktop with Kubernetes and AWX 19.2.2 and the job runs fine - FML. I'm guessing this is something in the Kubernetes stack and unfortunately I'm simply a user in our larger corporate UCP cluster and I don't have much freedom there.

@smullenrga
Copy link

smullenrga commented Jan 18, 2022

I think I've found the root cause of my issues. As mentioned above, running in a vanilla docker desktop with kubernetes, the job runs fine. On our UCP cluster it fails after 30 iterations (3item=30 is the last output line displayed in the GUI).

Our UCP cluster is configured to use the GELF logging driver. Changing my docker desktop to use GELF, the job that had run fine breaks in the same spot as it is breaking in UCP.

From what I can see, the GELF logging driver in docker (moby) has never been updated to handle partial log lines generated by the 16K docker line limit.

EDIT:
I've opened issue 11565 for reference as well as Docker issue 43162 regarding my findings. The issue has been acknowledged on the Docker issue.

@shanemcd
Copy link
Member

Closing in favor of #11338

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants