Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job output no complete and appear on ERROR in UI Exceeded retries for reading stdout #11803

Closed
5 of 6 tasks
chris93111 opened this issue Feb 24, 2022 · 7 comments
Closed
5 of 6 tasks

Comments

@chris93111
Copy link
Contributor

Please confirm the following

  • I agree to follow this project's code of conduct.
  • I have checked the current issues for duplicates.
  • I understand that AWX is open source software provided for free and that I might not receive a timely response.

Summary

Hello

With large job execution, the output on UI is not complete, and he finish with error

On EE worker the job is complete with no probleme
on EE awx see max retries for reading output
On TASK error appear but with no details

{"zipfile": 12569}
{"status": "successful", "runner_ident": "685"}
{"uuid": "21e6edff-7f4e-48a9-9a51-67bdcd579adb", "counter": 8670, "stdout": "\r\nPLAY RECAP XXXXXXXXXX
DEBUG 2022/02/23 23:59:31 Client connected to control service @
ERROR 2022/02/23 23:59:34 Exceeded retries for reading stdout /tmp/receptor/awx-cfg-awx-6d6cbd4847-p8bzz/LUknGGFE/stdout
WARNING 2022/02/23 23:59:34 Could not read in control service: read unix /var/run/receptor/receptor.sock->@: use of closed network connection
DEBUG 2022/02/23 23:59:34 Client disconnected from control service @
2022-02-24 01:09:08,121 INFO [00ba1d97341a4eae888fa118de06d688] awx.main.commands.run_callback_receiver Event processing is finished for Job 685, sending notifications
2022-02-24 01:09:08,121 INFO [00ba1d97341a4eae888fa118de06d688] awx.main.commands.run_callback_receiver Event processing is finished for Job 685, sending notifications
2022-02-24 01:09:08,872 DEBUG [00ba1d97341a4eae888fa118de06d688] awx.main.tasks job 685 (running) finished running, producing 6519 events.
2022-02-24 01:09:08,878 DEBUG [00ba1d97341a4eae888fa118de06d688] awx.analytics.job_lifecycle job-685 post run
2022-02-24 01:09:08,959 DEBUG [00ba1d97341a4eae888fa118de06d688] awx.analytics.job_lifecycle job-685 finalize run
2022-02-24 01:09:08,971 DEBUG [00ba1d97341a4eae888fa118de06d688] awx.main.dispatch task 67e4e210-6302-4054-a21b-94dbfa9aa0c3 starting awx.main.tasks.update_inventory_computed_fields(*[5])
2022-02-24 01:09:08,988 DEBUG [00ba1d97341a4eae888fa118de06d688] awx.main.models.inventory Going to update inventory computed fields, pk=5
2022-02-24 01:09:09,000 DEBUG [00ba1d97341a4eae888fa118de06d688] awx.main.models.inventory Finished updating inventory computed fields, pk=5, in 0.012 seconds
2022-02-24 01:09:09,122 WARNING [00ba1d97341a4eae888fa118de06d688] awx.main.dispatch job 685 (error) encountered an error (rc=None), please see task stdout for details.
2022-02-24 01:09:09,123 DEBUG [00ba1d97341a4eae888fa118de06d688] awx.main.dispatch task 15fb0ae2-dac5-47b1-b642-f2d69a0228a7 starting awx.main.tasks.handle_work_error(*['15fb0ae2-dac5-47b1-b642-f2d69a0228a7'])
2022-02-24 01:09:09,124 DEBUG [00ba1d97341a4eae888fa118de06d688] awx.main.tasks Executing error task id 15fb0ae2-dac5-47b1-b642-f2d69a0228a7, subtasks: [{'type': 'job', 'id': 685}]
2022-02-24 01:09:09,138 DEBUG [00ba1d97341a4eae888fa118de06d688] awx.main.dispatch task 15fb0ae2-dac5-47b1-b642-f2d69a0228a7 starting awx.main.tasks.handle_work_success(*[])
2022-02-24 01:09:09,138 DEBUG [00ba1d97341a4eae888fa118de06d688] awx.main.dispatch task af969bf8-7aa3-4348-a898-56ef7ba9ff2e starting awx.main.scheduler.tasks.run_task_manager(*[])

image

AWX version

19.3

Select the relevant components

  • UI
  • API
  • Docs

Installation method

kubernetes

Modifications

no

Ansible version

2.9

Operating system

redhat 8.4

Web browser

Chrome

Steps to reproduce

launch large job

Expected results

fulll output with no error

Actual results

split output and job on error

Additional information

with many tries , the output job stop always in the same line

@chris93111
Copy link
Contributor Author

chris93111 commented Feb 24, 2022

same error on 20.0.0

Exceeded retries for reading stdout

2022-02-24 10:56:17,711 DEBUG [e8d4348d6a9849be8edf3ec410de1316] awx.main.models.inventory Finished updating inventory computed fields, pk=5, in 0.013 seconds
2022-02-24 10:56:17,818 WARNING [e8d4348d6a9849be8edf3ec410de1316] awx.main.dispatch job 848 (error) encountered an error (rc=None), please see task stdout for details.
2022-02-24 10:56:17,819 DEBUG [e8d4348d6a9849be8edf3ec410de1316] awx.main.dispatch task 2a092168-6d90-4d5b-a6e7-4d82e7fbd41f starting awx.main.tasks.system.handle_work_error(*['2a092168-6d90-4d5b-a6e7-4d82e7fbd41f'])
2022-02-24 10:56:17,820 DEBUG [e8d4348d6a9849be8edf3ec410de1316] awx.main.tasks.system Executing error task id 2a092168-6d90-4d5b-a6e7-4d82e7fbd41f, subtasks: [{'type': 'job', 'id': 848}]

my ee for job execution is based on quay.io/ansible/ansible-runner:stable-2.9-latest

UPDATE:
with slice job it's ok
job with 8 hosts
image

with slice

image

@chris93111 chris93111 changed the title job output no complete and appear on ERROR in UI job output no complete and appear on ERROR in UI Exceeded retries for reading stdout Feb 24, 2022
@chris93111
Copy link
Contributor Author

#11338

@chris93111
Copy link
Contributor Author

fixed with kubelet --container-log-max-size

@stanislav-zaprudskiy
Copy link
Contributor

We observe the same symptoms once in a while, both on 19.4 and 20.0. It is not reproducible however, and the same job just works fine on re-run. Also, the mentioned log entries show up regularly on AWX EE container regardless of whether there were problems with jobs or not. We also tried experimenting with forks (not slices), but had no any luck (even though it helped to fix some re-occurring case once). Finally, we tried to reproduce "log rotation" problem, by running a job in debug and on many hosts - but then it did not reproduce.

What exactly do you mean by saying

With large job execution

is it about the amount of output (seems to not make any difference for us), amount of hosts (again, doesn't seem to impact us), or execution time (some jobs failed within first minute, others on their 20th minute)?

@chris93111
Copy link
Contributor Author

@stanislav-zaprudskiy yes for me is based on size log, with many hosts and many tasks the log size is bigger,
my job is always fail without --container-log-max-size and if i split with slice it's work ( less hosts so smaller log size )
In my case is reproducible

@astehlik
Copy link

We were running into the same issue with AWX Operator running on Amazon EKS.

We finally managed to resolve it by adding this to our config as described here:

apiVersion: awx.ansible.com/v1beta1
kind: AWX
spec:
  ee_extra_env: |
    - name: RECEPTOR_KUBE_SUPPORT_RECONNECT
      value: disabled

Our Kubernetes server version:

serverVersion:
  gitCommit: abb98ec0631dfe573ec5eae40dc48fd8f2017424
  gitVersion: v1.24.8-eks-ffeb93d

We had to disable the new reconnect behavior of receptor, because it does not seem to work with EKS even though it is supposed to be compatible with Kubernetes version 1.24.8 and later.

@emoshaya
Copy link

How do I set RECEPTOR_KUBE_SUPPORT_RECONNECT to disabled for a custom pod spec?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants