Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWX 19.1: jobs stuck in "running" for over 12 hours but not actually doing anything at all #10151

Open
lemmy04 opened this issue May 10, 2021 · 23 comments

Comments

@lemmy04
Copy link

lemmy04 commented May 10, 2021

ISSUE TYPE
  • Bug Report
SUMMARY

Since yesterday my AWX does not work anymore - jobs get created and started but do not actually start i.e. no worker container is created, no actual work is done

ENVIRONMENT
  • AWX version: 19.1.0
  • AWX install method: k3s on debian buster, awx deployed with operator 0.9.0
  • Ansible version: whichever is in awx-ee 0.2.0
  • Operating System: whichever is used in the images? Centos?
  • Web Browser: any.
STEPS TO REPRODUCE

I start a new job by manually launching either an SCM update or a job or workflow template

  1. job gets created
  2. job goes from pending to running
  3. NO new worker container is created in k3s
  4. NO actual work is done, no standard output on the job, job never finishes
EXPECTED RESULTS

I would expect the job to either work or fail at some point with some (meaningful) error message

ACTUAL RESULTS

job never actually starts - just sits there in "running" but nothing is actually happening.

ADDITIONAL INFORMATION

no errors on the K3s side of things - it looks more like AWX never actually even tries to talk to K3s anymore.
BUT: ad-hoc jobs still work... this only affects templates and SCM updates (or maybe more, i have not tried every possible thing)

@supremedacos
Copy link

I see this happening as well, except my playbooks successfully start and complete and the k8s pod terminates cleanly, but the UI still shows the job in the Running state.

@wenottingham
Copy link
Contributor

Hi - can you get the logs from the awx-task container related to the launch of the job? We'd need more information to determine what's going on here.

@lemmy04
Copy link
Author

lemmy04 commented May 15, 2021

... I haven't touched kubernetes before at all, until AWX 18 came out... what exactly do you need?

@lemmy04
Copy link
Author

lemmy04 commented May 15, 2021

2021-05-15 06:03:29,735 DEBUG    [7d8140cba46b442caf544b74ee77a3e5] awx.main.dispatch task ac491d10-dfb9-4384-a0f7-ff0bed326388 starting awx.main.tasks.cluster_node_heartbeat(*[])
2021-05-15 06:03:29,738 DEBUG    [7d8140cba46b442caf544b74ee77a3e5] awx.main.tasks Cluster node heartbeat task.
Instance Group already registered tower
(changed: True)
Instance Group already registered tower
(changed: True)
Instance Group already registered tower
(changed: True)
Instance Group already registered tower
(changed: True)
Instance Group already registered tower
(changed: True)
2021-05-15 06:03:29,944 WARNING  [7d8140cba46b442caf544b74ee77a3e5] awx.main.dispatch scaling up worker pid:4637
2021-05-15 06:03:29,930 DEBUG    [7d8140cba46b442caf544b74ee77a3e5] awx.main.dispatch task 069050b2-4f4a-4a40-ae4d-22e3d12e8692 starting awx.main.tasks.awx_periodic_scheduler(*[])
2021-05-15 06:03:29,929 DEBUG    [7d8140cba46b442caf544b74ee77a3e5] awx.main.dispatch task 61ef1476-a7b5-4d41-abba-e4406d2f1280 starting awx.main.scheduler.tasks.run_task_manager(*[])
2021-05-15 06:03:29,959 DEBUG    [7d8140cba46b442caf544b74ee77a3e5] awx.main.scheduler Running task manager.
2021-05-15 06:03:29,932 DEBUG    [7d8140cba46b442caf544b74ee77a3e5] awx.main.dispatch task b2f8cbd1-f0cb-4bfc-832f-774099510166 starting awx.main.tasks.awx_k8s_reaper(*[])
2021-05-15 06:03:30,028 DEBUG    [7d8140cba46b442caf544b74ee77a3e5] awx.main.dispatch task 5dc51ed6-7f26-4ea4-84fb-1967617ad64a starting awx.main.analytics.analytics_tasks.send_subsystem_metrics(*[])
2021-05-15 06:03:30,053 DEBUG    [7d8140cba46b442caf544b74ee77a3e5] awx.main.tasks Starting periodic scheduler
2021-05-15 06:03:30,064 DEBUG    [7d8140cba46b442caf544b74ee77a3e5] awx.main.tasks Last scheduler run was: 2021-05-15 06:02:59.623461+00:00
2021-05-15 06:03:30,134 DEBUG    [7d8140cba46b442caf544b74ee77a3e5] awx.main.scheduler Starting Scheduler
2021-05-15 06:03:30,225 DEBUG    [7d8140cba46b442caf544b74ee77a3e5] awx.main.tasks Checking for orphaned k8s pods for tower-1.
2021-05-15 06:03:30,327 DEBUG    [7d8140cba46b442caf544b74ee77a3e5] awx.main.scheduler Finishing Scheduler
2021-05-15 06:03:32,315 WARNING  [a7372df73eb140f4b35634cc1de6f3d2] awx.main.dispatch scaling down worker pid:4628
2021-05-15 06:03:32,316 DEBUG    [a7372df73eb140f4b35634cc1de6f3d2] awx.main.dispatch task 40e94a9d-513a-45dc-8cb1-f822ceed040d starting awx.main.scheduler.tasks.run_task_manager(*[])
2021-05-15 06:03:32,318 WARNING  [7d8140cba46b442caf544b74ee77a3e5] awx.main.dispatch worker exiting gracefully pid:4628
2021-05-15 06:03:32,329 DEBUG    [a7372df73eb140f4b35634cc1de6f3d2] awx.main.scheduler Running task manager.
Instance Group already registered tower
(changed: True)
Instance Group already registered tower
(changed: True)
2021-05-15 06:03:32,364 DEBUG    [a7372df73eb140f4b35634cc1de6f3d2] awx.main.scheduler Starting Scheduler
2021-05-15 06:03:32,640 DEBUG    [a7372df73eb140f4b35634cc1de6f3d2] awx.analytics.job_lifecycle projectupdate-10030 acknowledged
2021-05-15 06:03:32,690 WARNING  [a7372df73eb140f4b35634cc1de6f3d2] awx.main.scheduler No available capacity to run containerized <project_update 10030 (waiting)>.
2021-05-15 06:03:32,830 DEBUG    [a7372df73eb140f4b35634cc1de6f3d2] awx.analytics.job_lifecycle projectupdate-10030 waiting
2021-05-15 06:03:32,831 DEBUG    [a7372df73eb140f4b35634cc1de6f3d2] awx.main.scheduler project_update 10030 (waiting) consumed 0 capacity units from tower with prior total of 0
2021-05-15 06:03:32,832 DEBUG    [a7372df73eb140f4b35634cc1de6f3d2] awx.main.scheduler Finishing Scheduler
2021-05-15 06:03:32,894 DEBUG    [a7372df73eb140f4b35634cc1de6f3d2] awx.main.dispatch task 0c40df72-e14f-47fc-9f4c-79c6aea5503a starting awx.main.tasks.RunProjectUpdate(*[10030])
2021-05-15 06:03:33,561 DEBUG    [a7372df73eb140f4b35634cc1de6f3d2] awx.analytics.job_lifecycle projectupdate-10030 pre run
2021-05-15 06:03:33,581 DEBUG    [a7372df73eb140f4b35634cc1de6f3d2] awx.analytics.job_lifecycle projectupdate-10030 preparing playbook
2021-05-15 06:03:33,725 DEBUG    [a7372df73eb140f4b35634cc1de6f3d2] awx.analytics.job_lifecycle projectupdate-10030 running playbook
2021-05-15 06:03:42,842 DEBUG    [-] awx.main.commands.run_callback_receiver 27 is alive
2021-05-15 06:03:42,842 DEBUG    [-] awx.main.commands.run_callback_receiver 27 is alive
2021-05-15 06:03:49,893 DEBUG    [48f472980b8f4f2fa2dae02f4cd9e3dd] awx.main.dispatch task c315fc3a-bc0b-45bf-ab15-a733c305fc59 starting awx.main.scheduler.tasks.run_task_manager(*[])
2021-05-15 06:03:49,895 DEBUG    [48f472980b8f4f2fa2dae02f4cd9e3dd] awx.main.scheduler Running task manager.
2021-05-15 06:03:49,985 DEBUG    [48f472980b8f4f2fa2dae02f4cd9e3dd] awx.main.dispatch task 93290266-03db-4e82-8149-cfeabb045a4d starting awx.main.analytics.analytics_tasks.send_subsystem_metrics(*[])
2021-05-15 06:03:50,020 DEBUG    [48f472980b8f4f2fa2dae02f4cd9e3dd] awx.main.scheduler Starting Scheduler
2021-05-15 06:03:50,247 DEBUG    [48f472980b8f4f2fa2dae02f4cd9e3dd] awx.main.scheduler Finishing Scheduler
2021-05-15 06:04:00,015 DEBUG    [4d60fdea098b4adc8ddd989ee154d632] awx.main.dispatch task 32e4017f-f5cf-4ac8-8e7a-af2f14c32beb starting awx.main.tasks.awx_periodic_scheduler(*[])
2021-05-15 06:04:00,046 DEBUG    [4d60fdea098b4adc8ddd989ee154d632] awx.main.tasks Starting periodic scheduler
RESULT 2
OKREADY
2021-05-15 06:04:00,055 DEBUG    [4d60fdea098b4adc8ddd989ee154d632] awx.main.tasks Last scheduler run was: 2021-05-15 06:03:30.054996+00:00
2021-05-15 06:04:10,048 DEBUG    [82e2362b5b064dfca43b4eb30c004fc8] awx.main.dispatch task 2f5448fb-6d28-4935-b776-0866084fce7d starting awx.main.scheduler.tasks.run_task_manager(*[])
2021-05-15 06:04:10,052 DEBUG    [82e2362b5b064dfca43b4eb30c004fc8] awx.main.scheduler Running task manager.
2021-05-15 06:04:10,127 DEBUG    [82e2362b5b064dfca43b4eb30c004fc8] awx.main.scheduler Starting Scheduler
2021-05-15 06:04:10,219 DEBUG    [82e2362b5b064dfca43b4eb30c004fc8] awx.main.dispatch task ee9db357-b00a-4056-ac63-a34c9a89ce68 starting awx.main.analytics.analytics_tasks.send_subsystem_metrics(*[])
2021-05-15 06:04:10,424 DEBUG    [82e2362b5b064dfca43b4eb30c004fc8] awx.main.scheduler Finishing Scheduler
2021-05-15 06:04:30,227 DEBUG    [780f73b06160467781e0d4b02f1463d9] awx.main.dispatch task 4712463f-c29f-4fc2-9f1b-f52c28dabe8d starting awx.main.tasks.cluster_node_heartbeat(*[])
2021-05-15 06:04:30,230 DEBUG    [780f73b06160467781e0d4b02f1463d9] awx.main.tasks Cluster node heartbeat task.
Instance Group already registered tower
(changed: True)
Instance Group already registered tower
(changed: True)
Instance Group already registered tower
(changed: True)
Instance Group already registered tower
(changed: True)
Instance Group already registered tower
(changed: True)
Instance Group already registered tower
(changed: True)
2021-05-15 06:04:30,431 DEBUG    [780f73b06160467781e0d4b02f1463d9] awx.main.dispatch task dd11c919-b9bb-488e-9fa7-e4e2a28d1cc1 starting awx.main.tasks.awx_k8s_reaper(*[])
2021-05-15 06:04:30,442 WARNING  [780f73b06160467781e0d4b02f1463d9] awx.main.dispatch scaling up worker pid:4644
2021-05-15 06:04:30,431 DEBUG    [780f73b06160467781e0d4b02f1463d9] awx.main.dispatch task fb8d52f3-256d-49d4-9abd-709362f63fbb starting awx.main.tasks.awx_periodic_scheduler(*[])
2021-05-15 06:04:30,531 WARNING  [780f73b06160467781e0d4b02f1463d9] awx.main.dispatch scaling up worker pid:4646
2021-05-15 06:04:30,532 DEBUG    [780f73b06160467781e0d4b02f1463d9] awx.main.dispatch task c8a824dd-97c4-4d21-b330-c37b1a631f7e starting awx.main.scheduler.tasks.run_task_manager(*[])
2021-05-15 06:04:30,543 DEBUG    [780f73b06160467781e0d4b02f1463d9] awx.main.dispatch task 18217501-f7aa-4e35-90de-e49c14d71140 starting awx.main.analytics.analytics_tasks.send_subsystem_metrics(*[])
2021-05-15 06:04:30,552 DEBUG    [780f73b06160467781e0d4b02f1463d9] awx.main.scheduler Running task manager.
2021-05-15 06:04:30,630 DEBUG    [780f73b06160467781e0d4b02f1463d9] awx.main.tasks Starting periodic scheduler
2021-05-15 06:04:30,648 DEBUG    [780f73b06160467781e0d4b02f1463d9] awx.main.tasks Checking for orphaned k8s pods for tower-1.
2021-05-15 06:04:30,649 DEBUG    [780f73b06160467781e0d4b02f1463d9] awx.main.tasks Last scheduler run was: 2021-05-15 06:04:00.052460+00:00
2021-05-15 06:04:30,741 DEBUG    [780f73b06160467781e0d4b02f1463d9] awx.main.scheduler Starting Scheduler
2021-05-15 06:04:31,229 DEBUG    [780f73b06160467781e0d4b02f1463d9] awx.main.scheduler Finishing Scheduler

kubectl -n awx logs -f -- awx-task shows this during a scm update that never actually starts...

@supremedacos
Copy link

I think I had a similar problem with SCM updates, does your project contain a requirements.yml file? and if so did you create an Ansible Galaxy/Automation Hub API Token credential and assign it to your organization?

@lemmy04
Copy link
Author

lemmy04 commented May 17, 2021

No requirements.yml files anywhere in my projects, and my awx does have the one galaxy token that gets created and assigned during deployment.

@lemmy04
Copy link
Author

lemmy04 commented May 21, 2021

I just deleted my AWXS object on my k3s, and redeployed - same issue: scm jobs don't start at all.

@shanemcd
Copy link
Member

Is there anything helpful in the logs of the in-pod awx-ee container?

@lemmy04
Copy link
Author

lemmy04 commented May 21, 2021

not really:

INFO 2021/05/21 18:03:22 Client connected to control service
INFO 2021/05/21 18:03:22 Control service closed
INFO 2021/05/21 18:03:22 Client disconnected from control service
INFO 2021/05/21 18:03:22 Client connected to control service
DEBUG 2021/05/21 18:03:24 Sending service advertisement: &{awx-9b7494dfb-jgk9m control 2021-05-21 18:03:24.48428354 +0000 UTC m=+38108.977573786 1 map[] []}

that is literally all there is.

@lemmy04
Copy link
Author

lemmy04 commented May 28, 2021

I have deleted the whole namespace in kubernetes and deleted the operator and deployed from scratch - and it's still the same: project sync jobs do not start at all, and there is nothing in the logs that would explain why.

@lemmy04
Copy link
Author

lemmy04 commented May 28, 2021

I've actually found something in the logs:

2021-05-28 11:25:17,565 WARNING  [6478de4b690b4854bb03466b65b1b70d] awx.main.scheduler No available capacity to run containerized <project_update 10116 (waiting)>.
2021-05-28 11:25:17,720 DEBUG    [6478de4b690b4854bb03466b65b1b70d] awx.analytics.job_lifecycle projectupdate-10116 waiting

"No available capacity to run ..." - where does that come from, and why does it affect only SCM updates?

@lemmy04
Copy link
Author

lemmy04 commented Jun 1, 2021

I've reconfigured my AWX to have much bigger resource limits but I still get this message:

2021-06-01 06:05:35,058 WARNING [d03db904726b4c5abc80773ea8897714] awx.main.scheduler No available capacity to run containerized <project_update 10127 (waiting)>.

and the scm update is not happening.

@lemmy04
Copy link
Author

lemmy04 commented Jun 1, 2021

there is also a file not found error on awx-ee:

ERROR 2021/06/01 06:05:17 Error updating status file /tmp/receptor/awx-5f9c75f6b6-ks7jz/K72xh22I/status: open /tmp/receptor/awx-5f9c75f6b6-ks7jz/K72xh22I/status.lock: no such file or directory.

When I look inside the actual container I find that there is no file by that name.

instead there is /tmp/receptor/awx-5f9c75f6b6-ks7jz/Zg7PbqHR/status

@JRNC
Copy link

JRNC commented Aug 9, 2021

This sounds like it could be the same issue that some of us are having in issue #10489

@lemmy04
Copy link
Author

lemmy04 commented Aug 10, 2021

looks pretty much like the same thing. Can't really verify though - I've given up on AWX > 17.1 for now, and since the deployment with database migration is also broken I have no way to spin up a latest version AWX in my kubernetes to get enough history for it to trigger.

@JRNC
Copy link

JRNC commented Aug 12, 2021

Hey @lemmy04, I have actually been bypassing the database migration workflow and manually dumping the database and then restoring it. I've added some steps below. I hope this helps!

On old server:
stop all containers except postgres first
sudo docker exec awx_postgres pg_dump -U DATABASEUSERNAME -Ft -f backup.tar awx
sudo docker cp awx_postgres:db.tar ./
sudo docker exec awx_postgres rm db.tar
Send the tarball to the new server and then delete the copy that was just created on the old server

On new server after deploying a bare install of AWX:
(make sure you run the below commands against the appropriate namespace)
kubectl scale --replicas=0 deployment awx
kubectl scale --replicas=0 deployment awx-operator
kubectl exec -it awx-postgres-0 -- /bin/bash
psql -U DATABASEUSERNAME postgres
DROP DATABASE awx;
CREATE DATABASE awx;
exit
exit
kubectl cp db.tar awx/awx-postgres-0:/
kubectl exec -it awx-postgres-0 -- /bin/bash
pg_restore -U DATABASEUSERNAME -Ft backup.tar -d awx
rm backup.tar
exit
rm backup.tar
kubectl scale --replicas=1 deployment awx-operator
kubectl scale --replicas=1 deployment awx

@stefanpinter
Copy link

stefanpinter commented Oct 14, 2021

i also have a project update stuck at running the second time already. it is bad, because it also blocks other jobs (they are on pending)
using awx 19.3.0

Identity added: /tmp/pdd_wrapper_7925_uf780rcs/awx_7925_8hqi355q/artifacts/7925/ssh_key_data (ansible@vpadm002)
PLAY [Update source tree if necessary] *****************************************
TASK [update project using git] ************************************************
ok: [localhost]
TASK [Set the git repository version] ******************************************
ok: [localhost]
TASK [Repository Version] ******************************************************
ok: [localhost] => {
"msg": "Repository Version 6d46b547c250e7595c8c5f62c91a273eccb888c7"
}
PLAY [Install content with ansible-galaxy command if necessary] ****************
TASK [Check content sync settings] *********************************************

@backaf
Copy link

backaf commented Nov 10, 2021

I have the same issue on 19.4.0 where project updates jobs get stuck into "Check content sync settings". It happens randomly. All scheduled jobs are failing because the projects can't be updated. This is a fresh install using the official AWX operator.

@SudoNova
Copy link

Same here

INFO2021/12/26 06:22:48 Client connected to control service
INFO2021/12/26 06:22:48 Control service closed
INFO2021/12/26 06:22:48 Client disconnected from control service
ERROR2021/12/26 06:22:49 Error updating status file /tmp/receptor/awx-5b4cff67f5-mh552/9LQAkSuW/status: open /tmp/receptor/awx-5b4cff67f5-mh552/9LQAkSuW/status.lock: no such file or directory.
ERROR2021/12/26 06:22:49 Error creating pod: context canceled

Though I don't get the Error creating pod part (what pod?), but the Error updating status file is common.

@shanemcd
Copy link
Member

Are folks still seeing this?

@navvisit
Copy link

I've seen it last time on the 12th of April, shortly after the AWX upgrade to 20.1.0. It used to fail more often before, one thing I noticed was the issue gets a bit more stable if I delete the AWX pods.

I have a feeling it's better in 20.1.0, I can report back if I still see it.

@PurpEE
Copy link

PurpEE commented Oct 14, 2022

We are currently seeing this in two instances after upgrading from 21.4 to 21.7.
We tried to cancel it, and also redeployed the whole deployment. It still always appears.

Canceling it produces this output in the web container:

[awx-77dcb9477f-2r2vf awx-web] xx.xxx.xx.x - - [14/Oct/2022:09:46:44 +0000] "POST /api/v2/workflow_jobs/7136/cancel/ HTTP/1.1" 202 0 "https://awxurl.de/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Firefox/91.0" "xx.xxx.x.xxx, xx.xxx.xx.xxx"                                                                                                         
[awx-77dcb9477f-2r2vf awx-web] [pid: 35|app: 0|req: 253/382] xx.xxx.xx.x () {68 vars in 2074 bytes} [Fri Oct 14 09:46:44 2022] POST /api/v2/workflow_jobs/7136/cancel/ => generated 0 bytes in 265 msecs (HTTP/1.1 202) 13 headers in 552 bytes (1 switches on core 0)

ee-container gives (same output as ee-container of all other pods):

[awx-77dcb9477f-g5h86 awx-ee] DEBUG 2022/10/14 09:46:46 Client connected to control service @                                                                                         
[awx-77dcb9477f-g5h86 awx-ee] DEBUG 2022/10/14 09:46:46 Client connected to control service @                                                                                                 
[awx-77dcb9477f-g5h86 awx-ee] DEBUG 2022/10/14 09:46:46 Control service closed                                                                                                                
[awx-77dcb9477f-g5h86 awx-ee] DEBUG 2022/10/14 09:46:46 Client disconnected from control service @                                                                                            
[awx-77dcb9477f-g5h86 awx-ee] DEBUG 2022/10/14 09:46:46 Control service closed                                                                                                                
[awx-77dcb9477f-g5h86 awx-ee] DEBUG 2022/10/14 09:46:46 Client disconnected from control service @                                                                                            
[awx-77dcb9477f-g5h86 awx-ee] DEBUG 2022/10/14 09:46:55 Sending service advertisement: &{awx-77dcb9477f-g5h86 control 2022-10-14 09:46:55.947685586 +0000 UTC m=+63426.817330184 1 map[type:Control Service] [{local false} {kubernetes-runtime-auth false} {kubernetes-incluster-auth false}]}

Manually trying to POST /api/v2/workflow_jobs/7136/cancel/ produces the same output as seen in the web container above.

Trying to delete it produces:

[awx-77dcb9477f-2r2vf awx-web] 2022-10-14 09:57:11,526 WARNING  [b41cc4b4803647f1bbfdf2df5ffa0784] awx.api.generics status 403 received by user foo.bar attempting to access /api/v2/workflow_jobs/7136/ from xx.xxx.xx.x                                                                                                                                                               
[awx-77dcb9477f-2r2vf awx-web] 2022-10-14 09:57:11,562 WARNING  [b41cc4b4803647f1bbfdf2df5ffa0784] django.request Forbidden: /api/v2/workflow_jobs/7136/                                     
[awx-77dcb9477f-2r2vf awx-web] 2022-10-14 09:57:11,562 WARNING  [b41cc4b4803647f1bbfdf2df5ffa0784] django.request Forbidden: /api/v2/workflow_jobs/7136/                                      
[awx-77dcb9477f-2r2vf awx-web] xx.xxx.xx.x - - [14/Oct/2022:09:57:11 +0000] "DELETE /api/v2/workflow_jobs/7136/ HTTP/1.1" 403 9343 "https://awxurl.de/api/v2/workflow_jobs/7136/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Firefox/91.0" "xx.xxx.x.xxx, xx.xxx.xx.xxx"                                                                                 
[awx-77dcb9477f-2r2vf awx-web] [pid: 36|app: 0|req: 21/390] xx.xxx.xx.x () {70 vars in 2207 bytes} [Fri Oct 14 09:57:11 2022] DELETE /api/v2/workflow_jobs/7136/ => generated 9343 bytes in 212 msecs (HTTP/1.1 403) 15 headers in 709 bytes (1 switches on core 0)

In both cases the last job within the workflow shows the status: Error in awx. (Looking at status via /api/v2/jobs/7141/)

"status": "error",

It also shows the following result traceback:

"result_traceback": "Traceback (most recent call last):
  File \"/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/receptor.py\", line 372, in _run_internal
    resultsock = receptor_ctl.get_work_results(self.unit_id, return_sockfile=True)
  File \"/var/lib/awx/venv/awx/lib64/python3.9/site-packages/receptorctl/socket_interface.py\", line 248, in get_work_results
    self.writestr(f\"work results {unit_id}\
\")
  File \"/var/lib/awx/venv/awx/lib64/python3.9/site-packages/receptorctl/socket_interface.py\", line 49, in writestr
    self._sockfile.flush()
  File \"/usr/lib64/python3.9/socket.py\", line 722, in write
    return self._sock.send(b)
BrokenPipeError: [Errno 32] Broken pipe

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File \"/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/jobs.py\", line 537, in run
    res = receptor_job.run()
  File \"/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/receptor.py\", line 268, in run
    res = self._run_internal(receptor_ctl)
  File \"/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/receptor.py\", line 382, in _run_internal
    raise RuntimeError(detail)
RuntimeError
",

but the kubernetes job logs say the job has been canceled.

October 10th 2022, 11:05:35.922 {"status": "canceled", "runner_ident": "7141"} automation-job-7141-77cp7

Also here is the log of the pod at the end:

October 10th 2022, 11:01:12.2352022-10-10,230 INFO      awx.main.commands.run_callback_receiver Starting EOF event processing for Job 7141
October 10th 2022, 11:01:12.2352022-10-10,230 INFO      awx.main.commands.run_callback_receiver Starting EOF event processing for Job 7141
October 10th 2022, 11:01:12.4792022-10-10,477 DEBUG     awx.main.tasks.jobs job 7141 (running) finished running, producing 24 events.
October 10th 2022, 11:01:12.4792022-10-10,478 DEBUG     awx.analytics.job_lifecycle job-7141 post run
October 10th 2022, 11:01:12.4792022-10-10,477 ERROR     awx.main.tasks.jobs job 7141 (running) Exception occurred while running task
October 10th 2022, 11:01:12.7032022-10-10,697 DEBUG     awx.analytics.job_lifecycle job-7141 finalize run
October 10th 2022, 11:01:12.7712022-10-10,737 ERROR     awx.main.models.unified_jobs job 7141 (error) failed to emit channel msg about status change
October 10th 2022, 11:01:12.7832022-10-10,782 DEBUG     awx.main.tasks.system Executing error task id a492b12a-42d4-480a-95ce-0e3f1b5f7475, subtasks: [{'type': 'job', 'id': 7141}]
October 10th 2022, 11:01:12.7832022-10-10,775 WARNING   awx.main.dispatch job 7141 (error) encountered an error (rc=None), please see task stdout for details.

@root3200
Copy link

Hi, if anyone is currently having the task error with the executing bug and is using awx operator, this may help resolve the error.

I installed awx operator with ks3, in a debian 12 virtual machine, the bug was generated when the virtual machine had assigned 1 cpu core, I still do not have very clear why the error is generated, but when I assigned 2 cpu cores awx tasks worked correctly without any problem, I hope it is helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests