-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWX 19.1: jobs stuck in "running" for over 12 hours but not actually doing anything at all #10151
Comments
I see this happening as well, except my playbooks successfully start and complete and the k8s pod terminates cleanly, but the UI still shows the job in the Running state. |
Hi - can you get the logs from the awx-task container related to the launch of the job? We'd need more information to determine what's going on here. |
... I haven't touched kubernetes before at all, until AWX 18 came out... what exactly do you need? |
kubectl -n awx logs -f -- awx-task shows this during a scm update that never actually starts... |
I think I had a similar problem with SCM updates, does your project contain a |
No requirements.yml files anywhere in my projects, and my awx does have the one galaxy token that gets created and assigned during deployment. |
I just deleted my AWXS object on my k3s, and redeployed - same issue: scm jobs don't start at all. |
Is there anything helpful in the logs of the in-pod awx-ee container? |
not really: INFO 2021/05/21 18:03:22 Client connected to control service that is literally all there is. |
I have deleted the whole namespace in kubernetes and deleted the operator and deployed from scratch - and it's still the same: project sync jobs do not start at all, and there is nothing in the logs that would explain why. |
I've actually found something in the logs:
"No available capacity to run ..." - where does that come from, and why does it affect only SCM updates? |
I've reconfigured my AWX to have much bigger resource limits but I still get this message:
and the scm update is not happening. |
there is also a file not found error on awx-ee:
When I look inside the actual container I find that there is no file by that name. instead there is /tmp/receptor/awx-5f9c75f6b6-ks7jz/Zg7PbqHR/status |
This sounds like it could be the same issue that some of us are having in issue #10489 |
looks pretty much like the same thing. Can't really verify though - I've given up on AWX > 17.1 for now, and since the deployment with database migration is also broken I have no way to spin up a latest version AWX in my kubernetes to get enough history for it to trigger. |
Hey @lemmy04, I have actually been bypassing the database migration workflow and manually dumping the database and then restoring it. I've added some steps below. I hope this helps! On old server: On new server after deploying a bare install of AWX: |
i also have a project update stuck at running the second time already. it is bad, because it also blocks other jobs (they are on pending) Identity added: /tmp/pdd_wrapper_7925_uf780rcs/awx_7925_8hqi355q/artifacts/7925/ssh_key_data (ansible@vpadm002) |
I have the same issue on 19.4.0 where project updates jobs get stuck into "Check content sync settings". It happens randomly. All scheduled jobs are failing because the projects can't be updated. This is a fresh install using the official AWX operator. |
Same here
Though I don't get the |
Are folks still seeing this? |
I've seen it last time on the 12th of April, shortly after the AWX upgrade to 20.1.0. It used to fail more often before, one thing I noticed was the issue gets a bit more stable if I delete the AWX pods. I have a feeling it's better in 20.1.0, I can report back if I still see it. |
We are currently seeing this in two instances after upgrading from 21.4 to 21.7. Canceling it produces this output in the web container:
ee-container gives (same output as ee-container of all other pods):
Manually trying to POST /api/v2/workflow_jobs/7136/cancel/ produces the same output as seen in the web container above. Trying to delete it produces:
In both cases the last job within the workflow shows the status: Error in awx. (Looking at status via /api/v2/jobs/7141/)
It also shows the following result traceback:
but the kubernetes job logs say the job has been canceled.
Also here is the log of the pod at the end:
|
Hi, if anyone is currently having the task error with the executing bug and is using awx operator, this may help resolve the error. I installed awx operator with ks3, in a debian 12 virtual machine, the bug was generated when the virtual machine had assigned 1 cpu core, I still do not have very clear why the error is generated, but when I assigned 2 cpu cores awx tasks worked correctly without any problem, I hope it is helpful. |
ISSUE TYPE
SUMMARY
Since yesterday my AWX does not work anymore - jobs get created and started but do not actually start i.e. no worker container is created, no actual work is done
ENVIRONMENT
STEPS TO REPRODUCE
I start a new job by manually launching either an SCM update or a job or workflow template
EXPECTED RESULTS
I would expect the job to either work or fail at some point with some (meaningful) error message
ACTUAL RESULTS
job never actually starts - just sits there in "running" but nothing is actually happening.
ADDITIONAL INFORMATION
no errors on the K3s side of things - it looks more like AWX never actually even tries to talk to K3s anymore.
BUT: ad-hoc jobs still work... this only affects templates and SCM updates (or maybe more, i have not tried every possible thing)
The text was updated successfully, but these errors were encountered: