Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long job marked as error because logs stop to be retrieved while job continues and complete successfully #11451

Closed
3 tasks done
doczico opened this issue Dec 13, 2021 · 6 comments

Comments

@doczico
Copy link

doczico commented Dec 13, 2021

Please confirm the following

  • I agree to follow this project's code of conduct.
  • I have checked the current issues for duplicates.
  • I understand that AWX is open source software provided for free and that I am not entitled to status updates or other assurances.

Summary

Dear AWX-team,
I am running AWX within a Kubernetes Cluster, job are run on custom awx-ee made with ansible-builder (ansible-runner:stable-2.10-devel + few tools needed by playbooks)
All jobs are working fine, however a long job is repeatedly marked as Error.
Checking the job logs there is no failure but an incomplete log, stopped in the middle of a task.
Monitoring the job pod directly using kubectl i see it completed the playbook successfully what match the generated artifacts i found from tasks supposedly not executed from AWX and not the display of awx.

AWX version

19.5.0

Installation method

kubernetes

Modifications

yes

Ansible version

No response

Operating system

No response

Web browser

No response

Steps to reproduce

Trigger a long job and wait for it to be marked as error
check the container trough kubectl or any other tool and see it keeps running.

Expected results

The log display in AWX to be complete and status marked as real status.

Actual results

Awx job display incomplete log, Error status

Additional information

custom runner images that includes some tools and collections required by playbooks

@AlanCoding
Copy link
Member

Do you have the error that the job shows? There is an API field job_explanation which may describe the error, or it may show a traceback. Also, are there relevant server logs from when it errored?

@stefanpinter
Copy link

my problem from #11473 now is this one here as well (after setting up a new server and upgrading from 19.3.0 to 19.5.0.
"long" running job goes into error.
k3s logs from when it happened at 09:42 (jumped from Running to Error at this same moment)

Dec 23 09:40:45 vpadm002 k3s[915]: I1223 09:40:45.329466 915 topology_manager.go:187] "Topology Admit Handler"
Dec 23 09:40:45 vpadm002 k3s[915]: I1223 09:40:45.393250 915 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume "kube-api-access-k48w7" (UniqueName: "kubernetes.io/projected/e3b8a6f0-adcc-4abf-918a-746b884cc5c5-kube-api-access-k48w7") pod "automation-job-149-bjhxq" (UID: "e3b8a6f0-adcc-4abf-918a-746b884cc5c5") "
Dec 23 09:42:03 vpadm002 k3s[915]: I1223 09:42:03.340154 915 trace.go:205] Trace[1692521596]: "Get" url:/api/v1/namespaces/awx/pods/automation-job-149-bjhxq/log,user-agent:receptor/v0.0.0 (linux/amd64) kubernetes/$Format,client:10.42.0.63,accept:application/json, /,protocol:HTTP/1.1 (23-Dec-2021 09:40:47.965) (total time: 75374ms):
Dec 23 09:42:03 vpadm002 k3s[915]: Trace[1692521596]: ---"Transformed response object" 75372ms (09:42:00.340)
Dec 23 09:42:03 vpadm002 k3s[915]: Trace[1692521596]: [1m15.374163804s] [1m15.374163804s] END
Dec 23 09:42:04 vpadm002 k3s[915]: I1223 09:42:04.178391 915 scope.go:111] "RemoveContainer" containerID="b900f089979be234a2a120679f883e5cdba2fd03351e9b6eacaa0e0bf8201d32"
Dec 23 09:42:04 vpadm002 k3s[915]: I1223 09:42:04.189635 915 scope.go:111] "RemoveContainer" containerID="b900f089979be234a2a120679f883e5cdba2fd03351e9b6eacaa0e0bf8201d32"
Dec 23 09:42:04 vpadm002 k3s[915]: E1223 09:42:04.190105 915 remote_runtime.go:334] "ContainerStatus from runtime service failed" err="rpc error: code = NotFound desc = an error occurred when try to find container "b900f089979be234a2a120679f883e5cdba2fd03351e9b6eacaa0e0bf8201d32": not found" containerID="b900f089979be234a2a120679f883e5cdba2fd03351e9b6eacaa0e0bf8201d32"
Dec 23 09:42:04 vpadm002 k3s[915]: I1223 09:42:04.190146 915 pod_container_deletor.go:52] "DeleteContainer returned error" containerID={Type:containerd ID:b900f089979be234a2a120679f883e5cdba2fd03351e9b6eacaa0e0bf8201d32} err="failed to get container status "b900f089979be234a2a120679f883e5cdba2fd03351e9b6eacaa0e0bf8201d32": rpc error: code = NotFound desc = an error occurred when try to find container "b900f089979be234a2a120679f883e5cdba2fd03351e9b6eacaa0e0bf8201d32": not found"
Dec 23 09:42:04 vpadm002 k3s[915]: I1223 09:42:04.258173 915 reconciler.go:196] "operationExecutor.UnmountVolume started for volume "kube-api-access-k48w7" (UniqueName: "kubernetes.io/projected/e3b8a6f0-adcc-4abf-918a-746b884cc5c5-kube-api-access-k48w7") pod "e3b8a6f0-adcc-4abf-918a-746b884cc5c5" (UID: "e3b8a6f0-adcc-4abf-918a-746b884cc5c5") "
Dec 23 09:42:04 vpadm002 k3s[915]: I1223 09:42:04.260589 915 operation_generator.go:830] UnmountVolume.TearDown succeeded for volume "kubernetes.io/projected/e3b8a6f0-adcc-4abf-918a-746b884cc5c5-kube-api-access-k48w7" (OuterVolumeSpecName: "kube-api-access-k48w7") pod "e3b8a6f0-adcc-4abf-918a-746b884cc5c5" (UID: "e3b8a6f0-adcc-4abf-918a-746b884cc5c5"). InnerVolumeSpecName "kube-api-access-k48w7". PluginName "kubernetes.io/projected", VolumeGidValue ""
Dec 23 09:42:04 vpadm002 k3s[915]: I1223 09:42:04.359160 915 reconciler.go:319] "Volume detached for volume "kube-api-access-k48w7" (UniqueName: "kubernetes.io/projected/e3b8a6f0-adcc-4abf-918a-746b884cc5c5-kube-api-access-k48w7") on node "vpadm002" DevicePath """

@stefanpinter
Copy link

stefanpinter commented Dec 23, 2021

set up with https://github.com/kurokobo/awx-on-k3s
awx 19.5.0
awx-ee(latest)
tried awx-ee 0.6.0 as well

there are jobs on my instance that run okay that are even longer (> 300 seconds)

"status": "error",
"execution_environment": 1,
"failed": true,
"started": "2021-12-23T09:40:36.129419Z",
"finished": "2021-12-23T09:42:03.868156Z",
"canceled_on": null,
"elapsed": 87.739,
"job_explanation": "",

another try
"execution_environment": 4,
"failed": true,
"started": "2021-12-23T09:51:54.750005Z",
"finished": "2021-12-23T09:53:58.982937Z",
"canceled_on": null,
"elapsed": 124.233,

@kurokobo
Copy link
Contributor

Sometimes the rotation of the logs by kubelet causes similar issue.
I don't know if this is the one you faced but for k3s #10366 (comment) might be helpful (and #10366 (comment) for k8s).

@stefanpinter
Copy link

a very similar problem with maybe the same cause is now happening with a workflow.
a node ends with succes, but the next node isn't starting at all.
Workflow says "Running" and the next node is just gray and not running at all

@shanemcd
Copy link
Member

This is a duplicate of either #11594 (comment) or #11338

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants