Long job marked as error because logs stop to be retrieved while job continues and complete successfully #11451

doczico · 2021-12-13T09:15:11Z

Please confirm the following

I agree to follow this project's code of conduct.
I have checked the current issues for duplicates.
I understand that AWX is open source software provided for free and that I am not entitled to status updates or other assurances.

Summary

Dear AWX-team,
I am running AWX within a Kubernetes Cluster, job are run on custom awx-ee made with ansible-builder (ansible-runner:stable-2.10-devel + few tools needed by playbooks)
All jobs are working fine, however a long job is repeatedly marked as Error.
Checking the job logs there is no failure but an incomplete log, stopped in the middle of a task.
Monitoring the job pod directly using kubectl i see it completed the playbook successfully what match the generated artifacts i found from tasks supposedly not executed from AWX and not the display of awx.

AWX version

19.5.0

Installation method

kubernetes

Modifications

yes

Ansible version

No response

Operating system

No response

Web browser

No response

Steps to reproduce

Trigger a long job and wait for it to be marked as error
check the container trough kubectl or any other tool and see it keeps running.

Expected results

The log display in AWX to be complete and status marked as real status.

Actual results

Awx job display incomplete log, Error status

Additional information

custom runner images that includes some tools and collections required by playbooks

AlanCoding · 2021-12-17T18:46:04Z

Do you have the error that the job shows? There is an API field job_explanation which may describe the error, or it may show a traceback. Also, are there relevant server logs from when it errored?

stefanpinter · 2021-12-23T09:43:22Z

my problem from #11473 now is this one here as well (after setting up a new server and upgrading from 19.3.0 to 19.5.0.
"long" running job goes into error.
k3s logs from when it happened at 09:42 (jumped from Running to Error at this same moment)

Dec 23 09:40:45 vpadm002 k3s[915]: I1223 09:40:45.329466 915 topology_manager.go:187] "Topology Admit Handler"
Dec 23 09:40:45 vpadm002 k3s[915]: I1223 09:40:45.393250 915 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume "kube-api-access-k48w7" (UniqueName: "kubernetes.io/projected/e3b8a6f0-adcc-4abf-918a-746b884cc5c5-kube-api-access-k48w7") pod "automation-job-149-bjhxq" (UID: "e3b8a6f0-adcc-4abf-918a-746b884cc5c5") "
Dec 23 09:42:03 vpadm002 k3s[915]: I1223 09:42:03.340154 915 trace.go:205] Trace[1692521596]: "Get" url:/api/v1/namespaces/awx/pods/automation-job-149-bjhxq/log,user-agent:receptor/v0.0.0 (linux/amd64) kubernetes/$Format,client:10.42.0.63,accept:application/json, /,protocol:HTTP/1.1 (23-Dec-2021 09:40:47.965) (total time: 75374ms):
Dec 23 09:42:03 vpadm002 k3s[915]: Trace[1692521596]: ---"Transformed response object" 75372ms (09:42:00.340)
Dec 23 09:42:03 vpadm002 k3s[915]: Trace[1692521596]: [1m15.374163804s] [1m15.374163804s] END
Dec 23 09:42:04 vpadm002 k3s[915]: I1223 09:42:04.178391 915 scope.go:111] "RemoveContainer" containerID="b900f089979be234a2a120679f883e5cdba2fd03351e9b6eacaa0e0bf8201d32"
Dec 23 09:42:04 vpadm002 k3s[915]: I1223 09:42:04.189635 915 scope.go:111] "RemoveContainer" containerID="b900f089979be234a2a120679f883e5cdba2fd03351e9b6eacaa0e0bf8201d32"
Dec 23 09:42:04 vpadm002 k3s[915]: E1223 09:42:04.190105 915 remote_runtime.go:334] "ContainerStatus from runtime service failed" err="rpc error: code = NotFound desc = an error occurred when try to find container "b900f089979be234a2a120679f883e5cdba2fd03351e9b6eacaa0e0bf8201d32": not found" containerID="b900f089979be234a2a120679f883e5cdba2fd03351e9b6eacaa0e0bf8201d32"
Dec 23 09:42:04 vpadm002 k3s[915]: I1223 09:42:04.190146 915 pod_container_deletor.go:52] "DeleteContainer returned error" containerID={Type:containerd ID:b900f089979be234a2a120679f883e5cdba2fd03351e9b6eacaa0e0bf8201d32} err="failed to get container status "b900f089979be234a2a120679f883e5cdba2fd03351e9b6eacaa0e0bf8201d32": rpc error: code = NotFound desc = an error occurred when try to find container "b900f089979be234a2a120679f883e5cdba2fd03351e9b6eacaa0e0bf8201d32": not found"
Dec 23 09:42:04 vpadm002 k3s[915]: I1223 09:42:04.258173 915 reconciler.go:196] "operationExecutor.UnmountVolume started for volume "kube-api-access-k48w7" (UniqueName: "kubernetes.io/projected/e3b8a6f0-adcc-4abf-918a-746b884cc5c5-kube-api-access-k48w7") pod "e3b8a6f0-adcc-4abf-918a-746b884cc5c5" (UID: "e3b8a6f0-adcc-4abf-918a-746b884cc5c5") "
Dec 23 09:42:04 vpadm002 k3s[915]: I1223 09:42:04.260589 915 operation_generator.go:830] UnmountVolume.TearDown succeeded for volume "kubernetes.io/projected/e3b8a6f0-adcc-4abf-918a-746b884cc5c5-kube-api-access-k48w7" (OuterVolumeSpecName: "kube-api-access-k48w7") pod "e3b8a6f0-adcc-4abf-918a-746b884cc5c5" (UID: "e3b8a6f0-adcc-4abf-918a-746b884cc5c5"). InnerVolumeSpecName "kube-api-access-k48w7". PluginName "kubernetes.io/projected", VolumeGidValue ""
Dec 23 09:42:04 vpadm002 k3s[915]: I1223 09:42:04.359160 915 reconciler.go:319] "Volume detached for volume "kube-api-access-k48w7" (UniqueName: "kubernetes.io/projected/e3b8a6f0-adcc-4abf-918a-746b884cc5c5-kube-api-access-k48w7") on node "vpadm002" DevicePath """

stefanpinter · 2021-12-23T09:50:29Z

set up with https://github.com/kurokobo/awx-on-k3s
awx 19.5.0
awx-ee(latest)
tried awx-ee 0.6.0 as well

there are jobs on my instance that run okay that are even longer (> 300 seconds)

"status": "error",
"execution_environment": 1,
"failed": true,
"started": "2021-12-23T09:40:36.129419Z",
"finished": "2021-12-23T09:42:03.868156Z",
"canceled_on": null,
"elapsed": 87.739,
"job_explanation": "",

another try
"execution_environment": 4,
"failed": true,
"started": "2021-12-23T09:51:54.750005Z",
"finished": "2021-12-23T09:53:58.982937Z",
"canceled_on": null,
"elapsed": 124.233,

kurokobo · 2021-12-23T13:13:35Z

Sometimes the rotation of the logs by kubelet causes similar issue.
I don't know if this is the one you faced but for k3s #10366 (comment) might be helpful (and #10366 (comment) for k8s).

stefanpinter · 2022-01-07T12:32:26Z

a very similar problem with maybe the same cause is now happening with a workflow.
a node ends with succes, but the next node isn't starting at all.
Workflow says "Running" and the next node is just gray and not running at all

shanemcd · 2022-01-31T17:09:43Z

This is a duplicate of either #11594 (comment) or #11338

AlanCoding added component:api state:needs_info labels Dec 17, 2021

stefanpinter mentioned this issue Dec 23, 2021

Job running forever and empty output #11473

Closed

3 tasks

aussielunix mentioned this issue Jan 25, 2022

Playbooks running longer than 4 hours are terminated unexpectedly #11594

Closed

3 tasks

shanemcd added needs_triage and removed state:needs_info labels Jan 26, 2022

shanemcd closed this as completed Jan 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long job marked as error because logs stop to be retrieved while job continues and complete successfully #11451

Long job marked as error because logs stop to be retrieved while job continues and complete successfully #11451

doczico commented Dec 13, 2021

AlanCoding commented Dec 17, 2021

stefanpinter commented Dec 23, 2021

stefanpinter commented Dec 23, 2021 •

edited

Loading

kurokobo commented Dec 23, 2021

stefanpinter commented Jan 7, 2022

shanemcd commented Jan 31, 2022

Long job marked as error because logs stop to be retrieved while job continues and complete successfully #11451

Long job marked as error because logs stop to be retrieved while job continues and complete successfully #11451

Comments

doczico commented Dec 13, 2021

Please confirm the following

Summary

AWX version

Installation method

Modifications

Ansible version

Operating system

Web browser

Steps to reproduce

Expected results

Actual results

Additional information

AlanCoding commented Dec 17, 2021

stefanpinter commented Dec 23, 2021

stefanpinter commented Dec 23, 2021 • edited Loading

kurokobo commented Dec 23, 2021

stefanpinter commented Jan 7, 2022

shanemcd commented Jan 31, 2022

stefanpinter commented Dec 23, 2021 •

edited

Loading