-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jobs with long running tasks and Inventory sync terminating after 5 minutes #12530
Comments
it's not only the inventory that times out after 5 minutes, any job that has long running tasks (no matter whether they produce output or not, e.g. a simple pause for more than 5 minutes) are killed after 5 minutes. |
Can you find any related logs from around the time it timed out? What information do the API details show at /api/v2/inventory_updates/(id)/? |
There is nothing in the job logs that shows any error. When running this simple playbook:
Did now a strace of the ansible-runner process in the execution environment and found that, after 5 minutes, it gets a SIGTERM. |
Did some further strace of the processes in the execution environment pod and found that the "init process" dumb-init is receiving the SIGTERM signal, so it looks like the pod is terminated from "outside". So the question remains: who's terminating my pod? Is it AWX or is it Kubernetes? |
I got the same problem, all plays which take longer as 5 minutes (without posting any output) are stopping with the status: Only thing in the log is: also tested it with the Seems if there is no output to AWX from the playbook/pod for 5 minutes, there is a somekind of timeout mechanism and the pod gets terminated. i have the k8s cluster in AKS. |
looks like this may be somehow related to AKS, my AWX is also running in AKS. what kubernetes version are you running? |
We are using version 1.23.x, recently upgraded, but before the upgrade we had also this issue, could be this issue started at version 1.22.x .. I have checked all the settings in AKS. the only timeout settings, i already mentioned and are set way behind the 5 minute threshold.. Maybe you could check your timeout settings of the LB in the AKS resourcegroup, as mentioned here: #12297 (comment) |
I was interested in this issue and I decided to set up a single node AKS cluster to try this issue out, and sure enough, the issue reproduced indeed. The automation-job pod will be killed around 5 minutes after its starting. $ kubectl -n awx describe automation-job-4-vqsht
Name: automation-job-4-vqsht
Namespace: awx
Priority: 0
Node: aks-agentpool-13146441-vmss000000/10.224.0.4
Start Time: Wed, 20 Jul 2022 12:55:24 +0000
Labels: ansible-awx=12e4dc0b-28c5-418d-b7a8-30803b17dfad
ansible-awx-job-id=4
Annotations: <none>
Status: Terminating (lasts <invalid>)
Termination Grace Period: 30s
IP: 10.244.0.14
IPs:
IP: 10.244.0.14
Containers:
worker:
Container ID: containerd://72b5fe1ac8c1a827755c0647ae25dfb9a8037dd40c434f10e416a1329f52bf6a
Image: quay.io/ansible/awx-ee:latest
Image ID: quay.io/ansible/awx-ee@sha256:833fcdf5211265040de6c4509130f04555f8a48548620172b73dfad949d6ea0c
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 5m10s default-scheduler Successfully assigned awx/automation-job-4-vqsht to aks-agentpool-13146441-vmss000000
Normal Pulling 5m9s kubelet Pulling image "quay.io/ansible/awx-ee:latest"
Normal Pulled 5m9s kubelet Successfully pulled image "quay.io/ansible/awx-ee:latest" in 782.62126ms
Normal Created 5m9s kubelet Created container worker
Normal Started 5m9s kubelet Started container worker 👈👈👈
Normal Killing 1s kubelet Stopping container worker 👈👈👈 In the log of the kubelet on the AKS node, "superfluous" http responses are logged, just before starting the termination for the automation-job pod.
And by increasing log level for kubelet, the
My understanding is that this log is opened by Receptor to follow output from pods by Stream. I still don't know the root cause of this 400 by kubelet after all, but is it possible that the Receptor got 400 from API, which disconnected Receptor's Stream, leading to the termination of the job? |
F.Y.I. for digging; To create an interactive shell connection to an AKS node: https://docs.microsoft.com/en-us/azure/aks/node-access kubectl debug node/<node> -it --image=mcr.microsoft.com/dotnet/runtime-deps:6.0
chroot /host
bash Increasing log level for kubelet on AKS node: # vi /etc/systemd/system/kubelet.service
...
ExecStart=/usr/local/bin/kubelet \
--enable-server \
--node-labels="${KUBELET_NODE_LABELS}" \
--v=4 \ 👈👈👈
--volume-plugin-dir=/etc/kubernetes/volumeplugins \
$KUBELET_TLS_BOOTSTRAP_FLAGS \
$KUBELET_CONFIG_FILE_FLAGS \
...
# systemctl daemon-reload
# systemctl restart kubelet |
I've already looked at the kublet logs but I'm not sure if that message is related to our issue because this "superfluous" http response is also logged for successful jobs... |
I guess every playbook in awx will get the http response to kill the automation pod Question is, what is sending this after the 5 minutes (timeout)? Cause this only happens with AKS, it needs to be something within Azure...? @kurokobo, thank you for your time, confirmation and investigation! May i ask what version of k8s were you running in AKS? |
@cmasopust @Parkhost In successful case, an "superfluous" response is recorded after the termination process for container and pod is initiated. 13:20:05.859584 4474 manager.go:1048] Destroyed container: "/kubepods/burstable/podb86aae97-f997-4bf5-a722-9d6a306b1c43/b6db6d5e42ac4ff209d65f5ef5af916ef51e1d3c0bce27603ab9f8a15beb7f22" (aliases: [b6db6d5e42ac4ff209d65f5ef5af916ef51e1d3c0bce27603ab9f8a15beb7f22 /kubepods/burstable/podb86aae97-f997-4bf5-a722-9d6a306b1c43/b6db6d5e42ac4ff209d65f5ef5af916ef51e1d3c0bce27603ab9f8a15beb7f22], namespace: "containerd")
...
13:20:06.206528 4474 generic.go:296] "Generic (PLEG): container finished" podID=b86aae97-f997-4bf5-a722-9d6a306b1c43 containerID="b6db6d5e42ac4ff209d65f5ef5af916ef51e1d3c0bce27603ab9f8a15beb7f22" exitCode=0
...
13:20:06.206725 4474 kubelet_pods.go:1459] "Got phase for pod" pod="awx/automation-job-4-k4k5p" oldPhase=Running phase=Succeeded
13:20:06.206779 4474 kubelet.go:1546] "syncPod exit" pod="awx/automation-job-4-k4k5p" podUID=b86aae97-f997-4bf5-a722-9d6a306b1c43 isTerminal=true
13:20:06.206796 4474 pod_workers.go:978] "Pod is terminal" pod="awx/automation-job-4-k4k5p" podUID=b86aae97-f997-4bf5-a722-9d6a306b1c43 updateType=0
13:20:06.206809 4474 pod_workers.go:1023] "Pod indicated lifecycle completed naturally and should now terminate" pod="awx/automation-job-4-k4k5p" podUID=b86aae97-f997-4bf5-a722-9d6a306b1c43
...
13:20:06.228010 4474 status_manager.go:682] "Patch status for pod" pod="awx/automation-job-4-k4k5p" patch="{\"metadata\":{\"uid\":\"b86aae97-f997-4bf5-a722-9d6a306b1c43\"},\"status\":{\"$setElementOrder/conditions\":[{\"type\":\"Initialized\"},{\"type\":\"Ready\"},{\"type\":\"ContainersReady\"},{\"type\":\"PodScheduled\"}],\"conditions\":[{\"reason\":\"PodCompleted\",\"type\":\"Initialized\"},{\"lastTransitionTime\":\"2022-07-21T13:20:06Z\",\"reason\":\"PodCompleted\",\"status\":\"False\",\"type\":\"Ready\"},{\"lastTransitionTime\":\"2022-07-21T13:20:06Z\",\"reason\":\"PodCompleted\",\"status\":\"False\",\"type\":\"ContainersReady\"}],\"containerStatuses\":[{\"containerID\":\"containerd://b6db6d5e42ac4ff209d65f5ef5af916ef51e1d3c0bce27603ab9f8a15beb7f22\",\"image\":\"quay.io/ansible/awx-ee:latest\",\"imageID\":\"quay.io/ansible/awx-ee@sha256:cd8c98f825884cfb2e7765842755bde165d22bb3bb8f637646e8483391af9921\",\"lastState\":{},\"name\":\"worker\",\"ready\":false,\"restartCount\":0,\"started\":false,\"state\":{\"terminated\":{\"containerID\":\"containerd://b6db6d5e42ac4ff209d65f5ef5af916ef51e1d3c0bce27603ab9f8a15beb7f22\",\"exitCode\":0,\"finishedAt\":\"2022-07-21T13:20:05Z\",\"reason\":\"Completed\",\"startedAt\":\"2022-07-21T13:18:59Z\"}}}]}}"
...
13:20:06.228577 4474 kubelet.go:2117] "SyncLoop RECONCILE" source="api" pods=[awx/automation-job-4-k4k5p]
👉13:20:06.528412 4474 log.go:184] http: superfluous response.WriteHeader call from k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog.(*respLogger).WriteHeader (httplog.go:269)
👉13:20:06.528468 4474 httplog.go:109] "HTTP" verb="GET" URI="/containerLogs/awx/automation-job-4-k4k5p/worker?follow=true" latency="1m6.436194772s" userAgent="Go-http-client/1.1" audit-ID="" srcIP="10.244.0.12:49648" resp=400
13:20:06.536103 4474 kubelet.go:2120] "SyncLoop DELETE" source="api" pods=[awx/automation-job-4-k4k5p]
...
13:20:07.208846 4474 kubelet.go:1785] "syncTerminatingPod enter" pod="awx/automation-job-4-k4k5p" podUID=b86aae97-f997-4bf5-a722-9d6a306b1c43
13:20:07.212456 4474 kubelet_pods.go:1447] "Generating pod status" pod="awx/automation-job-4-k4k5p"
...
13:20:07.213146 4474 kubelet.go:1817] "Pod terminating with grace period" pod="awx/automation-job-4-k4k5p" podUID=b86aae97-f997-4bf5-a722-9d6a306b1c43 gracePeriod=<nil> However, in the cases where it fails, the "superfluous" response seem to be at the beginning of termination for the pod. After this log, pod is marked for graceful deletion. 👉13:28:13.144435 4474 log.go:184] http: superfluous response.WriteHeader call from k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog.(*respLogger).WriteHeader (httplog.go:269)
👉13:28:13.144497 4474 httplog.go:109] "HTTP" verb="GET" URI="/containerLogs/awx/automation-job-6-rhs8w/worker?follow=true" latency="5m5.473488996s" userAgent="Go-http-client/1.1" audit-ID="" srcIP="10.244.0.11:33806" resp=400
...
13:28:14.755714 4474 kubelet.go:2120] "SyncLoop DELETE" source="api" pods=[awx/automation-job-6-rhs8w]
13:28:14.755747 4474 pod_workers.go:625] "Pod is marked for graceful deletion, begin teardown" pod="awx/automation-job-6-rhs8w" podUID=d51c5853-65d4-4823-8ea5-27a89272af18
13:28:14.755790 4474 pod_workers.go:888] "Processing pod event" pod="awx/automation-job-6-rhs8w" podUID=d51c5853-65d4-4823-8ea5-27a89272af18 updateType=1
13:28:14.755806 4474 pod_workers.go:1005] "Pod worker has observed request to terminate" pod="awx/automation-job-6-rhs8w" podUID=d51c5853-65d4-4823-8ea5-27a89272af18
13:28:14.755819 4474 kubelet.go:1785] "syncTerminatingPod enter" pod="awx/automation-job-6-rhs8w" podUID=d51c5853-65d4-4823-8ea5-27a89272af18
13:28:14.755831 4474 kubelet_pods.go:1447] "Generating pod status" pod="awx/automation-job-6-rhs8w"
13:28:14.755870 4474 kubelet_pods.go:1459] "Got phase for pod" pod="awx/automation-job-6-rhs8w" oldPhase=Running phase=Running
13:28:14.756073 4474 kubelet.go:1815] "Pod terminating with grace period" pod="awx/automation-job-6-rhs8w" podUID=d51c5853-65d4-4823-8ea5-27a89272af18 gracePeriod=30
13:28:14.756148 4474 kuberuntime_container.go:719] "Killing container with a grace period override" pod="awx/automation-job-6-rhs8w" podUID=d51c5853-65d4-4823-8ea5-27a89272af18 containerName="worker" containerID="containerd://0929b4635502f8e8fc5d644992fd1c288398bc79c1c084ab4ae95e1cc3c995ac" gracePeriod=30
13:28:14.756165 4474 kuberuntime_container.go:723] "Killing container with a grace period" pod="awx/automation-job-6-rhs8w" podUID=d51c5853-65d4-4823-8ea5-27a89272af18 containerName="worker" containerID="containerd://0929b4635502f8e8fc5d644992fd1c288398bc79c1c084ab4ae95e1cc3c995ac" gracePeriod=30
...
13:28:14.763017 4474 kubelet_pods.go:939] "Pod is terminated, but some containers are still running" pod="awx/automation-job-6-rhs8w"
...
13:28:18.864941 4474 manager.go:1048] Destroyed container: "/kubepods/burstable/podd51c5853-65d4-4823-8ea5-27a89272af18/0929b4635502f8e8fc5d644992fd1c288398bc79c1c084ab4ae95e1cc3c995ac" (aliases: [0929b4635502f8e8fc5d644992fd1c288398bc79c1c084ab4ae95e1cc3c995ac /kubepods/burstable/podd51c5853-65d4-4823-8ea5-27a89272af18/0929b4635502f8e8fc5d644992fd1c288398bc79c1c084ab4ae95e1cc3c995ac], namespace: "containerd") For someone who interested in this, I attached the full logs (/var/log/messages) including kubelet with loging level These are pod info for my job. They are available in the log.
@Parkhost |
It seems that AKS, at least a few years ago, had a behavior that terminates the connection to the API server if no data is sent for a certain period of time, even if it is within a cluster: Azure/AKS#1755 In fact, running Thus, a workaround would seem to be to have the EE pod periodically send some data to the stream, like keep-alive. I modified the entrypoint on EE to keep |
can you show me how you modified the entrypoint? I tried something similar: just connected to the pod and run the echo command there. but... it is the exit status of the task that is then incorrect and therefore the whole playbook stops then (have you tried having an additional task following the pause task in the playbook?) |
Didn't expect to work.. I have created a new AKS cluster with version 1.24, and seems the automation pod doesn't terminate at the 5 minute timeout.. Am skeptic about it, don't think MS even knows this bug exists in version 1.22 en 1.23, and as 1.24 is still in preview (in Azure) am not sure if it is a valid option.. maybe it breaks other things? On the other side; AWX isn't meant to run in production either 😅 Some prove.. Cluster info:
Pod info:
pod is still going strong, guess he gonna reach the magic number of @kurokobo , how did you capture the 'describe' of the pod after it is stopped / terminated, tried to get the describe output after the play is finished but kubectl errors, cause the pod doesn't exists anymore? Would you share your kubectl voodoo? |
@cmasopust
Here is my workaround: Click to ExpandEntrypoint scriptSave as #!/usr/bin/env bash
# In OpenShift, containers are run as a random high number uid
# that doesn't exist in /etc/passwd, but Ansible module utils
# require a named user. So if we're in OpenShift, we need to make
# one before Ansible runs.
if [[ (`id -u` -ge 500 || -z "${CURRENT_UID}") ]]; then
# Only needed for RHEL 8. Try deleting this conditional (not the code)
# sometime in the future. Seems to be fixed on Fedora 32
# If we are running in rootless podman, this file cannot be overwritten
ROOTLESS_MODE=$(cat /proc/self/uid_map | head -n1 | awk '{ print $2; }')
if [[ "$ROOTLESS_MODE" -eq "0" ]]; then
cat << EOF > /etc/passwd
root:x:0:0:root:/root:/bin/bash
runner:x:`id -u`:`id -g`:,,,:/home/runner:/bin/bash
EOF
fi
cat <<EOF > /etc/group
root:x:0:runner
runner:x:`id -g`:
EOF
fi
if [[ -n "${LAUNCHED_BY_RUNNER}" ]]; then
RUNNER_CALLBACKS=$(python3 -c "import ansible_runner.callbacks; print(ansible_runner.callbacks.__file__)")
# TODO: respect user callback settings via
# env ANSIBLE_CALLBACK_PLUGINS or ansible.cfg
export ANSIBLE_CALLBACK_PLUGINS="$(dirname $RUNNER_CALLBACKS)"
fi
if [[ -d ${AWX_ISOLATED_DATA_DIR} ]]; then
if output=$(ansible-galaxy collection list --format json 2> /dev/null); then
echo $output > ${AWX_ISOLATED_DATA_DIR}/collections.json
fi
ansible --version | head -n 1 > ${AWX_ISOLATED_DATA_DIR}/ansible_version.txt
fi
SCRIPT=/usr/local/bin/dumb-init
# NOTE(pabelanger): Downstream we install dumb-init from RPM.
if [ -f "/usr/bin/dumb-init" ]; then
SCRIPT=/usr/bin/dumb-init
fi
while /bin/true; do
sleep 120
echo '{"event": "FLUSH", "uuid": "keepalive", "counter": 0, "end_line": 0}' > /proc/1/fd/1
done &
exec $SCRIPT -- "${@}" Create configmap using
|
I upgraded the 'production' AKS cluster from 1.23.8 to 1.24.0 , but it didn't solved the issue as my newly created cluster (still error after +/- 5 minutes) 😕 |
Maybe it has something to do with the network profile settings? My "timeout/failing" cluster is using: While my "success" cluster is using: Going to try it out over the weekend. |
@Parkhost I don't think it's related to the Kubernetes network. We first had a cluster with Azure CNI and Kubernetes version 1.21.x where no timeout happened. Then our Azure team redeployed the cluster with Kubenet and version 1.22.6 and we now have the timeout. |
@kurokobo : thanks a lot for providing detailed instructions (for someone who's at the beginning of his Kubernetes journey 😄) I've implemented your workaround and can confirm that my long running job now succeeds. But.. when looking at the awx-task logs I see the following errors each time the pod sends the 'keep-alive': ` 2022-07-23 08:49:53,460 ERROR [-] awx.main.commands.run_callback_receiver Detail: Traceback (most recent call last): So I think we should try to find out what the receiver is expecting, maybe there's some noop data we can send to it. I'll try to find out more during the weekend. |
Looks like adding "event": "FLUSH" to the json data does no longer produce the errors in awx-task and the playbook still succeeds. Having a quick look at the AWX code shows that an event of type FLUSH skips any further processing of the received data and therefore produces no exception 😄 (see awx/main/dispatch/worker/callback.py) But, if that's really a good idea or the right event we should send for a keep-alive can only be answered by the AWX devs |
Have some bad news: although the workaround works perfectly well now for playbook runs, it breaks the inventory updates! |
@cmasopust echo '{"event": "FLUSH", "uuid": "keepalive", "counter": 0, "end_line": 0}' > /proc/1/fd/1 I've updated my Also, for my following concern in my previous comment, removing sleep and allowing as many repeated echoes as possible did not corrupt the output JSON (tried dozens of times).
So, I believe it's not the smartest way, but it seems to be a workaround that works anyway. Of course, the best way to handle this is to support long-term stream on the AKS side by making some changes. Hope Parkhost find something 😃 |
@kurokobo you're my hero 😄 can confirm that the inventory update is now working in my AWX too! |
We also have this issue on our AKS test (running Kubernetes 1.24) and production (running Kubernetes 1.23.8) cluster. We applied the workaround supplied by @kurokobo and I can also confirm that our jobs are now finishing as expected. Great! |
AWX: 21.3.0
Kubernetes Client Version: v1.24.2
Kubernetes Server Version: v1.24.3
Kustomize Version: v4.5.4
I got very confused here as I am new both to Kube and AWX and tried creating a hotfix.yaml and applying that via kubectl - that's completely wrong and what's actually needed needed is to set a new Instance Group under AWX Administration tab and make sure the templates/syncs use the new instance group. Unfortunately for me, that did not resolve my issue, probably because in my case the automation-job does not get killed after 5 minutes, it just breaks after 5-6.5 and gets killed about 2 minutes after that. While in the 'broken' state it stops outputting in the AWX console (at a random point in the playbook each time) but I see it still seems to run the playbook for a while when I observe the process list on the Kubernetes host. |
@C0rn3j AWX has an issue with jobs aborting if Kubelet log rotation occurs while the job is running: #10366 and #11338 Kubelet in AKS has a default log size limit of 10 MB, so if the size of the log of the automation-job pod exceeds this, the job will be terminated. |
In this (longish) comment I will address the 5 minute timeout issue in AKS only! (Not the log size issue, or the 4 hours timeout issue.) As Microsoft partners, we have escalated this issue and engaged engineers to identify the root cause and discuss options. The issue relates to the method by which the Azure managed control plane is made available to the customer managed vnet containing the worker nodes and pods. In previous versions they have relied on tunneld/openvpn to build the tunnel to the vnet, but upon recommendation from the k8s community, they are now using Konnectivity. This issue is AKS specific because they added some custom code to add a 5 minute timeout as a precaution to avoid leaking connections. As this Github issue shows, this issue is very difficult for end-user/developers to diagnose, as the problem appears to be with the API server, but it is actually with the tunnel to the API server, which is not visible to customers. A Microsoft engineer has confirmed this diagnosis with this Pause playbook; noting that it fails after 5 minutes, and then adjusting the Konnectivity timeout to 7 minutes, and confirming that it fails at 7 minutes. The feedback we received was that adjusting this timeout would have performance impact across clusters using Konnectivity, and therefore will not be done. The AKS Konnectivity feature owners suggest that AWX should be sending keepalives and/or gracefully handling terminated http2 streams instead. Alternatively, the API Server VNet Integration preview feature is available (albeit currently only in USA regions) which instead of setting up a tunnel:
The Microsoft engineer has confirmed with the same Pause playbook that the issue does not occur when AKS is deployed with this preview feature. Additional links from our ticket:
|
@kurokobo thank you for your workaround. You saved me! :) I can confirm that is working on AWX version 21.6.0 and AKS 1.23.12. |
@kurokobo I also thank you for your workaround. It was a life saver. |
Native periodical keep-alive message has been implemented in Ansible Runner 2.3.2, and AWX has been updated to use it in 21.14.0. To activate keep-alive message:
@AlanCoding @nitzmahone |
Please confirm the following
Bug Summary
In my AWX I have an inventory that is synchronized from my Katello server, since some time the synchronization needs appr. 9 minutes (tested that offline with the ansible-inventory command).
When I now try to synchronize the inventory in AWX, the job starts but then it is killed after 5 minutes.
I already changed the "Default Inventory Update Timeout" to 900s, but the synchronization job still terminates after 5 minutes.
AWX version
21.3.0
Select the relevant components
Installation method
kubernetes
Modifications
no
Ansible version
2.13.1
Operating system
No response
Web browser
Chrome
Steps to reproduce
Synchronize Satellite (Katello) repository, Katello contains appr. 500 hosts
Expected results
Repository synchronized successfully
Actual results
Synchronization is terminated after 5 minutes
Additional information
No response
The text was updated successfully, but these errors were encountered: