AWX 19.2.0 - Job Output not in sync - /var/run/receptor/receptor.sock Deadlock ? #10366

lucabrasi83 · 2021-06-05T06:36:58Z

ISSUE TYPE

Bug Report

SUMMARY

Hi,

For some jobs, we're experiencing an out of sync AWX Job output such as below where the job is marked as failed with no summary:

Issue looks similar to #9967 although I don't get any error in the logs of awx-web and awx-task containers.

From the actual Pod logs running the job, I can see it actually completed successfully.

However, the awx-ee container logs show what looks like a deadlock connection error to the receptor socket file:

ERROR2021/06/05 03:18:22 Read error in control service: read unix /var/run/receptor/receptor.sock->@: use of closed network connection
INFO2021/06/05 03:18:22 Client disconnected from control service
ERROR2021/06/05 03:18:22 Error closing connection: close unix /var/run/receptor/receptor.sock->@: use of closed network connection
INFO2021/06/05 03:18:22 Client connected to control service
INFO2021/06/05 03:18:23 Control service closed

From observations, it looks like to happen on fast running jobs. One workaround I found is to enable -vvv debugs which tend to slow down the job execution and then properly sync the job output with AWX.

ENVIRONMENT

AWX version: 19.2.0
AWX install method: AWX Operator - Kubernetes
Ansible version: 2.11.1
Operating System: CentOS
Web Browser: Chrome/Edge

STEPS TO REPRODUCE

Launch a template job

EXPECTED RESULTS

Expected to see the actual summary of the job results.

ACTUAL RESULTS

ADDITIONAL INFORMATION

The text was updated successfully, but these errors were encountered:

shanemcd · 2021-06-05T14:43:13Z

Hello. A couple questions:

Is there any traceback either in the job output or in the awx-task container?

Did you specify any custom EEs when installing, or are you using the default?

lucabrasi83 · 2021-06-05T15:54:49Z

@shanemcd Thanks for the reply. See updates below:

Is there any traceback either in the job output or in the awx-task container?

no traceback in either job output and awx-task.

Did you specify any custom EEs when installing, or are you using the default?

Yes I did specify a custom EE. It's using the base awx-ee:0.3.0 image + ansible 2.11.1 installed + some network collections.

EDIT: After further investigation, I manage to get the output sync'ed with AWX UI by setting no_log: yes to my tasks which reduce the amount of logs in the ansible-runner job. I don't have a tangible evidence from the task/web/ee container logs to prove that the amount of logging is the issue, but by disabling it in my playbook, it definitely fixes the issue 10 plays out of 10.

shanemcd · 2021-06-05T21:18:38Z

Thank you for the information. I'm wondering if you could share some more details about your workload:

Number of hosts in this job's inventory
Number of events emitted (see /api/v2/jobs/<failed-job-id>/job_events, under count)
Version / distro of Kubernetes

And if possible, sharing the logs from the worker pod may help us debug. If they contain sensitive information, you can send me an encrypted email instead of posting them here. My email is on my profile, and my gpg public key is at https://github.com/shanemcd.gpg.

To obtain the logs, you can prevent the worker pods from being deleted by doing something like this:

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: custom-awx-config
data:
  custom.py:  |
      RECEPTOR_RELEASE_WORK = False

---
apiVersion: awx.ansible.com/v1beta1
kind: AWX
metadata:
  name: awx
spec:
  task_extra_volume_mounts: |
    - name: custom-py
      mountPath: /etc/tower/conf.d/custom.py
      subPath: custom.py
  extra_volumes: |
    - name: custom-py
      configMap:
        defaultMode: 420
        items:
          - key: custom.py
            path: custom.py
        name: custom-awx-config

The pod will be left behind after the job is complete so you can grab the logs.

$ kubectl logs automation-job-3-psg8l > logs

Once you are done, you can undo the custom config and AWX will clean up all of the pods it created. Thank you in advance for any information you can provide.

lucabrasi83 · 2021-06-08T11:01:50Z

Please find details below:

Number of hosts in this job's inventory

35 (Cisco network devices )

Number of events emitted (see /api/v2/jobs//job_events, under count)

"count": 252

Version / distro of Kubernetes

v1.17.7 on Centos 7.9

Job output container logs sent to your email.

Thanks

HOSTED-POWER · 2021-08-04T08:31:22Z

We also experience really strange issues, it results in

Task was marked as running but was not present in the job queue, so it has been marked as failed.

It seems the receptor gets killed or stops with it somehow

pawstern95 · 2021-08-24T13:09:52Z

Setting more inotify watches on nodes helped for me:
$ sudo sysctl fs.inotify.max_user_watches = 1048576

Found this because when I was trying to check logs with --follow:
$ kubectl logs <pod> -f

I was getting this error:

failed to watch file "/var/lib/docker/containers/...-json.log": no space left on device

HOSTED-POWER · 2021-08-26T22:09:20Z

Setting more inotify watches on nodes helped for me:
$ sudo sysctl fs.inotify.max_user_watches = 1048576

Found this because when I was trying to check logs with --follow:
$ kubectl logs <pod> -f

I was getting this error:

failed to watch file "/var/lib/docker/containers/...-json.log": no space left on device

Unfortunately no difference for us :|

andrejzelnik · 2021-09-01T13:08:37Z

Hello there,
did anyone find out the cause of this issue, please?

We have the same problem as @lucabrasi83.

AWX version: 19.2.1
AWX install method: AWX Operator - Kubernetes
Kubernetes version: 1.19.11 on Ubuntu (AKS)
Ansible version: 2.11.3
job image: ansible/awx-ee:0.6.0
job host: 5
job events: ~1500

What I have found out that is probably somehow related to the amount of logs/output.
When there is a deployment of fresh environment where all task needs to execute its purpose then it fails close to the end of whole play. When the same job is relaunched and most of the task does not change as they are ok, it pass successfully to the end.
When there is verbosity of 1 or higher used for the job, a whole play fails a way sooner.

HOSTED-POWER · 2021-09-01T15:17:22Z

Is this issue related #9594 ?

It's really severe on our side, the solution is almost unusable.

oliverf1 · 2021-10-04T04:14:43Z

I have the same issue and as @andrejzelnik I arrived at the same conclusion that's it's due to the amount of logs.
I have a job template which fails 95% of the time when it runs in debug mode, but when it runs in normal mode it always succeeds.
After testing pretty much all the AWX versions and EE combinations I have still the same issue.
I also noticed that if I run a kubectl logs -f on the automation pods, the log stops at the very same place as in the web GUI!
At this point I started to research how the automation pods were created and how AWX gets the logs from the pod.
After a bit of code reading and observation I understood that the receptor daemon running in the awx-ee container was responsible for spawning the automation pods, getting the logs and forwarding them to the database.
At this point I hacked a little bit the awx-ee container to include the receptorctl command so I can check the jobs status in receptor directly. And I saw that the logs in receptor were stopping at the very same place as kubectl -f logs!
So after more reading in the receptor code, I found this. So in fact receptor is doing the equivalent of kubectl -f logs!
So this is my understanding that the source of the problem is not in AWX but in the kubernetes cluster itself and is due to kubectl -f logs hanging.
What I think is happening:

AWX submits the job to receptor
receptor starts a new pod for this job
receptor attaches to the pod to get the logs and hangs at some point
when the job finishes, it frees receptor process and it reports to AWX that the job is finished
since AWX do not have all the logs, it says that the job failed.

Also, to prove that it was really not related to AWX, I saved the full logs of a problematic job template in json format and with that I have created a very basic container, that read this file and echo its content to stdout at the same pace as the real job template (using the timestamp in the log). Once the pod running that container starts I attach to it with kubectl -f logs and it hangs at the same place as when it runs in AWX!

I have tested this on several k3s clusters that I own and I have the exact same issue on all of them. But if I test it on an AWS EKS cluster then it works.
So in my case it's really a k3s issue and I have opened a support ticket since I have support for this. I will see what the outcome is.
Sorry for this long comment, but I thought it was worth sharing that in details.

HTH

lucabrasi83 · 2021-10-04T04:30:08Z

Thanks @oliverf1 for the findings.

Looking at K8S docs regarding logging, it seems there is a default limit log size dictated by Kubelet / CRI : https://kubernetes.io/docs/concepts/cluster-administration/logging/

When using a CRI container runtime, the kubelet is responsible for rotating the logs and managing the logging directory structure. The kubelet sends this information to the CRI container runtime and the runtime writes the container logs to the given location. The two kubelet parameters containerLogMaxSize and containerLogMaxFiles in kubelet config file can be used to configure the maximum size for each log file and the maximum number of files allowed for each container respectively.

For example, that can be set in Docker as per the following doc: https://docs.docker.com/config/containers/logging/local/

oliverf1 · 2021-10-04T05:04:58Z

@lucabrasi83 I will check that in more details tomorrow, but I'm not sure it will help. For example if I "cat" the whole log file in one shot, then it works, no hang. Also if I print the log file at a slower pace, like one line per second, then it also works. The only way I found to reproduce the issue is to echo the log at the same pace as the job template is doing.
So it does not look to be a size or timeout issue, but a combination of the two.
Which kubernetes distrib are you running on?

Kardi5 · 2021-10-11T21:14:29Z

@oliverf1 Hi, interesting that you saw the log stopping in the running kubectl aswell. We obeserved that using kubectl logs -f actually works for us even when AWX did not. As of now we only have one reliable test case: start a script via task in an Ansible playbook and sleep for more than 3600s (eg. 3800s). After exactly 3600s AWX will error out and the receptorctl status for the job will read "unexpected EOF" originating from the io.Copy call in the receptor binary (binary compiled with different method does not change this). Other jobs will randomly be marked as failed (conditions mainly seem to be any kind of waiting 1-10min and/or many hosts 30+) making it hard to debug.

Digging deeper we ran tcpdump on the awx-ee node recording the traffic. For the traffic from our kubectl (v1.20) there are L5 packets even if no data is visibly send/received by kubectl (probably some kind of applications layer heartbeat). Also keepalives are send and received every 30s. For the client-go package (v0.18.6 per https://github.com/ansible/receptor/blob/1.0.0a2/go.mod) there are keepalives but no L5 traffic and after exactly 3600s the kube apiserver sends a TCP FIN packet signaling the end of the connection. Because this is a managed service we have limited access to the k8s control plane to debug (Azure).

For important jobs/tasks we have introduced the async/poll feature to create some applications layer (L5) noise, this works as a workaround for us.

Azure Kubernetes (v1.20) Service / Operator 0.12 / AWX 19.2.2 / Some customization (mounts and custom user EE)

nicovs · 2021-10-15T07:21:07Z

@oliverf1, I came to the same conclusion as you: the job failed when the logs get rotated by k3s. So it's indeed not an AWX problem, but k3s.
However, I think I found the solution for this, by changing the log file site for k3s.
As from: Suse: How to enable container log rotation with k3s or containerd
I 'updated' and reconfigured my k3s by running:

curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC="--kubelet-arg "container-log-max-files=4" --kubelet-arg "container-log-max-size=50Mi"" sh -

My logs, didnt get rotated anymore and my k3s instance is to configured to have 4 log files of 50MB, allowing for 200MB of total space to be retained per container.

Sidenote: Haven't had any log bigger than 50MB yet, so I don't know yet what happens when that 2nd, 3rd etc log gets created.

oliverf1 · 2021-10-15T20:02:51Z

@nicovs That is a very good catch! I can confirm that indeed kubectl -f logs stops when the log file at the node level are rotated.
So basically the kubernetes API is not aware that the log file has changed and does not switch to the new file.
This is the way kubernetes works as of now, so this can happen on any kubernetes distribution depending on the log file size.
This also line-up with @lucabrasi83 comment
There is a feature request opened since 2018 to implement this in kubernetes. Reading the issue, now that the pre-requisites are ready it might eventually be implemented.
But it seems to me it's an AWX/receptor bug. I think the receptor daemon should handle this case. Increasing the log file size can only a workaround. If not a bug that's a design issue.
Any thought?

Kardi5 · 2021-10-26T11:44:27Z

Just wanted to confirm that the same options @nicovs used, also work for us on Azure Kubernetes Service (AKS) to fix truncated log output.
With AKS it's a bit tricky to change the kubelet config (the nice way is in preview). Based on work from NGINX (https://docs.nginx.com/nginx-service-mesh/get-started/kubernetes-platform/aks/) we used a daemon set to configure the container log size on AKS:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  labels:
    component: custom-kubelet-logsize
  name: custom-kubelet-logsize
  namespace: kube-system
spec:
  selector:
    matchLabels:
      component: custom-kubelet-logsize
      tier: node
  template:
    metadata:
      labels:
        component: custom-kubelet-logsize
        tier: node
    spec:
      containers:
      - name: custom-kubelet-logsize
        image: alpine
        imagePullPolicy: IfNotPresent
        command:
          - nsenter
          - --target
          - "1"
          - --mount
          - --uts
          - --ipc
          - --net
          - --pid
          - --
          - sh
          - -c
          - |
            if ! $(grep -q container-log-max-size /etc/default/kubelet); then sed -i 's/--max-pods=30/--max-pods=30 --container-log-max-size=500Mi/g' /etc/default/kubelet; systemctl restart kubelet; fi
            sleep infinity            
        resources:
          requests:
            cpu: 50m
        securityContext:
          privileged: true
      dnsPolicy: ClusterFirst
      hostPID: true
      tolerations:
      - effect: NoSchedule
        operator: Exists
      restartPolicy: Always
  updateStrategy:
    type: OnDelete

You may need to adopt this config. We used 500MB as we have a lot of machines and tasks, I think the old log size was 10 or 40MB.

If you want to checkout your kubelet config (before/after) you can use the kubectl debug command.

kubectl debug node/<nodename> -it --image=mcr.microsoft.com/aks/fundamental/base-ubuntu:v0.0.11
cat /host/etc/default/kubelet

Ref: https://docs.microsoft.com/en-us/azure/aks/ssh#create-the-ssh-connection-to-a-linux-node

Sorry if this was too off-topic/Azure specific. Maybe this helps @andrejzelnik

omicronn88 · 2021-12-15T16:20:29Z

I also can confirm that Increasing file size of container logs solves this problem. With on-premise k8s cluster installed with kubeadm you can do it by adding containerLogMaxSize: 500Mi to /var/lib/kubelet/config.yaml and then restart kubelet service.

You may want to also decrease number of present files by adjusting containerLogMaxFiles option so that not to run out of space. But keep in mind that it cannot be less than 2.

andrejzelnik · 2021-12-15T16:39:08Z

Thank you @Kardi5, I have tried out your proposal and it works now.

Tioborto · 2021-12-20T08:47:29Z

Hello,

Here we can't modify the kubelet config from GKE cluster. We use official google image and we don't want to manage our own image.

This is a real problem for us. We have users with jobs in "ERROR" status while we have full logs in our external tool.
Why is this not a top priority? This is a regression from previous versions

MalfuncEddie · 2022-01-11T16:48:32Z

Increasing the log size works but not quite a solution when using a managed k8s service where you cannot access this :-(

oliverf1 · 2022-01-18T03:29:04Z

Even if you can change the log size, increasing it over and over is not a solution. In my case I have a very large job running on many nodes that is generating about 600Mb of log, so to run it successfully it means my log size should be about 800Mb. So it means that every and each container running in this cluster can write up to 800Mb of log. This is not sustainable. So my solution was to run this job ... outside of AWX.
I guess the only solution will be kubernetes/kubernetes#59902 to be implemented
I'm also wondering if the same issue occurs in OpenShift.

shanemcd · 2022-01-31T15:42:54Z

Closing this in favor of #11338

awxbot added the type:bug label Jun 5, 2021

wenottingham added state:needs_devel component:api labels Jun 18, 2021

This was referenced Oct 15, 2021

Long running jobs: job output stops non-deterministically #11087

Closed

AWX RuntimeError resulting in loss of worker task logs and true outcome #9961

Closed

This was referenced Oct 15, 2021

Pod logs stop being pulled when container log files are rotated ansible/receptor#446

Open

Job output being truncated #11235

Closed

meis4h mentioned this issue Oct 26, 2021

Job error on container log rotation kurokobo/awx-on-k3s#16

Closed

chrismeyersfsu mentioned this issue Nov 19, 2021

AWX stops gathering job output if kubernetes starts a new log #11338

Closed

3 tasks

kurokobo mentioned this issue Dec 23, 2021

Long job marked as error because logs stop to be retrieved while job continues and complete successfully #11451

Closed

3 tasks

aussielunix mentioned this issue Jan 25, 2022

Playbooks running longer than 4 hours are terminated unexpectedly #11594

Closed

3 tasks

shanemcd added needs_triage and removed priority:high labels Jan 26, 2022

shanemcd closed this as completed Jan 31, 2022

kurokobo mentioned this issue Jul 30, 2022

Jobs with long running tasks and Inventory sync terminating after 5 minutes #12530

Closed

9 tasks

fkuep mentioned this issue Aug 11, 2023

No meaningfull error message when fs.inotify.max_user_watches are depleted. #14334

Open

11 tasks

mcen1 mentioned this issue Dec 27, 2023

Job end in Failed without error when it had large logs ansible/awx-operator#1675

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWX 19.2.0 - Job Output not in sync - /var/run/receptor/receptor.sock Deadlock ? #10366

AWX 19.2.0 - Job Output not in sync - /var/run/receptor/receptor.sock Deadlock ? #10366

lucabrasi83 commented Jun 5, 2021

shanemcd commented Jun 5, 2021

lucabrasi83 commented Jun 5, 2021 •

edited

shanemcd commented Jun 5, 2021 •

edited

lucabrasi83 commented Jun 8, 2021

HOSTED-POWER commented Aug 4, 2021

pawstern95 commented Aug 24, 2021 •

edited

HOSTED-POWER commented Aug 26, 2021

andrejzelnik commented Sep 1, 2021

HOSTED-POWER commented Sep 1, 2021

oliverf1 commented Oct 4, 2021 •

edited

lucabrasi83 commented Oct 4, 2021

oliverf1 commented Oct 4, 2021

Kardi5 commented Oct 11, 2021

nicovs commented Oct 15, 2021 •

edited

oliverf1 commented Oct 15, 2021 •

edited

Kardi5 commented Oct 26, 2021

omicronn88 commented Dec 15, 2021 •

edited

andrejzelnik commented Dec 15, 2021

Tioborto commented Dec 20, 2021

MalfuncEddie commented Jan 11, 2022

oliverf1 commented Jan 18, 2022

shanemcd commented Jan 31, 2022

AWX 19.2.0 - Job Output not in sync - /var/run/receptor/receptor.sock Deadlock ? #10366

AWX 19.2.0 - Job Output not in sync - /var/run/receptor/receptor.sock Deadlock ? #10366

Comments

lucabrasi83 commented Jun 5, 2021

ISSUE TYPE

SUMMARY

ENVIRONMENT

STEPS TO REPRODUCE

EXPECTED RESULTS

ACTUAL RESULTS

ADDITIONAL INFORMATION

shanemcd commented Jun 5, 2021

lucabrasi83 commented Jun 5, 2021 • edited

shanemcd commented Jun 5, 2021 • edited

lucabrasi83 commented Jun 8, 2021

HOSTED-POWER commented Aug 4, 2021

pawstern95 commented Aug 24, 2021 • edited

HOSTED-POWER commented Aug 26, 2021

andrejzelnik commented Sep 1, 2021

HOSTED-POWER commented Sep 1, 2021

oliverf1 commented Oct 4, 2021 • edited

lucabrasi83 commented Oct 4, 2021

oliverf1 commented Oct 4, 2021

Kardi5 commented Oct 11, 2021

nicovs commented Oct 15, 2021 • edited

oliverf1 commented Oct 15, 2021 • edited

Kardi5 commented Oct 26, 2021

omicronn88 commented Dec 15, 2021 • edited

andrejzelnik commented Dec 15, 2021

Tioborto commented Dec 20, 2021

MalfuncEddie commented Jan 11, 2022

oliverf1 commented Jan 18, 2022

shanemcd commented Jan 31, 2022

lucabrasi83 commented Jun 5, 2021 •

edited

shanemcd commented Jun 5, 2021 •

edited

pawstern95 commented Aug 24, 2021 •

edited

oliverf1 commented Oct 4, 2021 •

edited

nicovs commented Oct 15, 2021 •

edited

oliverf1 commented Oct 15, 2021 •

edited

omicronn88 commented Dec 15, 2021 •

edited