Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWX 19.2.0 - Job Output not in sync - /var/run/receptor/receptor.sock Deadlock ? #10366

Closed
lucabrasi83 opened this issue Jun 5, 2021 · 22 comments

Comments

@lucabrasi83
Copy link

ISSUE TYPE
  • Bug Report
SUMMARY

Hi,

For some jobs, we're experiencing an out of sync AWX Job output such as below where the job is marked as failed with no summary:

image

Issue looks similar to #9967 although I don't get any error in the logs of awx-web and awx-task containers.

From the actual Pod logs running the job, I can see it actually completed successfully.

However, the awx-ee container logs show what looks like a deadlock connection error to the receptor socket file:

ERROR2021/06/05 03:18:22 Read error in control service: read unix /var/run/receptor/receptor.sock->@: use of closed network connection
INFO2021/06/05 03:18:22 Client disconnected from control service
ERROR2021/06/05 03:18:22 Error closing connection: close unix /var/run/receptor/receptor.sock->@: use of closed network connection
INFO2021/06/05 03:18:22 Client connected to control service
INFO2021/06/05 03:18:23 Control service closed

From observations, it looks like to happen on fast running jobs. One workaround I found is to enable -vvv debugs which tend to slow down the job execution and then properly sync the job output with AWX.

ENVIRONMENT
  • AWX version: 19.2.0
  • AWX install method: AWX Operator - Kubernetes
  • Ansible version: 2.11.1
  • Operating System: CentOS
  • Web Browser: Chrome/Edge
STEPS TO REPRODUCE

Launch a template job

EXPECTED RESULTS

Expected to see the actual summary of the job results.

ACTUAL RESULTS
ADDITIONAL INFORMATION
@awxbot awxbot added the type:bug label Jun 5, 2021
@shanemcd
Copy link
Member

shanemcd commented Jun 5, 2021

Hello. A couple questions:

Is there any traceback either in the job output or in the awx-task container?

Did you specify any custom EEs when installing, or are you using the default?

@lucabrasi83
Copy link
Author

lucabrasi83 commented Jun 5, 2021

@shanemcd Thanks for the reply. See updates below:

Is there any traceback either in the job output or in the awx-task container?

no traceback in either job output and awx-task.

Did you specify any custom EEs when installing, or are you using the default?

Yes I did specify a custom EE. It's using the base awx-ee:0.3.0 image + ansible 2.11.1 installed + some network collections.

EDIT: After further investigation, I manage to get the output sync'ed with AWX UI by setting no_log: yes to my tasks which reduce the amount of logs in the ansible-runner job. I don't have a tangible evidence from the task/web/ee container logs to prove that the amount of logging is the issue, but by disabling it in my playbook, it definitely fixes the issue 10 plays out of 10.

@shanemcd
Copy link
Member

shanemcd commented Jun 5, 2021

Thank you for the information. I'm wondering if you could share some more details about your workload:

  • Number of hosts in this job's inventory
  • Number of events emitted (see /api/v2/jobs/<failed-job-id>/job_events, under count)
  • Version / distro of Kubernetes

And if possible, sharing the logs from the worker pod may help us debug. If they contain sensitive information, you can send me an encrypted email instead of posting them here. My email is on my profile, and my gpg public key is at https://github.com/shanemcd.gpg.

To obtain the logs, you can prevent the worker pods from being deleted by doing something like this:

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: custom-awx-config
data:
  custom.py:  |
      RECEPTOR_RELEASE_WORK = False
---
apiVersion: awx.ansible.com/v1beta1
kind: AWX
metadata:
  name: awx
spec:
  task_extra_volume_mounts: |
    - name: custom-py
      mountPath: /etc/tower/conf.d/custom.py
      subPath: custom.py
  extra_volumes: |
    - name: custom-py
      configMap:
        defaultMode: 420
        items:
          - key: custom.py
            path: custom.py
        name: custom-awx-config

The pod will be left behind after the job is complete so you can grab the logs.

$ kubectl logs automation-job-3-psg8l > logs

Once you are done, you can undo the custom config and AWX will clean up all of the pods it created. Thank you in advance for any information you can provide.

@lucabrasi83
Copy link
Author

Please find details below:

Number of hosts in this job's inventory

35 (Cisco network devices )

Number of events emitted (see /api/v2/jobs//job_events, under count)

"count": 252

Version / distro of Kubernetes

v1.17.7 on Centos 7.9

Job output container logs sent to your email.

Thanks

@HOSTED-POWER
Copy link

We also experience really strange issues, it results in

Task was marked as running but was not present in the job queue, so it has been marked as failed.

It seems the receptor gets killed or stops with it somehow

@pawstern95
Copy link

pawstern95 commented Aug 24, 2021

Setting more inotify watches on nodes helped for me:
$ sudo sysctl fs.inotify.max_user_watches = 1048576

Found this because when I was trying to check logs with --follow:
$ kubectl logs <pod> -f

I was getting this error:

failed to watch file "/var/lib/docker/containers/...-json.log": no space left on device

@HOSTED-POWER
Copy link

Setting more inotify watches on nodes helped for me:
$ sudo sysctl fs.inotify.max_user_watches = 1048576

Found this because when I was trying to check logs with --follow:
$ kubectl logs <pod> -f

I was getting this error:

failed to watch file "/var/lib/docker/containers/...-json.log": no space left on device

Unfortunately no difference for us :|

@andrejzelnik
Copy link

Hello there,
did anyone find out the cause of this issue, please?

We have the same problem as @lucabrasi83.

  • AWX version: 19.2.1
  • AWX install method: AWX Operator - Kubernetes
  • Kubernetes version: 1.19.11 on Ubuntu (AKS)
  • Ansible version: 2.11.3
  • job image: ansible/awx-ee:0.6.0
  • job host: 5
  • job events: ~1500

What I have found out that is probably somehow related to the amount of logs/output.
When there is a deployment of fresh environment where all task needs to execute its purpose then it fails close to the end of whole play. When the same job is relaunched and most of the task does not change as they are ok, it pass successfully to the end.
When there is verbosity of 1 or higher used for the job, a whole play fails a way sooner.

@HOSTED-POWER
Copy link

Is this issue related #9594 ?

It's really severe on our side, the solution is almost unusable.

@oliverf1
Copy link

oliverf1 commented Oct 4, 2021

I have the same issue and as @andrejzelnik I arrived at the same conclusion that's it's due to the amount of logs.
I have a job template which fails 95% of the time when it runs in debug mode, but when it runs in normal mode it always succeeds.
After testing pretty much all the AWX versions and EE combinations I have still the same issue.
I also noticed that if I run a kubectl logs -f on the automation pods, the log stops at the very same place as in the web GUI!
At this point I started to research how the automation pods were created and how AWX gets the logs from the pod.
After a bit of code reading and observation I understood that the receptor daemon running in the awx-ee container was responsible for spawning the automation pods, getting the logs and forwarding them to the database.
At this point I hacked a little bit the awx-ee container to include the receptorctl command so I can check the jobs status in receptor directly. And I saw that the logs in receptor were stopping at the very same place as kubectl -f logs!
So after more reading in the receptor code, I found this. So in fact receptor is doing the equivalent of kubectl -f logs!
So this is my understanding that the source of the problem is not in AWX but in the kubernetes cluster itself and is due to kubectl -f logs hanging.
What I think is happening:

  • AWX submits the job to receptor
  • receptor starts a new pod for this job
  • receptor attaches to the pod to get the logs and hangs at some point
  • when the job finishes, it frees receptor process and it reports to AWX that the job is finished
  • since AWX do not have all the logs, it says that the job failed.

Also, to prove that it was really not related to AWX, I saved the full logs of a problematic job template in json format and with that I have created a very basic container, that read this file and echo its content to stdout at the same pace as the real job template (using the timestamp in the log). Once the pod running that container starts I attach to it with kubectl -f logs and it hangs at the same place as when it runs in AWX!

I have tested this on several k3s clusters that I own and I have the exact same issue on all of them. But if I test it on an AWS EKS cluster then it works.
So in my case it's really a k3s issue and I have opened a support ticket since I have support for this. I will see what the outcome is.
Sorry for this long comment, but I thought it was worth sharing that in details.

HTH

@lucabrasi83
Copy link
Author

Thanks @oliverf1 for the findings.

Looking at K8S docs regarding logging, it seems there is a default limit log size dictated by Kubelet / CRI : https://kubernetes.io/docs/concepts/cluster-administration/logging/

When using a CRI container runtime, the kubelet is responsible for rotating the logs and managing the logging directory structure. The kubelet sends this information to the CRI container runtime and the runtime writes the container logs to the given location. The two kubelet parameters containerLogMaxSize and containerLogMaxFiles in kubelet config file can be used to configure the maximum size for each log file and the maximum number of files allowed for each container respectively.

For example, that can be set in Docker as per the following doc: https://docs.docker.com/config/containers/logging/local/

@oliverf1
Copy link

oliverf1 commented Oct 4, 2021

@lucabrasi83 I will check that in more details tomorrow, but I'm not sure it will help. For example if I "cat" the whole log file in one shot, then it works, no hang. Also if I print the log file at a slower pace, like one line per second, then it also works. The only way I found to reproduce the issue is to echo the log at the same pace as the job template is doing.
So it does not look to be a size or timeout issue, but a combination of the two.
Which kubernetes distrib are you running on?

@Kardi5
Copy link

Kardi5 commented Oct 11, 2021

@oliverf1 Hi, interesting that you saw the log stopping in the running kubectl aswell. We obeserved that using kubectl logs -f actually works for us even when AWX did not. As of now we only have one reliable test case: start a script via task in an Ansible playbook and sleep for more than 3600s (eg. 3800s). After exactly 3600s AWX will error out and the receptorctl status for the job will read "unexpected EOF" originating from the io.Copy call in the receptor binary (binary compiled with different method does not change this). Other jobs will randomly be marked as failed (conditions mainly seem to be any kind of waiting 1-10min and/or many hosts 30+) making it hard to debug.

Digging deeper we ran tcpdump on the awx-ee node recording the traffic. For the traffic from our kubectl (v1.20) there are L5 packets even if no data is visibly send/received by kubectl (probably some kind of applications layer heartbeat). Also keepalives are send and received every 30s. For the client-go package (v0.18.6 per https://github.com/ansible/receptor/blob/1.0.0a2/go.mod) there are keepalives but no L5 traffic and after exactly 3600s the kube apiserver sends a TCP FIN packet signaling the end of the connection. Because this is a managed service we have limited access to the k8s control plane to debug (Azure).

For important jobs/tasks we have introduced the async/poll feature to create some applications layer (L5) noise, this works as a workaround for us.

Azure Kubernetes (v1.20) Service / Operator 0.12 / AWX 19.2.2 / Some customization (mounts and custom user EE)

@nicovs
Copy link

nicovs commented Oct 15, 2021

@oliverf1, I came to the same conclusion as you: the job failed when the logs get rotated by k3s. So it's indeed not an AWX problem, but k3s.
However, I think I found the solution for this, by changing the log file site for k3s.
As from: Suse: How to enable container log rotation with k3s or containerd
I 'updated' and reconfigured my k3s by running:

curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC="--kubelet-arg "container-log-max-files=4" --kubelet-arg "container-log-max-size=50Mi"" sh -

My logs, didnt get rotated anymore and my k3s instance is to configured to have 4 log files of 50MB, allowing for 200MB of total space to be retained per container.

nicovs@awx varlogpods_2021-10-15_09h18

Sidenote: Haven't had any log bigger than 50MB yet, so I don't know yet what happens when that 2nd, 3rd etc log gets created.

@oliverf1
Copy link

oliverf1 commented Oct 15, 2021

@nicovs That is a very good catch! I can confirm that indeed kubectl -f logs stops when the log file at the node level are rotated.
So basically the kubernetes API is not aware that the log file has changed and does not switch to the new file.
This is the way kubernetes works as of now, so this can happen on any kubernetes distribution depending on the log file size.
This also line-up with @lucabrasi83 comment
There is a feature request opened since 2018 to implement this in kubernetes. Reading the issue, now that the pre-requisites are ready it might eventually be implemented.
But it seems to me it's an AWX/receptor bug. I think the receptor daemon should handle this case. Increasing the log file size can only a workaround. If not a bug that's a design issue.
Any thought?

@Kardi5
Copy link

Kardi5 commented Oct 26, 2021

Just wanted to confirm that the same options @nicovs used, also work for us on Azure Kubernetes Service (AKS) to fix truncated log output.
With AKS it's a bit tricky to change the kubelet config (the nice way is in preview). Based on work from NGINX (https://docs.nginx.com/nginx-service-mesh/get-started/kubernetes-platform/aks/) we used a daemon set to configure the container log size on AKS:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  labels:
    component: custom-kubelet-logsize
  name: custom-kubelet-logsize
  namespace: kube-system
spec:
  selector:
    matchLabels:
      component: custom-kubelet-logsize
      tier: node
  template:
    metadata:
      labels:
        component: custom-kubelet-logsize
        tier: node
    spec:
      containers:
      - name: custom-kubelet-logsize
        image: alpine
        imagePullPolicy: IfNotPresent
        command:
          - nsenter
          - --target
          - "1"
          - --mount
          - --uts
          - --ipc
          - --net
          - --pid
          - --
          - sh
          - -c
          - |
            if ! $(grep -q container-log-max-size /etc/default/kubelet); then sed -i 's/--max-pods=30/--max-pods=30 --container-log-max-size=500Mi/g' /etc/default/kubelet; systemctl restart kubelet; fi
            sleep infinity            
        resources:
          requests:
            cpu: 50m
        securityContext:
          privileged: true
      dnsPolicy: ClusterFirst
      hostPID: true
      tolerations:
      - effect: NoSchedule
        operator: Exists
      restartPolicy: Always
  updateStrategy:
    type: OnDelete

You may need to adopt this config. We used 500MB as we have a lot of machines and tasks, I think the old log size was 10 or 40MB.

If you want to checkout your kubelet config (before/after) you can use the kubectl debug command.

kubectl debug node/<nodename> -it --image=mcr.microsoft.com/aks/fundamental/base-ubuntu:v0.0.11
cat /host/etc/default/kubelet

Ref: https://docs.microsoft.com/en-us/azure/aks/ssh#create-the-ssh-connection-to-a-linux-node

Sorry if this was too off-topic/Azure specific. Maybe this helps @andrejzelnik

@omicronn88
Copy link

omicronn88 commented Dec 15, 2021

I also can confirm that Increasing file size of container logs solves this problem. With on-premise k8s cluster installed with kubeadm you can do it by adding containerLogMaxSize: 500Mi to /var/lib/kubelet/config.yaml and then restart kubelet service.

You may want to also decrease number of present files by adjusting containerLogMaxFiles option so that not to run out of space. But keep in mind that it cannot be less than 2.

@andrejzelnik
Copy link

Thank you @Kardi5, I have tried out your proposal and it works now.

@Tioborto
Copy link
Contributor

Hello,

Here we can't modify the kubelet config from GKE cluster. We use official google image and we don't want to manage our own image.

This is a real problem for us. We have users with jobs in "ERROR" status while we have full logs in our external tool.
Why is this not a top priority? This is a regression from previous versions

@MalfuncEddie
Copy link

Increasing the log size works but not quite a solution when using a managed k8s service where you cannot access this :-(

@oliverf1
Copy link

Even if you can change the log size, increasing it over and over is not a solution. In my case I have a very large job running on many nodes that is generating about 600Mb of log, so to run it successfully it means my log size should be about 800Mb. So it means that every and each container running in this cluster can write up to 800Mb of log. This is not sustainable. So my solution was to run this job ... outside of AWX.
I guess the only solution will be kubernetes/kubernetes#59902 to be implemented
I'm also wondering if the same issue occurs in OpenShift.

@shanemcd
Copy link
Member

Closing this in favor of #11338

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests