Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing job output and log lines #14003

Closed
5 of 10 tasks
mamercad opened this issue May 15, 2023 · 22 comments
Closed
5 of 10 tasks

Missing job output and log lines #14003

mamercad opened this issue May 15, 2023 · 22 comments

Comments

@mamercad
Copy link
Contributor

Please confirm the following

  • I agree to follow this project's code of conduct.
  • I have checked the current issues for duplicates.
  • I understand that AWX is open source software provided for free and that I might not receive a timely response.
  • I am NOT reporting a (potential) security vulnerability. (These should be emailed to security@ansible.com instead.)

Bug Summary

I've run into this rather strange situation where I'm missing output from jobs. In an attempt to reproduce it, I created a relatively simple Ansible playbook which runs for a few hours and generates a few thousands of lines of output. For the most recent test, it ran for about 4 hours and 20 minutes and the playbook simply counted to 40,000.

The output in the UI and when downloaded looks like this (abbreviated):

image

Notice the huge swath of missing lines.

The playbook that it's running is quite simple:

---
- name: long_log_lines
  hosts: localhost
  connection: local
  gather_facts: false
  vars:
    how_many: 40000

  tasks:
    - name: Create a list
      ansible.builtin.command:
        cmd: seq {{ how_many }}
      register: the_list

    - name: Set a fact for the list
      ansible.builtin.set_fact:
        the_list: "{{ the_list.stdout_lines }}"

    - name: Ping list times
      ansible.builtin.ping:
      loop: "{{ the_list }}"

AWX version

21.11.14

Select the relevant components

  • UI
  • API
  • Docs
  • Collection
  • CLI
  • Other

Installation method

kubernetes

Modifications

no

Ansible version

2.9

Operating system

Linux

Web browser

Chrome

Steps to reproduce

This should be simple to reproduce, it's a very simple playbook.

Expected results

That all of the output is there.

Actual results

Missing many thousands of lines.

Additional information

No response

@dylex
Copy link

dylex commented May 19, 2023

We're seeing the same thing on 22.2.0 on a new k8s multi-node cluster. (We did not see this issue on an older 21.7.0, single-node k8s. Not sure what the critical difference is.) I do see the output in the automation-job pod log, but various errors in the awx-ee log (at the time the job ends):

WARNING 2023/05/19 15:35:33 [hfO2BPfm] Error getting pod awx/automation-job-203-h96fh. Will retry 5 more times. Error: client rate limiter Wait returned an error: context canceled
WARNING 2023/05/19 15:35:34 [hfO2BPfm] Error getting pod awx/automation-job-203-h96fh. Will retry 4 more times. Error: client rate limiter Wait returned an error: context canceled
WARNING 2023/05/19 15:35:35 [hfO2BPfm] Error getting pod awx/automation-job-203-h96fh. Will retry 3 more times. Error: client rate limiter Wait returned an error: context canceled
WARNING 2023/05/19 15:35:36 [hfO2BPfm] Error getting pod awx/automation-job-203-h96fh. Will retry 2 more times. Error: client rate limiter Wait returned an error: context canceled
ERROR 2023/05/19 15:35:37 Exceeded retries for reading stdout /tmp/receptor/awx-task-6dd69c77d5-ssn6l/hfO2BPfm/stdout
WARNING 2023/05/19 15:35:37 Could not read in control service: read unix /var/run/receptor/receptor.sock->@: use of closed network connection
WARNING 2023/05/19 15:35:37 Could not close connection: close unix /var/run/receptor/receptor.sock->@: use of closed network connection
WARNING 2023/05/19 15:35:37 [hfO2BPfm] Error getting pod awx/automation-job-203-h96fh. Will retry 1 more times. Error: client rate limiter Wait returned an error: context canceled
ERROR 2023/05/19 15:35:38 [hfO2BPfm] Error getting pod awx/automation-job-203-h96fh. Error: client rate limiter Wait returned an error: context canceled
ERROR 2023/05/19 15:35:38 Error updating status file /tmp/receptor/awx-task-6dd69c77d5-ssn6l/hfO2BPfm/status: open /tmp/receptor/awx-task-6dd69c77d5-ssn6l/hfO2BPfm/status.lock: no such file or directory.

@shanemcd
Copy link
Member

What distro of k8s are y'all using? Are you running jobs in the control plane's namespace or externally via a container group?

@dylex
Copy link

dylex commented May 19, 2023

We're using k8s 1.26.1 via kubeadm (on-prem). awx is deployed via awx-operator 2.1.0 with helm without anything special, jobs in the "default" container group instance (not control plane), AWX EE.

@mamercad
Copy link
Contributor Author

What distro of k8s are y'all using? Are you running jobs in the control plane's namespace or externally via a container group?

I'm seeing this, at least, on 1.24.8.

@mamercad
Copy link
Contributor Author

mamercad commented May 20, 2023

What distro of k8s are y'all using? Are you running jobs in the control plane's namespace or externally via a container group?

I can't remember if it was under controlplane or a separate Container Group; next week, I'll do the same experiment under both and report back.

@mamercad
Copy link
Contributor Author

What distro of k8s are y'all using? Are you running jobs in the control plane's namespace or externally via a container group?

I can't remember if it was under controlplane or a separate Container Group; next week, I'll do the same experiment under both and report back.

For the output that I began this issue with, it happened under the default Container Group.

@mamercad
Copy link
Contributor Author

e's namespace or external

Hrm, when using controlplane I'm currently getting This job is not ready to start because there is not enough available capacity.; trying to figure out which knobs to turn.

@mamercad
Copy link
Contributor Author

What distro of k8s are y'all using? Are you running jobs in the control plane's namespace or externally via a container group?

Hrm, I'm getting kind of confused, this is basically a stock install and what I have:
image

What/how would you like me to test?

@mabashian
Copy link
Member

At first I thought this was exclusively a ui issue but since you indicated that the downloaded output was also missing the same lines I'm going to flip this over to api. Downloaded output comes straight through the api.

@shanemcd
Copy link
Member

@mamercad Sorry for the confusion here. By default AWX will run pods in the same namespace the control plane is running. The "default" container group is using the k8s api to run jobs in the same namespace AWX is running. The instance group "controlplane" forces certain types of tasks (like project updates) to run within the AWX pod rather than as an external pod. We also support running jobs in remote Kubernetes clusters using the container group mechanism. Based on your comments and image here it looks like you're just running things in the local cluster, which is what I was curious about.

@mamercad
Copy link
Contributor Author

@mamercad Sorry for the confusion here. By default AWX will run pods in the same namespace the control plane is running. The "default" container group is using the k8s api to run jobs in the same namespace AWX is running. The instance group "controlplane" forces certain types of tasks (like project updates) to run within the AWX pod rather than as an external pod. We also support running jobs in remote Kubernetes clusters using the container group mechanism. Based on your comments and image here it looks like you're just running things in the local cluster, which is what I was curious about.

Ah, okay. Yes, I'm running everything in the same cluster, and, in the same namespace.

@mamercad

This comment was marked as off-topic.

@mamercad

This comment was marked as off-topic.

@mamercad

This comment was marked as off-topic.

@mamercad

This comment was marked as off-topic.

@mamercad

This comment was marked as off-topic.

@mamercad

This comment was marked as off-topic.

@mamercad

This comment was marked as off-topic.

@mamercad
Copy link
Contributor Author

For the new "Job terminated due to error" that I'm seeing on 22.3.0, I opened #14057.

@tparker00
Copy link

I was able to reproduce this in my home environment as well.
k8s: 1.24.8
awx: 22.3.0

I just deployed it yesterday using operator 2.2.1

Only thing I changed with the deployment was to specify the image_version for awx, everything else is as the operator sets it up by default

When I download the job output a simple wc -l shows way less lines than the 16k I would have expected as well (around 7k)
Screenshot 2023-06-02 at 3 01 21 PM

@mamercad
Copy link
Contributor Author

I'm testing this version, 21.11.14, with the resolution in #14057 to see if they do the trick for this version as well.

@mamercad
Copy link
Contributor Author

I'm testing this version, 21.11.14, with the resolution in #14057 to see if they do the trick for this version as well.

Seems good, going to close this (10k lines worked just fine):

❯ wc -l ~/Downloads/job_126.txt
   10019 /Users/mark/Downloads/job_126.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants