Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWX job fails but it completes successfully #14288

Closed
6 of 11 tasks
EsDmitrii opened this issue Jul 27, 2023 · 16 comments
Closed
6 of 11 tasks

AWX job fails but it completes successfully #14288

EsDmitrii opened this issue Jul 27, 2023 · 16 comments

Comments

@EsDmitrii
Copy link

Please confirm the following

  • I agree to follow this project's code of conduct.
  • I have checked the current issues for duplicates.
  • I understand that AWX is open source software provided for free and that I might not receive a timely response.
  • I am NOT reporting a (potential) security vulnerability. (These should be emailed to security@ansible.com instead.)

Bug Summary

Hi!
We have weird issue with AWX.
Our jobs fails but all tasks in Ansible log completes without errors.
Also there are no errors in AWX pod (we run it inside k8s).
AWX log freeze and we can't see any updates until AWX pod succeed task.
After all AWX job fails with error:
telegram-cloud-photo-size-2-5337318351246052956-x

AWX version

22.3.0

Select the relevant components

  • UI
  • UI (tech preview)
  • API
  • Docs
  • Collection
  • CLI
  • Other

Installation method

kubernetes

Modifications

no

Ansible version

ansible-core==2.15.0 ansible-runner==2.3.2

Operating system

Centos7

Web browser

Chrome

Steps to reproduce

Run huge task on huge inventory:)
Really don't know what is the root cause of this issue

Expected results

Job succeeded without errors

Actual results

Job fails with error: "Job terminated due to error"

Additional information

We found a lot relevant issues with no updates:
#12297 (open with no updates)
#12530 (closed and said that issue fixed in 21.14.0)
#13376 (closed but there is a new comment with the same error)
#12400 (open)
etc..

@jollyekb
Copy link

Same problem, we got this error on big inventory and a lot of lines at ones (skipped status)

@fosterseth
Copy link
Member

fosterseth commented Aug 2, 2023

do you have an error message in you awx-ee container around the time the job fails?

are you using the Receptor reconnect feature? did you bump up max container log size for you cluster?

can you also provide output of api/v2/jobs/16662?

@EsDmitrii
Copy link
Author

@fosterseth Hi! Sorry for the late response.

do you have an error message in you awx-ee container around the time the job fails?

In container there are no errors, all stuff goes till the end successfully

are you using the Receptor reconnect feature?

No, but we tested this feature when you asked about it.
We set it to enabled via exta env:

  ee_extra_env: |
    - name: RECEPTOR_KUBE_SUPPORT_RECONNECT
      value: enabled

After it we started to get failing jobs without no relevant output. Jobs fail immediately after 1-5 sec gathering facts

When we set it to disabled, job fails in 1-2 mins with status "canceled", but we don't cancel the job
Screenshot 2023-08-04 at 14 19 18

When we revert all changes (removed extra env), all went as usual. In AWX UI job fails, but container still working.

The interesting thing when we set this var into instance group (job pod template, not in the operator) all keep usual state - AWX UI job fails, but container still working.

did you bump up max container log size for you cluster?

Yes, we increased these parameters earlier for kubelet, no luck with it:
Screenshot 2023-08-04 at 14 43 24

In sum we revert all changes related to RECEPTOR_KUBE_SUPPORT_RECONNECT

We tried these ways, nothing helped:
ansible/receptor#683
#14057
ansible/awx-operator#1203

can you also provide output of api/v2/jobs/16662?

sure, json output:

{
    "id": 16662,
    "type": "job",
    "url": "/api/v2/jobs/16662/",
    "related": {
        "labels": "/api/v2/jobs/16662/labels/",
        "inventory": "/api/v2/inventories/12/",
        "project": "/api/v2/projects/11/",
        "organization": "/api/v2/organizations/2/",
        "credentials": "/api/v2/jobs/16662/credentials/",
        "unified_job_template": "/api/v2/job_templates/42/",
        "schedule": "/api/v2/schedules/26/",
        "stdout": "/api/v2/jobs/16662/stdout/",
        "execution_environment": "/api/v2/execution_environments/9/",
        "job_events": "/api/v2/jobs/16662/job_events/",
        "job_host_summaries": "/api/v2/jobs/16662/job_host_summaries/",
        "activity_stream": "/api/v2/jobs/16662/activity_stream/",
        "notifications": "/api/v2/jobs/16662/notifications/",
        "create_schedule": "/api/v2/jobs/16662/create_schedule/",
        "job_template": "/api/v2/job_templates/42/",
        "cancel": "/api/v2/jobs/16662/cancel/",
        "relaunch": "/api/v2/jobs/16662/relaunch/"
    },
    "summary_fields": {
        "organization": {
            "id": 2,
            "name": "",
            "description": ""
        },
        "inventory": {
            "id": 12,
            "name": "inv_test",
            "description": "",
            "has_active_failures": true,
            "total_hosts": 90,
            "hosts_with_active_failures": 12,
            "total_groups": 53,
            "has_inventory_sources": true,
            "total_inventory_sources": 1,
            "inventory_sources_with_failures": 0,
            "organization_id": 2,
            "kind": ""
        },
        "execution_environment": {
            "id": 9,
            "name": "ansible2-12-10",
            "description": "",
            "image": "awx-ee:2.12.10"
        },
        "project": {
            "id": 11,
            "name": "shared_prj_infra",
            "description": "",
            "status": "successful",
            "scm_type": "git",
            "allow_override": false
        },
        "job_template": {
            "id": 42,
            "name": "daily",
            "description": ""
        },
        "schedule": {
            "id": 26,
            "name": "every_day_run_daily",
            "description": "",
            "next_run": "2023-08-04T23:15:00Z"
        },
        "unified_job_template": {
            "id": 42,
            "name": "daily",
            "description": "",
            "unified_job_type": "job"
        },
        "instance_group": {
            "id": 2,
            "name": "default",
            "is_container_group": true
        },
        "user_capabilities": {
            "delete": true,
            "start": true
        },
        "labels": {
            "count": 0,
            "results": []
        },
        "credentials": [
            {
                "id": 6,
                "name": "cred_machine",
                "description": "",
                "kind": "ssh",
                "cloud": false
            }
        ]
    },
    "created": "2023-07-26T23:15:05.734636Z",
    "modified": "2023-07-26T23:15:30.366081Z",
    "name": "tpl_daily",
    "description": "",
    "job_type": "run",
    "inventory": 12,
    "project": 11,
    "playbook": "playbook_daily.yaml",
    "scm_branch": "",
    "forks": 0,
    "limit": "",
    "verbosity": 0,
    "extra_vars": "{}",
    "job_tags": "",
    "force_handlers": false,
    "skip_tags": "",
    "start_at_task": "",
    "timeout": 3600,
    "use_fact_cache": true,
    "organization": 2,
    "unified_job_template": 42,
    "launch_type": "scheduled",
    "status": "error",
    "execution_environment": 9,
    "failed": true,
    "started": "2023-07-26T23:15:30.439479Z",
    "finished": "2023-07-26T23:33:17.968437Z",
    "canceled_on": null,
    "elapsed": 1067.529,
    "job_args": "[\"ssh-agent\", \"sh\", \"-c\", \"trap 'rm -f /runner/artifacts/16662/ssh_key_data' EXIT && ssh-add /runner/artifacts/16662/ssh_key_data && rm -f /runner/artifacts/16662/ssh_key_data && ansible-playbook -u awx..rcod --ask-pass --become --diff -l '!p00t3mon01.m9.egov.local' -i /runner/inventory/hosts -e @/runner/env/extravars playbook_daily.yaml\"]",
    "job_cwd": "/runner/project",
    "job_env": {
        "KUBERNETES_SERVICE_PORT_HTTPS": "443",
        "KUBERNETES_SERVICE_PORT": "443",
        "AWX_OPERATOR_CONTROLLER_MANAGER_METRICS_SERVICE_SERVICE_HOST": "10.233.15.99",
        "HOSTNAME": "automation-job-16662-ngqfl",
        "LANGUAGE": "en_US:en",
        "AWX_SERVICE_PORT_80_TCP": "tcp://10.233.38.219:80",
        "PWD": "/_build",
        "AWX_OPERATOR_CONTROLLER_MANAGER_METRICS_SERVICE_PORT_8443_TCP_PORT": "8443",
        "AWX_OPERATOR_CONTROLLER_MANAGER_METRICS_SERVICE_PORT_8443_TCP_PROTO": "tcp",
        "AWX_SERVICE_SERVICE_PORT": "80",
        "HOME": "/root",
        "LANG": "en_US.UTF-8",
        "KUBERNETES_PORT_443_TCP": "tcp://10.233.0.1:443",
        "AWX_SERVICE_PORT_80_TCP_ADDR": "10.233.38.219",
        "AWX_SERVICE_PORT": "tcp://10.233.38.219:80",
        "AWX_OPERATOR_CONTROLLER_MANAGER_METRICS_SERVICE_PORT": "tcp://10.233.15.99:8443",
        "AWX_SERVICE_PORT_80_TCP_PORT": "80",
        "AWX_OPERATOR_CONTROLLER_MANAGER_METRICS_SERVICE_SERVICE_PORT_HTTPS": "8443",
        "AWX_OPERATOR_CONTROLLER_MANAGER_METRICS_SERVICE_SERVICE_PORT": "8443",
        "SHLVL": "0",
        "KUBERNETES_PORT_443_TCP_PROTO": "tcp",
        "KUBERNETES_PORT_443_TCP_ADDR": "10.233.0.1",
        "AWX_SERVICE_SERVICE_HOST": "10.233.38.219",
        "AWX_OPERATOR_CONTROLLER_MANAGER_METRICS_SERVICE_PORT_8443_TCP": "tcp://10.233.15.99:8443",
        "AWX_SERVICE_SERVICE_PORT_HTTP": "80",
        "KUBERNETES_SERVICE_HOST": "10.233.0.1",
        "LC_ALL": "en_US.UTF-8",
        "KUBERNETES_PORT": "tcp://10.233.0.1:443",
        "KUBERNETES_PORT_443_TCP_PORT": "443",
        "PATH": "/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
        "AWX_SERVICE_PORT_80_TCP_PROTO": "tcp",
        "AWX_OPERATOR_CONTROLLER_MANAGER_METRICS_SERVICE_PORT_8443_TCP_ADDR": "10.233.15.99",
        "ANSIBLE_FACT_CACHE_TIMEOUT": "604800",
        "ANSIBLE_FORCE_COLOR": "True",
        "ANSIBLE_HOST_KEY_CHECKING": "False",
        "ANSIBLE_INVENTORY_UNPARSED_FAILED": "True",
        "ANSIBLE_PARAMIKO_RECORD_HOST_KEYS": "False",
        "AWX_PRIVATE_DATA_DIR": "/tmp/awx_16662_ys7_va8n",
        "JOB_ID": "16662",
        "INVENTORY_ID": "12",
        "PROJECT_REVISION": "2f30d79ae939cbef6b648f6e96e4609a6c035295",
        "ANSIBLE_RETRY_FILES_ENABLED": "False",
        "MAX_EVENT_RES": "700000",
        "AWX_HOST": "MASKED",
        "ANSIBLE_SSH_CONTROL_PATH_DIR": "/runner/cp",
        "ANSIBLE_COLLECTIONS_PATHS": "/runner/requirements_collections:~/.ansible/collections:/usr/share/ansible/collections",
        "ANSIBLE_ROLES_PATH": "/runner/requirements_roles:./roles:~/.ansible/roles:/usr/share/ansible/roles:/etc/ansible/roles",
        "ANSIBLE_CALLBACK_PLUGINS": "/usr/local/lib/python3.9/site-packages/ansible_runner/display_callback/callback",
        "ANSIBLE_STDOUT_CALLBACK": "awx_display",
        "AWX_ISOLATED_DATA_DIR": "/runner/artifacts/16662",
        "ANSIBLE_CACHE_PLUGIN": "jsonfile",
        "ANSIBLE_CACHE_PLUGIN_CONNECTION": "/runner/artifacts/16662/fact_cache",
        "RUNNER_OMIT_EVENTS": "False",
        "RUNNER_ONLY_FAILED_EVENTS": "False"
    },
    "job_explanation": "Job terminated due to error",
    "execution_node": "",
    "controller_node": "awx-task-5cd6fccf55-gcs2l",
    "result_traceback": "Finished",
    "event_processing_finished": true,
    "launched_by": {
        "id": 26,
        "name": "every_day_run__tpl_daily_t3",
        "type": "schedule",
        "url": "/api/v2/schedules/26/"
    },
    "work_unit_id": "CobuskbH",
    "job_template": 42,
    "passwords_needed_to_start": [],
    "allow_simultaneous": false,
    "artifacts": {},
    "scm_revision": "2f30d79ae939cbef6b648f6e96e4609a6c035295",
    "instance_group": 2,
    "diff_mode": true,
    "job_slice_number": 0,
    "job_slice_count": 1,
    "webhook_service": "",
    "webhook_credential": null,
    "webhook_guid": "",
    "host_status_counts": null,
    "playbook_counts": {
        "play_count": 9,
        "task_count": 19877
    },
    "custom_virtualenv": null
}

@fosterseth
Copy link
Member

what version of kubelet do you have (kubectl get node should show it)? the RECEPTOR_KUBE_SUPPORT_RECONNECT requires a certain minimum version, which you can find on the PR ansible/receptor#683

It still feels like you might be running into log rotation problems. 100Mi is still pretty low given your inventory/task size. Maybe bump this up? it would need to accommodate your entire stdout from the automation job pod

@EsDmitrii
Copy link
Author

@fosterseth hi!
My kubelet ver is 1.23.7
Can't find any table or something else about minimal requirements for RECEPTOR_KUBE_SUPPORT_RECONNECT, it not obvious on the PR you sent
Do you mean that I need to upgrade kubelet to 1.23.14?
Screenshot 2023-08-10 at 11 22 59

I bumped up to 1Gi logs, will test it, thx for the tip

Appreciate your assistance!

@EsDmitrii
Copy link
Author

Tested with bumped up to 1Gi logs, the same: UI says that job failed, but pod keep working and giving me actual logs

@from88
Copy link

from88 commented Aug 16, 2023

Same issue here. my last post: #12685

@EsDmitrii
Copy link
Author

I'm here with updates for you
Updated to https://quay.io/repository/ansible/awx-ee?tab=tags&tag=22.7.0
Everything is well at first sight
We will watch and test it for a few days
Will be back with news

@fivetran-joliveira
Copy link

fivetran-joliveira commented Sep 12, 2023

Hi @EsDmitrii @jollyekb could you share your solution, in case you got it working?

@fosterseth any other data/logs I should check that could be helpful here?

The AWX jobs output has only ~150 lines at the time it fails.

For me, after upgrading from 22.3.0 to 22.7.0 the error on /api/v2/jobs/<job id>/

"job_explanation": "Job terminated due to error",
...
"result_traceback": "Finished",

Is now replaced with

"job_explanation": "The running ansible process received a shutdown signal.",
...
"result_traceback": "",

And AWX jobs fail after about ~4 minutes.

kubelet version v1.24.14-gke.1200

$ kubectl -n <my namespace> get node                                             

NAME       STATUS   ROLES    AGE   VERSION
gke-....   Ready    <none>   41d   v1.24.14-gke.1200

Logs from task pod doesn't say much for job id 29842:

$ kubectl -n <my namespace> logs -f awx-task-<pod id>

2023-09-12 10:38:10,122 INFO     [971f5e51d0fc43f485b774ea42715c93] awx.analytics.job_lifecycle job-29842 blocked by inventoryupdate-29843
2023-09-12 10:38:10,617 INFO     [-] awx.main.wsrelay Producer 10.4.0.160-control-limit_reached_1 has no subscribers, shutting down.
2023-09-12 10:38:15,939 INFO     [971f5e51d0fc43f485b774ea42715c93] awx.main.commands.run_callback_receiver Starting EOF event processing for Job 29843
2023-09-12 10:38:16,002 INFO     [971f5e51d0fc43f485b774ea42715c93] awx.analytics.job_lifecycle inventoryupdate-29843 post run
2023-09-12 10:38:16,013 INFO     [971f5e51d0fc43f485b774ea42715c93] awx.analytics.job_lifecycle inventoryupdate-29843 stats wrapup finished
2023-09-12 10:38:18,294 INFO     [971f5e51d0fc43f485b774ea42715c93] awx.analytics.job_lifecycle inventoryupdate-29843 finalize run
2023-09-12 10:38:19,476 INFO     [971f5e51d0fc43f485b774ea42715c93] awx.analytics.job_lifecycle job-29842 waiting
2023-09-12 10:38:19,674 INFO     [971f5e51d0fc43f485b774ea42715c93] awx.analytics.job_lifecycle job-29842 pre run
2023-09-12 10:38:25,156 INFO     [971f5e51d0fc43f485b774ea42715c93] awx.analytics.job_lifecycle job-29842 preparing playbook
2023-09-12 10:38:25,497 INFO     [971f5e51d0fc43f485b774ea42715c93] awx.analytics.job_lifecycle job-29842 running playbook
2023-09-12 10:38:38,771 INFO     [-] awx.main.wsrelay Starting producer for 10.4.0.160-control-limit_reached_1
2023-09-12 10:38:38,773 INFO     [-] awx.main.wsrelay Starting producer for 10.4.0.160-jobs-status_changed
2023-09-12 10:38:43,544 INFO     [-] awx.main.wsrelay Starting producer for 10.4.0.160-schedules-changed
2023-09-12 10:38:46,472 INFO     [971f5e51d0fc43f485b774ea42715c93] awx.analytics.job_lifecycle job-29842 work unit id received
2023-09-12 10:38:46,525 INFO     [971f5e51d0fc43f485b774ea42715c93] awx.analytics.job_lifecycle job-29842 work unit id assigned
2023-09-12 10:39:13,555 INFO     [-] awx.main.wsrelay Producer 10.4.0.160-schedules-changed has no subscribers, shutting down.
2023-09-12 10:39:19,504 INFO     [-] awx.main.wsrelay Starting producer for 10.4.0.160-jobs-summary
2023-09-12 10:39:19,505 INFO     [-] awx.main.wsrelay Starting producer for 10.4.0.160-job_events-29842
2023-09-12 10:39:26,562 INFO     [-] awx.main.wsrelay Starting producer for 10.4.0.160-schedules-changed
2023-09-12 10:39:46,566 INFO     [-] awx.main.wsrelay Producer 10.4.0.160-schedules-changed has no subscribers, shutting down.
2023-09-12 10:40:08,799 INFO     [-] awx.main.wsrelay Producer 10.4.0.160-control-limit_reached_1 has no subscribers, shutting down.
2023-09-12 10:41:27,256 INFO     [-] awx.main.wsrelay Starting producer for 10.4.0.160-control-limit_reached_1
2023-09-12 10:41:27,257 INFO     [-] awx.main.wsrelay Starting producer for 10.4.0.160-schedules-changed
2023-09-12 10:41:37,262 INFO     [-] awx.main.wsrelay Producer 10.4.0.160-schedules-changed has no subscribers, shutting down.
2023-09-12 10:42:07,273 INFO     [-] awx.main.wsrelay Producer 10.4.0.160-control-limit_reached_1 has no subscribers, shutting down.
2023-09-12 10:42:37,661 INFO     [971f5e51d0fc43f485b774ea42715c93] awx.main.commands.run_callback_receiver Starting EOF event processing for Job 29842
2023-09-12 10:42:37,721 INFO     [971f5e51d0fc43f485b774ea42715c93] awx.analytics.job_lifecycle job-29842 post run
2023-09-12 10:42:37,878 INFO     [971f5e51d0fc43f485b774ea42715c93] awx.analytics.job_lifecycle job-29842 finalize run
2023-09-12 10:42:38,847 WARNING  [971f5e51d0fc43f485b774ea42715c93] awx.main.dispatch job 29842 (failed) encountered an error (rc=None), please see task stdout for details.
2023-09-12 10:42:47,015 INFO     [-] awx.main.wsrelay Producer 10.4.0.160-job_events-29842 has no subscribers, shutting down.
2023-09-12 10:42:47,668 INFO     [-] awx.main.wsrelay Producer 10.4.0.160-jobs-summary has no subscribers, shutting down.
2023-09-12 10:42:47,763 INFO     [-] awx.main.wsrelay Starting producer for 10.4.0.160-control-limit_reached_1
2023-09-12 10:43:17,892 INFO     [-] awx.main.wsrelay Producer 10.4.0.160-jobs-status_changed has no subscribers, shutting down.
2023-09-12 10:43:27,775 INFO     [-] awx.main.wsrelay Producer 10.4.0.160-control-limit_reached_1 has no subscribers, shutting down.

@EsDmitrii
Copy link
Author

Hi @fivetran-joliveira!
We don't resolved it finally yet.
We still face the issue on several hosts, but generally, upgrade to 22.7.0 resolved the issue on the most AWX servers.
Also I tried this but it didn't help me.
I still researching about this, can't find any pattern of the root cause.

@fivetran-joliveira
Copy link

HI @EsDmitrii,
Thanks for replying!
If you or anyone have any ideas, or links to AWX docs, on how to dig deeper debugging this issue, I would appreciate it, as I'm running out of ideas on how to keep investigating to find the root cause.

@chris93111
Copy link
Contributor

Hi same for us after upgrade k3s to 1.27.5 , awx 21.14.0 , before upgrade k3s (1.24) we don't have this problem

@chris93111
Copy link
Contributor

I noticed that the container playbook was in status completed and that it was deleted before awx could retrieve the logs

so the job is successfully but in error in awx with incomplete stream

@chris93111
Copy link
Contributor

in my case, the stop streaming is caused by a big file with many line (+80k), capture the content freeze the stream

@mogamal1
Copy link

mogamal1 commented Dec 6, 2023

@EsDmitrii
Do you have any updates here , I having the same issues and I'm using awx:23.0.0

@EsDmitrii
Copy link
Author

Hi all!
Sorry for the late response.
My team and me divided huge jobs to a several smaller ones
Now all works well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants