Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jobs with ID > 9999 are stuck in pending state, but nothing is happening #10489

Closed
JustInVTime opened this issue Jun 22, 2021 · 27 comments
Closed
Labels
sustaining:support Support prioritized issues type:bug

Comments

@JustInVTime
Copy link

ISSUE TYPE
  • Bug Report
SUMMARY

Last week we encountered a strange problem that our AWX jobs wouldn't run anymore. They were stuck in the running (or pending) state, but nothing seems to happen. No output in the job logs, no awx-ee pods being started in our kubernetes cluster, no strange things in the logs. Cancelling and restarting the jobs didn't work either. I noticed that it was job ID 10000 that was de first job which was stuck in running state.

I created a new database for AWX, connected our deployement to this new database and now our jobs could complete again. Changing it back to the old database stuck the jobs, so it must had something to do with AWX or the database. Yesterday I created a test setup with 10 git based projects and a script which would sync those projects periodically via the api. This morning it reached job ID 10000 again, same result. Jobs with id greater then 9999 are started but nothing seems to happen.

ENVIRONMENT
  • AWX version: 19.2.1
  • AWX install method: awx-operator
  • Ansible version: 2.9
  • Operating System: Kubernetes v1.20.4 running on Ubuntu 20.04 servers with docker engine 20.10.5
  • Web Browser: Firefox on Ubuntu Desktop
STEPS TO REPRODUCE

Run 9999 jobs in AWX (project sync's, inventory syncs or job_templates, it doesn't matter)
Run an extra job and you would see that it is starting according to the UI, but it will never finish the job.

EXPECTED RESULTS

Job's with ID > 9999 finishing succesfully

ACTUAL RESULTS

Job's with ID > 9999 are stuck

ADDITIONAL INFORMATION

Screenshot from 2021-06-22 10-58-25

I could provide a database, but then I need to run the test again, with non production credentials. I'm not allowed to post a database dump containing our production credentials of course. Let me know if you need this dump.

@JustInVTime
Copy link
Author

Maybe this is related: https://groups.google.com/g/awx-project/c/GfDZ3iXxjzc

@Zokormazo
Copy link
Member

Just tested with job ids > 10k and I can't reproduce this. Is there anything useful on pod logs when you hit this?

@JustInVTime
Copy link
Author

No, the task container logs are filled only showing blocked jobs now, because job 10000 - 100010 are running but not doing anything.

If you want I can cancel all jobs with ID 10000+ and start a new job, so I have more recent job logs. Do you need logs from web and awx-ee container as well?

awx_task.log

@shanemcd shanemcd changed the title Jobs with ID > 9999 are stuck in running state, but nothing is happening Jobs with ID > 9999 are stuck in pending state, but nothing is happening Jun 25, 2021
@shanemcd
Copy link
Member

Hello. Did you migrate from a local docker install or did you start fresh with the operator?

@shanemcd
Copy link
Member

Can you provide the json from /api/v2/project_updates/10000?

@JustInVTime
Copy link
Author

Hi @shanemcd

Sorry for my late response, I took a long weekend off.

Hello. Did you migrate from a local docker install or did you start fresh with the operator?

We did a fresh start with the operatorHello. Did you migrate from a local docker install or did you start fresh with the operator?

Can you provide the json from /api/v2/project_updates/10000?

Sure ! (I had to replace some hostnames with ***** because of company policy, I hope this is ok?)

HTTP 200 OK
Allow: GET, DELETE, HEAD, OPTIONS
Content-Type: application/json
Vary: Accept
X-API-Node: testfast-awx-794cb77d76-mr82w
X-API-Product-Name: AWX
X-API-Product-Version: 19.2.1
X-API-Time: 0.035s

{
    "id": 10000,
    "type": "project_update",
    "url": "/api/v2/project_updates/10000/",
    "related": {
        "created_by": "/api/v2/users/1/",
        "credential": "/api/v2/credentials/3/",
        "unified_job_template": "/api/v2/projects/17/",
        "stdout": "/api/v2/project_updates/10000/stdout/",
        "execution_environment": "/api/v2/execution_environments/2/",
        "project": "/api/v2/projects/17/",
        "cancel": "/api/v2/project_updates/10000/cancel/",
        "scm_inventory_updates": "/api/v2/project_updates/10000/scm_inventory_updates/",
        "notifications": "/api/v2/project_updates/10000/notifications/",
        "events": "/api/v2/project_updates/10000/events/"
    },
    "summary_fields": {
        "organization": {
            "id": 1,
            "name": "Default",
            "description": ""
        },
        "execution_environment": {
            "id": 2,
            "name": "Control Plane Execution Environment",
            "description": "",
            "image": "quay.io/ansible/awx-ee:0.4.0"
        },
        "project": {
            "id": 17,
            "name": "test 7",
            "description": "",
            "status": "pending",
            "scm_type": "git",
            "allow_override": false
        },
        "credential": {
            "id": 3,
            "name": "*********",
            "description": "",
            "kind": "scm",
            "cloud": false,
            "kubernetes": false,
            "credential_type_id": 2
        },
        "unified_job_template": {
            "id": 17,
            "name": "test 7",
            "description": "",
            "unified_job_type": "project_update"
        },
        "instance_group": {
            "id": 1,
            "name": "controlplane",
            "is_container_group": false
        },
        "created_by": {
            "id": 1,
            "username": "admin",
            "first_name": "",
            "last_name": ""
        },
        "user_capabilities": {
            "delete": true,
            "start": true
        }
    },
    "created": "2021-06-22T06:06:06.185402Z",
    "modified": "2021-06-22T06:06:06.353727Z",
    "name": "test 7",
    "description": "",
    "local_path": "_17__test_5_84845_pm",
    "scm_type": "git",
    "scm_url": "git@*********:dev-ansible/customer-repos/4056.git",
    "scm_branch": "",
    "scm_refspec": "",
    "scm_clean": false,
    "scm_track_submodules": false,
    "scm_delete_on_update": false,
    "credential": 3,
    "timeout": 0,
    "scm_revision": "",
    "unified_job_template": 17,
    "launch_type": "manual",
    "status": "running",
    "execution_environment": 2,
    "failed": false,
    "started": "2021-06-22T06:06:06.450368Z",
    "finished": null,
    "canceled_on": null,
    "elapsed": 641785.021096,
    "job_args": "[\"ssh-agent\", \"sh\", \"-c\", \"trap 'rm -f /tmp/pdd_wrapper_10000_omy7294z/awx_10000_uur9h0_7/artifacts/10000/ssh_key_data' EXIT && ssh-add /tmp/pdd_wrapper_10000_omy7294z/awx_10000_uur9h0_7/artifacts/10000/ssh_key_data && rm -f /tmp/pdd_wrapper_10000_omy7294z/awx_10000_uur9h0_7/artifacts/10000/ssh_key_data && ansible-playbook -t update_git,install_roles,install_collections -i /tmp/pdd_wrapper_10000_omy7294z/awx_10000_uur9h0_7/inventory/hosts -e @/tmp/pdd_wrapper_10000_omy7294z/awx_10000_uur9h0_7/env/extravars project_update.yml\"]",
    "job_cwd": "/tmp/pdd_wrapper_10000_omy7294z/awx_10000_uur9h0_7/project",
    "job_env": {
        "FAST_AWX_SERVICE_PORT_80_TCP_ADDR": "10.43.75.245",
        "FAST_AWX_SERVICE_PORT_80_TCP": "tcp://10.43.75.245:80",
        "TESTFAST_AWX_SERVICE_PORT": "tcp://10.43.157.90:80",
        "HOSTNAME": "testfast-awx-794cb77d76-mr82w",
        "FAST_AWX_SERVICE_PORT_80_TCP_PROTO": "tcp",
        "TESTFAST_AWX_SERVICE_SERVICE_PORT_HTTP": "80",
        "TESTFAST_AWX_SERVICE_PORT_80_TCP_PORT": "80",
        "KUBERNETES_PORT_443_TCP_PROTO": "tcp",
        "KUBERNETES_PORT_443_TCP_ADDR": "10.43.0.1",
        "FAST_AWX_SERVICE_PORT": "tcp://10.43.75.245:80",
        "KUBERNETES_PORT": "tcp://10.43.0.1:443",
        "PWD": "/runner",
        "HOME": "/home/runner",
        "KUBERNETES_SERVICE_PORT_HTTPS": "443",
        "KUBERNETES_PORT_443_TCP_PORT": "443",
        "TESTFAST_AWX_SERVICE_SERVICE_PORT": "80",
        "FAST_AWX_SERVICE_SERVICE_PORT": "80",
        "TESTFAST_AWX_SERVICE_PORT_80_TCP_ADDR": "10.43.157.90",
        "FAST_AWX_SERVICE_PORT_80_TCP_PORT": "80",
        "KUBERNETES_PORT_443_TCP": "tcp://10.43.0.1:443",
        "TESTFAST_AWX_SERVICE_PORT_80_TCP_PROTO": "tcp",
        "SHLVL": "0",
        "KUBERNETES_SERVICE_PORT": "443",
        "TESTFAST_AWX_SERVICE_PORT_80_TCP": "tcp://10.43.157.90:80",
        "FAST_AWX_SERVICE_SERVICE_HOST": "10.43.75.245",
        "PATH": "/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
        "KUBERNETES_SERVICE_HOST": "10.43.0.1",
        "FAST_AWX_SERVICE_SERVICE_PORT_HTTP": "80",
        "TESTFAST_AWX_SERVICE_SERVICE_HOST": "10.43.157.90",
        "LC_CTYPE": "C.UTF-8",
        "ANSIBLE_FORCE_COLOR": "True",
        "ANSIBLE_HOST_KEY_CHECKING": "False",
        "ANSIBLE_INVENTORY_UNPARSED_FAILED": "True",
        "ANSIBLE_PARAMIKO_RECORD_HOST_KEYS": "False",
        "AWX_PRIVATE_DATA_DIR": "/tmp/pdd_wrapper_10000_omy7294z/awx_10000_uur9h0_7",
        "ANSIBLE_RETRY_FILES_ENABLED": "False",
        "ANSIBLE_ASK_PASS": "False",
        "ANSIBLE_BECOME_ASK_PASS": "False",
        "DISPLAY": "",
        "TMP": "/tmp",
        "PROJECT_UPDATE_ID": "10000",
        "ANSIBLE_GALAXY_SERVER_SERVER0_URL": "https://galaxy.ansible.com/",
        "ANSIBLE_GALAXY_SERVER_LIST": "server0",
        "PYTHONPATH": ":/usr/local/lib/python3.8/site-packages/ansible_runner/config/../callbacks",
        "ANSIBLE_CALLBACK_PLUGINS": "/usr/local/lib/python3.8/site-packages/ansible_runner/config/../callbacks",
        "ANSIBLE_STDOUT_CALLBACK": "awx_display",
        "AWX_ISOLATED_DATA_DIR": "/tmp/pdd_wrapper_10000_omy7294z/awx_10000_uur9h0_7/artifacts/10000",
        "RUNNER_OMIT_EVENTS": "False",
        "RUNNER_ONLY_FAILED_EVENTS": "False",
        "RECEPTOR_UNIT_ID": "lwpIw4L6"
    },
    "job_explanation": "",
    "execution_node": "testfast-awx-794cb77d76-mr82w",
    "result_traceback": "",
    "event_processing_finished": false,
    "launched_by": {
        "id": 1,
        "name": "admin",
        "type": "user",
        "url": "/api/v2/users/1/"
    },
    "project": 17,
    "job_type": "check",
    "job_tags": "update_git,install_roles,install_collections",
    "host_status_counts": {},
    "playbook_counts": {
        "play_count": 0,
        "task_count": 0
    }
}

@shanemcd shanemcd added the sustaining:support Support prioritized issues label Jul 9, 2021
@SomePati
Copy link

ENVIRONMENT
AWX version: 19.2.2
AWX install method: awx-operator
Ansible version: 2.11.2.post0
Operating System: Kubernetes v1.20.7 running on Debian 10 servers with docker engine 20.10.7
Web Browser: Firefox on Windows

I hit the issue after job id 1600.
In my awx_task log there was the same error as in the closed issue #9559.
After redeployment of AWX the issue is gone for now.

HTTP 200 OK
Allow: GET, DELETE, HEAD, OPTIONS
Content-Type: application/json
Vary: Accept
X-API-Node: awx-t-68f9bfcc59-7sngv
X-API-Product-Name: AWX
X-API-Product-Version: 19.2.2
X-API-Time: 0.287s

{
    "id": 1601,
    "type": "project_update",
    "url": "/api/v2/project_updates/1601/",
    "related": {
        "credential": "/api/v2/credentials/3/",
        "unified_job_template": "/api/v2/projects/8/",
        "stdout": "/api/v2/project_updates/1601/stdout/",
        "execution_environment": "/api/v2/execution_environments/5/",
        "project": "/api/v2/projects/8/",
        "cancel": "/api/v2/project_updates/1601/cancel/",
        "scm_inventory_updates": "/api/v2/project_updates/1601/scm_inventory_updates/",
        "notifications": "/api/v2/project_updates/1601/notifications/",
        "events": "/api/v2/project_updates/1601/events/"
    },
    "summary_fields": {
        "organization": {
            "id": 1,
            "name": "Default",
            "description": ""
        },
        "execution_environment": {
            "id": 5,
            "name": "Control Plane Execution Environment",
            "description": "",
            "image": "quay.io/ansible/awx-ee:0.5.0"
        },
        "project": {
            "id": 8,
            "name": "inventory",
            "description": "",
            "status": "successful",
            "scm_type": "git",
            "allow_override": false
        },
        "credential": {
            "id": 3,
            "name": "user",
            "description": "",
            "kind": "scm",
            "cloud": false,
            "kubernetes": false,
            "credential_type_id": 2
        },
        "unified_job_template": {
            "id": 8,
            "name": "inventory",
            "description": "",
            "unified_job_type": "project_update"
        },
        "instance_group": {
            "id": 3,
            "name": "default",
            "is_container_group": true
        },
        "user_capabilities": {
            "delete": true,
            "start": true
        }
    },
    "created": "2021-07-22T01:00:12.941652Z",
    "modified": "2021-07-22T01:00:12.941684Z",
    "name": "inventory",
    "description": "",
    "local_path": "_8__inventory",
    "scm_type": "git",
    "scm_url": "somegitsrv",
    "scm_branch": "",
    "scm_refspec": "",
    "scm_clean": true,
    "scm_track_submodules": false,
    "scm_delete_on_update": false,
    "credential": 3,
    "timeout": 0,
    "scm_revision": "a2a7f3ba6af1e7e9d5005cf18c1d7b9dd2d7367a",
    "unified_job_template": 8,
    "launch_type": "sync",
    "status": "successful",
    "execution_environment": 5,
    "failed": false,
    "started": "2021-07-22T01:00:12.941363Z",
    "finished": "2021-07-22T01:00:15.210402Z",
    "canceled_on": null,
    "elapsed": 2.269,
    "job_args": "[\"ansible-playbook\", \"-t\", \"update_git\", \"-i\", \"/tmp/pdd_wrapper_1601_xt4s735b/awx_1601_slj0nr1e/inventory/hosts\", \"-e\", \"@/tmp/pdd_wrapper_1601_xt4s735b/awx_1601_slj0nr1e/env/extravars\", \"project_update.yml\"]",
    "job_cwd": "/tmp/pdd_wrapper_1601_xt4s735b/awx_1601_slj0nr1e/project",
    "job_env": {
        "AWX_T_SERVICE_PORT_80_TCP_ADDR": "10.43.123.164",
        "AWX_SERVICE_PORT_80_TCP_ADDR": "10.43.194.113",
        "AWX_POSTGRES_CLONE_SERVICE_PORT": "5432",
        "AWX_POSTGRES_CLONE_SERVICE_HOST": "10.43.60.187",
        "HOSTNAME": "awx-t-68f9bfcc59-vcjt7",
        "AWX_SERVICE_PORT": "tcp://10.43.194.113:80",
        "AWX_SERVICE_PORT_80_TCP_PORT": "80",
        "AWX_POSTGRES_CLONE_PORT_5432_TCP_ADDR": "10.43.60.187",
        "AWX_OPERATOR_METRICS_PORT_8383_TCP_PROTO": "tcp",
        "KUBERNETES_PORT_443_TCP_PROTO": "tcp",
        "AWX_OPERATOR_METRICS_PORT_8383_TCP_ADDR": "10.43.229.136",
        "KUBERNETES_PORT_443_TCP_ADDR": "10.43.0.1",
        "AWX_OPERATOR_METRICS_SERVICE_PORT": "8383",
        "AWX_SERVICE_PORT_80_TCP_PROTO": "tcp",
        "KUBERNETES_PORT": "tcp://10.43.0.1:443",
        "AWX_T_SERVICE_PORT_80_TCP_PROTO": "tcp",
        "AWX_OPERATOR_METRICS_SERVICE_HOST": "10.43.229.136",
        "PWD": "/runner",
        "HOME": "/home/runner",
        "AWX_T_SERVICE_PORT": "tcp://10.43.123.164:80",
        "AWX_T_SERVICE_SERVICE_PORT_HTTP": "80",
        "AWX_POSTGRES_CLONE_PORT_5432_TCP": "tcp://10.43.60.187:5432",
        "AWX_SERVICE_PORT_80_TCP": "tcp://10.43.194.113:80",
        "KUBERNETES_SERVICE_PORT_HTTPS": "443",
        "AWX_OPERATOR_METRICS_PORT_8383_TCP_PORT": "8383",
        "AWX_SERVICE_SERVICE_PORT": "80",
        "KUBERNETES_PORT_443_TCP_PORT": "443",
        "AWX_SERVICE_SERVICE_HOST": "10.43.194.113",
        "AWX_OPERATOR_METRICS_PORT_8686_TCP_ADDR": "10.43.229.136",
        "AWX_OPERATOR_METRICS_PORT_8686_TCP": "tcp://10.43.229.136:8686",
        "AWX_POSTGRES_CLONE_PORT_5432_TCP_PORT": "5432",
        "AWX_POSTGRES_CLONE_PORT_5432_TCP_PROTO": "tcp",
        "KUBERNETES_PORT_443_TCP": "tcp://10.43.0.1:443",
        "AWX_OPERATOR_METRICS_PORT_8383_TCP": "tcp://10.43.229.136:8383",
        "AWX_T_SERVICE_PORT_80_TCP_PORT": "80",
        "AWX_T_SERVICE_SERVICE_PORT": "80",
        "AWX_OPERATOR_METRICS_SERVICE_PORT_HTTP_METRICS": "8383",
        "AWX_T_SERVICE_PORT_80_TCP": "tcp://10.43.123.164:80",
        "AWX_OPERATOR_METRICS_PORT": "tcp://10.43.229.136:8383",
        "AWX_POSTGRES_CLONE_PORT": "tcp://10.43.60.187:5432",
        "SHLVL": "0",
        "AWX_OPERATOR_METRICS_PORT_8686_TCP_PORT": "8686",
        "KUBERNETES_SERVICE_PORT": "443",
        "AWX_OPERATOR_METRICS_PORT_8686_TCP_PROTO": "tcp",
        "AWX_POSTGRES_CLONE_SERVICE_PORT_5432TCP02": "5432",
        "AWX_T_SERVICE_SERVICE_HOST": "10.43.123.164",
        "PATH": "/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
        "KUBERNETES_SERVICE_HOST": "10.43.0.1",
        "AWX_SERVICE_SERVICE_PORT_HTTP": "80",
        "AWX_OPERATOR_METRICS_SERVICE_PORT_CR_METRICS": "8686",
        "LC_CTYPE": "C.UTF-8",
        "ANSIBLE_FORCE_COLOR": "True",
        "ANSIBLE_HOST_KEY_CHECKING": "False",
        "ANSIBLE_INVENTORY_UNPARSED_FAILED": "True",
        "ANSIBLE_PARAMIKO_RECORD_HOST_KEYS": "False",
        "AWX_PRIVATE_DATA_DIR": "/tmp/pdd_wrapper_1601_xt4s735b/awx_1601_slj0nr1e",
        "ANSIBLE_RETRY_FILES_ENABLED": "False",
        "ANSIBLE_ASK_PASS": "False",
        "ANSIBLE_BECOME_ASK_PASS": "False",
        "DISPLAY": "",
        "TMP": "/tmp",
        "PROJECT_UPDATE_ID": "1601",
        "ANSIBLE_GALAXY_SERVER_SERVER0_URL": "https://galaxy.ansible.com/",
        "ANSIBLE_GALAXY_SERVER_LIST": "server0",
        "PYTHONPATH": ":/usr/local/lib/python3.8/site-packages/ansible_runner/config/../callbacks",
        "ANSIBLE_CALLBACK_PLUGINS": "/usr/local/lib/python3.8/site-packages/ansible_runner/config/../callbacks",
        "ANSIBLE_STDOUT_CALLBACK": "awx_display",
        "AWX_ISOLATED_DATA_DIR": "/tmp/pdd_wrapper_1601_xt4s735b/awx_1601_slj0nr1e/artifacts/1601",
        "RUNNER_OMIT_EVENTS": "False",
        "RUNNER_ONLY_FAILED_EVENTS": "False"
    },
    "job_explanation": "",
    "execution_node": "",
    "result_traceback": "",
    "event_processing_finished": false,
    "launched_by": {
        "id": 8,
        "name": "inventory",
        "type": "project",
        "url": "/api/v2/projects/8/"
    },
    "work_unit_id": "mbq6sYWJ",
    "project": 8,
    "job_type": "run",
    "job_tags": "update_git",
    "host_status_counts": {},
    "playbook_counts": {
        "play_count": 0,
        "task_count": 0
    }
}

@JustInVTime
Copy link
Author

@shanemcd Today we have hit the 10000 jobs again, and same behaviour. All jobs with JobID 10000+ are stuck.

What do you need from us to debug and fix this issue?

@samalv
Copy link

samalv commented Aug 4, 2021

any one found a solution for this one ? :-)

@SomePati
Copy link

SomePati commented Aug 5, 2021

any one found a solution for this one ? :-)

My workaround is to cancel the pending jobs and redeploy the awx. It will work for a while, but it will reoccur.

@JRNC
Copy link

JRNC commented Aug 5, 2021

We are experiencing the same thing. We went from a 17.0.1 local docker install to 19.2.2 on kubernetes (k3s) and already had well over 10,000 historical jobs. When we try to run jobs, we receive zero output, even after setting verbosity to level 4. We have tried 19.2.0 and 19.2.1 to no avail. I have noticed that by default we have two instance groups (controlplane and tower) and one container group (default). Our jobs defaulted to trying to use the container group and that is when we see zero output. If I force the job to run with controlplane or tower instead of default, we are stuck in a pending state and the job never runs.

@JRNC
Copy link

JRNC commented Aug 6, 2021

Following up on my previous post. After changing the instance group to tower or controlpane, below is the message provided while the job is stuck in the pending state:
image
This seems to be erroneous, as I have tried adding additional instances to the instance groups and I have tried doubling the requests and limits for the task and ee container resources. I also show 0% used capacity for both instance groups:
image

It is also worth noting that when the default container group is associated with a normal job template there is no output and I never see a new pod spin up, but I can successfully run an inventory sync and see a new pod temporarily spin up with "automation-job" in the name.

@JRNC
Copy link

JRNC commented Aug 12, 2021

After having this occur on k3s, I stood up a new server with RKE2 running and ran into the exact same issue. There does appear to be something wrong as it pertains to the instance groups and how they function on certain clusters.

@JRNC
Copy link

JRNC commented Aug 12, 2021

For project_updates and jobs when they are told to run on the tower instance group (one of the default instance groups) instead of running using the default container group, they show the following errors in the awx-task container while stuck in the pending state (the log was grepped for scheduler):

2021-08-12 19:34:24,237 DEBUG [69a12d2d38974d878236a3d842c2d0e4] awx.main.scheduler Running task manager. 2021-08-12 19:34:24,244 DEBUG [69a12d2d38974d878236a3d842c2d0e4] awx.main.scheduler Starting Scheduler 2021-08-12 19:34:24,292 DEBUG [69a12d2d38974d878236a3d842c2d0e4] awx.main.scheduler Skipping group tower, task cannot run on control plane 2021-08-12 19:34:24,295 DEBUG [69a12d2d38974d878236a3d842c2d0e4] awx.main.scheduler job 16905 (pending) couldn't be scheduled on graph, waiting for next cycle 2021-08-12 19:34:24,295 DEBUG [69a12d2d38974d878236a3d842c2d0e4] awx.main.scheduler Finishing Scheduler 2021-08-12 19:34:34,252 DEBUG [57156e73c4c641688a226ea8ac2c6932] awx.main.dispatch task da83ac53-1bfa-4cb8-9801-d1b7af0a7903 starting awx.main.tasks.awx_periodic_scheduler(*[]) 2021-08-12 19:34:34,260 DEBUG [57156e73c4c641688a226ea8ac2c6932] awx.main.tasks Starting periodic scheduler 2021-08-12 19:34:34,262 DEBUG [57156e73c4c641688a226ea8ac2c6932] awx.main.tasks Last scheduler run was: 2021-08-12 19:34:04.210536+00:00 2021-08-12 19:34:44,270 DEBUG [e26dbab5c9654c33b9e9ae6bb22b4d03] awx.main.dispatch task 1ec0dcea-3d39-45ce-b178-9da4b073a3f5 starting awx.main.scheduler.tasks.run_task_manager(*[]) 2021-08-12 19:34:44,270 DEBUG [e26dbab5c9654c33b9e9ae6bb22b4d03] awx.main.scheduler Running task manager. 2021-08-12 19:34:44,278 DEBUG [e26dbab5c9654c33b9e9ae6bb22b4d03] awx.main.scheduler Starting Scheduler 2021-08-12 19:34:44,322 DEBUG [e26dbab5c9654c33b9e9ae6bb22b4d03] awx.main.scheduler Skipping group tower, task cannot run on control plane 2021-08-12 19:34:44,325 DEBUG [e26dbab5c9654c33b9e9ae6bb22b4d03] awx.main.scheduler job 16905 (pending) couldn't be scheduled on graph, waiting for next cycle 2021-08-12 19:34:44,326 DEBUG [e26dbab5c9654c33b9e9ae6bb22b4d03] awx.main.scheduler Finishing Scheduler

It appears as though we are being caught by the following conditional statements:

for rampart_group in preferred_instance_groups:
if task.can_run_containerized and rampart_group.is_container_group:
self.graph[rampart_group.name]['graph'].add_job(task)
self.start_task(task, rampart_group, task.get_jobs_fail_chain(), None)
found_acceptable_queue = True
break
if not task.can_run_on_control_plane:
logger.debug("Skipping group {}, task cannot run on control plane".format(rampart_group.name))
continue

It seems as though our tasks' can_run_containerized attribute is somehow set to false and this is preventing our tasks from running. I don't know how this attribute is declared on a per task basis, is there a way for us to manually change it for a task and test?

@JRNC
Copy link

JRNC commented Aug 12, 2021

The class for the unified jobs appear to set the can_run_containerized and can_run_on_control_plane properties to false: (

@property
def can_run_on_control_plane(self):
if settings.IS_K8S:
return False
return True
@property
def can_run_containerized(self):
return False
)

However, for just jobs, it is set to true (

awx/awx/main/models/jobs.py

Lines 1239 to 1241 in c09cad3

@property
def can_run_on_control_plane(self):
return True
and

awx/awx/main/models/jobs.py

Lines 746 to 748 in c09cad3

@property
def can_run_containerized(self):
return True
)

Also, for inventories can_run_containerized is set to true (why inventory sync would still work):

@property
def can_run_containerized(self):
return True

I looked into our unified job templates (https://yourawxserver.yourdomain.com/api/v2/unified_job_templates) and our job templates (https://yourawxserver.yourdomain.com/api/v2/job_templates). I determined that many of our templates and all of our projects were classified as unified job templates. It appears as though any launched template will hang; because it is either a unified job template itself, or it is associated with a project which is classified as a unified job template.

If anyone else who has experienced this issue can corroborate what I have mentioned above, that could be very helpful.

Does anyone know the intended distinction between "unified job templates" and just "job templates"? Also, does anyone know if all projects should be classified as "unified job templates" and why it would be desirable to have their can_run_containerized and can_run_on_control_plane properties set to false?

I am wondering if we have a case of objects being erroneously identified as "unified job templates," or if there is something else going on.

@wenottingham
Copy link
Contributor

All templates (job, workflow, project updates, etc) are unified job templates - that's the base class they all inherit from.

@JRNC
Copy link

JRNC commented Aug 18, 2021

@wenottingham Is it the intention to only allow jobs to run on container groups and no longer allow jobs to run on instance groups? There seems to be two issues:

  1. The can_run_on_control_plane is clearly getting set to false in this instance, because we are seeing the debug message "Skipping group {}, task cannot run on control plane." I don't know if this is a precedence issue or not, it looks like the parent class would be setting can_run_on_control_plane to false, since we are running on K8s. However, it looks like the intent is to have the child class change that property true, which it doesn't appear that it is in this case. Also, as you can see I am targeting the tower instance group. If the intent was to just prevent jobs from running on the control plane instance group (but not all possible instance groups), it appears to have had the unintended consequence of preventing execution on any instance group (not container group).

  2. It appears as though nothing but an inventory refresh can successfully leverage a container group. When the job is launched with the default container group, it will hang with a running status and never provide any standard output. The logs don't provide any error messages, they just end at the projectupdate running playbook line and no automation-job pods spin up until you cancel the job. Once the job is cancelled, it appears as though the job attempts to spin up the automation-job container and show some log lines that imply it is going to run the job but the pod is quickly terminated due to the cancellation request.

image

@JRNC
Copy link

JRNC commented Aug 18, 2021

I think that I may have tracked down the issue, although I am wondering if this is truly exclusive to job IDs > 9999 or not. If it is, then there may be something additional.

The below code snippet, where the can_run_containerized property is set to true is present in inventory.py and jobs.py, but not projects.py. All of the issues that have been presented in this issue are occurring when a project update is run or a job is run and it automatically attempts to run a project update before executing the playbook for the job.

@Property
def can_run_containerized(self):
return True

Was there any reason to exclude setting this property to true in awx/awx/main/models/projects.py? If not, I believe this could be the root of the issue.

@rbicker
Copy link

rbicker commented Sep 29, 2021

FYI, I have encountered an issue #11051 which I found to be connected to this one. From what I found the issue occurs for jobs with IDs > 9999 that use ssh keys with passphrases only in our case. It is happening because the jobs are indefinitely waiting for the passphrase. When I manually inject the passphrases, the jobs proceed. Maybe this helps further troubleshoot the issue.

@JRNC
Copy link

JRNC commented Oct 6, 2021

FYI, this issue appears to be fixed for me with v19.4.0. It appears as though the changes made to instance and container group logic seems to have resolved this issue.

@JustInVTime
Copy link
Author

Yeah! I've updated our test environment to 9.4.0 this evening. This one contained the test database which had jobs with job id > 9999 in a pending state without doing anything. And they seem to run now.

I will update production soon, the current job ID for production is in the 5k now (had to reset it last week because of this bug), so we will find out soon enough if our production environment keeps working after the upgrade to 19.4.0. If so I will close this issue.

@hans2520
Copy link

I posted a question on stackoverflow which led me here -- https://stackoverflow.com/questions/70320452/awx-5-0-0-all-jobs-stop-processing-and-hang-indefinitely-why

There's a bit of analysis in there -- I don't see a reason why the job ID count hitting 10k would trigger this condition. However, the symptoms are identical to other commenters here -- it's the injection of credentials via the file pipe causing the issue. Maybe the file pipe name becomes invalid somehow after the job number hits 10k?

@hans2520
Copy link

I posted a question on stackoverflow which led me here -- https://stackoverflow.com/questions/70320452/awx-5-0-0-all-jobs-stop-processing-and-hang-indefinitely-why

As an update to this comment, it turns out in our case the root cause was Crowdstrike -- it had recently deployed script-based execution monitoring and this feature was blocking the FIFO file pipe used by AWX to start up the jobs with the needed SSH key. I opened a new issue for the AWX team to look at improving the FIFO pipe write so as to give users a better indication as to what is happening in case a hang were to happen.

@shanemcd
Copy link
Member

Nice find. Closing this in favor of your new issue.

@AlanCoding
Copy link
Member

I think #11453 is the new issue.

@kakawait
Copy link

I've posted a monkey-patch (please read the final ATTENTION) to by-pass that limitation if you don't want to disable Crowdstrike or tune configuration at enterprise level.

https://stackoverflow.com/questions/70320452/awx-all-jobs-stop-processing-and-hang-indefinitely-why/70425369#70425369

@craph
Copy link
Contributor

craph commented Jan 11, 2022

the crowdstrike issue seems fixed with the update 6.32.12905
A6905A95

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sustaining:support Support prioritized issues type:bug
Projects
None yet
Development

No branches or pull requests