Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task was marked as running in Tower but was not present in the job queue, so it has been marked as failed #9594

Open
bdoublet91 opened this issue Mar 16, 2021 · 42 comments

Comments

@bdoublet91
Copy link

ISSUE TYPE
  • Bug Report
SUMMARY

Sometimes, I have some schedule jobs that failed on awx status job with no logs but the playbook ran on the server.
Here it's an update of the system like apt update && apt upgrade.

Sometimes the same schedules jobs succeeded with logs and all is ok

I get the following explanation

Task was marked as running in Tower but was not present in the job queue, so it has been marked as failed.
Identity added: /tmp/awx_537_84tgb_ux/artifacts/537/ssh_key_data (***)
Using /etc/ansible/ansible.cfg as config file
Vault password: 

PLAY [Update and Upgrade all packages on Linux machines] ***********************

TASK [Gathering Facts] *********************************************************

image

ENVIRONMENT
  • AWX version: 15.0.1
  • AWX install method: openshift, minishift, docker on linux, docker for mac, boot2docker
  • Ansible version: 2.9.14
  • Operating System:
cat /etc/os-release 
NAME="CentOS Linux"
VERSION="8 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="8"
PLATFORM_ID="platform:el8"
PRETTY_NAME="CentOS Linux 8 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:8"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-8"
CENTOS_MANTISBT_PROJECT_VERSION="8"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="8"
  • Web Browser: Chrome Version 89.0.4389.82
STEPS TO REPRODUCE

Dont really know

EXPECTED RESULTS

All logs and success jobs

ACTUAL RESULTS

failed jobs random with no logs

I don't use latest version of awx because there are too much bugs that makes awx painfull to use so I don't know if latest version resolve this issue.
If you want more information or tests, feel free to ask. Thanks you

@ryanpetrello
Copy link
Contributor

Hey @bdoublet91,

Would you give this a shot on a newer version of AWX? Generally this type of issue means a playbook unexpectedly exited during execution, but it could be a bug in AWX itself that's been addressed since 15.0.1 was released.

@bdoublet91
Copy link
Author

I will be able to do the test next week with awx 17.0.1 I think.
I'm not sure about unexpectedly exited because 80% of the time, it's succeed. I will add a task to write logs on my server with this playbook to checkout what it's going on. I will update this issue next week

@bdoublet91
Copy link
Author

Some news,
I have tested with awx 17.1.0 and get no problems about job failed. All schedule jobs worked. I would also test it in 18.0.0 because some bug was fixed but I didn't see the breaking change about data migration and I just broke up my database. Fortunately, I did backup so I will redeploy in version 17.1.0 and migrate to 18.0.0 and check if ti works.

@bdoublet91
Copy link
Author

bdoublet91 commented May 6, 2021

updates there,

I have still problems with my playbooks.
I schedules system update at 1:00am with different environments. Same code is executed but it failed often in integration environment without logs whereas it succeed on hears environment
image

last_day:

Identity added: /tmp/bwrap_996_m019kc_5/awx_996_du4k_26v/artifacts/996/ssh_key_data
Using /etc/ansible/ansible.cfg as config file
Vault password:

day_before

Identity added: /tmp/bwrap_963_obcg3d7x/awx_963_dw8t1kj2/artifacts/963/ssh_key_data
Using /etc/ansible/ansible.cfg as config file
Vault password: 
PLAY [Update and Upgrade all packages on Linux machines] ***********************
TASK [Gathering Facts] *********************************************************
Identity added: /tmp/bwrap_941_n0haer6o/awx_941_lmzmp9d3/artifacts/941/ssh_key_data
Using /etc/ansible/ansible.cfg as config file
Vault password: 
PLAY [Update and Upgrade all packages on Linux machines] ***********************
TASK [Gathering Facts] *********************************************************
ok: [int_haproxy1]
ok: [int_console1]
ok: [int_postgres2]

and sometimes, it succeeded:

PLAY RECAP *********************************************************************
int_administration1        : ok=11   changed=2    unreachable=0    failed=0    skipped=12   rescued=0    ignored=0   
int_console1               : ok=7    changed=1    unreachable=0    failed=0    skipped=12   rescued=0    ignored=0   
int_haproxy1               : ok=7    changed=1    unreachable=0    failed=0    skipped=12   rescued=0    ignored=0   
int_haproxy2               : ok=7    changed=1    unreachable=0    failed=0    skipped=12   rescued=0    ignored=0

Awx 17.1.0 and awx_task run on docker_swarm. I have no container restart during the execution, the update is well done even if I have no logs in awx.
When I relaunch playbook manually, it succeeds.

Another point is:
I have setup log_path=/var/log/ansible/update_system.log to keep logs in awx_task container to check if something wrong.
Log file are created by the playbook but no logs are write into its. File's date change as well.
If I launch my playbook in cli to awx_task container, logs are well writted into the log file.

Do you have any idea or test I can do to resolve suspicious failed ?
Does awx have a problem in writing logs into log_path ?

Thanks

Ps: I can't update my AWX version because since 18.0.0, docker is not recommended (supported) by awx installler

EDIT:

This is my ssh configuration for playbook in ansible.cfg, I had once an error

"msg": "Failed to connect to the host via ssh: Failed to add the host to the list of known hosts (/root/.ssh/known_hosts).\r\nmuxclient: master hello exchange failed\r\nFailed to connect to new control master"

[ssh_connection]
pipelining=True
ssh_args = -o ControlMaster=auto -o ControlPersist=10m -o PreferredAuthentications=publickey
control_path = %(directory)s/ansible-ssh-%%h-%%p-%%r

@HOSTED-POWER
Copy link

HOSTED-POWER commented Jul 22, 2021

We seem to have the error "Task was marked as running in Tower but was not present in the job queue, so it has been marked as failed."

This happens as soon as we run multiple jobs at the same time, with a single job it never occurs. We use AWX 19.2.2

I only could find this in the supervisor.log after it happened:

2021-07-22 19:48:15,401 INFO waiting for awx-dispatcher to stop
2021-07-22 19:48:15,417 INFO stopped: awx-dispatcher (terminated by SIGKILL)
2021-07-22 19:48:15,438 INFO waiting for awx-receiver to stop
2021-07-22 19:48:15,446 INFO stopped: awx-receiver (terminated by SIGKILL)
2021-07-22 19:48:15,455 INFO spawned: 'awx-dispatcher' with pid 31298
2021-07-22 19:48:16,616 INFO success: awx-dispatcher entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2021-07-22 19:48:16,619 INFO spawned: 'awx-receiver' with pid 31327
2021-07-22 19:48:17,627 INFO success: awx-receiver entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2021-07-22 19:48:18,034 INFO waiting for awx-rsyslogd to stop
2021-07-22 19:48:18,035 INFO stopped: awx-rsyslogd (exit status 0)
2021-07-22 19:48:18,039 INFO spawned: 'awx-rsyslogd' with pid 31374
2021-07-22 19:48:19,049 INFO success: awx-rsyslogd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)

@JoelBChapmanII
Copy link

I am seeing this as well on K8s, AWX version 19.2.2. This occurs when we have multiple jobs launched at the same time like bdoublet91.

@HOSTED-POWER
Copy link

I checked many logs but couldn't find it. It's like something is resetted on certain events and that this also cancels / kills the jobs

@afnanqureshi
Copy link

Check NTP for all your nodes. I faced similar issue and NTP for one of the server nodes for Ansible Tower was our of sync. Re-synching them fixed the issue.

@HOSTED-POWER
Copy link

I just have 1 box with 3 docker containers, unlikely there is a time problem?

@HOSTED-POWER
Copy link

HOSTED-POWER commented Aug 4, 2021

Found this arround timestamp of job failure:

awx-uwsgi stdout | 172.18.0.1 GET /api/v2/jobs/13336/ - HTTP/1.1 200
awx-uwsgi stdout |
awx-uwsgi stdout | 2021-08-04 08:04:29,768 INFO     [4d2e0f6f] awx.api.authentication User hosted-power performed a GET to /api/v2/jobs/13336/job_events/ through the API using OAuth 2 token 43.
awx-uwsgi stdout |
awx-uwsgi stdout | 2021-08-04 08:04:29,811 DEBUG    [4d2e0f6f] awx.analytics.performance request: <WSGIRequest: GET '/api/v2/jobs/13336/job_events/?order_by=start_line&no_truncate=True&start_line__gte=2541'>, response_time: 0.074s
awx-uwsgi stdout |
awx-uwsgi stdout | 172.18.0.1 GET /api/v2/jobs/13336/job_events/?order_by=start_line&no_truncate=True&start_line__gte=2541 - HTTP/1.1 200
awx-uwsgi stdout |
awx-uwsgi stdout | 2021-08-04 08:04:30,115 INFO     [90ad53de] awx.api.authentication User hosted-power performed a GET to /api/v2/jobs/13336/ through the API using OAuth 2 token 43.
awx-uwsgi stdout |
awx-uwsgi stdout | 2021-08-04 08:04:30,157 DEBUG    [90ad53de] awx.analytics.performance request: <WSGIRequest: GET '/api/v2/jobs/13336/'>, response_time: 0.087s
awx-uwsgi stdout |
awx-uwsgi stdout | 172.18.0.1 GET /api/v2/jobs/13336/ - HTTP/1.1 200
awx-uwsgi stdout |
awx-uwsgi stdout | 2021-08-04 08:04:30,209 INFO     [212c08ae] awx.api.authentication User hosted-power performed a GET to /api/v2/jobs/13336/job_events/ through the API using OAuth 2 token 43.
awx-uwsgi stdout |
awx-uwsgi stdout | WSGI app 0 (mountpoint='') ready in 2 seconds on interpreter 0x2411e30 pid: 12608 (default app)
awx-uwsgi stdout |
**awx-uwsgi stdout | running "exec: supervisorctl restart tower-processes:awx-dispatcher tower-processes:awx-receiver" (accepting1)...**
awx-uwsgi stdout |
awx-uwsgi stdout | 2021-08-04 08:04:30,252 DEBUG    [212c08ae] awx.analytics.performance request: <WSGIRequest: GET '/api/v2/jobs/13336/job_events/?order_by=start_line&no_truncate=True&start_line__gte=2546'>, response_time: 0.086s
awx-uwsgi stdout |
awx-uwsgi stdout | 172.18.0.1 GET /api/v2/jobs/13336/job_events/?order_by=start_line&no_truncate=True&start_line__gte=2546 - HTTP/1.1 200
awx-uwsgi stdout |
2021-08-04 08:04:30,412 INFO waiting for awx-dispatcher to stop
2021-08-04 08:04:30,421 INFO stopped: awx-dispatcher (terminated by SIGKILL)
2021-08-04 08:04:30,431 INFO stopped: awx-receiver (terminated by SIGKILL)
2021-08-04 08:04:30,437 INFO spawned: 'awx-dispatcher' with pid 12628
awx-dispatcher stdout | make[1]: Entering directory '/awx_devel'
awx-dispatcher stdout |
awx-uwsgi stdout | 2021-08-04 08:04:30,548 INFO     [385c06bf] awx.api.authentication User hosted-power performed a GET to /api/v2/jobs/13336/ through the API using OAuth 2 token 43.
awx-uwsgi stdout |
awx-receptor stdout | ERROR2021/08/04 08:04:30 Write error in control service: write unix /var/run/receptor/receptor.sock->@: write: broken pipe
awx-receptor stdout | INFO2021/08/04 08:04:30 Client disconnected from control service
awx-receptor stdout |
awx-uwsgi stdout | 2021-08-04 08:04:30,599 DEBUG    [385c06bf] awx.analytics.performance request: <WSGIRequest: GET '/api/v2/jobs/13336/'>, response_time: 0.086s
awx-uwsgi stdout |
awx-uwsgi stdout | 172.18.0.1 GET /api/v2/jobs/13336/ - HTTP/1.1 200
awx-uwsgi stdout |
awx-uwsgi stdout | 2021-08-04 08:04:30,640 INFO     [dd7aedd8] awx.api.authentication User hosted-power performed a GET to /api/v2/jobs/13336/job_events/ through the API using OAuth 2 token 43.
awx-uwsgi stdout |
awx-uwsgi stdout | 2021-08-04 08:04:30,668 DEBUG    [dd7aedd8] awx.analytics.performance request: <WSGIRequest: GET '/api/v2/jobs/13336/job_events/?order_by=start_line&no_truncate=True&start_line__gte=2549'>, response_time: 0.062s
awx-uwsgi stdout |
awx-uwsgi stdout | 172.18.0.1 GET /api/v2/jobs/13336/job_events/?order_by=start_line&no_truncate=True&start_line__gte=2549 - HTTP/1.1 200
awx-uwsgi stdout |
awx-uwsgi stdout | 2021-08-04 08:04:30,963 INFO     [c835d51a] awx.api.authentication User hosted-power performed a GET to /api/v2/jobs/13336/ through the API using OAuth 2 token 43.
awx-uwsgi stdout |

awx-uwsgi stdout | running "exec: supervisorctl restart tower-processes:awx-dispatcher tower-processes:awx-receiver" (accepting1)...

Why would it do this? Seems exactly at the moment we experience our lost task.

@kimbernator
Copy link

+1, also seeing this

@HOSTED-POWER
Copy link

It's really severe on our side, we can hardly use AWX at this moment, as soon as we launch multiple jobs we get in trouble :(

@HOSTED-POWER
Copy link

Unfortunately the upgrade to 19.3.0 didn't fix it

@kimbernator
Copy link

This isn't particularly scientific, but it appears that this was mitigated for me in some capacity by increasing the memory limit for the awx-task container to 4Gi. I'll do some more testing once I get my kubernetes cluster beefed up a little bit to handle the extra requirements.

@HOSTED-POWER
Copy link

We have now 16 GB ram and 6 cpu cores, just for this AWX thing and our issues still aren't resolved. We still get task failed :(

@kimbernator
Copy link

@HOSTED-POWER Are those the requests/limits for the awx-task container or the specs of the system it runs on?

@HOSTED-POWER
Copy link

It are the specs of the host machine, the containers don't have any limitations configured.

@kimbernator
Copy link

In a situation like that I'd consider defining SYSTEM_TASK_ABS_MEM in the operator such that it can't spin up a basically unlimited number of queue workers in the awx-task container. Each one takes 150MB+ of memory itself and spins up an EE pod which could take much more than that, eventually choking the system and preventing jobs from being re-queued.

max worker count when that setting is defined is just the value * 5, so it may take some tweaking and will likely result in a small throughput loss, but it may end up resolving the problem.

@HOSTED-POWER
Copy link

I don't fully understand, we have 16 GB ram on the host machine.

The container has no limits.

Where is the problem coming from exactly?

I tried this on the container:

cat /proc/meminfo
MemTotal: 16427040 kB

Looks like it reads like 16 Gb, so where is this problem coming from? :)

(Just double checking to make sure it applies)

@kimbernator
Copy link

kimbernator commented Sep 8, 2021

I'm not certain this will be the cause for everyone seeing this error, I can only say that it was what solved it for me. I would encourage you to read my extended description in #11036, but here it is fitted to your scenario:

The awx-task container determines the maximum number of workers it can start simultaneously to handle the lifecycle of EE containers running jobs in the queue based on the available system memory. In your case it's reading 16GB as the system memory budgeted just for queue workers, which calculates to a max of 85 workers ((16GB + 1) * 5). Each worker takes ~150-200MB of memory upon initialization. Obviously given the chance to have enough jobs, the workers alone would consume between 12,750MB and 17,000MB of memory. But that's not budgeting the likely much higher memory consumption of each individual EE pod being spun up, so the 16GB will be fully consumed long before you hit 85 workers, assuming that the EE pods run on the same system. Once all available memory has been used, these workers start dying off and their queued jobs are lost, presumably because there isn't enough memory to process re-queuing them.

By defining SYSTEM_TASK_ABS_MEM, you override AWX's built-in method of determining system memory and limit the maximum queue workers so they can't fill up the available system memory. In the awx-operator, I define that with the following because I gave my awx-task container a limit of 3Gi of memory:

  extra_settings:
    - setting: SYSTEM_TASK_ABS_MEM
      value: "3"

Now I'm limited to 15 concurrent jobs in AWX and I see no more errors.

@HOSTED-POWER
Copy link

Ok that would make sense, just one thing I find strange, we never used more than 5 GB of this 16 GB of RAM ... (we log statistics of the host machine)

So I don't think we are in the case of exhausting resources? Or is something else possible?

@kimbernator
Copy link

kimbernator commented Sep 8, 2021

It depends on the other pods in the cluster as well. If they have memory requests that are higher than their actual memory usage, they will prevent any other pod from accessing that memory without actually "using" it. If you're easily able to recreate the situation where the error described here happens, I would suggest running a kubectl describe node <node> while errors are occurring to get the requests/limits of the node to see if requests have come close to/exceeded 100%.

@kimbernator
Copy link

Also, it may be silly but I'd verify that there aren't any default limits on pods in your namespace that might be applying to your pods without you realizing - would be visible in kubectl describe namespace <namespace>

@HOSTED-POWER
Copy link

Thx, but we run them in docker at this moment, no parameter is showing sign of overload as far as I can see

In which .py file can I set SYSTEM_TASK_ABS_MEM ? I would try it inside the container as a test otherwise

@kimbernator
Copy link

I have it in /etc/tower/settings.py.

@HOSTED-POWER
Copy link

Ok I set this:

cat /etc/tower/settings.py
SYSTEM_TASK_ABS_MEM = 3

not sure how this would help ... :)

It's still quite a strange bug

@HOSTED-POWER
Copy link

This doesn't make any difference, it must be something (totally) different

We keep getting 10's of task failed every day, it's hell and unusable :(

@HOSTED-POWER
Copy link

Well we migrated to Kubernetes only to see that same problems keep happening. We just see "Error" now , without any extra information.

@goldyfruit
Copy link

goldyfruit commented Oct 27, 2021

Facing the same issue with 19.2.1 running in Docker, not cool with a 4 hours workflow :D

@goldyfruit
Copy link

goldyfruit commented Oct 27, 2021

The worker is killed for no reason (that I'm aware of)... Like if uWSGI reached a limit or a timeout.

awx-uwsgi stdout | ...The work of process 530 is done. Seeya!
awx-uwsgi stdout | 
awx-uwsgi stdout | worker 1 killed successfully (pid: 530)
awx-uwsgi stdout | 
awx-uwsgi stdout | Respawned uWSGI worker 1 (new pid: 11323)
awx-uwsgi stdout | 
awx-uwsgi stdout | Python auto-reloader enabled
awx-uwsgi stdout | 
awx-uwsgi stdout | WSGI app 0 (mountpoint='') ready in 1 seconds on interpreter 0x1df6cf0 pid: 11323 (default app)
awx-uwsgi stdout | running "exec: supervisorctl restart tower-processes:awx-dispatcher tower-processes:awx-receiver" (accepting1)...
awx-uwsgi stdout | 
2021-10-27 17:16:47,970 INFO stopped: awx-dispatcher (terminated by SIGKILL)
2021-10-27 17:16:47,971 INFO waiting for awx-receiver to stop
2021-10-27 17:16:47,971 INFO stopped: awx-receiver (terminated by SIGKILL)
2021-10-27 17:16:47,976 INFO spawned: 'awx-dispatcher' with pid 11329
awx-dispatcher stdout | make[1]: Entering directory '/awx_devel'
awx-dispatcher stdout | 
2021-10-27 17:16:48,992 INFO success: awx-dispatcher entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2021-10-27 17:16:48,994 INFO spawned: 'awx-receiver' with pid 11348
awx-receiver stdout | make[1]: Entering directory '/awx_devel'
awx-receiver stdout | 
awx-uwsgi stdout | 2021-10-27 17:16:49,926 INFO     [c35361bd] awx.api.authentication User goldyfruit performed a GET to /api/v2/metrics/ through the API using OAuth 2 token 176.
awx-uwsgi stdout | 
awx-uwsgi stdout | 2021-10-27 17:16:49,982 DEBUG    [c35361bd] awx.analytics.performance request: <WSGIRequest: GET '/api/v2/metrics/'>, response_time: 0.103s
awx-uwsgi stdout | 
awx-uwsgi stdout | 24.225.217.83 GET /api/v2/metrics/ - HTTP/1.1 200
awx-uwsgi stdout | 
2021-10-27 17:16:50,984 INFO success: awx-receiver entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
awx-uwsgi stdout | tower-processes:awx-dispatcher: stopped
awx-uwsgi stdout | tower-processes:awx-receiver: stopped
awx-uwsgi stdout | tower-processes:awx-dispatcher: started
awx-uwsgi stdout | tower-processes:awx-receiver: started
awx-uwsgi stdout | 
awx-dispatcher stdout | 2021-10-27 17:16:51,826 DEBUG    [-] awx.analytics.job_lifecycle job-1656 event processing finished
awx-dispatcher stdout | 
awx-receiver stdout | 2021-10-27 17:16:52,613 WARNING  [-] awx.main.commands.run_callback_receiver scaling up worker pid:11402
awx-receiver stdout | 
awx-receiver stdout | 2021-10-27 17:16:52,618 WARNING  [-] awx.main.commands.run_callback_receiver scaling up worker pid:11403
awx-receiver stdout | 
awx-receiver stdout | 2021-10-27 17:16:52,622 WARNING  [-] awx.main.commands.run_callback_receiver scaling up worker pid:11404
awx-receiver stdout | 
awx-receiver stdout | 2021-10-27 17:16:52,627 WARNING  [-] awx.main.commands.run_callback_receiver scaling up worker pid:11405
awx-receiver stdout | 
awx-uwsgi stdout | 2021-10-27 17:16:54,911 INFO     [9b66dac4] awx.api.authentication User goldyfruit performed a GET to /api/v2/metrics/ through the API using OAuth 2 token 176.
awx-uwsgi stdout | 
awx-uwsgi stdout | 2021-10-27 17:16:54,959 DEBUG    [9b66dac4] awx.analytics.performance request: <WSGIRequest: GET '/api/v2/metrics/'>, response_time: 0.080s
awx-uwsgi stdout | 
awx-uwsgi stdout | 24.225.217.83 GET /api/v2/metrics/ - HTTP/1.1 200
awx-uwsgi stdout | 
awx-dispatcher stdout | 2021-10-27 17:16:57,066 ERROR    [-] awx.main.dispatch job 1656 (failed) is no longer running; reaping
awx-dispatcher stdout | 
awx-receptor stdout | INFO 2021/10/27 17:16:57 Client connected to control service
awx-receptor stdout | 
awx-receptor stdout | INFO 2021/10/27 17:16:57 Control service closed
awx-receptor stdout | INFO 2021/10/27 17:16:57 Client disconnected from control service
awx-receptor stdout | 
2021-10-27 17:16:57,330 INFO waiting for awx-rsyslogd to stop
2021-10-27 17:16:57,331 INFO stopped: awx-rsyslogd (exit status 0)
2021-10-27 17:16:57,334 INFO spawned: 'awx-rsyslogd' with pid 11411
2021-10-27 17:16:58,336 INFO success: awx-rsyslogd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)

@goldyfruit
Copy link

goldyfruit commented Oct 27, 2021

Just happened again with a job that ran for 1 hour and 52 minutes... part of a workflow.
Same exact error in the logs as above.

awx-dispatcher stdout | 2021-10-27 19:30:36,877 ERROR    [-] awx.main.dispatch job 1675 (failed) is no longer running; reaping

@christiangfeller
Copy link

The same behavior here on OpenShift. If running multiple jobs at the same time, every single Pod is created and Jobs are running. Suddenly (most when starting one more job), all Pods will unexpected terminate.

Operator: 0.15.0
AWX: 19.5.0

No memory issues on OpenShift Pods.

2021-11-30 09:00:56,601 DEBUG    [935f3edba0b240f3b8d35a2200679f1f] awx.analytics.job_lifecycle job-3274 preparing playbook
2021-11-30 09:00:56,742 DEBUG    [935f3edba0b240f3b8d35a2200679f1f] awx.analytics.job_lifecycle job-3274 running playbook
2021-11-30 09:00:56,779 ERROR    [464de20e20ae4ab0a83893c2beb50947] awx.main.dispatch worker pid:1303 is gone (exit=-9)
2021-11-30 09:00:56,779 DEBUG    [464de20e20ae4ab0a83893c2beb50947] awx.main.dispatch task e8059321-e9fa-41f3-a6fc-0139dcdb815d starting awx.main.tasks.update_inventory_computed_fields(*[2])
2021-11-30 09:00:56,819 DEBUG    [464de20e20ae4ab0a83893c2beb50947] awx.main.models.inventory Going to update inventory computed fields, pk=2
2021-11-30 09:00:56,856 DEBUG    [-] awx.main.commands.run_callback_receiver JobEvent.objects.bulk_create(2)
2021-11-30 09:00:56,856 DEBUG    [-] awx.main.commands.run_callback_receiver JobEvent.objects.bulk_create(2)
2021-11-30 09:00:56,863 DEBUG    [-] awx.main.commands.run_callback_receiver JobEvent.objects.bulk_create(1)
2021-11-30 09:00:56,863 DEBUG    [-] awx.main.commands.run_callback_receiver JobEvent.objects.bulk_create(1)
2021-11-30 09:00:56,871 DEBUG    [-] awx.main.commands.run_callback_receiver JobEvent.objects.bulk_create(1)
2021-11-30 09:00:56,871 DEBUG    [-] awx.main.commands.run_callback_receiver JobEvent.objects.bulk_create(1)
2021-11-30 09:00:56,874 DEBUG    [-] awx.main.commands.run_callback_receiver JobEvent.objects.bulk_create(1)
2021-11-30 09:00:56,874 DEBUG    [-] awx.main.commands.run_callback_receiver JobEvent.objects.bulk_create(1)
2021-11-30 09:00:56,877 DEBUG    [464de20e20ae4ab0a83893c2beb50947] awx.main.models.inventory Finished updating inventory computed fields, pk=2, in 0.057 seconds
2021-11-30 09:00:57,080 DEBUG    [-] awx.main.commands.run_callback_receiver JobEvent.objects.bulk_create(6)
2021-11-30 09:00:57,080 DEBUG    [-] awx.main.commands.run_callback_receiver JobEvent.objects.bulk_create(6)
2021-11-30 09:00:57,282 ERROR    [464de20e20ae4ab0a83893c2beb50947] awx.main.dispatch job 3269 (failed) is no longer running; reaping
2021-11-30 09:00:57,283 ERROR    [464de20e20ae4ab0a83893c2beb50947] awx.main.dispatch requeuing 1 messages from gone worker pid:1303
2021-11-30 09:00:57,284 WARNING  [464de20e20ae4ab0a83893c2beb50947] awx.main.dispatch scaling down worker pid:1495
2021-11-30 09:00:57,285 WARNING  [464de20e20ae4ab0a83893c2beb50947] awx.main.dispatch worker exiting gracefully pid:1495
2021-11-30 09:00:57,293 WARNING  [464de20e20ae4ab0a83893c2beb50947] awx.main.dispatch scaling up worker pid:1517
2021-11-30 09:00:57,358 DEBUG    [464de20e20ae4ab0a83893c2beb50947] awx.main.dispatch task 5b70e761-9bdd-43f2-a824-d6c708a15a3f starting awx.main.tasks.handle_work_error(*['1bd0f432-d461-40a0-a345-e6d7a3f7b68f'])
2021-11-30 09:00:57,360 DEBUG    [464de20e20ae4ab0a83893c2beb50947] awx.main.tasks Executing error task id 1bd0f432-d461-40a0-a345-e6d7a3f7b68f, subtasks: [{'type': 'job', 'id': 3269}]
Instance Group already registered controlplane
Instance Group already registered default
2021-11-30 09:00:57,484 WARNING  [464de20e20ae4ab0a83893c2beb50947] awx.main.dispatch scaling up worker pid:1519
2021-11-30 09:00:57,559 DEBUG    [464de20e20ae4ab0a83893c2beb50947] awx.main.dispatch task 7a1e94d8-0c1d-402f-b667-1585e0463702 starting awx.main.scheduler.tasks.run_task_manager(*[])
2021-11-30 09:00:57,561 DEBUG    [464de20e20ae4ab0a83893c2beb50947] awx.main.scheduler Running task manager.
2021-11-30 09:00:57,573 DEBUG    [-] awx.main.commands.run_callback_receiver JobEvent.objects.bulk_create(1)
2021-11-30 09:00:57,573 DEBUG    [-] awx.main.commands.run_callback_receiver JobEvent.objects.bulk_create(1)
2021-11-30 09:00:57,586 DEBUG    [935f3edba0b240f3b8d35a2200679f1f] awx.analytics.job_lifecycle job-3274 work unit id received
2021-11-30 09:00:57,599 DEBUG    [464de20e20ae4ab0a83893c2beb50947] awx.main.scheduler Starting Scheduler
2021-11-30 09:00:57,654 DEBUG    [464de20e20ae4ab0a83893c2beb50947] awx.main.dispatch task 84d4ea8f-fa3f-4937-9658-aca1e9c28b9f starting awx.main.scheduler.tasks.run_task_manager(*[])
2021-11-30 09:00:57,657 DEBUG    [935f3edba0b240f3b8d35a2200679f1f] awx.analytics.job_lifecycle job-3274 work unit id assigned
2021-11-30 09:00:57,659 DEBUG    [464de20e20ae4ab0a83893c2beb50947] awx.main.scheduler Running task manager.
2021-11-30 09:00:57,688 DEBUG    [464de20e20ae4ab0a83893c2beb50947] awx.main.scheduler Not running scheduler, another task holds lock
2021-11-30 09:00:57,774 DEBUG    [464de20e20ae4ab0a83893c2beb50947] awx.analytics.job_lifecycle job-3275 acknowledged
2021-11-30 09:00:57,820 DEBUG    [464de20e20ae4ab0a83893c2beb50947] awx.analytics.job_lifecycle job-3275 controller node chosen
2021-11-30 09:00:57,830 DEBUG    [464de20e20ae4ab0a83893c2beb50947] awx.analytics.job_lifecycle job-3275 waiting
2021-11-30 09:00:57,830 DEBUG    [464de20e20ae4ab0a83893c2beb50947] awx.main.scheduler job 3275 (waiting) consumed 6 capacity units from low_limits_tocp with prior total of 246
2021-11-30 09:00:57,831 DEBUG    [464de20e20ae4ab0a83893c2beb50947] awx.main.scheduler Finishing Scheduler
2021-11-30 09:00:57,913 DEBUG    [464de20e20ae4ab0a83893c2beb50947] awx.main.dispatch task 6beb4a96-be30-46ea-80bf-cc7dd47ee35c starting awx.main.tasks.RunJob(*[3275])
2021-11-30 09:00:58,197 DEBUG    [464de20e20ae4ab0a83893c2beb50947] awx.analytics.job_lifecycle job-3275 pre run
2021-11-30 09:00:58,201 INFO exited: dispatcher (terminated by SIGKILL; not expected)
2021-11-30 09:00:58,201 INFO exited: dispatcher (terminated by SIGKILL; not expected)
2021-11-30 09:00:58,202 INFO spawned: 'dispatcher' with pid 1525
2021-11-30 09:00:58,202 INFO spawned: 'dispatcher' with pid 1525
2021-11-30 09:00:58,764 DEBUG    [-] awx.main.commands.run_callback_receiver JobEvent.objects.bulk_create(1)
2021-11-30 09:00:58,764 DEBUG    [-] awx.main.commands.run_callback_receiver JobEvent.objects.bulk_create(1)
2021-11-30 09:00:59,358 DEBUG    [-] awx.main.commands.run_callback_receiver JobEvent.objects.bulk_create(1)
2021-11-30 09:00:59,358 DEBUG    [-] awx.main.commands.run_callback_receiver JobEvent.objects.bulk_create(1)
2021-11-30 09:00:59,359 INFO success: dispatcher entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2021-11-30 09:00:59,359 INFO success: dispatcher entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2021-11-30 09:00:59,372 DEBUG    [-] awx.main.commands.run_callback_receiver JobEvent.objects.bulk_create(1)
2021-11-30 09:00:59,372 DEBUG    [-] awx.main.commands.run_callback_receiver JobEvent.objects.bulk_create(1)
2021-11-30 09:00:59,583 DEBUG    [-] awx.main.commands.run_callback_receiver JobEvent.objects.bulk_create(1)
2021-11-30 09:00:59,583 DEBUG    [-] awx.main.commands.run_callback_receiver JobEvent.objects.bulk_create(1)
2021-11-30 09:00:59,584 DEBUG    [-] awx.main.commands.run_callback_receiver JobEvent.objects.bulk_create(1)
2021-11-30 09:00:59,584 DEBUG    [-] awx.main.commands.run_callback_receiver JobEvent.objects.bulk_create(1)
2021-11-30 09:00:59,757 DEBUG    [-] awx.main.commands.run_callback_receiver JobEvent.objects.bulk_create(1)
2021-11-30 09:00:59,757 DEBUG    [-] awx.main.commands.run_callback_receiver JobEvent.objects.bulk_create(1)
2021-11-30 09:00:59,758 DEBUG    [-] awx.main.commands.run_callback_receiver JobEvent.objects.bulk_create(1)
2021-11-30 09:00:59,758 DEBUG    [-] awx.main.commands.run_callback_receiver JobEvent.objects.bulk_create(1)
2021-11-30 09:00:59,960 DEBUG    [-] awx.main.commands.run_callback_receiver JobEvent.objects.bulk_create(1)
2021-11-30 09:00:59,960 DEBUG    [-] awx.main.commands.run_callback_receiver JobEvent.objects.bulk_create(1)
2021-11-30 09:00:59,964 DEBUG    [-] awx.main.commands.run_callback_receiver JobEvent.objects.bulk_create(1)
2021-11-30 09:00:59,964 DEBUG    [-] awx.main.commands.run_callback_receiver JobEvent.objects.bulk_create(1)
2021-11-30 09:00:59,983 DEBUG    [-] awx.main.commands.run_callback_receiver JobEvent.objects.bulk_create(1)
2021-11-30 09:00:59,983 DEBUG    [-] awx.main.commands.run_callback_receiver JobEvent.objects.bulk_create(1)
2021-11-30 09:00:59,988 DEBUG    [-] awx.main.commands.run_callback_receiver JobEvent.objects.bulk_create(1)
2021-11-30 09:00:59,988 DEBUG    [-] awx.main.commands.run_callback_receiver JobEvent.objects.bulk_create(1)
2021-11-30 09:01:01,308 DEBUG    [-] awx.main.commands.run_callback_receiver JobEvent.objects.bulk_create(3)
2021-11-30 09:01:01,308 DEBUG    [-] awx.main.commands.run_callback_receiver JobEvent.objects.bulk_create(3)
RESULT 2
OKREADY
2021-11-30 09:01:01,495 WARNING  [-] awx.main.dispatch.periodic periodic beat started
2021-11-30 09:01:01,786 ERROR    [-] awx.main.dispatch job 3232 (failed) is no longer running; reaping
2021-11-30 09:01:01,908 ERROR    [-] awx.main.dispatch job 3234 (failed) is no longer running; reaping
2021-11-30 09:01:02,031 ERROR    [-] awx.main.dispatch job 3235 (failed) is no longer running; reaping
2021-11-30 09:01:02,175 ERROR    [-] awx.main.dispatch job 3236 (failed) is no longer running; reaping
2021-11-30 09:01:02,297 ERROR    [-] awx.main.dispatch job 3237 (failed) is no longer running; reaping
2021-11-30 09:01:02,435 ERROR    [-] awx.main.dispatch job 3238 (failed) is no longer running; reaping
2021-11-30 09:01:02,554 ERROR    [-] awx.main.dispatch job 3239 (failed) is no longer running; reaping
2021-11-30 09:01:02,681 ERROR    [-] awx.main.dispatch job 3240 (failed) is no longer running; reaping
2021-11-30 09:01:02,779 ERROR    [-] awx.main.dispatch job 3241 (failed) is no longer running; reaping
2021-11-30 09:01:02,913 ERROR    [-] awx.main.dispatch job 3242 (failed) is no longer running; reaping
2021-11-30 09:01:02,985 DEBUG    [-] awx.main.commands.run_callback_receiver JobEvent.objects.bulk_create(1)
2021-11-30 09:01:02,985 DEBUG    [-] awx.main.commands.run_callback_receiver JobEvent.objects.bulk_create(1)
2021-11-30 09:01:02,986 DEBUG    [-] awx.main.commands.run_callback_receiver JobEvent.objects.bulk_create(1)
2021-11-30 09:01:02,986 DEBUG    [-] awx.main.commands.run_callback_receiver JobEvent.objects.bulk_create(1)```

@chris93111
Copy link
Contributor

Hi any update !? i have the same error with 100 job at same time in awx 20.0, i have try with the new params SYSTEM_TASK_ABS_MEM = 3 and SYSTEM_TASK_ABS_CPU = "3000mi" and is the same result

we try to migrate awx 15.0 to 20 and we didn't have this problem before (15 in docker 20 in kubernetes)

We also notice that during a heavy load of the platform I see that the playbooks are distributed in a non-homogeneous way
In 15.0 each node was 50% full before going further

image

@Haagy
Copy link

Haagy commented Mar 24, 2022

I do have the same issue when running jobs from outside awx with awx-cli in a container.

podman run -it --rm my-repo/awx/cli:3.8.5 --conf.username jenkins --conf.token TOKEN projects update 123--monitor --format human --filter status will keep running for ever while AWX shows the error message

  • Task was marked as running but was not present in the job queue, so it has been marked as failed.

Logging AWX keeps repeating:

awx-uwsgi stdout | 2022-03-24 14:47:08,382 DEBUG    [9a7b95a3] awx.analytics.job_lifecycle projectupdate-44258 event processing finished
awx-uwsgi stdout |
awx-uwsgi stdout | 2022-03-24 14:47:08,419 DEBUG    [9a7b95a3] awx.analytics.performance request: <WSGIRequest: GET '/api/v2/project_updates/44258/'>, response_time: 0.100s
awx-uwsgi stdout |
awx-uwsgi stdout | 172.16.130.119 GET /api/v2/project_updates/44258/ - HTTP/1.1 200
awx-uwsgi stdout |
awx-uwsgi stdout | 2022-03-24 14:47:08,454 INFO     [3530108f] awx.api.authentication User jenkins performed a GET to /api/v2/project_updates/44258/events/ through the API using OAuth 2 token 8.
awx-uwsgi stdout |
awx-uwsgi stdout | 2022-03-24 14:47:08,479 DEBUG    [3530108f] awx.analytics.performance request: <WSGIRequest: GET '/api/v2/project_updates/44258/events/?order_by=start_line&no_truncate=True&start_line__gte=27'>, response_time: 0.055s
awx-uwsgi stdout |
awx-uwsgi stdout | 172.16.130.119 GET /api/v2/project_updates/44258/events/?order_by=start_line&no_truncate=True&start_line__gte=27 - HTTP/1.1 200
awx-uwsgi stdout |
awx-uwsgi stdout | 2022-03-24 14:47:08,765 INFO     [029d9a85] awx.api.authentication User jenkins performed a GET to /api/v2/project_updates/44258/ through the API using OAuth 2 token 8.
awx-uwsgi stdout |
awx-uwsgi stdout | 2022-03-24 14:47:08,796 DEBUG    [029d9a85] awx.analytics.job_lifecycle projectupdate-44258 event processing finished
awx-uwsgi stdout |
awx-uwsgi stdout | 2022-03-24 14:47:08,829 DEBUG    [029d9a85] awx.analytics.performance request: <WSGIRequest: GET '/api/v2/project_updates/44258/'>, response_time: 0.094s
awx-uwsgi stdout |
awx-uwsgi stdout | 172.16.130.119 GET /api/v2/project_updates/44258/ - HTTP/1.1 200
awx-uwsgi stdout |
awx-uwsgi stdout | 2022-03-24 14:47:08,864 INFO     [ab04fe0b] awx.api.authentication User jenkins performed a GET to /api/v2/project_updates/44258/events/ through the API using OAuth 2 token 8.
awx-uwsgi stdout |
awx-uwsgi stdout | 2022-03-24 14:47:08,888 DEBUG    [ab04fe0b] awx.analytics.performance request: <WSGIRequest: GET '/api/v2/project_updates/44258/events/?order_by=start_line&no_truncate=True&start_line__gte=27'>, response_time: 0.055s
awx-uwsgi stdout |
awx-uwsgi stdout | 172.16.130.119 GET /api/v2/project_updates/44258/events/?order_by=start_line&no_truncate=True&start_line__gte=27 - HTTP/1.1 200

@Haagy
Copy link

Haagy commented May 25, 2022

Updated our AWX from version 19.1 to 21.0 and from dockerized installation to kubernetes on k3s a few days ago.
Haven't had this issue anymore since.

@mani3887
Copy link

@chris93111 and @HOSTED-POWER : The calculation is based on the CPU and Memory. CPU is calculated based on the value we set in SYSTEM_TASK_ABS_CPU and memory is calculated is based on the value we set in SYSTEM_TASK_ABS_MEM. Based on this the min and max forks are calculated. So if we set the values in awx_task pod as:
- name: SYSTEM_TASK_ABS_MEM
value: 4Gi
- name: SYSTEM_TASK_ABS_CPU
value: 1000m

The min and max forks would be 4(1 CPU * 4 = 4 ) to 4(4 * 10 = 40). This will mirror your configuration for 1 instance:
image

From my experimentation, this does not mean that you can run 40 jobs continuously. I can run a max of 10-20 jobs, and after that I start hitting this issue. The max number of jobs you can run also depends on how much CPU/Memory you have in your cluster, so if a AWX EE pod has a request cpu of 1 and memory as 1Gi, while the underlying K8s VM cluster as a max of 12 CPUs, you can run at max 12 jobs(maybe lesser if you factor in control plane pods).

The exact number of jobs running in the system can be viewed inside the awx task pod's container,(kubectl exec -it (pod) -- bash ), AWX community has provided an utility awx_manage. You can run awx_manage graph_jobs . This is a graphical representation of how many jobs are running, pending, waiting state, and the capacity. This utility has been very helpful for me to determine the capacity calculation. Hope this helps!

@creativemind-pluto
Copy link

creativemind-pluto commented Oct 17, 2022

Still seeing this issue on AWX 21.4.0 version. AWX is installed on K8s cluster
Task got completed but still on AWX UI portal it shows failed.


Failed
Task was marked as running but was not present in the job queue, so it has been marked as failed.

@creativemind-pluto
Copy link

Any update on this issue? Please suggest.

@r0b1ndot
Copy link

Hi there,
I also did some high load tests and performed a total amount of 3000 jobs in 2 hours.
Also with a variation that I have the mixture of short and long-running tasks.

I also applied that suggestion modified, so I used 6Gi of RAM and 2000 m of CPU.

  • name: SYSTEM_TASK_ABS_MEM
    value: 4Gi
  • name: SYSTEM_TASK_ABS_CPU
    value: 1000m

The interesting part is I verified via Grafana that my pods never reached that limit, neither in CPU neither in ram.

This issue also appeared in situations where the awx API told me that 80% capacity is available.
So please have a look and provide an alternative how to mitigate this issue.
Thanks!

@CMPluto
Copy link

CMPluto commented Mar 7, 2023

AWX version 21.5.0 Still having this issue.
"Task was marked as running but was not present in the job queue, so it has been marked as failed."

@mani3887
Copy link

mani3887 commented Mar 7, 2023

I have tested this issue multiple times. This happens when AWX system is overwhelmed with the amount of jobs sent in. This can be addressed to an extent by changing the SYSTEM_TASK_ABS_MEM, SYSTEM_TASK_ABS_CPU parameters. I was able to achieve running close to 200 concurrent jobs at the same time by adjusting these parameters; and running 2 AWX replicas and giving guaranteed effort for AWX task container in K8s cluster.

While digging deep into the issue, what I observed was:

  1. When AWX system starts up to receive a lot of jobs all at once, and when it tries to load balance it across its instances, it was not able to load balance it equally, in my case, I observed that it was sending most of these jobs to only one instance; and so this instance gets overwhelmed and many jobs fail as a result. You can possibly play around with this setting: START_TASK_LIMIT. By default this setting is set to 100. I was able to increase the concurrent job execution from 150 to 200 jobs by playing with this setting. Sadly irrespective of what level I set, I was unable to make it go higher than this.
  2. When I ran high load, I initially saw the job fail, and then I saw that the control plane crashed and never came back up completely, it was always restarting and never came up.

@deep7861
Copy link

deep7861 commented Jul 2, 2023

@mani3887 I have same issue on 21.3.0 version (deployed on k8s). But we aren't running many concurrent jobs. This issue for us is reproducible easily when we connect to certain number of Linux machines that in turn run and register commands on ~100 network devices and store outputs in per inventory host dictionary.

Where to change these values? I'm willing to experiment and see if it helps

If someone from AWX team is still looking at this issue, this is easily reproducible and I'm happy to share any outputs/data that may help solve mysteries behind the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests