jobs running indefinitely because of receptor error #12645

jangel97 · 2022-08-11T10:42:35Z

Please confirm the following

I agree to follow this project's code of conduct.
I have checked the current issues for duplicates.
I understand that AWX is open source software provided for free and that I might not receive a timely response.

Bug Summary

Hi, we are running AAP 2.2, we have observed that when hitting error:

[root@controller3 jmorenas]# cat /tmp/receptor/controller3/mzjPtyLC/status
{"State":3,"Detail":"Failed to restart: remote work had not previously started","StdoutSize":0,"WorkType":"remote","ExtraData":{"RemoteNode":"exec2","RemoteWorkType":"ansible-runner","RemoteParams":{"params":"--private-data-dir=/tmp/awx_82194_p1j4wvaz --delete"},"RemoteUnitID":"","RemoteStarted":false,"LocalCancelled":false,"LocalReleased":false,"SignWork":true,"TLSClient":"tls_client","Expiration":"0001-01-01T00:00:00Z"}}

(This error corresponds to following issue in receptor project ansible/receptor#363)
When this issue is hit, the jobs will run indefinitely and can't be canceled, they have no stdout and no container is started on exec node. Also, if you run the command awx-manage run_dispatcher --status you can see the dispatcher reporting that the job is being ran, when it was never even started.
When you query for the work results you can see it was never started:

I do not think this should be the appropriate behaviour. If the results of a work unit are not retrievable because it was never started in the remote, then why dispatcher thinks that the job is still running? I would expect the job to fail and leave some trace.

Thoughts?

AWX version

AAP 2.2

Select the relevant components

Installation method

N/A

Modifications

no

Ansible version

No response

Operating system

RHEL 8.5

Web browser

No response

Steps to reproduce

We believe this issue in receptor can happen because of network latencies
ansible/receptor#363

Expected results

It's okay that receptor fails if there are network latencies or whatever the reason. What is not okay is that if receptor fails and the container never gets started... shouldn't dispatcher somehow know that the job never got started?

Actual results

Right now the job never gets started because of receptor error and then dispatcher believes the job is running forever.

Additional information

No response

The text was updated successfully, but these errors were encountered:

AlanCoding · 2022-08-18T18:25:24Z

Link WIP potential fix, or part of potential fix #12653

AlanCoding · 2022-11-15T16:43:54Z

We should triage what to do about this. My changes to the canceling mechanism have kind of blew the prior approach out of the water. We could still add logic to process SIGTERM signal in the transmit phase, just the same as done in the processing phase. However, if that still doesn't address some forms of hangs. From experimentation and reading, I found that the old hang on ssh_key_data in the artifacts_callback would could not be addressed by a watcher in another thread at all. Without a reproducer or knowing exactly where the hang happens, I'm not confident if the proposed fix would actually help.

github-actions bot added needs_triage type:bug labels Aug 11, 2022

hesmithrh assigned fosterseth Aug 16, 2022

akus062381 removed the needs_triage label Aug 17, 2022

fosterseth closed this as completed Apr 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jobs running indefinitely because of receptor error #12645

jobs running indefinitely because of receptor error #12645

jangel97 commented Aug 11, 2022

AlanCoding commented Aug 18, 2022

AlanCoding commented Nov 15, 2022

jobs running indefinitely because of receptor error #12645

jobs running indefinitely because of receptor error #12645

Comments

jangel97 commented Aug 11, 2022

Please confirm the following

Bug Summary

AWX version

Select the relevant components

Installation method

Modifications

Ansible version

Operating system

Web browser

Steps to reproduce

Expected results

Actual results

Additional information

AlanCoding commented Aug 18, 2022

AlanCoding commented Nov 15, 2022