You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I understand that AWX is open source software provided for free and that I might not receive a timely response.
Bug Summary
Hi, we are running AAP 2.2, we have observed that when hitting error:
[root@controller3 jmorenas]# cat /tmp/receptor/controller3/mzjPtyLC/status
{"State":3,"Detail":"Failed to restart: remote work had not previously started","StdoutSize":0,"WorkType":"remote","ExtraData":{"RemoteNode":"exec2","RemoteWorkType":"ansible-runner","RemoteParams":{"params":"--private-data-dir=/tmp/awx_82194_p1j4wvaz --delete"},"RemoteUnitID":"","RemoteStarted":false,"LocalCancelled":false,"LocalReleased":false,"SignWork":true,"TLSClient":"tls_client","Expiration":"0001-01-01T00:00:00Z"}}
(This error corresponds to following issue in receptor project ansible/receptor#363)
When this issue is hit, the jobs will run indefinitely and can't be canceled, they have no stdout and no container is started on exec node. Also, if you run the command awx-manage run_dispatcher --status you can see the dispatcher reporting that the job is being ran, when it was never even started.
When you query for the work results you can see it was never started:
I do not think this should be the appropriate behaviour. If the results of a work unit are not retrievable because it was never started in the remote, then why dispatcher thinks that the job is still running? I would expect the job to fail and leave some trace.
Thoughts?
AWX version
AAP 2.2
Select the relevant components
UI
API
Docs
Collection
CLI
Other
Installation method
N/A
Modifications
no
Ansible version
No response
Operating system
RHEL 8.5
Web browser
No response
Steps to reproduce
We believe this issue in receptor can happen because of network latencies ansible/receptor#363
Expected results
It's okay that receptor fails if there are network latencies or whatever the reason. What is not okay is that if receptor fails and the container never gets started... shouldn't dispatcher somehow know that the job never got started?
Actual results
Right now the job never gets started because of receptor error and then dispatcher believes the job is running forever.
Additional information
No response
The text was updated successfully, but these errors were encountered:
We should triage what to do about this. My changes to the canceling mechanism have kind of blew the prior approach out of the water. We could still add logic to process SIGTERM signal in the transmit phase, just the same as done in the processing phase. However, if that still doesn't address some forms of hangs. From experimentation and reading, I found that the old hang on ssh_key_data in the artifacts_callback would could not be addressed by a watcher in another thread at all. Without a reproducer or knowing exactly where the hang happens, I'm not confident if the proposed fix would actually help.
Please confirm the following
Bug Summary
Hi, we are running AAP 2.2, we have observed that when hitting error:
(This error corresponds to following issue in receptor project ansible/receptor#363)
When this issue is hit, the jobs will run indefinitely and can't be canceled, they have no stdout and no container is started on exec node. Also, if you run the command
awx-manage run_dispatcher --status
you can see the dispatcher reporting that the job is being ran, when it was never even started.When you query for the work results you can see it was never started:
I do not think this should be the appropriate behaviour. If the results of a work unit are not retrievable because it was never started in the remote, then why dispatcher thinks that the job is still running? I would expect the job to fail and leave some trace.
Thoughts?
AWX version
AAP 2.2
Select the relevant components
Installation method
N/A
Modifications
no
Ansible version
No response
Operating system
RHEL 8.5
Web browser
No response
Steps to reproduce
We believe this issue in receptor can happen because of network latencies
ansible/receptor#363
Expected results
It's okay that receptor fails if there are network latencies or whatever the reason. What is not okay is that if receptor fails and the container never gets started... shouldn't dispatcher somehow know that the job never got started?
Actual results
Right now the job never gets started because of receptor error and then dispatcher believes the job is running forever.
Additional information
No response
The text was updated successfully, but these errors were encountered: