Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jobs running indefinitely because of receptor error #12645

Closed
4 of 9 tasks
jangel97 opened this issue Aug 11, 2022 · 2 comments
Closed
4 of 9 tasks

jobs running indefinitely because of receptor error #12645

jangel97 opened this issue Aug 11, 2022 · 2 comments
Assignees
Labels

Comments

@jangel97
Copy link

Please confirm the following

  • I agree to follow this project's code of conduct.
  • I have checked the current issues for duplicates.
  • I understand that AWX is open source software provided for free and that I might not receive a timely response.

Bug Summary

Hi, we are running AAP 2.2, we have observed that when hitting error:

[root@controller3 jmorenas]# cat /tmp/receptor/controller3/mzjPtyLC/status
{"State":3,"Detail":"Failed to restart: remote work had not previously started","StdoutSize":0,"WorkType":"remote","ExtraData":{"RemoteNode":"exec2","RemoteWorkType":"ansible-runner","RemoteParams":{"params":"--private-data-dir=/tmp/awx_82194_p1j4wvaz --delete"},"RemoteUnitID":"","RemoteStarted":false,"LocalCancelled":false,"LocalReleased":false,"SignWork":true,"TLSClient":"tls_client","Expiration":"0001-01-01T00:00:00Z"}}

(This error corresponds to following issue in receptor project ansible/receptor#363)
When this issue is hit, the jobs will run indefinitely and can't be canceled, they have no stdout and no container is started on exec node. Also, if you run the command awx-manage run_dispatcher --status you can see the dispatcher reporting that the job is being ran, when it was never even started.
When you query for the work results you can see it was never started:

I do not think this should be the appropriate behaviour. If the results of a work unit are not retrievable because it was never started in the remote, then why dispatcher thinks that the job is still running? I would expect the job to fail and leave some trace.

Thoughts?

AWX version

AAP 2.2

Select the relevant components

  • UI
  • API
  • Docs
  • Collection
  • CLI
  • Other

Installation method

N/A

Modifications

no

Ansible version

No response

Operating system

RHEL 8.5

Web browser

No response

Steps to reproduce

We believe this issue in receptor can happen because of network latencies
ansible/receptor#363

Expected results

It's okay that receptor fails if there are network latencies or whatever the reason. What is not okay is that if receptor fails and the container never gets started... shouldn't dispatcher somehow know that the job never got started?

Actual results

Right now the job never gets started because of receptor error and then dispatcher believes the job is running forever.

Additional information

No response

@AlanCoding
Copy link
Member

Link WIP potential fix, or part of potential fix #12653

@AlanCoding
Copy link
Member

We should triage what to do about this. My changes to the canceling mechanism have kind of blew the prior approach out of the water. We could still add logic to process SIGTERM signal in the transmit phase, just the same as done in the processing phase. However, if that still doesn't address some forms of hangs. From experimentation and reading, I found that the old hang on ssh_key_data in the artifacts_callback would could not be addressed by a watcher in another thread at all. Without a reproducer or knowing exactly where the hang happens, I'm not confident if the proposed fix would actually help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants