Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jobs using ssh credentials get stuck when loading passphrase #11051

Closed
3 tasks done
rbicker opened this issue Sep 8, 2021 · 9 comments
Closed
3 tasks done

Jobs using ssh credentials get stuck when loading passphrase #11051

rbicker opened this issue Sep 8, 2021 · 9 comments

Comments

@rbicker
Copy link

rbicker commented Sep 8, 2021

Please confirm the following

  • I agree to follow this project's code of conduct.
  • I have checked the current issues for duplicates.
  • I understand that AWX is open source software provided for free and that I am not entitled to status updates or other assurances.

Summary

After migrating our docker-compose based awx 16.0.0 installation to a kubernetes based installation using awx-operator 0.13 (while providing the old postgres database and the ansible secret as instructed in the migration guide), jobs using ssh credentials have stopped working.

In the webinterface these jobs proceed to and are stuck in "running" state forever. There is no output shown.
Using awx-manage shell_plus, I was able to verify that credentials can be decrypted successfully. Other credential types are working fine.

I can see in the awx-ee container that the jobs seem to be stuck on ssh-add processes (like
ssh-add /tmp/pdd_wrapper_50404_25q104h6/awx_50404_br45qk20/artifacts/50404/ssh_key_data for example). From what I can tell, the named pipe "ssh_key_data" is never receiving the passphrase which is why the job gets stuck. When I manually write the passphrase to the named pipe, the job proceeds! How do the passphrases normally get passed to the named pipes?

I have tried running our migrated awx installation on minikube and k3s, I have also tried using awx 18.0.0. We are facing this issue either way.

I am not sure if #10489 is connected to this issue as our job IDs are over 50000.

AWX version

19.3.0

Installation method

kubernetes

Modifications

no

Ansible version

No response

Operating system

Rocky 8.4

Web browser

No response

Steps to reproduce

Unfortunately I am not sure how the issue can be reproduced as we are only facing it when migrating our docker-compose based awx 16.0.0 installation. I am happy to help troubleshooting the issue on our installation in any way.

Expected results

Jobs should start running after successfully loading ssh credentials with passphrases

Actual results

Jobs with ssh credentials that have passphrases are stuck in "running" state forever without actually starting.

Additional information

No response

@polarroyo
Copy link

We can reproduce the same issue after upgrading from 17.1.0 to 19.X.X in Kubernetes.

@nlvw
Copy link

nlvw commented Sep 28, 2021

I'm seeing this issue on the latest 19.X.X but it only started happening once our job ID number exceeded 10k.

@rbicker
Copy link
Author

rbicker commented Sep 29, 2021

I'm seeing this issue on the latest 19.X.X but it only started happening once our job ID number exceeded 10k.

I can confirm this is also the case for our issue. I have lowered the AUTO INCREMENTATION for the jobs in the db (don't do that in production!) - ALTER SEQUENCE main_unifiedjob_id_seq RESTART WITH 500;

After that, the issue is gone. So our issue really seems to be connected to #10489.

Does anyone know of a "safe way" to remove all job runs and reset the counter as a workaround for now?

@AlanCoding
Copy link
Member

Much of the implementation of this is in ansible-runner, like the writing to the pipe, which can be seen here:

https://github.com/ansible/ansible-runner/blob/1a8c1c59e010ac87966a552b6d3f9f52aa1abe1e/ansible_runner/config/_base.py#L244

The natural speculation is that the ssh_key_path written to there is somehow different from the path that it is read from, so reading hangs.

It jumps out to me that you show ssh-add /tmp/pdd_wrapper_50404_25q104h6/awx_50404_br45qk20. In recent versions, the direction location was changed, so that I would no longer expect to see the parent folder, so that would look like ssh-add /tmp/awx_50404_br45qk20 instead.

Provided that these jobs were started after the migration finished, this shouldn't happen. Anywhere you see "pdd_wrapper" is a red flag to me, and suggests something stale from prior to the migration.

@AlanCoding AlanCoding self-assigned this Oct 9, 2021
@AlanCoding
Copy link
Member

Any solution to this will probably resolve #11453 as well, even if the original cause is different.

@craph
Copy link
Contributor

craph commented Jan 11, 2022

Hello @AlanCoding , @rbicker ,

I have the same issue. I did the migration from local docker installation 17.x to the kubernetes version in 19.2.0 and after that all my projects can't sync anymore and are still in running state forever.

I open an issue for that #11518 and in the private key of my ssh credential I don't have any passphrase. So the issue is happening even if we don't have passphrase for the credential.

Moreover my jobs id are 2791...

Thank you very much for your help.

May be, is it related to this issue ansible/awx-operator#376 ?

@craph
Copy link
Contributor

craph commented Jan 11, 2022

@rbicker ,
Please can you tell me how did you proceed to check the credentials with awx-manage shell_plus ?

And in the awx-ee container where did you find the log ? I can't find anything about the running job .. and I'm asking if it's possibly linked to ansible/awx-operator#376

Moreover, for me my jobs are at 2791 so I don't think it's linked to #10489

@AlanCoding AlanCoding removed their assignment Jan 17, 2022
@AlanCoding
Copy link
Member

From what I can tell, the named pipe "ssh_key_data" is never receiving the passphrase which is why the job gets stuck.

This looks the same as a large number of issues open right now, and I would like to start consolidating them soon. I would favor #11518 as the primary issue.

@shanemcd
Copy link
Member

Closing in favor of #11518

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants