Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ansible-mitogen] 'delegate_to' logic #340

Closed
fdutheil opened this issue Aug 9, 2018 · 11 comments
Closed

[ansible-mitogen] 'delegate_to' logic #340

fdutheil opened this issue Aug 9, 2018 · 11 comments

Comments

@fdutheil
Copy link

@fdutheil fdutheil commented Aug 9, 2018

Hi, let's start by saying this project is quite impressive (goals, design, initial documentation). I spent half a day reading, understanding (or trying to) and trying to get it working as an alternative to the sshjail plugin (great either, in other aspects).
Anyway, I got pretty much what I wanted: H is a freebsd host with jails, like the J jail. As I use sudo, I had to use mitogen_sudo, which was kind of a travel :). Anyway, as I said, my existing playbooks workds on J and its siblings. Except one: at some point, I run a command on an external host E (which is OK when tested with ansible ping module) that renew many TLS certificates in one run, thus I us the ansible delegate_to logic to achive that.

Let's see:

- name: (re)generate TLS certificates
  hosts: all_jails (including [J])
  gather_facts: no
  vars:
    force_cert_renewal: "{{ tls_force|default(False)|bool|ternary('--force','') }}"
  tasks:
    - name: Request a TLS certificate from LE with dehydrated
      command: /home/me/progs/dehydrated/dehydrated -c {{ force_cert_renewal }} -a {{ tls_algo }} -d {{ tls_fqdn }} --lock-suffix {{ tls_fqdn }}
      async: 1800
      poll: 20
      delegate_to: [E]

And that's where the logic is strange to me, as I look at the error I get:

TASK [Request a TLS certificate from LE with dehydrated] *****************************************
fatal: [J]: UNREACHABLE! => {"changed": false, "msg": "error occurred on host [H]: Host key checking is enabled, and SSH reported an unrecognized or mismatching host key.", "unreachable": true}

As I understand, H (hosting J) is trying to connect via SSH to E, when that should be the ansible control machine that should do it. OK, may that be one case where mitogen differs from ansible vanilla behaviour? I've defined a simple "tree" of connections (meaning my control machine can reach them all), and not a complete P2P web allowing any connection from any point. Is it fundamentally condemn to fail, from the mitogen point of view? Because I don't plan to allow J or H to reach E anytime soon (different security networks).

Versions information:

  • H: FreeBSD 11.x/python2.7
  • J: same
  • E: OpenBSD 6.2/python2.7
  • Control Machine: gentoo/python2.7/python3.6/ansible 2.5.6, tested both 0.2.2 and latest master version of mitogen

Meanwhile, I'll continue to dig.

@dw
Copy link
Owner

@dw dw commented Aug 9, 2018

Hi, thanks for reporting! Is this using Connection Delegation? (mitogen_via=...). If so, it might be another instance of #251. I'm not looking at bugs for the past week as I'm working hard on next development branch :) But will hopefully get to this on Saturday or Sunday. Sorry for the delay

@dw dw added bug ansible labels Aug 9, 2018
@fdutheil
Copy link
Author

@fdutheil fdutheil commented Aug 9, 2018

Yes, could be another instance of #251. I'm using Connection Delegation twice:
CM ----> E ----> H --mitogen_via--> H w/Sudo --mitogen_via--> J

(E is ProxyJump in .ssh/config for H, but that's not relevant here)

@dw dw added the target:v0.2 label Aug 11, 2018
@dw
Copy link
Owner

@dw dw commented Aug 11, 2018

Just to note I've reproduced this locally, the relevant code looks entirely wrong. Hopefully will have a fix for this before end of the weekend.

@dw
Copy link
Owner

@dw dw commented Aug 12, 2018

Hi @fdutheil can you please retry using dmw branch and let me know if the problem persists? There was an existing fix on this branch for #251, but I found another bug and added a bunch of tests.

@fdutheil
Copy link
Author

@fdutheil fdutheil commented Aug 16, 2018

Hello @dw,

  • (case 1) I tested the dmw branch with success regarding this delegated task when it's part of its short playbook.
  • (case 2) But once this short playbook is integrated via import_playbook in a master playbook (and lots of tasks involving hosts/jails without delegation are run before this one), I reliably (~100%) get at least one timeout (as it's an async task) for some of the jails (J & its siblings). I can confirm a task is not run on the delegated host (E) when a timeout is reported by ansible for one of these J's.
  • (case 3) If I add a tag to only enable this imported playbook (to get something like case 1) and run the whole thing (same as in case 2 + tag), it's OK again.
  • (case 4) With strategy: linear around the delegated task and everything run like in case 2, no timeout. So it seems indeed related to mitogen.
@dw
Copy link
Owner

@dw dw commented Aug 17, 2018

Hi Florent! Thank you for persisting with this :)

Can you please share:

  • the exact text of the timeout (I guess it's just plain old "connect timeout" with host marked unreachable?)
  • the value of ansible-config dump | grep DEFAULT_TIMEOUT
  • How many jails you are targetting
  • Is the target machine quite fast and lots of memory?

One possibility is that it's spinning up so many interpreters that the target machine starts IO thrashing and slows way down.

Another is that because delegation is single-threaded, simply targetting many, many containers (possibly in conjunction with a slow machine) means one tasks begin timing out waiting for the single thread to become available. I can arrange to fix this as part of 0.2, it's a relatively simple change.

A final possibility is some odd race condition, but usually they do not manifest as simple timeouts.

Thanks again!

@dw
Copy link
Owner

@dw dw commented Aug 17, 2018

Ah! One final data point would be useful -- does the timeout appear to occur after DEFAULT_TIMEOUT seconds have elapsed? If it is earlier, that would be a strong indicator of some bug :)

dw added a commit that referenced this issue Aug 18, 2018
When inventory name did not match remote_addr, it would attempt to SSH
to the inventory name.
dw added a commit that referenced this issue Aug 18, 2018
The logic was getting too busy.
dw added a commit that referenced this issue Aug 18, 2018
dw added a commit that referenced this issue Aug 18, 2018
dw added a commit that referenced this issue Aug 18, 2018
PlayContext.delegate_to is the unexpanded template, Ansible doesn't keep
a copy of it around anywhere convenient. We either need to re-expand it
or take the expanded version that was stored on the Task, which is what
is done here.
@fdutheil
Copy link
Author

@fdutheil fdutheil commented Aug 20, 2018

Hello David,
Previously I was running playbook from the physical site where H/E/J are. Now I'm running it from a remote site... And I can't reproduce it for now. Raaaaaah. Quite disturbing (I ran more than 10 times this case 1 vs case 2 tests before reporting them...). I'll dig more into it.

Anyway, here we go:

  • exact text of the timeout: fatal: [J -> E]: FAILED! => {"changed": false, "msg": "async task did not complete within the requested time"}
  • DEFAULT_TIMEOUT(default) = 10
  • 10 jails, but this particular async & delegated task is related to only 4 of them
  • the delegated host (E) is an APU2: 4 cores + 4GB RAM OpenBSD host used as a bastion so without any ressource consuming process except this ansible delegated task. I monitor it with top + ps when running the playbook, nothing to say (I can see 1 + 4 python process at peak when everything is OK, each one consuming less than 20MB or RAM, completing within seconds).

When the issue appeared in case 2, the "timed out" instances of the task were not run on E. It was not the DEFAULT_TIMEOUT but the task's async: <value> that was triggered.

Anyway, more details ASAP when/if I can reproduce it.

@fdutheil
Copy link
Author

@fdutheil fdutheil commented Aug 21, 2018

Hello again David,
I manage to reproduce the async timeout issue with this particular async task 100% of the time when the control machine is on the physical site (tested wireless and wired networks, same result).
When on the physical site but using my phone with LTE modem feature enabled to emulate a remote connection (same setup I used yesterday), there is definitely no async timeout on the delegated task.

[Insert lack of conclusive observations here]

I think we can close #340, as it is initially related to delegation, which seems to work now. It would be good if someone else could confirm that async delegated tasks are working fine on their side.
When I have more time, I'll try to compose a generic playbook with a generic async task to trigger the async timeout, and eventually open a new issue.

@dw
Copy link
Owner

@dw dw commented Jan 22, 2019

@fdutheil your async task problem may have been #414 -- please retry current master or wait a few days and 0.2.4 will be out

Thanks again!

@fdutheil
Copy link
Author

@fdutheil fdutheil commented Feb 20, 2019

Hi David, sorry, I forgot to test and report back: so I updated mitogen to master version (v2.5 + a couple of commits including fix for #548) and updated ansible to 2.7.5.
I 've set strategy = mitogen_linear in .ansible.cfg to enable mitogen everywhere (and not only for plays needing jail communications), and yes, the playbook I used for this ticket is fine (so the delegate_to issue as reported is fixed).

As a side note, I still have timeout issues (not 100% reproducible) with mitogen + delegate_to + async, so I need to spend more time on getting (conclusive) details before reporting it.

Thank you for your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.