Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bad auth ID during disconnect #533

Open
dw opened this issue Feb 10, 2019 · 9 comments

Comments

@dw
Copy link
Owner

commented Feb 10, 2019

Note: this ticket got conflated, there are 2 separate issues here:

  • bad auth ID due to a parent->child message bouncing, described below. bad auth_id got N via Stream(..) not None
  • bad auth ID due to seemingly duplicate connections between WorkerProcess and MuxProcess, described in comments and the kubespray ticket bad auth_id got N via Stream(...) not Stream(...)

While investigating #481, while the sudo/python child is prematurely dying, we can sometimes see:

ERROR! [pid 12203] 18:48:27.017954 E mitogen: Router(Broker(0x7f2c10f27b10)): bad auth_id: got 0 via mitogen.ssh.Stream('ssh.172.16.80.131'), not None: Message(3, 0, 0, 101, 1002, '\x80\x02(NX\x16\x00\x00\x00ansible_mitogen.targetNX\n\x00\x00\x00init_child)cm'..174)

This is due to the master not yet receiving the DEL_ROUTE for the dead sudo child, and sending a function call towards the SSH account. By the time the SSH account receives the function call, it knows the sudo child is dead, and so has no route for the destination ID. The current routing rules mean in this case, the route to the destination must be via some parent/grandparent, and so it sends the message back to the parent.

But the message auth ID is stamped with that of a parent, causing this bizarre message - a security warning. But really it's just something else buried in bad logging as usual, the real error is that no route exists for an in-flight message

@Flow76320

This comment has been minimized.

Copy link

commented Jul 23, 2019

Hi, do you have any news on that issue?

@dw

This comment has been minimized.

Copy link
Owner Author

commented Aug 13, 2019

See also #368, #585. Hoping they're the same little piece of missing logic.

@Flow76320 I just want to check, when you see this error, does the run continue without otherwise failing?

The cases I know of at present do not cause the run to fail, and I'm not sure if that's possible.

@dsgnr

This comment has been minimized.

Copy link

commented Aug 16, 2019

So this is the first time i've seen this today, and so far I've tried a few things. For me at least, the whole play hangs - no exit codes. It also seems, for me, this is occurring only on the command module.

It seems completely random as well and I'm yet to find a way to get through it

@Flow76320

This comment has been minimized.

Copy link

commented Aug 16, 2019

To be honest, I've never waited for the playbook to hang or to crash.

It appeared on Kubespray (Kubernetes cluster installation with Ansible) during fact_gathering. Do you want some stack trace or so?

@dw

This comment has been minimized.

Copy link
Owner Author

commented Aug 16, 2019

Please mention host distro / Python version / target distro / Python version / Ansible version / kubespray version, and if there is some extra configuration, please include it. I will try to setup a repro for this shortly

It does not sound like you are seeing this error due to a crash, it sounds like the children are dying unexpectly for some reason.

@dsgnr

This comment has been minimized.

Copy link

commented Aug 16, 2019

for me, running in an Ubuntu 18.04 venv. Python 3.6 with Ansible 2.7.10 and Mitogen latest from pip (0.2.7), running against newly deployed RHEL 7.7 devices. I'm just testing with the same, but python 2.7. Will update this comment when/if I can replicate.

@dw

This comment has been minimized.

Copy link
Owner Author

commented Aug 16, 2019

If this is a new bug, I'm not sure latest master will help much. Is there anything special about the target machines? Modified OS config prior to Ansible running, special third party software or anything like that?

If it would be possible, try to grab a "-vvv" log a failing run, there might be something obvious in it

@dsgnr

This comment has been minimized.

Copy link

commented Aug 16, 2019

Nothing special about the target machines. They're vanilla OS and just been deployed. Python2 didn't seem to have any issues fwiw! I doubt i'll be able to get -vvv on these due to how we run the plays but I'll try my best!

dw added a commit that referenced this issue Aug 16, 2019
issue #533: include object identity of Stream in repr()
At least one of the causes of the #533 error appears to be that streams
with the same name exist
dw added a commit that referenced this issue Aug 17, 2019
dw added a commit that referenced this issue Aug 17, 2019
Merge remote-tracking branch 'origin/dmw'
* origin/dmw:
  issue #615: ensure 4GB max_message_size is configured for task workers.
  issue #615: update Changelog.
  issue #615: route a dead message to recipients when no reply is expected
  issue #615: fetch_file() might be called with AnsibleUnicode.
  issue #615: redirect 'fetch' action to 'mitogen_fetch'.
  issue #615: extricate slurp brainwrong from mitogen_fetch
  issue #615: ansible: import Ansible fetch.py action plug-in
  issue #533: include object identity of Stream in repr()
  docs: lots more changelog
  issue #595: add buildah to docs and changelog.
  docs: a few more internals.rst additions
dw added a commit that referenced this issue Aug 17, 2019
Merge remote-tracking branch 'origin/dmw'
* origin/dmw:
  issue #533: update routing to account for DEL_ROUTE propagation race
  tests: use defer_sync() Rather than defer() + ancient sync_with_broker()
  tests: one case from doas_test was invoking su
  tests: hide memory-mapped files from lsof output
  issue #615: remove meaningless test
  issue #625: ignore SIGINT within MuxProcess
  issue #625: use exec() instead of subprocess in mitogen_ansible_playbook
  issue #615: regression test
  issue #615: update Changelog.
dw added a commit that referenced this issue Aug 18, 2019
Merge remote-tracking branch 'origin/v028' into stable
* origin/v028: (383 commits)
  Bump version for release.
  docs: update Changelog for 0.2.8.
  issue #627: add test and tweak Reaper behaviour.
  docs: lots more changelog concision
  docs: changelog concision
  docs: more changelog tweaks
  docs: reorder chapters
  docs: versionless <title>
  docs: update supported Ansible version, mention unsupported features
  docs: changelog fixes/tweaks
  issue #590: update Changelog.
  issue #621: send ADD_ROUTE earlier and add test for early logging.
  issue #590: whoops, import missing test modules
  issue #590: rework ParentEnumerationMethod to recursively handle bad modules
  issue #627: reduce the default pool size in a child to 2.
  tests: add a few extra service tests.
  docs: some more hyperlink joy
  docs: more hyperlinks
  docs: add domainrefs plugin to make link aliases everywhere \o/
  docs: link IS_DEAD in changelog
  docs: tweaks to better explain changelog race
  issue #533: update routing to account for DEL_ROUTE propagation race
  tests: use defer_sync() Rather than defer() + ancient sync_with_broker()
  tests: one case from doas_test was invoking su
  tests: hide memory-mapped files from lsof output
  issue #615: remove meaningless test
  issue #625: ignore SIGINT within MuxProcess
  issue #625: use exec() instead of subprocess in mitogen_ansible_playbook
  issue #615: regression test
  issue #615: update Changelog.
  issue #615: ensure 4GB max_message_size is configured for task workers.
  issue #615: update Changelog.
  issue #615: route a dead message to recipients when no reply is expected
  issue #615: fetch_file() might be called with AnsibleUnicode.
  issue #615: redirect 'fetch' action to 'mitogen_fetch'.
  issue #615: extricate slurp brainwrong from mitogen_fetch
  issue #615: ansible: import Ansible fetch.py action plug-in
  issue #533: include object identity of Stream in repr()
  docs: lots more changelog
  issue #595: add buildah to docs and changelog.
  docs: a few more internals.rst additions
  ci: update to Ansible 2.8.3
  tests: another random string changed in 2.8.3
  tests: fix sudo_flags_failure for Ansible 2.8.3
  ci: fix procps command line format warning
  Whoops, merge together lgtm.yml and .lgtm.yml
  issue #440: log Python version during bootstrap.
  docs: update changelog
  issue #558: disable test on OSX to cope with boundless mediocrity
  issue #558, #582: preserve remote tmpdir if caller did not supply one
  issue #613: must await 'exit' and 'disconnect' in wait=False test
  Import LGTM config to disable some stuff
  Fix up another handful of LGTM errors.
  tests: work around AnsibleModule.run_command() race.
  docs: mention another __main__ safeguard
  docs: tweaks
  formatting error
  docs: make Sphinx install soft fail on Python 2.
  issue #598: allow disabling preempt in terraform
  issue #598: update Changelog.
  issue #605: update Changelog.
  issue #605: ansible: share a sem_t instead of a pthread_mutex_t
  issue #613: add tests for all the weird shutdown methods
  Add mitogen.core.now() and use it everywhere; closes #614.
  docs: move decorator docs into core.py and use autodecorator
  preamble_size: make it work on Python 3.
  docs: upgrade Sphinx to 2.1.2, require Python 3 to build docs.
  docs: fix Sphinx warnings, add LogHandler, more docstrings
  docs: tidy up some Changelog text
  issue #615: fix up FileService tests for new logic
  issue #615: another Py3x fix.
  issue #615: Py3x fix.
  issue #615: update Changelog.
  issue #615: use FileService for target->controll file transfers
  issue #482: another Py3 fix
  ci: try removing exclude: to make Azure jobs work again
  compat: fix Py2.4 SyntaxError
  issue #482: remove 'ssh' from checked processes
  ci: Py3 fix
  issue #279: add one more test for max_message_size
  issue #482: ci: add stray process checks to all jobs
  tests: fix format string error
  core: MitogenProtocol.is_privileged was not set in children
  issue #482: tests: fail DockerMixin tests if stray processes exist
  docs: update Changelog.
  issue #586: update Changelog.
  docs: update Changelog.
  [security] core: undirectional routing wasn't respected in some cases
  docs: tidy up Select.all()
  issue #612: update Changelog.
  master: fix TypeError
  pkgutil: fix Python3 compatibility
  parent: use protocol for getting remote_id
  docs: merge signals.rst into internals.rst
  os_fork: do not attempt to cork the active thread.
  parent: fix get_log_level() for split out loggers.
  issue #547: fix service_test failures.
  issue #547: update Changelog.
  issue #547: core/service: race/deadlock-free service pool init
  docs: update Changelog.
  ...
@dw

This comment has been minimized.

Copy link
Owner Author

commented Aug 18, 2019

I have been running a 3 VM soak on Python 3.6, Ansible 2.7 and Kubespray master for over 24 hours with no failures. There is no doubt a bug exists that I have not seen (case 2 in the bug description), so if anyone is still catching this error, please comment here.

The going theory is that it's something going wrong inside an individual Ansible task worker. I /think/ it has something perhaps to do with the synchronize module or with_items loops, or maybe the meta action, but have been unable to replicate and have scanned code until I'm almost blind :)

If you have logs immediately prior to the failure, or even know what playbook it happens in, that would be extremely useful.

It's also possible that a default Kubespray does not run the task that is causing this, so it could depend on your e.g. networking deployment configuration. If you are only tweaking one or two items from the default Kubespray config, please share what they are!

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.