Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streams disconnected in network-intense Ansible task #593

Closed
yuki-nishida-exa opened this issue Jun 1, 2019 · 3 comments

Comments

Projects
None yet
2 participants
@yuki-nishida-exa
Copy link

commented Jun 1, 2019

We had a specific Ansible task failed due to stream disconnected.

TASK [*** : Pull image from registry]
:
{"changed": false, "msg": "Mitogen was disconnected from the remote environment while a call was in-progress. If you feel this is in error, please file a bug. Original error was: the respondent Context has disconnected", "unreachable": true}

We faced this on our test environment with 19 hosts during the task where all the hosts pull about 1GB docker image from the registry running in the 1st host. This makes outgoing traffic from 1st host almost maxed out for its 1Gbps NIC for about 70 secs. In each attempt a few out of 19 hosts failed, and the failing hosts changed over the attempts.

Debug messages around the error is below (IPs were replaced):

[mux  8488] 19:40:32.178381 D mitogen.ctx.ssh.host18:22.sudo.root: mitogen: Importer().find_module('ansible.module_utils.common.')
[mux  8488] 19:40:32.178750 D mitogen.ctx.ssh.host18:22.sudo.root: mitogen: Importer(): master doesn't know u'ansible.module_utils.common.'
[mux  8488] 19:40:32.179137 D mitogen.ctx.ssh.host18:22.sudo.root: mitogen: Importer().find_module('ansible.module_utils.parsing.')
[mux  8488] 19:40:32.179546 D mitogen.ctx.ssh.host18:22.sudo.root: mitogen: Importer(): master doesn't know u'ansible.module_utils.parsing.'
[mux  8488] 19:40:32.180691 D mitogen.ctx.ssh.host18:22.sudo.root: mitogen: Importer().find_module('ansible.module_utils.six.')
[mux  8488] 19:40:32.181083 D mitogen.ctx.ssh.host18:22.sudo.root: mitogen: Importer(): master doesn't know u'ansible.module_utils.six.'
[mux  8488] 19:40:36.941557 D mitogen: mitogen.ssh.Stream('ssh.host4:22').on_disconnect()
[mux  8488] 19:40:36.942205 D mitogen: RouteMonitor(): mitogen.ssh.Stream('ssh.host4:22') is gone; propagating DEL_ROUTE for set([14040, 9, 32039, 14039])
[mux  8488] 19:40:36.942600 D mitogen: Router(Broker(0x7fdb3bbe6210)).del_route(14040)
[mux  8488] 19:40:36.943055 I ansible_mitogen.services: ContextService(): Forgetting Context(14040, u'ssh.host4:22.sudo.root') due to stream disconnect
[task 9534] 19:40:36.943212 D mitogen: RouteMonitor(): firing local disconnect for Context(14040, u'ssh.host4:22.sudo.root')
[mux  8488] 19:40:36.943448 D mitogen: Router(Broker(0x7fdb3bbe6210)).del_route(9)
[mux  8488] 19:40:36.943781 I ansible_mitogen.services: ContextService(): Forgetting Context(9, u'ssh.host4:22') due to stream disconnect
[task 9534] 19:40:36.943906 D mitogen: RouteMonitor(): deleting route to 14040 via mitogen.core.Stream('unix_listener.8488')
[mux  8488] 19:40:36.944137 D mitogen: Router(Broker(0x7fdb3bbe6210)).del_route(32039)
[task 9534] 19:40:36.944366 D mitogen: Router(Broker(0x7fdb3a4293d0)).del_route(14040)
[mux  8488] 19:40:36.944434 D mitogen: Router(Broker(0x7fdb3bbe6210)).del_route(14039)
[task 9534] 19:40:36.944889 D mitogen: CallChain(Context(14040, u'ssh.host4:22.sudo.root')).call_no_reply(): mitogen.core.Dispatcher.forget_chain('psr-ex3000-168-24-9534-7fdb4c9d3740-58a0bf10be265')
[mux  8488] 19:40:36.944966 D mitogen: mitogen.parent.DiagLogStream(fd=128, 'ssh.host4:22').on_disconnect()
[mux  8488] 19:40:36.945471 D mitogen: mitogen.ssh.Stream('ssh.host4:22'): PID 8610 exited with return code 255
[task 9534] 19:40:36.945478 D mitogen: Context(0, None).call_service_async(u'ansible_mitogen.services.ContextService', u'reset', {'context': Context(14040, u'ssh.host4:22.sudo.root')})
[task 9534] 19:40:36.945922 D mitogen: Context(0, None).send_async(Message(0, 40102, 40102, 110, 1004, "\x80\x02X'\x00\x00\x00ansible_mitogen.services.ContextServiceq\x01X\x05"..186))
[mux  8488] 19:40:36.946478 D mitogen: Router(Broker(0x7fdb3bbe6210)): Message(14040, 40102, 0, 101, 0, '\x80\x02(NX\x0c\x00\x00\x00mitogen.coreX\n\x00\x00\x00Dispatcherq\x01X\x0c\x00\x00\x00forget_'..143) is dead: 'no route to 14040, my ID is 0'
[mux  8488] 19:40:36.947071 D ansible_mitogen.services: ContextService().reset(Context(14040, u'ssh.host4:22.sudo.root'))
[mux  8488] 19:40:36.947425 I ansible_mitogen.services: ContextService()._shutdown_unlocked(): shutting down Context(14040, u'ssh.host4:22.sudo.root')
[mux  8488] 19:40:36.947690 D mitogen: Context(14040, u'ssh.host4:22.sudo.root').shutdown() sending SHUTDOWN
[mux  8488] 19:40:36.948014 D ansible_mitogen.services: ContextService(): attempt to forget unknown Context(14040, u'ssh.host4:22.sudo.root')
[mux  8488] 19:40:36.948399 D mitogen: Router(Broker(0x7fdb3bbe6210)): Message(14040, 0, 0, 106, None, ''..0) is dead: 'no route to 14040, my ID is 0'
[task 9534] 19:40:36.949069 D mitogen: Broker(0x7fdb3a4293d0).shutdown()
[task 9534] 19:40:36.949775 D mitogen: mitogen.core.Stream('unix_listener.8488').on_disconnect()
[mux  8488] 19:40:36.950449 D mitogen: mitogen.core.Stream('unix_client.9534').on_disconnect()
[task 9534] 19:40:36.950419 D mitogen: Context(0, None).on_disconnect()
[mux  8488] 19:40:36.950894 D mitogen: Context(40102, None).on_disconnect()
[task 9534] 19:40:36.950986 D mitogen: Waker(Broker(0x7fdb3a4293d0) rfd=30, wfd=31).on_shutdown()
[task 9534] 19:40:36.951540 D mitogen: Waker(Broker(0x7fdb3a4293d0) rfd=30, wfd=31).on_disconnect()
[task 9534] 19:40:36.952548 D mitogen: Router(Broker(0x7fdb3a4293d0)): stats: 0 module requests in 0 ms, 0 sent (0 ms minify time), 0 negative responses. Sent 0.0 kb total, 0.0 kb avg.
[task 9534] 19:40:36.954727 D ansible_mitogen.connection: Call took 60033 ms: ansible_mitogen.target.run_module(kwargs={u'module_map': {u'builtin': [u'ansible.module_utils._text', u'ansible.module_utils.basic', u'ansible.module_utils.common', u'ansible.module_utils.common._collections_compat', u'ansible.module_utils.common.collections', u'ansible.module_utils.common.file', u'ansible.module_utils.common.process', u'ansible.module_utils.parsing', u'ansible.module_utils.parsing.convert_bool', u'ansible.module_utils.pycompat24', u'ansible.module_utils.six'], u'custom': []}, u'emulate_tty': True, u'good_temp_dir': u'/root/.ansible/tmp', u'module': u'command', u'cwd': None, u'service_context': Context(0, None), u'extra_env': None, u'is_python': None, u'env': {}, u'path': u'/opt/exabeam_installer/.env/lib/python2.7/site-packages/ansible/modules/commands/command.py', u'runner_name': u'NewStyleRunner', u'interpreter_fragment': None, u'json_args': '{"_ansible_version": "2.7.10", "warn": true, "_ansible_selinux_special_fs": ["fuse", "nfs", "vboxsf", "ramfs", "9p"], "_ansible_no_log": false, "_ansible_module_name": "command", "_raw_params": "docker pull host1:5000/exabeam-dl-kafka:LMS-i32.6_4", "_ansible_verbosity": 3, "_ansible_keep_remote_files": false, "_ansible_syslog_facility": "LOG_USER", "_ansible_socket": null, "_ansible_diff": false, "_ansible_remote_tmp": "/root/.ansible/tmp", "_ansible_shell_executable": "/bin/sh", "_ansible_check_mode": false, "_ansible_tmpdir": null, "_ansible_debug": false}'})
fatal: [host4]: UNREACHABLE! => {
    "changed": false,
    "msg": "Mitogen was disconnected from the remote environment while a call was in-progress. If you feel this is in error, please file a bug. Original error was: the respondent Context has disconnected",
    "unreachable": true
}
[mux  8488] 19:40:37.332524 D mitogen: mitogen.ssh.Stream('ssh.host9:22').on_disconnect()
[mux  8488] 19:40:37.333122 D mitogen: RouteMonitor(): mitogen.ssh.Stream('ssh.host9:22') is gone; propagating DEL_ROUTE for set([11040, 18, 23039, 11039])
[mux  8488] 19:40:37.333499 D mitogen: Router(Broker(0x7fdb3bbe6210)).del_route(11040)
[mux  8488] 19:40:37.333945 I ansible_mitogen.services: ContextService(): Forgetting Context(11040, u'ssh.host9:22.sudo.root') due to stream disconnect
[task 9552] 19:40:37.334135 D mitogen: RouteMonitor(): firing local disconnect for Context(11040, u'ssh.host9:22.sudo.root')
[mux  8488] 19:40:37.334357 D mitogen: Router(Broker(0x7fdb3bbe6210)).del_route(18)
[mux  8488] 19:40:37.334701 I ansible_mitogen.services: ContextService(): Forgetting Context(18, u'ssh.host9:22') due to stream disconnect
[task 9552] 19:40:37.334695 D mitogen: RouteMonitor(): deleting route to 11040 via mitogen.core.Stream('unix_listener.8488')
[task 9552] 19:40:37.335039 D mitogen: Router(Broker(0x7fdb3a2c0410)).del_route(11040)
[mux  8488] 19:40:37.335079 D mitogen: Router(Broker(0x7fdb3bbe6210)).del_route(23039)
[mux  8488] 19:40:37.335408 D mitogen: Router(Broker(0x7fdb3bbe6210)).del_route(11039)
[task 9552] 19:40:37.335502 D mitogen: CallChain(Context(11040, u'ssh.host9:22.sudo.root')).call_no_reply(): mitogen.core.Dispatcher.forget_chain('psr-ex3000-168-24-9552-7fdb4c9d3740-58a0bf111c129')
[mux  8488] 19:40:37.335901 D mitogen: mitogen.parent.DiagLogStream(fd=143, 'ssh.host9:22').on_disconnect()
[task 9552] 19:40:37.336038 D mitogen: Context(0, None).call_service_async(u'ansible_mitogen.services.ContextService', u'reset', {'context': Context(11040, u'ssh.host9:22.sudo.root')})
[mux  8488] 19:40:37.336375 D mitogen: mitogen.ssh.Stream('ssh.host9:22'): PID 8612 exited with return code 255
[task 9552] 19:40:37.336455 D mitogen: Context(0, None).send_async(Message(0, 40107, 40107, 110, 1004, "\x80\x02X'\x00\x00\x00ansible_mitogen.services.ContextServiceq\x01X\x05"..186))

One suspicious thing was this error came along with the message ansible_mitogen.connection: Call took 60XXX ms, which always indicate duration around 60 secs. It sounds there's some timeout.. We wonder if there's any parameter we're not aware of triggering.

We tried adding ServerAliveInterval, ServerAliveCountMax and also TCPKeepAlive=yes to ssh_args. They were recognized as mitogen_get_stack showed one of those combinations below. While trying the ssh_args, the task passed for a couple attempts, but still none of the variations of args made the task pass multiple times in a row.

$ ANSIBLE_STRATEGY=mitogen_linear ansible host1 -i ../inventory -m mitogen_get_stack -b
host1 | CHANGED => {
    "changed": true,
    "result": [
        {
            "kwargs": {
                "check_host_keys": "ignore",
                "compression": true,
                "connect_timeout": 90,
                "hostname": "<IP>",
                "identities_only": false,
                "identity_file": "<key>",
                "password": null,
                "port": 22,
                "python_path": [
                    "/usr/bin/python"
                ],
                "remote_name": null,
                "ssh_args": [
                    "-o",
                    "ControlMaster=auto",
                    "-o",
                    "ControlPersist=18000s",
                    "-o",
                    "ForwardAgent=yes",
                    "-o",
                    "ServerAliveInterval=15",
                    "-o",
                    "ServerAliveCountMax=12"
                ],
                "ssh_debug_level": null,
                "ssh_path": "ssh",
                "username": "exabeam"
            },
            "method": "ssh"
        },
        {
            "enable_lru": true,
            "kwargs": {
                "connect_timeout": 90,
                "password": null,
                "python_path": [
                    "/usr/bin/python"
                ],
                "remote_name": null,
                "sudo_args": [
                    "-H",
                    "-S",
                    "-n"
                ],
                "sudo_path": null,
                "username": "root"
            },
            "method": "sudo"
        }
    ]
}

Some more background info:

  • Which version of Ansible are you running?
    We tested following combinations but all had the issue:
    Ansible 2.7.10 + Mitogen 0.2.6,
    2.7.10 + 0.2.7,
    2.7.11 + 0.2.7, and
    2.7.11 + master

  • Is your version of Ansible patched in any way? - No

  • Are you running with any custom modules, or module_utils loaded? - No

  • Cluster set-up:

    • 19 onprem hosts with 1Gbps NIC
    • CentOS 7 + Python 2.7.5.
      Note: We suspected bandwidth as the interface of host1 was maxed for upload during the task (as all the other 18 hosts pulls docker image from host1). However, the issue was not reproduced on AWS and GCP (20/30 VMs) with limiting bandwidth to 1Gbps or even 100Mbps using wondershaper.
  • Have you tried the latest master version from Git? - Yes

We really liked speed up with Mitogen in our tests in cloud and hope to get this solved so we can use it onprem. Any suggestion for configs or things to look at would be appreciated.

@dw

This comment has been minimized.

Copy link
Owner

commented Jun 3, 2019

This is an excellent bug report, thanks so much! Indeed, Mitogen by default configures SSH for 15 second heartbeats and times out when 3 have not been received. This setting has never been exposed to Ansible, and the default may be far too aggressive in any case.

For now, consider editing 'mitogen/ssh.py' and setting keepalive_count=3 on line 154 to.. say.. 30 :)

I will add new variables to control these values, and perhaps bump the default up to something more reasonable, like 5 minutes.

dw added a commit that referenced this issue Jun 3, 2019

dw added a commit that referenced this issue Jun 3, 2019

Merge remote-tracking branch 'origin/issue593'
* origin/issue593:
  issue #593: expose configurables for SSH keepalive and increase the default
@yuki-nishida-exa

This comment has been minimized.

Copy link
Author

commented Jun 3, 2019

@dw thank you so much too for your quick response and updating master! We will try again with master and update here.

@dw

This comment has been minimized.

Copy link
Owner

commented Jun 3, 2019

Thanks again :)


This is now on the master branch and will make it into the next release. To be updated when a new release is made, subscribe to https://www.freelists.org/list/mitogen-announce

Thanks for reporting this!

@dw dw closed this Jun 3, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.