Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSH connect failures on Mitogen 0.2.9 on WSL Ubuntu 18.04 #681

Open
gchaix opened this issue Jan 13, 2020 · 18 comments
Open

SSH connect failures on Mitogen 0.2.9 on WSL Ubuntu 18.04 #681

gchaix opened this issue Jan 13, 2020 · 18 comments

Comments

@gchaix
Copy link

gchaix commented Jan 13, 2020

I'm seeing consistent failures when trying to connect via SSH when multiple hosts are specified in the inventory:

TASK [Gathering Facts] **********************************************************************************************************************************************ERROR! [mux  15260] 10:54:20.330539 E mitogen: <Stream ssh.stage-web1 #6e10> crashed 
Traceback (most recent call last):
  File "/home/gchaix/repos/xxx/ansible/plugins/mitogen-0.2.9/mitogen/core.py", line 3481, in _call
    func(self)
  File "/home/gchaix/repos/xxx/ansible/plugins/mitogen-0.2.9/mitogen/core.py", line 1719, in on_transmit
    self.protocol.on_transmit(broker)
  File "/home/gchaix/repos/xxx/ansible/plugins/mitogen-0.2.9/mitogen/core.py", line 2167, in on_transmit
    self._writer.on_transmit(broker)
  File "/home/gchaix/repos/xxx/ansible/plugins/mitogen-0.2.9/mitogen/core.py", line 1907, in on_transmit
    written = self._protocol.stream.transmit_side.write(buf)
  File "/home/gchaix/repos/xxx/ansible/plugins/mitogen-0.2.9/mitogen/core.py", line 2033, in write 
    written, disconnected = io_op(os.write, self.fd, s)
  File "/home/gchaix/repos/xxx/ansible/plugins/mitogen-0.2.9/mitogen/core.py", line 553, in io_op
    return func(*args), None
OSError: [Errno 11] Resource temporarily unavailable
fatal: [stage-web1]: UNREACHABLE! => {"changed": false, "msg": "Mitogen was disconnected from the remote environment while a call was in-progress. If you feel this is in error, please file a bug. Original error was: the respondent Context has disconnected", "unreachable": true}
ok: [stage-web2] 

One host connects, all of the host connections other fail. If there are more than two hosts in the inventory, all but one fail with the same errors. Repeated runs show that the host that fails appears to be random.

PLAY RECAP **********************************************************************************************************************************************************
prod-solr1         : ok=0    changed=0    unreachable=1    failed=0
prod-solr2        : ok=0    changed=0    unreachable=1    failed=0
prod-solr3         : ok=0    changed=0    unreachable=1    failed=0    
prod-util1 : ok=8    changed=0    unreachable=0    failed=0
prod-web1          : ok=0    changed=0    unreachable=1    failed=0
prod-web2          : ok=0    changed=0    unreachable=1    failed=0
prod-web3          : ok=0    changed=0    unreachable=1    failed=0

Environment:
Mitogen 0.2.9
Windows 10 Pro, V. 1809, OS build 17763.914
WSL Ubuntu 18.04.3 LTS
ansible 2.7.11 config file = /home/gchaix/repos/xxx/ansible/ansible.cfg configured module search path = [u'/home/gchaix/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules'] ansible python module location = /home/gchaix/.local/lib/python2.7/site-packages/ansible executable location = /home/gchaix/.local/bin/ansible python version = 2.7.15+ (default, Oct 7 2019, 17:39:04) [GCC 7.4.0]
Host target OS is generally CentOS 7.x but this also appears to be happening with other distros (Ubuntu, etc.)

No patches on Ansible or Mitogen. I tried running it with Mitogen current master, same behavior. This feels like it might be related to #319 but I'm not familiar enough with the internals of WSL to really say for certain. Interestingly, running Ansible with -vvv seems to bypass the issue, as all host connections succeed, whereas running with just --verbose produces failure and the output above.

@atoom
Copy link

atoom commented Feb 11, 2020

Hi,

We are experiencing the exact same issue when running a playbook in WSL with Ubuntu over multiple hosts. There are no issues when running a playbook with a single host or when running with -vvv over multiple hosts.

Edit: Running with MITOGEN_ROUTER_DEBUG=1 also "solves" the problem without having to use -vvv but leaves a log file behind on each target host.

I would gladly help out with additional troubleshooting but I need some pointers on where to start.

Environment:
WSL/Ubuntu: Ubuntu 18.04.1 LTS
Windows 10 V. 1809, OS build 18363.592
Ansible: 2.9.4
Mitogen: 0.2.9

@konstantin-kornienko
Copy link

Same thing (

@kevinvalk
Copy link

Same here, single connection works fine (--limit single host), else I get the same error.

Using WSL1 Debian Buster

@s1113950
Copy link
Collaborator

Could someone try latest master again? I don't have a WSL env to test with unfortunately :( I have noticed other unrelated tasks have failed though with different amounts of -v applied; perhaps it's a bigger issue than specifically WSL-related 🤔

@gchaix
Copy link
Author

gchaix commented Apr 27, 2020

I'm still seeing failures on master @ a5fe4a9

ansible-playbook 2.9.6
  config file = /home/gchaix/repos/project/ansible/ansible.cfg
  configured module search path = [u'/home/gchaix/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /home/gchaix/.local/lib/python2.7/site-packages/ansible
  executable location = /home/gchaix/.local/bin/ansible-playbook
  python version = 2.7.17 (default, Apr 15 2020, 17:20:14) [GCC 7.5.0]
Using /home/gchaix/repos/project/ansible/ansible.cfg as config file
TASK [Gathering Facts] *******************************************************************************************************************************************************************ERROR! [mux  734] 12:05:11.015470 E mitogen: <Stream ssh.stage-web2.bak #1050> crashed
Traceback (most recent call last):
  File "/home/gchaix/repos/project/ansible/plugins/mitogen-head/mitogen/core.py", line 3481, in _call
    func(self)
  File "/home/gchaix/repos/project/ansible/plugins/mitogen-head/mitogen/core.py", line 1719, in on_transmit
    self.protocol.on_transmit(broker)
  File "/home/gchaix/repos/project/ansible/plugins/mitogen-head/mitogen/core.py", line 2167, in on_transmit
    self._writer.on_transmit(broker)
  File "/home/gchaix/repos/project/ansible/plugins/mitogen-head/mitogen/core.py", line 1907, in on_transmit
    written = self._protocol.stream.transmit_side.write(buf)
  File "/home/gchaix/repos/project/ansible/plugins/mitogen-head/mitogen/core.py", line 2033, in write
    written, disconnected = io_op(os.write, self.fd, s)
  File "/home/gchaix/repos/project/ansible/plugins/mitogen-head/mitogen/core.py", line 553, in io_op
    return func(*args), None
OSError: [Errno 11] Resource temporarily unavailable
fatal: [stage-web2.bak]: UNREACHABLE! => {"changed": false, "msg": "Mitogen was disconnected from the remote environment while a call was in-progress. If you feel this is in error, please file a bug. Original error was: the respondent Context has disconnected", "unreachable": true}
ERROR! [mux  734] 12:05:11.303791 E mitogen: <Stream ssh.stage-web1.bak #b8d0> crashed
Traceback (most recent call last):
  File "/home/gchaix/repos/project/ansible/plugins/mitogen-head/mitogen/core.py", line 3481, in _call
    func(self)
  File "/home/gchaix/repos/project/ansible/plugins/mitogen-head/mitogen/core.py", line 1719, in on_transmit
    self.protocol.on_transmit(broker)
  File "/home/gchaix/repos/project/ansible/plugins/mitogen-head/mitogen/core.py", line 2167, in on_transmit
    self._writer.on_transmit(broker)
  File "/home/gchaix/repos/project/ansible/plugins/mitogen-head/mitogen/core.py", line 1907, in on_transmit
    written = self._protocol.stream.transmit_side.write(buf)
  File "/home/gchaix/repos/project/ansible/plugins/mitogen-head/mitogen/core.py", line 2033, in write
    written, disconnected = io_op(os.write, self.fd, s)
  File "/home/gchaix/repos/project/ansible/plugins/mitogen-head/mitogen/core.py", line 553, in io_op
    return func(*args), None
OSError: [Errno 11] Resource temporarily unavailable
fatal: [stage-web1.bak]: UNREACHABLE! => {"changed": false, "msg": "Mitogen was disconnected from the remote environment while a call was in-progress. If you feel this is in error, please file a bug. Original error was: the respondent Context has disconnected", "unreachable": true}
ok: [prod-util1.bak]

@s1113950
Copy link
Collaborator

Does anyone know if there's a way to get a WSL machine to test with? We use Azure Devops to test but afaik there's no WSL env we can enable

@arnemorten
Copy link

Does anyone know if there's a way to get a WSL machine to test with? We use Azure Devops to test but afaik there's no WSL env we can enable

You can probably run the azure devops agent inside a WSL instance and use that as the agent pool in your devops pipeline.

@rdghickman
Copy link

Also reproducible for me most of the time, it seems much more prone to doing it on "copy" tasks for some reason.

I'm surprised because it was all working fine a while ago, so I suspect WSL has updated or something.

If I can help with any debug details let me know and I will try.

@s1113950
Copy link
Collaborator

s1113950 commented Jul 1, 2020

Does anyone know if there's a way to get a WSL machine to test with? We use Azure Devops to test but afaik there's no WSL env we can enable

You can probably run the azure devops agent inside a WSL instance and use that as the agent pool in your devops pipeline.

We'd need a WSL instance for that right? 🤔 is there an OSS-supported test env (like Travis, Circle, Azure devops, etc) that offer WSL instances?

@s1113950
Copy link
Collaborator

s1113950 commented Jul 1, 2020

Also reproducible for me most of the time, it seems much more prone to doing it on "copy" tasks for some reason.

I'm surprised because it was all working fine a while ago, so I suspect WSL has updated or something.

If I can help with any debug details let me know and I will try.

I wonder if WSL added a timeout on connection or something? 🤔 The error of the respondent Context has disconnected is reflecting that the connection was broken somehow. Did it work for WSL1 but not WSL2?

@gchaix
Copy link
Author

gchaix commented Jul 1, 2020

I'm still on WSL1 and definitely seeing the problem. Sadly, I don't know of any test envs that provide WSL instances to test.

@s1113950
Copy link
Collaborator

s1113950 commented Jul 1, 2020

Could it be due to an ssh timeout error maybe? I found https://www.reddit.com/r/bashonubuntuonwindows/comments/bj617c/how_to_keep_wsl_shell_open_when_ssh_session/ . Wild shot in the dark but if it used to work with the same code and now doesn't then maybe WSL changed their default ssh session connection time?

@gchaix
Copy link
Author

gchaix commented Jul 1, 2020

I'll dig through the linked post and do some experimenting but an initial look through it doesn't seem to apply, as there is no delay at all between the success and failures. One - and only one - random machine always succeeds and the others immediately fail. It feels more like when it is trying to open a bunch of SSH connections in parallel but only one is being allowed, the rest are immediately rejected by the underlying subsystems (networking maybe?). It's important to note that for me, at least, I'm not sure it ever worked properly. I don't think I tried connecting to an inventory with multiple hosts on WSL before encountering this problem.

@s1113950
Copy link
Collaborator

s1113950 commented Jul 1, 2020

Ok. I'm not too sure why the underlying subsystems would be rejecting the other connections 😞 maybe @dw knows? He fixed WSL stuff last time: 22bab87 and 56943d3 . I do see other ssh-related WSL issues have been filed in the past: microsoft/WSL#3503, not sure if relevant though.

@rdghickman
Copy link

Just as an additional point, I am seeing the failures and I am only targeting a single host. I agree it seems like a very quick failure.

@rdghickman
Copy link

Anyone tried WSL2 yet with this?

@asantoni
Copy link

asantoni commented Jun 3, 2021

Just to chime in with a possible workaround, I was able to work around this by disabling the Windows Defender firewall. I'm not sure why that solves it. All prior steps in the playbook execute successfully. I can also confirm the LAN IP the playbook was run against is accessible with both the firewall on and off.

The task in the playbook is:

- name: Upload redacted package
  copy:
    dest: "/tmp/"
    src: "{{ latest_redacted_builds[ansible_distribution][ansible_distribution_major_version] }}"
    backup: yes
    owner: root
    group: root
  register: redacted_upload
  tags: [config, redacted-binary]

And the backtrace from the failed execution of the task is:

TASK [redacted: Upload redacted package] ****************************************************************************************************
ERROR! [mux  4321] 13:30:16.461182 E mitogen: <Stream ssh.192.168.122.236 #7c10> crashed
Traceback (most recent call last):
  File "/mnt/c/ansible/plugins/mitogen-0.2.10-rc.0/mitogen/core.py", line 3481, in _call
    func(self)
  File "/mnt/c/ansible/plugins/mitogen-0.2.10-rc.0/mitogen/core.py", line 1719, in on_transmit
    self.protocol.on_transmit(broker)
  File "/mnt/c/ansible/plugins/mitogen-0.2.10-rc.0/mitogen/core.py", line 2167, in on_transmit
    self._writer.on_transmit(broker)
  File "/mnt/c/ansible/plugins/mitogen-0.2.10-rc.0/mitogen/core.py", line 1907, in on_transmit
    written = self._protocol.stream.transmit_side.write(buf)
  File "/mnt/c/ansible/plugins/mitogen-0.2.10-rc.0/mitogen/core.py", line 2033, in write
    written, disconnected = io_op(os.write, self.fd, s)
  File "/mnt/c/ansible/plugins/mitogen-0.2.10-rc.0/mitogen/core.py", line 553, in io_op
    return func(*args), None
OSError: [Errno 11] Resource temporarily unavailable
fatal: [192.168.122.236]: UNREACHABLE! => {
    "changed": false,
    "unreachable": true
}

My platform is WSL1 with Ubuntu 18.04.3 LTS, on Windows 10 1904.985.

@ginolegigot
Copy link

ginolegigot commented Oct 25, 2021

Hello,
Same issue here on a more recent config with WSL 1 and Ubuntu 20.04.
Tested with mitogen tag v2.10rc1 (also tested 0.2.9 unsuccessfully).
An example of error message here:
bugmitogen1
Like the others, -vvv option works well, but without it mitogen will choose one host to perform ansible tasks execution.
Hope it helps

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants