Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WinRM: Bails out with "[Errno 111] Connection refused" #25532

Open
dagwieers opened this issue Jun 9, 2017 · 15 comments

Comments

Projects
None yet
7 participants
@dagwieers
Copy link
Member

commented Jun 9, 2017

ISSUE TYPE
  • Bug Report
COMPONENT NAME

WinRM

ANSIBLE VERSION

v2.4

OS / ENVIRONMENT

Control master: RHEL7
Target nodes: Windows 2012R2 (with Powershell 4.0, also tried Powershell 5.1)

SUMMARY

I just experienced again a Connection refused. The task was waiting for 3 VMs to appear (wait_for_connection doing a win_ping test), the last VM to come online then gave me a Connection refused in the next task doing setup.

We are using CredSSP.

I wonder if we could retry longer/delayed on Connection refused to hopefully make it survive such intermittent issues better.

It is not unlikely that during the first boot the WinRM service starts, stops and then starts again, causing the "Connection refused", however we should recover from this situation if it appears.

TASK [Gathering Facts] ********************************************************************************
***************************************************************************
Using module file /home/user/ansible.git/lib/ansible/modules/windows/setup.ps1
<1.2.3.101> ESTABLISH WINRM CONNECTION FOR USER: Administrator on PORT 5986 TO 1.2.3.101
Using module file /home/user/ansible.git/lib/ansible/modules/windows/setup.ps1
<1.2.3.103> ESTABLISH WINRM CONNECTION FOR USER: Administrator on PORT 5986 TO 1.2.3.103
Using module file /home/user/ansible.git/lib/ansible/modules/windows/setup.ps1
<1.2.3.102> ESTABLISH WINRM CONNECTION FOR USER: Administrator on PORT 5986 TO 1.2.3.102
EXEC (via pipeline wrapper)
EXEC (via pipeline wrapper)
EXEC (via pipeline wrapper)
The full traceback is:
Traceback (most recent call last):
  File "/home/user/ansible.git/lib/ansible/executor/task_executor.py", line 125, in run
    res = self._execute()
  File "/home/user/ansible.git/lib/ansible/executor/task_executor.py", line 526, in _ex
ecute
    result = self._handler.run(task_vars=variables)
  File "/home/user/ansible.git/lib/ansible/plugins/action/normal.py", line 45, in run
    results = merge_hash(results, self._execute_module(tmp=tmp, task_vars=task_vars, wrap_async=wrap_as
ync))
  File "/home/user/ansible.git/lib/ansible/plugins/action/__init__.py", line 743, in _e
xecute_module
    res = self._low_level_execute_command(cmd, sudoable=sudoable, in_data=in_data)
  File "/home/user/ansible.git/lib/ansible/plugins/action/__init__.py", line 892, in _l
ow_level_execute_command
    rc, stdout, stderr = self._connection.exec_command(cmd, in_data=in_data, sudoable=sudoable)
  File "/home/user/ansible.git/lib/ansible/plugins/connection/winrm.py", line 337, in exec_command
    result = self._winrm_exec(cmd_parts[0], cmd_parts[1:], from_exec=True, stdin_iterator=self._wrapper_payload_stream(payload))
  File "/home/user/ansible.git/lib/ansible/plugins/connection/winrm.py", line 294, in _winrm_exec
    self.protocol.cleanup_command(self.shell_id, command_id)
  File "/usr/lib/python2.7/site-packages/winrm/protocol.py", line 307, in cleanup_command
    res = self.send_message(xmltodict.unparse(req))
  File "/usr/lib/python2.7/site-packages/winrm/protocol.py", line 207, in send_message
    return self.transport.send_message(message)
  File "/usr/lib/python2.7/site-packages/winrm/transport.py", line 184, in send_message
    response = self.session.send(prepared_request, timeout=self.read_timeout_sec)
  File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 609, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python2.7/site-packages/requests/adapters.py", line 487, in send
    raise ConnectionError(e, request=request)
ConnectionError: HTTPSConnectionPool(host='38.38.12.3', port=5986): Max retries exceeded with url: /wsman (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x38f9a50>: Failed to establish a new connection: [Errno 111] Connection refused',))
fatal: [VM3]: FAILED! => {
    "failed": true,
    "msg": "Unexpected failure during module execution.",
    "stdout": ""
}
ok: [VM1]
ok: [VM2]

This relates to #23320 (more examples from others there)

The playbook looks like this, and it fails on the setup task.

- name: Clone VM
  vmware_guest:
    hostname: '{{ vcenter_ipaddress }}'
    username: '{{ vcenter_username }}'
    password: '{{ vcenter_password }}'
    datacenter: '{{ vcenter_datacenter }}'
    resource_pool: '{{ vcenter_resource_pool }}'
    cluster: '{{ vcenter_cluster }}'
    folder: '{{ vcenter_folder }}'
    template: '{{ vcenter_template }}'
    name: '{{ inventory_hostname_short }}'
    state: poweredon
    validate_certs: no
    networks:
    - name: 'VLAN {{ vcenter_portgroup_prefix }}{{ pod_id }}'
      ip: '{{ ansible_host }}'
      netmask: '{{ vm_netmask }}'
      gateway: '{{ gw_ip }}'
      domain: '{{ domain }}'
      dns_servers: [ '{{ dns_ip }}' ]
    customization:
      autologon: yes
      #fullname: Administrator
      hostname: '{{ windows_shortname }}'
      orgname: '{{ windows_organization }}'
      password: '{{ windows_admin_password }}'
      productid: '{{ windows_product_id }}'
      runonce:
        - powershell.exe -ExecutionPolicy Unrestricted -File C:\Windows\Temp\ConfigureRemotingForAnsible.ps1 -CertValidity
ys 3650 -EnableCredSSP -ForceNewSSLCert
      timezone: 105
  register: vm
  delegate_to: localhost

# Wait for system(s) to become reachable
- name: Wait for VM customizations
  wait_for_connection:
    delay: 240
    sleep: 15
    timeout: 900

- setup:
@dagwieers

This comment has been minimized.

Copy link
Member Author

commented Jun 12, 2017

The same issue appears consistently when installing SCVMM (in async mode). If you have the following tasks in a playbook:

- name: Transfer System-Center ISO
  win_get_url:
    url: '{{ binaries_source }}/mu_system_center_2012_r2_virtual_machine_manager_x86_and_x64_dvd_2913737.iso'
    dest: C:\Windows\Temp\mu_system_center_2012_r2_virtual_machine_manager_x86_and_x64_dvd_2913737.iso
    force: no
    skip_certificate_validation: yes

- name: Mount System-Center ISO image
  win_disk_image:
    image_path: 'C:\Windows\Temp\mu_system_center_2012_r2_virtual_machine_manager_x86_and_x64_dvd_2913737.iso'
    state: present
  register: iso

- name: Run System-Center installer
  win_command: >
    {{ iso.mount_path }}setup.exe /server /i /f "C:\Windows\Temp\VMServer.ini"
    /SqlDBAdminDomain "{{ dc }}" /SqlDBAdminName "{{ windows_admin_user }}" /SqlDBAdminPassword "{{ windows_admin_password }}"
    /VmmServiceDomain "{{ dc }}" /VmmServiceUserName "scvmmsvc" /VmmServiceUserPassword "{{ windows_admin_password }}"
    /IACCEPTSCEULA
  args:
    creates: 'C:\Program Files\Microsoft System Center 2012 R2\Virtual Machine Manager\bin\VmmAdminUi.exe'
  vars:
    ansible_user: '{{ dc }}\{{ windows_admin_user }}'
  when: not vmadminui.stat.exists
  register: systemcenter
  async: 1000
  poll: 15
  ignore_errors: yes

- name: Run System-Center installer (again)
  win_command: >
    {{ iso.mount_path }}setup.exe /server /i /f "C:\Windows\Temp\VMServer.ini"
    /SqlDBAdminDomain "{{ dc }}" /SqlDBAdminName "{{ windows_admin_user }}" /SqlDBAdminPassword "{{ windows_admin_password }}"
    /VmmServiceDomain "{{ dc }}" /VmmServiceUserName "scvmmsvc" /VmmServiceUserPassword "{{ windows_admin_password }}"
    /IACCEPTSCEULA
  args:
    creates: 'C:\Program Files\Microsoft System Center 2012 R2\Virtual Machine Manager\bin\VmmAdminUi.exe'
  vars:
    ansible_user: '{{ dc }}\{{ windows_admin_user }}'
  when: not vmadminui.stat.exists and systemcenter|failed
  async: 1000
  poll: 15

It fails consistently with a Connection refused:

TASK [windows-scvmm-server : Run System-Center installer] ******************************************************************
fatal: [bdsol-aci51-scvmm-01]: FAILED! => {"failed": true, "msg": "credssp: HTTPSConnectionPool(host='38.38.51.3', port=5986): Max retries exceeded with url: /wsman (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x3b6f3d0>: Failed to establish a new connection: [Errno 111] Connection refused',)), ssl: HTTPSConnectionPool(host='38.38.51.3', port=5986): Max retries exceeded with url: /wsman (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x3b65b10>: Failed to establish a new connection: [Errno 111] Connection refused',)), plaintext: HTTPSConnectionPool(host='38.38.51.3', port=5986): Max retries exceeded with url: /wsman (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x3b5d9d0>: Failed to establish a new connection: [Errno 111] Connection refused',))"}
...ignoring

TASK [windows-scvmm-server : Run System-Center installer (again)] **********************************************************
fatal: [bdsol-aci51-scvmm-01]: UNREACHABLE! => {"changed": false, "msg": "credssp: HTTPSConnectionPool(host='38.38.51.3', port=5986): Max retries exceeded with url: /wsman (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x3b68c50>: Failed to establish a new connection: [Errno 111] Connection refused',)), ssl: HTTPSConnectionPool(host='38.38.51.3', port=5986): Max retries exceeded with url: /wsman (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x3b971d0>: Failed to establish a new connection: [Errno 111] Connection refused',)), plaintext: HTTPSConnectionPool(host='38.38.51.3', port=5986): Max retries exceeded with url: /wsman (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x3b97350>: Failed to establish a new connection: [Errno 111] Connection refused',))", "unreachable": true}
@dagwieers

This comment has been minimized.

Copy link
Member Author

commented Jun 14, 2017

So I wrote a simple implementation that would retry 5 times for 5 seconds when dealing with Connection refused, and I ended up with reproducible HTTP 500 errors after that (with Windows 2012R2). When I then upgraded WMF/PS 4.0 to WMF/PS 5.1, the HTTP 500 errors were a thing of the past, while the task would then succeed successfully !

But the task would only work successfully if it was run in async mode, if not I would get a probem related to a None value being provided to the ElementTree parser. I haven't looked into that issue.

So it seems that HTTP 500 issues with WinRM (likely due to the WinRM service being restarted) disappear with WMF 5.1 ! (Potentially with other scenarios)

@dagwieers dagwieers changed the title WinRM: During use bails out with "Connection refused" WinRM: Bails out with "[Errno 111] Connection refused" Jul 14, 2017

@dagwieers

This comment was marked as outdated.

Copy link
Member Author

commented Aug 19, 2017

Ok, so since I stumbled upon this again when building a new environment, and this fix (ugly as it is) does help for me, here is the poor man's workaround to survive "Connection Refused"

diff --git a/winrm/transport.py b/winrm/transport.py
index bb2f881..de547e9 100644
--- a/winrm/transport.py
+++ b/winrm/transport.py
@@ -1,8 +1,10 @@
 from __future__ import unicode_literals
 from contextlib import contextmanager
+import errno
 import re
 import sys
 import os
+import time
 import weakref
 
 is_py2 = sys.version[0] == '2'
@@ -189,7 +191,16 @@ def send_message(self, message):
         prepared_request = self.session.prepare_request(request)
 
         try:
-            response = self.session.send(prepared_request, timeout=self.read_timeout_sec)
+
+            for attempt in range(5):
+                try:
+                    response = self.session.send(prepared_request, timeout=self.read_timeout_sec)
+                    break
+                except requests.exceptions.ConnectionError as e:
+                    if attempt == 4 or 'connection refused' not in str(e).lower():
+                        raise
+                    time.sleep(5)
+
             response_text = response.text
             response.raise_for_status()
             return response_text

So the first problem here is that requests makes it very hard to intercept the Connection Refused exception, its abstraction is flawed in that respect IMO. So what I am doing here is catch the Connection Refused, and re-raise if not.

The other thing this implementation needs, is a way to configure the number of retries, and the delay. So in my poor man's implementation it's trying 5 times, every 5 seconds. But we may be able to deduce some better defaults when testing this in real-life scenarios (how long does it take to restart the service in general, etc.).

Feedback appreciated, upstream PR is here: diyan/pywinrm#174

@shilpa12345

This comment was marked as resolved.

Copy link

commented Nov 21, 2017

Guys Need help here I am posting /etc/ansible/hosts file for windows entry is
[windows]
Lenovo-PC

Lenovo-PC entry is added in my /etc/hosts file

AND

/etc/ansible/windows.yml file is
ansible_user: Administrator
ansible_password:
ansible_port: 5985
ansible_connection: winrm
ansible_winrm_scheme: http
ansible_winrm_server_cert_validation: ignore

and run command
ansible windows -m win_ping

But Still GET ERROR

Lenovo-PC | UNREACHABLE! => {
"changed": false,
"msg": "Failed to connect to the host via ssh: ssh: connect to host lenovo-pc port 22: Connection refused\r\n",
"unreachable": true
}

PLEASE HELLPPP!!!!!!!!

@dagwieers

This comment was marked as resolved.

Copy link
Member Author

commented Nov 21, 2017

@shilpa12345 This seems to be a user-induced problem. Ansible is using SSH, not WINRM. You created a file /etc/ansible/windows.yml file, but that is not supposed to work. You have to drop your group variables in a directory group_vars/windows.yml.

Please visit the IRC channel or the Ansible google group for support questions, this is NOT a support forum, but a development tool.

@ansibot

This comment has been minimized.

Copy link
Contributor

commented Jan 8, 2018

@ashfaqn

This comment was marked as off-topic.

Copy link

commented Sep 28, 2018

@dagwieers could you please let me know if you have managed to resolve this issue as I am facing this on EC2 (WIN 2016) instances.

@dagwieers

This comment was marked as off-topic.

Copy link
Member Author

commented Sep 28, 2018

@ashfaqn I don't think we can resolve it from Ansible's side, as it clearly is something caused on Windows that makes WinRM not being able to connect on the WinRM port (possible temporarily). I suspect that the WinRM service (or one of its dependencies) was being restarted, or maybe the network subset is being restarted, ...

So the only workaround is to recover from a "Connection Refused" as shown by my patch above. Or maybe introduce a sleep, and try again (increasing the chance the system has settled).

In the case of VMware I assume that when the VM was created, Windows is doing some initialising and as part of that the WinRM service or network is being restarted. So it works briefly, then stops working, then works again.

@dagwieers

This comment has been minimized.

Copy link
Member Author

commented Oct 7, 2018

I updated my patch to pywinrm to recover from this at: diyan/pywinrm#174

Now you can set reconnection_retries and reconnection_backoff (e.g. resp to 4 retries and 2.0 seconds) to recover from temporary Connection Refused situations. This can recover from e.g. installing SCVMM (which apparently makes WinRM unavailable for a short while). The backoff period is 2, 4, 8, 16 (=30) seconds.

@dagwieers

This comment has been minimized.

Copy link
Member Author

commented Oct 7, 2018

I also implemented the same solution for pypsrp now at: jborean93/pypsrp#10

@dagwieers

This comment has been minimized.

Copy link
Member Author

commented Oct 9, 2018

Here is a quick-fix for hand-editing your pywinrm installation:

--- a/winrm/protocol.py
+++ b/winrm/protocol.py
@@ -158,6 +158,16 @@ class Transport(object):
         settings = session.merge_environment_settings(url=self.endpoint, proxies={}, stream=None,
                                                       verify=None, cert=None)
 
+        # Retry on connection errors, with a backoff factor
+        retries = requests.packages.urllib3.util.retry.Retry(total=4,
+                                                             connect=4,
+                                                             status=4,
+                                                             read=0,
+                                                             backoff_factor=2.0,
+                                                             status_forcelist=(413, 425, 429, 503))
+        session.mount('http://', requests.adapters.HTTPAdapter(max_retries=retries))
+        session.mount('https://', requests.adapters.HTTPAdapter(max_retries=retries))
+
         # get proxy settings from env
         # FUTURE: allow proxy to be passed in directly to supersede this value
         session.proxies = settings['proxies']

Or your pypsrp installation:

--- a/pypsrp/wsman.py
+++ b/pypsrp/wsman.py
@@ -773,6 +773,18 @@ class _TransportHTTP(object):
         elif self.no_proxy:
             session.proxies = orig_proxy
 
+        # Retry on connection errors, with a backoff factor
+        retries = requests.packages.urllib3.util.retry.Retry(
+            total=4,
+            connect=4,
+            status=4,
+            read=0,
+            backoff_factor=2.0,
+            status_forcelist=(413, 425, 429, 503),
+        )
+        session.mount('http://', requests.adapters.HTTPAdapter(max_retries=retries))
+        session.mount('https://', requests.adapters.HTTPAdapter(max_retries=retries))
+
         # set cert validation config
         session.verify = self.cert_validation

This will implement 4 retries with an exponential back-off of 2.0 seconds.

Please test and report back.

@VladislavPershin

This comment was marked as off-topic.

@dagwieers

This comment was marked as resolved.

Copy link
Member Author

commented Dec 8, 2018

@VladislavPershin It seems it has never worked for him, so it has no relation to this.

@mayfield2333

This comment was marked as off-topic.

Copy link

commented Feb 3, 2019

Interestingly I have success using ntlm and http on port 5985, but I have this extra warning about CBT:

$ ansible windows --ask-vault-pass  -m win_ping
Vault password: 
/usr/lib/python2.7/site-packages/requests_ntlm/requests_ntlm.py:200: NoCertificateRetrievedWarning: Requests is running with a non urllib3 backend, cannot retrieve server certificate for CBT
  NoCertificateRetrievedWarning)
myhost | SUCCESS => {
    "changed": false, 
    "ping": "pong"
}

Using config:

# Working config
ansible_port: 5985
ansible_winrm_scheme: http
ansible_winrm_transport: ntlm
ansible_connection: winrm
ansible_winrm_server_cert_validation: ignore

But fail to the same machine with https on the configured port of 443.

$ ansible windows --ask-vault-pass  -m win_ping
Vault password: 
myhost | UNREACHABLE! => {
    "changed": false, 
    "msg": "ntlm: HTTPSConnectionPool(host='myhost', port=443): Max retries exceeded with url: /wsman (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7fe7f21bc610>, 'Connection to myhost timed out. (connect timeout=30)'))", 
    "unreachable": true
}

I have disabled the firewall, after seeing reports of it blocking on 443, so I know the attempted connection is making it to the machine, but being refused.

Failed configuration:

ansible_port: 443
ansible_winrm_scheme: https
ansible_winrm_transport: ntlm
ansible_connection: winrm
ansible_winrm_server_cert_validation: ignore
@dagwieers

This comment was marked as off-topic.

Copy link
Member Author

commented Feb 3, 2019

Do you have WinRM configured to use port 443 ? Because that is not a default port for WinRM for https (like port 80 is not the default port for WinRM for http). If you want to use https, leave out the ansible_port and it will use port 5986 automatically. Anyway, this is off-topic, it is not related to this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.