win_updates round 2 ERROR DURING WINRM SEND INPUT WinRMOperationTimeoutError #516

agibson2 · 2023-06-14T21:58:07Z

SUMMARY

Applying windows updates to servers with ansible.windows/community.windows version 2.0.0 give the error below. When the error gets triggered, it will just sit for hours and no further output is seen until I CTRL-C the process. This is tested against Windows 2019 server.

Does not happen with ansible.windows/community.windows 1.12.0

If I install ansible.windows and collection.windows (seems to use ansible.windows going by -vvvvv output as I am not specifying which one to use... just using win_updates in playbook.

ISSUE TYPE

Ansible never completes after the error and just doesn't display any more output. I see communications going to the servers (maybe every few seconds a query of some kind is sent) but nothing appears to be done. I waited hours and nothing else is output.

COMPONENT NAME

win_updates

ANSIBLE VERSION

Using OS supplied Ansible version.

ansible [core 2.14.2]
  config file = /etc/ansible/ansible.cfg
  configured module search path = ['/root/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python3.11/site-packages/ansible
  ansible collection location = /root/.ansible/collections:/usr/share/ansible/collections
  executable location = /usr/bin/ansible
  python version = 3.11.2 (main, Feb 16 2023, 00:00:00) [GCC 11.3.1 20221121 (Red Hat 11.3.1-4)] (/usr/bin/python3.11)
  jinja version = 3.1.2
  libyaml = True

COLLECTION VERSION

Using collections installed through ansible-galaxy command.

Collection        Version
----------------- -------
ansible.posix     1.5.4  
ansible.windows   2.0.0 
community.windows 2.0.0

CONFIGURATION

CONFIG_FILE() = /etc/ansible/ansible.cfg

OS / ENVIRONMENT

Rocky Linux 9.2 (based on RHEL 9.2)

STEPS TO REPRODUCE

Apply the updates. Get the error.

Initiate update and right when round 2 starts (after reboot check is successful), the error below occurs. (only occurs if windows updates are applied to get to round 2)

- name: "{{ 'Check' if ansible_check_mode else 'Apply' }} updates"
  win_updates:
    category_names:
      - SecurityUpdates
      - CriticalUpdates
      - Updates
      - Update Rollups
      - FeaturePacks
    state: installed
    reboot: yes
    skip_optional: yes
    reboot_timeout: 1800
    log_path: C:\Support\win_updates\apply_wulog.txt
    reject_list:
      - ' Preview '
  register: update_result

EXPECTED RESULTS

Update to complete and successfully complete round 2 checks

ACTUAL RESULTS

Output where it shows the update succeeds and reboot is successful, but on round 2 you get the error.
https://gist.github.com/agibson2/a6e8747ea5042156b04f5bd6ff30385b

Last part of the output...

win_updates: resetting connection plugin
<192.168.1.35> ESTABLISH WINRM CONNECTION FOR USER: adminuser on PORT 5985 TO 192.168.1.35
<192.168.1.35> WINRM CONNECT: transport=ntlm endpoint=http://192.168.1.35:5985/wsman
<192.168.1.35> WINRM OPEN SHELL: 8B1A0B32-3BF2-4F98-8D1C-DBC3EB8E7169
win_updates: attempting to get system boot time
win_updates: getting boot time
win_updates: running command: (Get-CimInstance -ClassName Win32_OperatingSystem -Property LastBootUpTime).LastBootUpTime.ToFileTime()
EXEC (via pipeline wrapper)
<192.168.1.35> WINRM EXEC 'PowerShell' ['-NoProfile', '-NonInteractive', '-ExecutionPolicy', 'Unrestricted', '-EncodedCommand', '--CUT---']
<192.168.1.35> WINRM RESULT '<Response code 0, out "133312528185000000\r\n", err "#< CLIXML\r\n<Objs Ver">'
win_updates: command result - rc: 0, stdout: 133312528185000000, stderr: 
win_updates: last boot time: 133312528185000000
win_updates: last boot time check success
win_updates running post reboot test command
win_updates: attempting post-reboot test command
win_updates: running command: Get-Item -LiteralPath 'HKLM:\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Winlogon\AutoLogonChecked' -ErrorAction Stop; 'success-07f8a835-2512-44b5-9aa1-53f60ad2dd9d'
EXEC (via pipeline wrapper)
<192.168.1.35> WINRM EXEC 'PowerShell' ['-NoProfile', '-NonInteractive', '-ExecutionPolicy', 'Unrestricted', '-EncodedCommand', '--CUT--']
<192.168.1.35> WINRM RESULT '<Response code 0, out "\r\n\r\n    Hive: HKEY_L", err "#< CLIXML\r\n<Objs Ver">'
win_updates: command result - rc: 0, stdout: Hive: HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Winlogon


Name                           Property                                                                                
----                           --------                                                                                
AutoLogonChecked                                                                                                       
success-07f8a835-2512-44b5-9aa1-53f60ad2dd9d, stderr: 
win_updates: post-reboot test command success
win_updates: system successfully rebooted
<192.168.1.35> Running win_updates - round 2
<192.168.1.35> Starting update task
Using module file /home/ansible/.ansible/collections/ansible_collections/ansible/windows/plugins/modules/win_updates.ps1
Pipelining is enabled.
EXEC (via pipeline wrapper)
<192.168.1.35> WINRM EXEC 'PowerShell' ['-NoProfile', '-NonInteractive', '-ExecutionPolicy', 'Unrestricted', '-EncodedCommand', '---CUT---']
[WARNING]: ERROR DURING WINRM SEND INPUT - attempting to recover:
WinRMOperationTimeoutError

The text was updated successfully, but these errors were encountered:

acobaugh · 2023-06-15T17:42:03Z

We actually see this same error, at least with 2019 AWS AMIs, with 1.10.0, 1.12.0, and 2.0.0 (haven't tried any others). This started for us this week. I also wonder if #511 is the same issue. Fwiw, it seems like the 2022 AWS AMIs update fine.

jborean93 · 2023-06-27T21:50:47Z

Thanks for the bug report, we've had a few other issues reported for similar errors. It seems like we need to tighten up some of the code that handles these cases when the connection is flaky.

nvn17git · 2023-07-20T18:30:52Z

@jborean93 - has this been fixed? We are facing the same issue.

jborean93 · 2023-07-20T18:47:08Z

No, I’m currently knee deep in trying to make things better with this plugin but it’s very slow going. Try the psrp connection plugin as it should hopefully make the connection a bit more stable.

bidetzz · 2023-08-10T13:57:57Z

I'm also having this issue. any updates ?

MrKementari · 2023-08-14T15:02:55Z

Same issue here, with version 1.13.0.
Is there a workaround while this bug is being fixed ?

dolsson01 · 2023-08-14T17:39:44Z

This has worked for me though there is nothing to stop the timeout from happening in the rescue.

Improvements welcome.

    - name: Install updates
      block:
        - name: Run win_updates
          ansible.windows.win_updates:
            category_names: '*'
            reboot: yes
          timeout: 1500
      rescue:
        - name: Run win_updates again
          ansible.windows.win_updates:
            category_names: '*'
            reboot: yes

guilhermerocha000 · 2023-08-14T18:01:23Z

As I can dig into, I found this:
https://docs.ansible.com/ansible/latest/os_guide/windows_winrm.html#inventory-options

The default value it is not enough for server reboot.

There is a rule to set that time. So it is a better choice use the 'connection_timeout' option (it set both, operational and read timeout values based on that rule):
ansible_winrm_connection_timeout: 1200

Furthermore another important config:
https://docs.ansible.com/ansible/latest/os_guide/windows_winrm.html#winrm-limitations

ansible_winrm_kerberos_delegation: true

So this config worked for me (about 20 minutes):
ansible_winrm_connection_timeout: 1200
ansible_winrm_kerberos_delegation: true

agibson2 · 2023-08-14T22:16:40Z

There is already a dedicated timeout that you can set for a reboot timeout (reboot_timeout) once the current issues are resolved where winrm handles retries better.

I don't think the ones you mention are a good thing to set for a reboot timeout as that could delay individual winrm commands not related to reboot for 20 minutes during a connection issue with the server. Could delay your updates more than needed for some intermittent network connectivity (over VPN, etc). A more robust retry mechanism would be better which I think is what is being worked on.

guilhermerocha000 · 2023-08-15T01:47:09Z

Reboot timeout didn't solves the problem. It's about winRM connection timeout.
Reboot its just an action that make winRM lost connection long enough to achieve default 30 seconds.
And it's not a sleep time. So, will not delay update installation.
.
It's not a better solution. But it's a workaround that solves the problem for now.

agibson2 · 2023-08-15T02:10:47Z

I didn't mean it was a sleep timeout. I meant that adding that timeout could affect more areas than expected. What if state is lost in a firewall somewhere, it would take 20 minutes for that connection to timeout if the firewall doesn't send TCP reset (some firewalls don't). So you would have to wait for that long 20 minute timeout in a section of code that is not related to rebooting where normally it would have retried in 30 seconds.

MrKementari · 2023-08-16T07:30:27Z

I already tried to set those two variables :

ansible_winrm_operation_timeout_sec: 3600
ansible_winrm_read_timeout_sec: 3600

No changes from my side, same issue.
I'm going to run a few tests today with psrp, and hope the situation improves.

guilhermerocha000 · 2023-08-16T20:24:02Z

I already tried to set those two variables :
ansible_winrm_operation_timeout_sec: 3600
ansible_winrm_read_timeout_sec: 3600
No changes from my side, same issue. I'm going to run a few tests today with psrp, and hope the situation improves.

Just try with these two

ansible_winrm_connection_timeout: 1200 
ansible_winrm_kerberos_delegation: true

jborean93 · 2023-08-16T23:59:02Z

@MrKementari can you run with -vvv and share the raw output? It will be a big ask but if you or anyone else in this thread can capture the wireshark traffic of this exchange with the send input error it will greatly help try to identify what's happening. Unfortunately I'm very limited in what I can fix if I cannot reproduce the error. In the past 3 weeks of improving this module I have not encountered this problem once so I'm unsure what it could be.

jborean93 · 2023-08-17T00:01:09Z

@agibson2 is this something that reliably reproduces for you? Is the Server 2019 host a local VM or something else. Is it using some image that I could probably try out if it's public?

agibson2 · 2023-08-17T01:32:50Z

It is a VM. I have a few systems I test before I deploy updates to other systems and that is how I caught it. I should be able to reproduce it. I might have to try backing out the last Windows updates if I were to try it again as I don't think I have a snapshot of it from before the last update. Do you want me to try the newer code that you have been working on lately to see if that fixes the problem? I am not sure if that work is in main branch or not.

It is basically a fresh install of 2019 Server on an ESXi host (no other software installed) with updates that I apply each month. Nothing special.

I am using Rocky Linux 9.2 with all updates applied and the supplied ansible-core RPM installed (don't recall if that is EPEL rpm or what is provided by redhat). For the community modules, I am using ansible-galaxy. I had to install the python winrm of course (using the correct python pip for the python version needed).

jborean93 · 2023-08-17T01:41:46Z

While the newer code should simplify some of the cleanup operations I do believe this particular issue is still present and hasn't been fixed with any of the other changes made recently.

When looking at the code in the winrm connection plugin I can see why it is hanging but I cannot see how it gets into the state in the first place. Essentially Ansible starts a new powershell.exe process and attempts to send data to the stdin process. This send operation fails with a WinRMOperationTimeoutError which is when the WinRM service doesn't respond in the operation timeout requested. As the stdin data was never fully written by the service the subsequent wait that Ansible does for the process to finish does not occur causing the hang.

The trouble is I cannot reproduce the error myself. Creating a reproducer is key to figuring out safe ways to try and solve these problems and verify that the fix actually works. I have 1 or 2 more things to try that might work but finding out more information in user environments will greatly help figure out what might be going on.

agibson2 · 2023-08-17T01:42:10Z

I checked and the ansible-core I am using is what is provided by Rocky and not EPEL.

jborean93 · 2023-08-18T03:41:51Z

While I do not have a bullet proof solution I do have a way to try and mitigate this problem. I've opened the PR ansible/ansible#81538 to make the send input operation more resiliant by handling the WinRMOperationTimeoutError in a special way. From testing when the server is under really heavy load the WSMan Send operation can fail with this WSMan fault but in reality the data was still written to the process' stdin. Taking this into account the winrm connection plugin can ignore this exception and continue sending the remaining input (with some extra tolerance in case it's still busy). It isn't perfect but it should avoid this particular problem and from testing with a VM with 1 CPU I can see that it no longer hangs at that step.

I will mention that the psrp connection plugin is already more resiliant against this problem and while not perfect it should work today against this problem.

Unfortunately as this is a server side load issue there isn't really much else we can do here. I had to reduce my testing VM to 1 CPU and even then it didn't always reproduce the error here. At some point Ansible can't do much more except offer a way to override the operation timeout which is through the connection_timeout option or by setting ansible_winrm_operation_timeout_sec and ansible_winrm_read_timeout_sec (read timeout should be higher than operation).

david-sieg · 2023-09-15T05:32:13Z

I tried new collection version 2.1.0 without success, timeouts also occurs.

`[WARNING]: ERROR DURING WINRM SEND INPUT - attempting to recover:
WinRMOperationTimeoutError

[WARNING]: ERROR DURING WINRM SEND INPUT - attempting to recover:
WinRMOperationTimeoutError

[WARNING]: Unknown failure when polling update result - attempting to cancel
task: HTTPSConnectionPool(host='172.21.110.2', port=5986): Max retries exceeded
with url: /wsman (Caused by
NewConnectionError('<urllib3.connection.HTTPSConnection object at
0x77b15e245f10>: Failed to establish a new connection: [Errno 111] Connection
refused'))

[WARNING]: Unknown failure when cancelling update task:
HTTPSConnectionPool(host='172.21.110.2', port=5986): Max retries exceeded with
url: /wsman (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection
object at 0x77b15c5f9070>: Failed to establish a new connection: [Errno 111]
Connection refused'))

[WARNING]: ERROR DURING WINRM SEND INPUT - attempting to recover:
WinRMOperationTimeoutError

[WARNING]: ERROR DURING WINRM SEND INPUT - attempting to recover:
WinRMOperationTimeoutError`

jborean93 · 2023-09-15T05:42:59Z

The send input recovery stuff is part of Ansible and not this collection. Unfortunately it has not been merged yet as it's waiting on a review from one of my colleagues ansible/ansible#81538

waddelb2 · 2023-09-21T18:58:35Z

Does that mean there's a fix for this getting merged soon? I tried psrp and now I'm running into new, intermittent issues that I didn't experience with WinRM

jborean93 · 2023-09-26T22:56:49Z

The PR ansible/ansible#81538 has been merged and will be present in subsequent Ansible releases. Backport PRs going back to Ansible 2.14 have been opened but there are no guarantees that they will be accepted.

agibson2 · 2024-01-10T21:27:33Z

I just want to say thanks for the work on this and that I haven't experienced this anymore in months since using updated versions on Rocky Linux 9. I use ansible-core from RL9 and then use ansible-galaxy for all the rest of the things like ansible.posix, ansible.windows, community.general, community.windows.

ansible [core 2.14.9]
config file = /etc/ansible/ansible.cfg
configured module search path = ['/home/ansible/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
ansible python module location = /usr/lib/python3.9/site-packages/ansible
ansible collection location = /home/ansible/.ansible/collections:/usr/share/ansible/collections
executable location = /usr/bin/ansible
python version = 3.9.18 (main, Sep 7 2023, 00:00:00) [GCC 11.4.1 20230605 (Red Hat 11.4.1-2)] (/usr/bin/python3)
jinja version = 3.1.2
libyaml = True

Collection Version

ansible.posix 1.5.4
ansible.windows 2.2.0
community.general 8.2.0
community.windows 2.1.0

brian-lamb-software-engineer · 2024-01-17T22:57:31Z

I've been experiencing this issue still

Ansible 2.15.3
python 3.11.5

So not sure about the last comment. I cant figure out the easiest way to check ansible plugin versions with out using a task for that, i need a command so i can check easier.

I suppose we can check into psrp if needed, or if anybody has pinned down which time to increase that may be helpful, but i dont want to update the timeout ansible wide, it would be great if theres a way to update the timeout only for that task/play, etc..

jborean93 · 2024-01-17T23:15:02Z

Ansible 2.15.3

The backport for 2.15 is in 2.15.5 https://github.com/ansible/ansible/blob/11e50715a369c163c9ffe7e68926b54674d14b8f/changelogs/CHANGELOG-v2.15.rst#L87-L120.

Increasing the timeout didn't help for this scenario, the WSMan service got stuck somewhere on the server side which we have no control over. The work merged into Ansible was designed to handle such a timeout and retry when possible.

agibson2 · 2024-01-18T01:05:48Z

Maybe I just got lucky and it isn't being triggered for whatever reason now for me. Hopefully Redhat update it to 2.15.5.

@brian-lamb-software-engineer To list collection versions...
ansible-galaxy collection list

https://docs.ansible.com/ansible/latest/collections_guide/collections_listing.html

jborean93 transferred this issue from ansible-collections/community.windows Jun 14, 2023

agibson2 mentioned this issue Jun 19, 2023

Add post_reboot_delay option to win_updates module #518

Closed

This was referenced Aug 18, 2023

winrm - make command input more resiliant ansible/ansible#81538

Merged

win_updates always fails with WinRMOperationTimeoutError on starting Round 2 after reboot #511

Closed

jborean93 closed this as completed Sep 26, 2023

ChristianGeiger3 mentioned this issue Apr 18, 2024

ansible winrm connection hangs when connection is bad ansible/ansible#83081

Open

1 task

Sportcast-DE mentioned this issue May 6, 2024

WinRM Operation Timeout vmware-samples/packer-examples-for-vsphere#917

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

win_updates round 2 ERROR DURING WINRM SEND INPUT WinRMOperationTimeoutError #516

win_updates round 2 ERROR DURING WINRM SEND INPUT WinRMOperationTimeoutError #516

agibson2 commented Jun 14, 2023 •

edited

acobaugh commented Jun 15, 2023

jborean93 commented Jun 27, 2023

nvn17git commented Jul 20, 2023

jborean93 commented Jul 20, 2023

bidetzz commented Aug 10, 2023

MrKementari commented Aug 14, 2023

dolsson01 commented Aug 14, 2023 •

edited

guilhermerocha000 commented Aug 14, 2023

agibson2 commented Aug 14, 2023 •

edited

guilhermerocha000 commented Aug 15, 2023

agibson2 commented Aug 15, 2023 •

edited

MrKementari commented Aug 16, 2023

guilhermerocha000 commented Aug 16, 2023

jborean93 commented Aug 16, 2023

jborean93 commented Aug 17, 2023

agibson2 commented Aug 17, 2023 •

edited

jborean93 commented Aug 17, 2023

agibson2 commented Aug 17, 2023

jborean93 commented Aug 18, 2023

david-sieg commented Sep 15, 2023 •

edited

jborean93 commented Sep 15, 2023

waddelb2 commented Sep 21, 2023

jborean93 commented Sep 26, 2023

agibson2 commented Jan 10, 2024 •

edited

brian-lamb-software-engineer commented Jan 17, 2024 •

edited

jborean93 commented Jan 17, 2024 •

edited

agibson2 commented Jan 18, 2024

win_updates round 2 ERROR DURING WINRM SEND INPUT WinRMOperationTimeoutError #516

win_updates round 2 ERROR DURING WINRM SEND INPUT WinRMOperationTimeoutError #516

Comments

agibson2 commented Jun 14, 2023 • edited

SUMMARY

ISSUE TYPE

COMPONENT NAME

ANSIBLE VERSION

COLLECTION VERSION

CONFIGURATION

OS / ENVIRONMENT

STEPS TO REPRODUCE

EXPECTED RESULTS

ACTUAL RESULTS

acobaugh commented Jun 15, 2023

jborean93 commented Jun 27, 2023

nvn17git commented Jul 20, 2023

jborean93 commented Jul 20, 2023

bidetzz commented Aug 10, 2023

MrKementari commented Aug 14, 2023

dolsson01 commented Aug 14, 2023 • edited

guilhermerocha000 commented Aug 14, 2023

agibson2 commented Aug 14, 2023 • edited

guilhermerocha000 commented Aug 15, 2023

agibson2 commented Aug 15, 2023 • edited

MrKementari commented Aug 16, 2023

guilhermerocha000 commented Aug 16, 2023

jborean93 commented Aug 16, 2023

jborean93 commented Aug 17, 2023

agibson2 commented Aug 17, 2023 • edited

jborean93 commented Aug 17, 2023

agibson2 commented Aug 17, 2023

jborean93 commented Aug 18, 2023

david-sieg commented Sep 15, 2023 • edited

jborean93 commented Sep 15, 2023

waddelb2 commented Sep 21, 2023

jborean93 commented Sep 26, 2023

agibson2 commented Jan 10, 2024 • edited

brian-lamb-software-engineer commented Jan 17, 2024 • edited

jborean93 commented Jan 17, 2024 • edited

agibson2 commented Jan 18, 2024

agibson2 commented Jun 14, 2023 •

edited

dolsson01 commented Aug 14, 2023 •

edited

agibson2 commented Aug 14, 2023 •

edited

agibson2 commented Aug 15, 2023 •

edited

agibson2 commented Aug 17, 2023 •

edited

david-sieg commented Sep 15, 2023 •

edited

agibson2 commented Jan 10, 2024 •

edited

brian-lamb-software-engineer commented Jan 17, 2024 •

edited

jborean93 commented Jan 17, 2024 •

edited