Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

win_updates round 2 ERROR DURING WINRM SEND INPUT WinRMOperationTimeoutError #516

Closed
agibson2 opened this issue Jun 14, 2023 · 27 comments
Closed

Comments

@agibson2
Copy link
Contributor

agibson2 commented Jun 14, 2023

SUMMARY

Applying windows updates to servers with ansible.windows/community.windows version 2.0.0 give the error below. When the error gets triggered, it will just sit for hours and no further output is seen until I CTRL-C the process. This is tested against Windows 2019 server.

Does not happen with ansible.windows/community.windows 1.12.0

If I install ansible.windows and collection.windows (seems to use ansible.windows going by -vvvvv output as I am not specifying which one to use... just using win_updates in playbook.

ISSUE TYPE

Ansible never completes after the error and just doesn't display any more output. I see communications going to the servers (maybe every few seconds a query of some kind is sent) but nothing appears to be done. I waited hours and nothing else is output.

COMPONENT NAME

win_updates

ANSIBLE VERSION

Using OS supplied Ansible version.

ansible [core 2.14.2]
  config file = /etc/ansible/ansible.cfg
  configured module search path = ['/root/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python3.11/site-packages/ansible
  ansible collection location = /root/.ansible/collections:/usr/share/ansible/collections
  executable location = /usr/bin/ansible
  python version = 3.11.2 (main, Feb 16 2023, 00:00:00) [GCC 11.3.1 20221121 (Red Hat 11.3.1-4)] (/usr/bin/python3.11)
  jinja version = 3.1.2
  libyaml = True
COLLECTION VERSION

Using collections installed through ansible-galaxy command.

Collection        Version
----------------- -------
ansible.posix     1.5.4  
ansible.windows   2.0.0 
community.windows 2.0.0

CONFIGURATION
CONFIG_FILE() = /etc/ansible/ansible.cfg
OS / ENVIRONMENT

Rocky Linux 9.2 (based on RHEL 9.2)

STEPS TO REPRODUCE

Apply the updates. Get the error.

Initiate update and right when round 2 starts (after reboot check is successful), the error below occurs. (only occurs if windows updates are applied to get to round 2)

- name: "{{ 'Check' if ansible_check_mode else 'Apply' }} updates"
  win_updates:
    category_names:
      - SecurityUpdates
      - CriticalUpdates
      - Updates
      - Update Rollups
      - FeaturePacks
    state: installed
    reboot: yes
    skip_optional: yes
    reboot_timeout: 1800
    log_path: C:\Support\win_updates\apply_wulog.txt
    reject_list:
      - ' Preview '
  register: update_result

EXPECTED RESULTS

Update to complete and successfully complete round 2 checks

ACTUAL RESULTS

Output where it shows the update succeeds and reboot is successful, but on round 2 you get the error.
https://gist.github.com/agibson2/a6e8747ea5042156b04f5bd6ff30385b

Last part of the output...

win_updates: resetting connection plugin
<192.168.1.35> ESTABLISH WINRM CONNECTION FOR USER: adminuser on PORT 5985 TO 192.168.1.35
<192.168.1.35> WINRM CONNECT: transport=ntlm endpoint=http://192.168.1.35:5985/wsman
<192.168.1.35> WINRM OPEN SHELL: 8B1A0B32-3BF2-4F98-8D1C-DBC3EB8E7169
win_updates: attempting to get system boot time
win_updates: getting boot time
win_updates: running command: (Get-CimInstance -ClassName Win32_OperatingSystem -Property LastBootUpTime).LastBootUpTime.ToFileTime()
EXEC (via pipeline wrapper)
<192.168.1.35> WINRM EXEC 'PowerShell' ['-NoProfile', '-NonInteractive', '-ExecutionPolicy', 'Unrestricted', '-EncodedCommand', '--CUT---']
<192.168.1.35> WINRM RESULT '<Response code 0, out "133312528185000000\r\n", err "#< CLIXML\r\n<Objs Ver">'
win_updates: command result - rc: 0, stdout: 133312528185000000, stderr: 
win_updates: last boot time: 133312528185000000
win_updates: last boot time check success
win_updates running post reboot test command
win_updates: attempting post-reboot test command
win_updates: running command: Get-Item -LiteralPath 'HKLM:\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Winlogon\AutoLogonChecked' -ErrorAction Stop; 'success-07f8a835-2512-44b5-9aa1-53f60ad2dd9d'
EXEC (via pipeline wrapper)
<192.168.1.35> WINRM EXEC 'PowerShell' ['-NoProfile', '-NonInteractive', '-ExecutionPolicy', 'Unrestricted', '-EncodedCommand', '--CUT--']
<192.168.1.35> WINRM RESULT '<Response code 0, out "\r\n\r\n    Hive: HKEY_L", err "#< CLIXML\r\n<Objs Ver">'
win_updates: command result - rc: 0, stdout: Hive: HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Winlogon


Name                           Property                                                                                
----                           --------                                                                                
AutoLogonChecked                                                                                                       
success-07f8a835-2512-44b5-9aa1-53f60ad2dd9d, stderr: 
win_updates: post-reboot test command success
win_updates: system successfully rebooted
<192.168.1.35> Running win_updates - round 2
<192.168.1.35> Starting update task
Using module file /home/ansible/.ansible/collections/ansible_collections/ansible/windows/plugins/modules/win_updates.ps1
Pipelining is enabled.
EXEC (via pipeline wrapper)
<192.168.1.35> WINRM EXEC 'PowerShell' ['-NoProfile', '-NonInteractive', '-ExecutionPolicy', 'Unrestricted', '-EncodedCommand', '---CUT---']
[WARNING]: ERROR DURING WINRM SEND INPUT - attempting to recover:
WinRMOperationTimeoutError
@jborean93 jborean93 transferred this issue from ansible-collections/community.windows Jun 14, 2023
@acobaugh
Copy link

We actually see this same error, at least with 2019 AWS AMIs, with 1.10.0, 1.12.0, and 2.0.0 (haven't tried any others). This started for us this week. I also wonder if #511 is the same issue. Fwiw, it seems like the 2022 AWS AMIs update fine.

@jborean93
Copy link
Collaborator

Thanks for the bug report, we've had a few other issues reported for similar errors. It seems like we need to tighten up some of the code that handles these cases when the connection is flaky.

@nvn17git
Copy link

@jborean93 - has this been fixed? We are facing the same issue.

@jborean93
Copy link
Collaborator

No, I’m currently knee deep in trying to make things better with this plugin but it’s very slow going. Try the psrp connection plugin as it should hopefully make the connection a bit more stable.

@bidetzz
Copy link

bidetzz commented Aug 10, 2023

I'm also having this issue. any updates ?

@MrKementari
Copy link

Same issue here, with version 1.13.0.
Is there a workaround while this bug is being fixed ?

@dolsson01
Copy link

dolsson01 commented Aug 14, 2023

This has worked for me though there is nothing to stop the timeout from happening in the rescue.

Improvements welcome.

    - name: Install updates
      block:
        - name: Run win_updates
          ansible.windows.win_updates:
            category_names: '*'
            reboot: yes
          timeout: 1500
      rescue:
        - name: Run win_updates again
          ansible.windows.win_updates:
            category_names: '*'
            reboot: yes

@guilhermerocha000
Copy link

As I can dig into, I found this:
https://docs.ansible.com/ansible/latest/os_guide/windows_winrm.html#inventory-options
image

The default value it is not enough for server reboot.

There is a rule to set that time. So it is a better choice use the 'connection_timeout' option (it set both, operational and read timeout values based on that rule):
ansible_winrm_connection_timeout: 1200


Furthermore another important config:
https://docs.ansible.com/ansible/latest/os_guide/windows_winrm.html#winrm-limitations

image

ansible_winrm_kerberos_delegation: true


So this config worked for me (about 20 minutes):
ansible_winrm_connection_timeout: 1200
ansible_winrm_kerberos_delegation: true

@agibson2
Copy link
Contributor Author

agibson2 commented Aug 14, 2023

There is already a dedicated timeout that you can set for a reboot timeout (reboot_timeout) once the current issues are resolved where winrm handles retries better.

I don't think the ones you mention are a good thing to set for a reboot timeout as that could delay individual winrm commands not related to reboot for 20 minutes during a connection issue with the server. Could delay your updates more than needed for some intermittent network connectivity (over VPN, etc). A more robust retry mechanism would be better which I think is what is being worked on.

@guilhermerocha000
Copy link

Reboot timeout didn't solves the problem. It's about winRM connection timeout.
Reboot its just an action that make winRM lost connection long enough to achieve default 30 seconds.
And it's not a sleep time. So, will not delay update installation.
.
It's not a better solution. But it's a workaround that solves the problem for now.

@agibson2
Copy link
Contributor Author

agibson2 commented Aug 15, 2023

I didn't mean it was a sleep timeout. I meant that adding that timeout could affect more areas than expected. What if state is lost in a firewall somewhere, it would take 20 minutes for that connection to timeout if the firewall doesn't send TCP reset (some firewalls don't). So you would have to wait for that long 20 minute timeout in a section of code that is not related to rebooting where normally it would have retried in 30 seconds.

@MrKementari
Copy link

I already tried to set those two variables :

ansible_winrm_operation_timeout_sec: 3600
ansible_winrm_read_timeout_sec: 3600

No changes from my side, same issue.
I'm going to run a few tests today with psrp, and hope the situation improves.

@guilhermerocha000
Copy link

I already tried to set those two variables :

ansible_winrm_operation_timeout_sec: 3600
ansible_winrm_read_timeout_sec: 3600

No changes from my side, same issue. I'm going to run a few tests today with psrp, and hope the situation improves.

Just try with these two

ansible_winrm_connection_timeout: 1200 
ansible_winrm_kerberos_delegation: true

@jborean93
Copy link
Collaborator

@MrKementari can you run with -vvv and share the raw output? It will be a big ask but if you or anyone else in this thread can capture the wireshark traffic of this exchange with the send input error it will greatly help try to identify what's happening. Unfortunately I'm very limited in what I can fix if I cannot reproduce the error. In the past 3 weeks of improving this module I have not encountered this problem once so I'm unsure what it could be.

@jborean93
Copy link
Collaborator

@agibson2 is this something that reliably reproduces for you? Is the Server 2019 host a local VM or something else. Is it using some image that I could probably try out if it's public?

@agibson2
Copy link
Contributor Author

agibson2 commented Aug 17, 2023

It is a VM. I have a few systems I test before I deploy updates to other systems and that is how I caught it. I should be able to reproduce it. I might have to try backing out the last Windows updates if I were to try it again as I don't think I have a snapshot of it from before the last update. Do you want me to try the newer code that you have been working on lately to see if that fixes the problem? I am not sure if that work is in main branch or not.

It is basically a fresh install of 2019 Server on an ESXi host (no other software installed) with updates that I apply each month. Nothing special.

I am using Rocky Linux 9.2 with all updates applied and the supplied ansible-core RPM installed (don't recall if that is EPEL rpm or what is provided by redhat). For the community modules, I am using ansible-galaxy. I had to install the python winrm of course (using the correct python pip for the python version needed).

@jborean93
Copy link
Collaborator

While the newer code should simplify some of the cleanup operations I do believe this particular issue is still present and hasn't been fixed with any of the other changes made recently.

When looking at the code in the winrm connection plugin I can see why it is hanging but I cannot see how it gets into the state in the first place. Essentially Ansible starts a new powershell.exe process and attempts to send data to the stdin process. This send operation fails with a WinRMOperationTimeoutError which is when the WinRM service doesn't respond in the operation timeout requested. As the stdin data was never fully written by the service the subsequent wait that Ansible does for the process to finish does not occur causing the hang.

The trouble is I cannot reproduce the error myself. Creating a reproducer is key to figuring out safe ways to try and solve these problems and verify that the fix actually works. I have 1 or 2 more things to try that might work but finding out more information in user environments will greatly help figure out what might be going on.

@agibson2
Copy link
Contributor Author

I checked and the ansible-core I am using is what is provided by Rocky and not EPEL.

@jborean93
Copy link
Collaborator

While I do not have a bullet proof solution I do have a way to try and mitigate this problem. I've opened the PR ansible/ansible#81538 to make the send input operation more resiliant by handling the WinRMOperationTimeoutError in a special way. From testing when the server is under really heavy load the WSMan Send operation can fail with this WSMan fault but in reality the data was still written to the process' stdin. Taking this into account the winrm connection plugin can ignore this exception and continue sending the remaining input (with some extra tolerance in case it's still busy). It isn't perfect but it should avoid this particular problem and from testing with a VM with 1 CPU I can see that it no longer hangs at that step.

I will mention that the psrp connection plugin is already more resiliant against this problem and while not perfect it should work today against this problem.

Unfortunately as this is a server side load issue there isn't really much else we can do here. I had to reduce my testing VM to 1 CPU and even then it didn't always reproduce the error here. At some point Ansible can't do much more except offer a way to override the operation timeout which is through the connection_timeout option or by setting ansible_winrm_operation_timeout_sec and ansible_winrm_read_timeout_sec (read timeout should be higher than operation).

@david-sieg
Copy link

david-sieg commented Sep 15, 2023

I tried new collection version 2.1.0 without success, timeouts also occurs.

`[WARNING]: ERROR DURING WINRM SEND INPUT - attempting to recover:
WinRMOperationTimeoutError

[WARNING]: ERROR DURING WINRM SEND INPUT - attempting to recover:
WinRMOperationTimeoutError

[WARNING]: ERROR DURING WINRM SEND INPUT - attempting to recover:
WinRMOperationTimeoutError

[WARNING]: ERROR DURING WINRM SEND INPUT - attempting to recover:
WinRMOperationTimeoutError

[WARNING]: Unknown failure when polling update result - attempting to cancel
task: HTTPSConnectionPool(host='172.21.110.2', port=5986): Max retries exceeded
with url: /wsman (Caused by
NewConnectionError('<urllib3.connection.HTTPSConnection object at
0x77b15e245f10>: Failed to establish a new connection: [Errno 111] Connection
refused'))

[WARNING]: Unknown failure when cancelling update task:
HTTPSConnectionPool(host='172.21.110.2', port=5986): Max retries exceeded with
url: /wsman (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection
object at 0x77b15c5f9070>: Failed to establish a new connection: [Errno 111]
Connection refused'))

[WARNING]: ERROR DURING WINRM SEND INPUT - attempting to recover:
WinRMOperationTimeoutError

[WARNING]: ERROR DURING WINRM SEND INPUT - attempting to recover:
WinRMOperationTimeoutError`

@jborean93
Copy link
Collaborator

The send input recovery stuff is part of Ansible and not this collection. Unfortunately it has not been merged yet as it's waiting on a review from one of my colleagues ansible/ansible#81538

@waddelb2
Copy link

Does that mean there's a fix for this getting merged soon? I tried psrp and now I'm running into new, intermittent issues that I didn't experience with WinRM

@jborean93
Copy link
Collaborator

The PR ansible/ansible#81538 has been merged and will be present in subsequent Ansible releases. Backport PRs going back to Ansible 2.14 have been opened but there are no guarantees that they will be accepted.

@agibson2
Copy link
Contributor Author

agibson2 commented Jan 10, 2024

I just want to say thanks for the work on this and that I haven't experienced this anymore in months since using updated versions on Rocky Linux 9. I use ansible-core from RL9 and then use ansible-galaxy for all the rest of the things like ansible.posix, ansible.windows, community.general, community.windows.

ansible [core 2.14.9]
config file = /etc/ansible/ansible.cfg
configured module search path = ['/home/ansible/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
ansible python module location = /usr/lib/python3.9/site-packages/ansible
ansible collection location = /home/ansible/.ansible/collections:/usr/share/ansible/collections
executable location = /usr/bin/ansible
python version = 3.9.18 (main, Sep 7 2023, 00:00:00) [GCC 11.4.1 20230605 (Red Hat 11.4.1-2)] (/usr/bin/python3)
jinja version = 3.1.2
libyaml = True

Collection Version


ansible.posix 1.5.4
ansible.windows 2.2.0
community.general 8.2.0
community.windows 2.1.0

@brian-lamb-software-engineer
Copy link

I've been experiencing this issue still

Ansible 2.15.3
python 3.11.5

So not sure about the last comment. I cant figure out the easiest way to check ansible plugin versions with out using a task for that, i need a command so i can check easier.

I suppose we can check into psrp if needed, or if anybody has pinned down which time to increase that may be helpful, but i dont want to update the timeout ansible wide, it would be great if theres a way to update the timeout only for that task/play, etc..

@jborean93
Copy link
Collaborator

jborean93 commented Jan 17, 2024

Ansible 2.15.3

The backport for 2.15 is in 2.15.5 https://github.com/ansible/ansible/blob/11e50715a369c163c9ffe7e68926b54674d14b8f/changelogs/CHANGELOG-v2.15.rst#L87-L120.

Increasing the timeout didn't help for this scenario, the WSMan service got stuck somewhere on the server side which we have no control over. The work merged into Ansible was designed to handle such a timeout and retry when possible.

@agibson2
Copy link
Contributor Author

Maybe I just got lucky and it isn't being triggered for whatever reason now for me. Hopefully Redhat update it to 2.15.5.

@brian-lamb-software-engineer To list collection versions...
ansible-galaxy collection list

https://docs.ansible.com/ansible/latest/collections_guide/collections_listing.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests