New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
win_updates round 2 ERROR DURING WINRM SEND INPUT WinRMOperationTimeoutError #516
Comments
We actually see this same error, at least with 2019 AWS AMIs, with |
Thanks for the bug report, we've had a few other issues reported for similar errors. It seems like we need to tighten up some of the code that handles these cases when the connection is flaky. |
@jborean93 - has this been fixed? We are facing the same issue. |
No, I’m currently knee deep in trying to make things better with this plugin but it’s very slow going. Try the psrp connection plugin as it should hopefully make the connection a bit more stable. |
I'm also having this issue. any updates ? |
Same issue here, with version 1.13.0. |
This has worked for me though there is nothing to stop the timeout from happening in the rescue. Improvements welcome.
|
As I can dig into, I found this: The default value it is not enough for server reboot. There is a rule to set that time. So it is a better choice use the 'connection_timeout' option (it set both, operational and read timeout values based on that rule): Furthermore another important config: ansible_winrm_kerberos_delegation: true So this config worked for me (about 20 minutes): |
There is already a dedicated timeout that you can set for a reboot timeout (reboot_timeout) once the current issues are resolved where winrm handles retries better. I don't think the ones you mention are a good thing to set for a reboot timeout as that could delay individual winrm commands not related to reboot for 20 minutes during a connection issue with the server. Could delay your updates more than needed for some intermittent network connectivity (over VPN, etc). A more robust retry mechanism would be better which I think is what is being worked on. |
Reboot timeout didn't solves the problem. It's about winRM connection timeout. |
I didn't mean it was a sleep timeout. I meant that adding that timeout could affect more areas than expected. What if state is lost in a firewall somewhere, it would take 20 minutes for that connection to timeout if the firewall doesn't send TCP reset (some firewalls don't). So you would have to wait for that long 20 minute timeout in a section of code that is not related to rebooting where normally it would have retried in 30 seconds. |
I already tried to set those two variables :
No changes from my side, same issue. |
Just try with these two
|
@MrKementari can you run with |
@agibson2 is this something that reliably reproduces for you? Is the Server 2019 host a local VM or something else. Is it using some image that I could probably try out if it's public? |
It is a VM. I have a few systems I test before I deploy updates to other systems and that is how I caught it. I should be able to reproduce it. I might have to try backing out the last Windows updates if I were to try it again as I don't think I have a snapshot of it from before the last update. Do you want me to try the newer code that you have been working on lately to see if that fixes the problem? I am not sure if that work is in main branch or not. It is basically a fresh install of 2019 Server on an ESXi host (no other software installed) with updates that I apply each month. Nothing special. I am using Rocky Linux 9.2 with all updates applied and the supplied ansible-core RPM installed (don't recall if that is EPEL rpm or what is provided by redhat). For the community modules, I am using ansible-galaxy. I had to install the python winrm of course (using the correct python pip for the python version needed). |
While the newer code should simplify some of the cleanup operations I do believe this particular issue is still present and hasn't been fixed with any of the other changes made recently. When looking at the code in the The trouble is I cannot reproduce the error myself. Creating a reproducer is key to figuring out safe ways to try and solve these problems and verify that the fix actually works. I have 1 or 2 more things to try that might work but finding out more information in user environments will greatly help figure out what might be going on. |
I checked and the ansible-core I am using is what is provided by Rocky and not EPEL. |
While I do not have a bullet proof solution I do have a way to try and mitigate this problem. I've opened the PR ansible/ansible#81538 to make the send input operation more resiliant by handling the I will mention that the Unfortunately as this is a server side load issue there isn't really much else we can do here. I had to reduce my testing VM to 1 CPU and even then it didn't always reproduce the error here. At some point Ansible can't do much more except offer a way to override the operation timeout which is through the connection_timeout option or by setting |
I tried new collection version 2.1.0 without success, timeouts also occurs. `[WARNING]: ERROR DURING WINRM SEND INPUT - attempting to recover: [WARNING]: ERROR DURING WINRM SEND INPUT - attempting to recover: [WARNING]: ERROR DURING WINRM SEND INPUT - attempting to recover: [WARNING]: ERROR DURING WINRM SEND INPUT - attempting to recover: [WARNING]: Unknown failure when polling update result - attempting to cancel [WARNING]: Unknown failure when cancelling update task: [WARNING]: ERROR DURING WINRM SEND INPUT - attempting to recover: [WARNING]: ERROR DURING WINRM SEND INPUT - attempting to recover: |
The send input recovery stuff is part of Ansible and not this collection. Unfortunately it has not been merged yet as it's waiting on a review from one of my colleagues ansible/ansible#81538 |
Does that mean there's a fix for this getting merged soon? I tried psrp and now I'm running into new, intermittent issues that I didn't experience with WinRM |
The PR ansible/ansible#81538 has been merged and will be present in subsequent Ansible releases. Backport PRs going back to Ansible 2.14 have been opened but there are no guarantees that they will be accepted. |
I just want to say thanks for the work on this and that I haven't experienced this anymore in months since using updated versions on Rocky Linux 9. I use ansible-core from RL9 and then use ansible-galaxy for all the rest of the things like ansible.posix, ansible.windows, community.general, community.windows. ansible [core 2.14.9] Collection Version ansible.posix 1.5.4 |
I've been experiencing this issue still Ansible So not sure about the last comment. I cant figure out the easiest way to check ansible plugin versions with out using a task for that, i need a command so i can check easier. I suppose we can check into |
The backport for 2.15 is in 2.15.5 https://github.com/ansible/ansible/blob/11e50715a369c163c9ffe7e68926b54674d14b8f/changelogs/CHANGELOG-v2.15.rst#L87-L120. Increasing the timeout didn't help for this scenario, the WSMan service got stuck somewhere on the server side which we have no control over. The work merged into Ansible was designed to handle such a timeout and retry when possible. |
Maybe I just got lucky and it isn't being triggered for whatever reason now for me. Hopefully Redhat update it to 2.15.5. @brian-lamb-software-engineer To list collection versions... https://docs.ansible.com/ansible/latest/collections_guide/collections_listing.html |
SUMMARY
Applying windows updates to servers with ansible.windows/community.windows version 2.0.0 give the error below. When the error gets triggered, it will just sit for hours and no further output is seen until I CTRL-C the process. This is tested against Windows 2019 server.
Does not happen with ansible.windows/community.windows 1.12.0
If I install ansible.windows and collection.windows (seems to use ansible.windows going by -vvvvv output as I am not specifying which one to use... just using win_updates in playbook.
ISSUE TYPE
Ansible never completes after the error and just doesn't display any more output. I see communications going to the servers (maybe every few seconds a query of some kind is sent) but nothing appears to be done. I waited hours and nothing else is output.
COMPONENT NAME
win_updates
ANSIBLE VERSION
Using OS supplied Ansible version.
COLLECTION VERSION
Using collections installed through ansible-galaxy command.
CONFIGURATION
OS / ENVIRONMENT
Rocky Linux 9.2 (based on RHEL 9.2)
STEPS TO REPRODUCE
Apply the updates. Get the error.
Initiate update and right when round 2 starts (after reboot check is successful), the error below occurs. (only occurs if windows updates are applied to get to round 2)
EXPECTED RESULTS
Update to complete and successfully complete round 2 checks
ACTUAL RESULTS
Output where it shows the update succeeds and reboot is successful, but on round 2 you get the error.
https://gist.github.com/agibson2/a6e8747ea5042156b04f5bd6ff30385b
Last part of the output...
The text was updated successfully, but these errors were encountered: