Tasks run via winrm hang on payload send error #79016
Labels
affects_2.15
bug
This issue/PR relates to a bug.
support:core
This issue/PR relates to code supported by the Ansible Engineering Team.
windows
Windows community
Summary
A misbehaving or severely-overloaded Windows host may raise an unexpected WinRM Operation Timeout fault while a task payload is being transferred via the
shell/Send
input operation over awinrm
connection. If this fault occurs during a payload send, the send operation cannot be safely retried, and the state of the target payload can no longer be relied upon (as WinRM does not guarantee that the buffer was not partially or completely consumed by the target process), thus the overall command execution should fail. When such an error occurs, the connection plugin attempts to retrieve any command output for forensic purposes (e.g., error messages, an Ansible error response), but if the target process is still blocked waiting for a payload, the command output fetch can logically deadlock and run forever (due to the nature of how command output fetching works in a loop with a timeout in the underlying pywinrm library). This situation was even posited in the original code, but doesn't seem to have been a common problem.The "easy" fix is to just make all input send failures immediately fatal and allow the raw exception to fly. However, this would lose any diagnostic result information that the target process may have written to its output/error streams. A better fix would be to do a single final "best effort" output fetch without the loop, then raise
AnsibleConnectionError
with the details, if any. This will require an API change to the underlying pywinrm library and conditional code in thewinrm
connection plugin to use it if available, and fall back to some more robust form of the current behavior if not.Running tasks
async
against the problem host doesn't always help in this case, since the control-side async timeout loop doesn't start until the exec wrapper has been successfully sent to the target host. The new tasktimeout
directive could possibly help, but is reliant on the blocking operation running on the worker's main thread and that the blocking operation isn't in native code, neither of which are guaranteed. Threading the final output fetch with a timeout inside the error handler would also cause problems, since thefinally
block tries to clean up the connection and the intra-request HTTP connection state would likely be broken by the cancellation operation.Issue Type
Bug Report
Component Name
winrm
Ansible Version
Configuration
NA
OS / Environment
Any Ansible targeting a Windows host that hangs or times out while processing input buffers over WinRM.
Steps to Reproduce
(send a task payload to an exec wrapper that doesn't process it)
Expected Results
A relatively quick failure of the task for the offending host, along with any forensic output emitted by the target host process (maybe none, depending on why the host is misbehaving).
Actual Results
Ansible warning: `ERROR DURING WINRM SEND INPUT - attempting to recover: WinRMOperationTimeoutError`, followed by an indefinite hang.
Code of Conduct
The text was updated successfully, but these errors were encountered: