Tasks run via winrm hang on payload send error #79016

nitzmahone · 2022-10-04T00:03:54Z

Summary

A misbehaving or severely-overloaded Windows host may raise an unexpected WinRM Operation Timeout fault while a task payload is being transferred via the shell/Send input operation over a winrm connection. If this fault occurs during a payload send, the send operation cannot be safely retried, and the state of the target payload can no longer be relied upon (as WinRM does not guarantee that the buffer was not partially or completely consumed by the target process), thus the overall command execution should fail. When such an error occurs, the connection plugin attempts to retrieve any command output for forensic purposes (e.g., error messages, an Ansible error response), but if the target process is still blocked waiting for a payload, the command output fetch can logically deadlock and run forever (due to the nature of how command output fetching works in a loop with a timeout in the underlying pywinrm library). This situation was even posited in the original code, but doesn't seem to have been a common problem.

The "easy" fix is to just make all input send failures immediately fatal and allow the raw exception to fly. However, this would lose any diagnostic result information that the target process may have written to its output/error streams. A better fix would be to do a single final "best effort" output fetch without the loop, then raise AnsibleConnectionError with the details, if any. This will require an API change to the underlying pywinrm library and conditional code in the winrm connection plugin to use it if available, and fall back to some more robust form of the current behavior if not.

Running tasks async against the problem host doesn't always help in this case, since the control-side async timeout loop doesn't start until the exec wrapper has been successfully sent to the target host. The new task timeout directive could possibly help, but is reliant on the blocking operation running on the worker's main thread and that the blocking operation isn't in native code, neither of which are guaranteed. Threading the final output fetch with a timeout inside the error handler would also cause problems, since the finally block tries to clean up the connection and the intra-request HTTP connection state would likely be broken by the cancellation operation.

Issue Type

Bug Report

Component Name

winrm

Ansible Version

$ ansible --version

(applies to all pre-2.15 versions of Ansible)

Configuration

NA

OS / Environment

Any Ansible targeting a Windows host that hangs or times out while processing input buffers over WinRM.

Steps to Reproduce

(send a task payload to an exec wrapper that doesn't process it)

Expected Results

A relatively quick failure of the task for the offending host, along with any forensic output emitted by the target host process (maybe none, depending on why the host is misbehaving).

Actual Results

Ansible warning: `ERROR DURING WINRM SEND INPUT - attempting to recover: WinRMOperationTimeoutError`, followed by an indefinite hang.

Code of Conduct

I agree to follow the Ansible Code of Conduct

The text was updated successfully, but these errors were encountered:

ansibot · 2022-10-04T00:08:57Z

Files identified in the description:

lib/ansible/plugins/connection/winrm.py

If these files are incorrect, please update the component name section of the description or use the !component bot command.

click here for bot help

nm17 · 2023-03-22T20:20:11Z

Can this be solved by using something like asyncio.wait_for?

egvimo · 2023-06-22T05:22:41Z

I have exactly the same issue. It happens very often (not always) on performing a service restart via win_service.

dandavids · 2023-09-25T20:49:36Z

Any updates here since its been awhile and having the same issue as well.

@nitzmahone or @egvimo - were you able to find a work around/solution to this issue?

jborean93 · 2023-09-25T20:52:21Z

#81538 may be helpful. It tries to handle a send input failure a bit better than it does today.

nitzmahone · 2024-05-01T19:57:42Z

Should be addressed by the above PR...

ansibot added affects_2.15 bug This issue/PR relates to a bug. needs_triage Needs a first human triage before being processed. support:core This issue/PR relates to code supported by the Ansible Engineering Team. windows Windows community labels Oct 4, 2022

nitzmahone removed the needs_triage Needs a first human triage before being processed. label Oct 4, 2022

nitzmahone self-assigned this Oct 4, 2022

ChristianGeiger3 mentioned this issue Apr 18, 2024

ansible winrm connection hangs when connection is bad #83081

Open

1 task

nitzmahone closed this as completed May 1, 2024

ansible locked and limited conversation to collaborators May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tasks run via winrm hang on payload send error #79016

Tasks run via winrm hang on payload send error #79016

nitzmahone commented Oct 4, 2022 •

edited

Loading

ansibot commented Oct 4, 2022

nm17 commented Mar 22, 2023 •

edited

Loading

egvimo commented Jun 22, 2023

dandavids commented Sep 25, 2023 •

edited

Loading

jborean93 commented Sep 25, 2023

nitzmahone commented May 1, 2024

Tasks run via winrm hang on payload send error #79016

Tasks run via winrm hang on payload send error #79016

Comments

nitzmahone commented Oct 4, 2022 • edited Loading

Summary

Issue Type

Component Name

Ansible Version

Configuration

OS / Environment

Steps to Reproduce

Expected Results

Actual Results

Code of Conduct

ansibot commented Oct 4, 2022

nm17 commented Mar 22, 2023 • edited Loading

egvimo commented Jun 22, 2023

dandavids commented Sep 25, 2023 • edited Loading

jborean93 commented Sep 25, 2023

nitzmahone commented May 1, 2024

nitzmahone commented Oct 4, 2022 •

edited

Loading

nm17 commented Mar 22, 2023 •

edited

Loading

dandavids commented Sep 25, 2023 •

edited

Loading