Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tasks run via winrm hang on payload send error #79016

Closed
1 task done
nitzmahone opened this issue Oct 4, 2022 · 6 comments
Closed
1 task done

Tasks run via winrm hang on payload send error #79016

nitzmahone opened this issue Oct 4, 2022 · 6 comments
Assignees
Labels
affects_2.15 bug This issue/PR relates to a bug. support:core This issue/PR relates to code supported by the Ansible Engineering Team. windows Windows community

Comments

@nitzmahone
Copy link
Member

nitzmahone commented Oct 4, 2022

Summary

A misbehaving or severely-overloaded Windows host may raise an unexpected WinRM Operation Timeout fault while a task payload is being transferred via the shell/Send input operation over a winrm connection. If this fault occurs during a payload send, the send operation cannot be safely retried, and the state of the target payload can no longer be relied upon (as WinRM does not guarantee that the buffer was not partially or completely consumed by the target process), thus the overall command execution should fail. When such an error occurs, the connection plugin attempts to retrieve any command output for forensic purposes (e.g., error messages, an Ansible error response), but if the target process is still blocked waiting for a payload, the command output fetch can logically deadlock and run forever (due to the nature of how command output fetching works in a loop with a timeout in the underlying pywinrm library). This situation was even posited in the original code, but doesn't seem to have been a common problem.

The "easy" fix is to just make all input send failures immediately fatal and allow the raw exception to fly. However, this would lose any diagnostic result information that the target process may have written to its output/error streams. A better fix would be to do a single final "best effort" output fetch without the loop, then raise AnsibleConnectionError with the details, if any. This will require an API change to the underlying pywinrm library and conditional code in the winrm connection plugin to use it if available, and fall back to some more robust form of the current behavior if not.

Running tasks async against the problem host doesn't always help in this case, since the control-side async timeout loop doesn't start until the exec wrapper has been successfully sent to the target host. The new task timeout directive could possibly help, but is reliant on the blocking operation running on the worker's main thread and that the blocking operation isn't in native code, neither of which are guaranteed. Threading the final output fetch with a timeout inside the error handler would also cause problems, since the finally block tries to clean up the connection and the intra-request HTTP connection state would likely be broken by the cancellation operation.

Issue Type

Bug Report

Component Name

winrm

Ansible Version

$ ansible --version

(applies to all pre-2.15 versions of Ansible)

Configuration

NA

OS / Environment

Any Ansible targeting a Windows host that hangs or times out while processing input buffers over WinRM.

Steps to Reproduce

(send a task payload to an exec wrapper that doesn't process it)

Expected Results

A relatively quick failure of the task for the offending host, along with any forensic output emitted by the target host process (maybe none, depending on why the host is misbehaving).

Actual Results

Ansible warning: `ERROR DURING WINRM SEND INPUT - attempting to recover: WinRMOperationTimeoutError`, followed by an indefinite hang.

Code of Conduct

  • I agree to follow the Ansible Code of Conduct
@ansibot
Copy link
Contributor

ansibot commented Oct 4, 2022

Files identified in the description:

If these files are incorrect, please update the component name section of the description or use the !component bot command.

click here for bot help

@ansibot ansibot added affects_2.15 bug This issue/PR relates to a bug. needs_triage Needs a first human triage before being processed. support:core This issue/PR relates to code supported by the Ansible Engineering Team. windows Windows community labels Oct 4, 2022
@nitzmahone nitzmahone removed the needs_triage Needs a first human triage before being processed. label Oct 4, 2022
@nitzmahone nitzmahone self-assigned this Oct 4, 2022
@nm17
Copy link

nm17 commented Mar 22, 2023

Can this be solved by using something like asyncio.wait_for?

@egvimo
Copy link

egvimo commented Jun 22, 2023

I have exactly the same issue. It happens very often (not always) on performing a service restart via win_service.

@dandavids
Copy link

dandavids commented Sep 25, 2023

Any updates here since its been awhile and having the same issue as well.

@nitzmahone or @egvimo - were you able to find a work around/solution to this issue?

@jborean93
Copy link
Contributor

#81538 may be helpful. It tries to handle a send input failure a bit better than it does today.

@nitzmahone
Copy link
Member Author

Should be addressed by the above PR...

@ansible ansible locked and limited conversation to collaborators May 15, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
affects_2.15 bug This issue/PR relates to a bug. support:core This issue/PR relates to code supported by the Ansible Engineering Team. windows Windows community
Projects
None yet
Development

No branches or pull requests

6 participants