Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

winrm - make command input more resiliant #81538

Merged
merged 2 commits into from
Sep 26, 2023

Conversation

jborean93
Copy link
Contributor

SUMMARY

Attempts to improve the handling of failures with the winrm send stdin input functions to better handle timeout conditions on the server when it can be safely ignored.

Related: ansible-collections/ansible.windows#516

ISSUE TYPE
  • Bugfix Pull Request
ADDITIONAL INFORMATION

The current behaviour is to treat any errors as being ignored so this new retry behaviour shouldn't introduce any extra issues. It should only improve the chances of being able to recover from these types of failures.

@ansibot ansibot added bug This issue/PR relates to a bug. needs_triage Needs a first human triage before being processed. labels Aug 18, 2023
@jborean93
Copy link
Contributor Author

I was using the following script to test out the server side behaviour when a send operation failed with an operation timeout. It verifies that the data is actually written to the stdin pipe and the 2nd attempt but need a delayed retry in case it is still busy writing. The changes in the PR have also been tested under a win_updates run with reboot: true with a host with only 1 CPU to induce a more likely load scenario when this could occur.

import base64
import time

import winrm
import winrm.exceptions

PWSH_CMD = r'''
if (Test-Path -Path C:\temp\test.txt) {
    Remove-Item C:\temp\test.txt -Force
}
Add-Content C:\temp\test.txt -Value start
$input | ForEach-Object { $_ | Add-Content C:\temp\test.txt }
Add-Content C:\temp\test.txt -Value finish
'done'
'''

def main():
    p = winrm.Protocol(
        "http://192.168.121.151:5985/wsman",
        transport="basic",
        username="vagrant",
        password="vagrant",
    )
    p.operation_timeout_sec = 2
    shell_id = p.open_shell()
    try:
        cmd_id = p.run_command(
            shell_id,
            "powershell.exe",
            ["-EncodedCommand", base64.b64encode(PWSH_CMD.encode("utf-16-le")).decode()],
        )

        inputs = [
            ((b"a" * 250000) + b"end1", False),
            ((b"b" * 250000) + b"end2", True),
        ]
        try:
            p.operation_timeout_sec = 2
            try:
                for in_data, end in inputs:
                    while True:
                        try:
                            print("Sending input")
                            p.send_command_input(
                                shell_id,
                                cmd_id,
                                in_data,
                                end=end,
                            )
                        except winrm.exceptions.WinRMOperationTimeoutError:
                            print("Operation timeout, ignoring")
                        except winrm.exceptions.WinRMError as e:
                            print(f"Transport error: {e}")
                            if "'wsmanfault_code': '170'" not in str(e):
                                raise
                            time.sleep(5)
                            continue

                        break
            finally:
                p.operation_timeout_sec = winrm.Protocol.DEFAULT_OPERATION_TIMEOUT_SEC

            print("Getting output")
            stdout, stderr, rc = p.get_command_output(shell_id, cmd_id)
            print(f"RC: {rc}\nSTDOUT:\n{stdout.decode('utf-8')}\nSTDERR:\n{stderr.decode('utf-8')}")
        finally:
            p.cleanup_command(shell_id, cmd_id)

    finally:
        p.close_shell(shell_id)

if __name__ == '__main__':
    main()

Copy link
Member

@nitzmahone nitzmahone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the 5s delay was arrived at empirically- it feels somewhat long to me, but definitely not unreasonable, so if 3x 5s makes more sense than like 6-8x 2s, WFM.

Thanks for tracking this one down! Does pypsrp have a similar issue on input sends, or was it already more robust?

@jborean93
Copy link
Contributor Author

I assume the 5s delay was arrived at empirically- it feels somewhat long to me, but definitely not unreasonable, so if 3x 5s makes more sense than like 6-8x 2s, WFM.

In testing I found that 2-3 seconds was typically not long enough for the first operation timeout to have finished before it would fail a second time saying the resource is busy (hence the 2nd exception check). Obviously this is more dependent on the host in question but doing 5 seconds seemed to at least be more "stable" in real world testing. In the end it's mostly there as an exceptional scenario to not fail if we could continue, I would say users should be increasing their operation/read timeout values if this is being hit (hence the warning) if they are receiving this warning multiple times.

Does pypsrp have a similar issue on input sends, or was it already more robust?

While theoretically it sounds like it could suffer from the same problem in testing I never came across it. My guess is that winrm having to spawn the new process and is being delayed by the new powershell process actually starting up for it to read the stdin pipe getting buffered causing the timeout whereas psrp is more efficient as it's already started and is always reading the pipe input.

@ansibot ansibot added the stale_ci This PR has been tested by CI more than one week ago. Close and re-open this PR to get it retested. label Sep 5, 2023
@jborean93
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jborean93 jborean93 enabled auto-merge (squash) September 26, 2023 22:19
@jborean93 jborean93 merged commit f22231d into ansible:devel Sep 26, 2023
78 checks passed
@jborean93 jborean93 deleted the winrm-input branch September 26, 2023 22:47
jborean93 added a commit to jborean93/ansible that referenced this pull request Sep 26, 2023
* winrm - make command input more resiliant

* Expand warning message

(cherry picked from commit f22231d)
jborean93 added a commit to jborean93/ansible that referenced this pull request Sep 26, 2023
* winrm - make command input more resiliant

* Expand warning message

(cherry picked from commit f22231d)
jborean93 added a commit to jborean93/ansible that referenced this pull request Sep 26, 2023
* winrm - make command input more resiliant

* Expand warning message

(cherry picked from commit f22231d)
@jborean93
Copy link
Contributor Author

@agibson2
Copy link

Thanks for this work. I do hope it gets accepted for backports so that RHEL 9 will see the update.

sivel pushed a commit that referenced this pull request Sep 27, 2023
* winrm - make command input more resiliant

* Expand warning message

(cherry picked from commit f22231d)
sivel pushed a commit that referenced this pull request Sep 27, 2023
* winrm - make command input more resiliant

* Expand warning message

(cherry picked from commit f22231d)
sivel pushed a commit that referenced this pull request Sep 27, 2023
* winrm - make command input more resiliant

* Expand warning message

(cherry picked from commit f22231d)
@ansible ansible locked and limited conversation to collaborators Oct 24, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug This issue/PR relates to a bug. stale_ci This PR has been tested by CI more than one week ago. Close and re-open this PR to get it retested.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants