Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WinRMTransport Kerberos authentication error but logged as success in windows logs #15669

Closed
MichaelBaydoun opened this issue Apr 29, 2016 · 6 comments
Labels
affects_2.0 This issue/PR affects Ansible v2.0 bug This issue/PR relates to a bug. needs_info This issue requires further information. Please answer any outstanding questions. windows Windows community

Comments

@MichaelBaydoun
Copy link
Contributor

MichaelBaydoun commented Apr 29, 2016

There are a few WinRMTransport issues that are possibly related to this issue, but can't be sure. I'm hoping the debug details in this issue will help solve some of the outstanding WinRMTransport issues.

ISSUE TYPE
  • Bug Report
ANSIBLE VERSION
ansible 2.0.1.0
  config file = /home/icansible/.ansible.cfg
  configured module search path = Default w/o overrides
CONFIGURATION

[defaults]
callback_whitelist = profile_tasks,timer
forks = 50
host_key_checking = false
inventory = ~/ansible/hosts-dev
max_fail_percentage = 1
pattern = NONE
retry_files_enabled = false
transport = ssh

OS / ENVIRONMENT

From: Red Hat Enterprise Linux Server release 6.7 (Santiago)
To: Windows 2008R2

SUMMARY

Sporadically, but frequently, WinRM connection is attempted, and 62 seconds later a failure is logged. However, the windows logs show the login was successful.

STEPS TO REPRODUCE

Unable to reproduce in any reliable way. We see the problem frequently and caught some details during a full debug run.

From the debug run, the setup module is causing a connection winrm connection attempt

09:01:17              9020 1461934877.08828: _low_level_execute_command(): executing: PowerShell -NoProfile -NonInteractive -ExecutionPolicy Unrestricted -EncodedCommand UwBlAHQALQBTAHQAcgBpAGMAdABNAG8AZABlACAALQBWAGUAcgBzAGkAbwBuACAATABhAHQAZQBzAHQACgAoAE4AZQB3AC0ASQB0AGUAbQAgAC0AVAB5AHAAZQAgAEQAaQByAGUAYwB0AG8AcgB5ACAALQBQAGEAdABoACAAJABlAG4AdgA6AHQAZQBtAHAAIAAtAE4AYQBtAGUAIAAiAGEAbgBzAGkAYgBsAGUALQB0AG0AcAAtADEANAA2ADEAOQAzADQAOAA3ADcALgAwADYALQAyADAAOQA4ADAANAA3ADgAOAA4ADYAMQAzADIAOQAiACkALgBGAHUAbABsAE4AYQBtAGUAIAB8ACAAVwByAGkAdABlAC0ASABvAHMAdAAgAC0AUwBlAHAAYQByAGEAdABvAHIAIAAnACcAOwA=
09:01:17            <wdv-servicesint> ESTABLISH WINRM CONNECTION FOR USER: ansible-svc-dev@realm on PORT 5986 TO wdv-servicesint

I believe the first number, 9020, is unique to this server and attempt. The next 9020 debug entries occur 62 seconds later

09:02:19            Traceback (most recent call last):
09:02:19              File "/usr/lib/python2.6/site-packages/ansible/plugins/connection/winrm.py", line 244, in exec_command
09:02:19                result = self._winrm_exec(cmd_parts[0], cmd_parts[1:], from_exec=True)
09:02:19              File "/usr/lib/python2.6/site-packages/ansible/plugins/connection/winrm.py", line 177, in _winrm_exec
09:02:19                self.shell_id = self.protocol.open_shell(codepage=65001) # UTF-8
09:02:19              File "/usr/lib/python2.6/site-packages/winrm/protocol.py", line 103, in open_shell
09:02:19                rs = self.send_message(xmltodict.unparse(rq))
09:02:19              File "/usr/lib/python2.6/site-packages/winrm/protocol.py", line 173, in send_message
09:02:19                return self.transport.send_message(message)
09:02:19              File "/usr/lib/python2.6/site-packages/winrm/transport.py", line 224, in send_message
09:02:19              9020 1461934939.09307: done running TaskExecutor() for wdv-servicesint/TASK: setup
09:02:19              9020 1461934939.09328: sending task result
09:02:19                raise WinRMTransportError('kerberos', error_message)
09:02:19            WinRMTransportError: 500 WinRMTransport. Kerberos-based authentication was failed. Code 500
09:02:19              9020 1461934939.10486: done sending task result
09:02:19              9020 1461934939.10628: WORKER PROCESS EXITING

It's classified as a Kerberos-based authentication error however the windows security log shows an immediate successful login

04/29/2016 09:01:18 AM
LogName=Security
SourceName=Microsoft Windows security auditing.
EventCode=4624
EventType=0
Type=Information
ComputerName=WDV-SERVICESINT.fhlbi.com
TaskCategory=Logon
OpCode=Info
RecordNumber=19604901
Keywords=Audit Success
Message=An account was successfully logged on.

Subject:
    Security ID:        NULL SID
    Account Name:       -
    Account Domain:     -
    Logon ID:       0x0

Logon Type:         3

New Logon:
    Security ID:        FHLBI-EC\ansible-svc-dev
    Account Name:       ansible-svc-dev
    Account Domain:     FHLBI-EC
    Logon ID:       0x5ffbf6f7
    Logon GUID:     {FA6AC69A-CDDF-A2A4-C6A4-043426E7A5C1}

Process Information:
    Process ID:     0x0
    Process Name:       -

Network Information:
    Workstation Name:   
    Source Network Address: -
    Source Port:        -

Detailed Authentication Information:
    Logon Process:      Kerberos
    Authentication Package: Kerberos
    Transited Services: -
    Package Name (NTLM only):   -
    Key Length:     0

This event is generated when a logon session is created. It is generated on the computer that was accessed.

The subject fields indicate the account on the local system which requested the logon. This is most commonly a service such as the Server service, or a local process such as Winlogon.exe or Services.exe.

The logon type field indicates the kind of logon that occurred. The most common types are 2 (interactive) and 3 (network).

The New Logon fields indicate the account for whom the new logon was created, i.e. the account that was logged on.

The network fields indicate where a remote logon request originated. Workstation name is not always available and may be left blank in some cases.

The authentication information fields provide detailed information about this specific logon request.
    - Logon GUID is a unique identifier that can be used to correlate this event with a KDC event.
    - Transited services indicate which intermediate services have participated in this logon request.
    - Package name indicates which sub-protocol was used among the NTLM protocols.
    - Key length indicates the length of the generated session key. This will be 0 if no session key was requested.

which a second later get's special privileges

04/29/2016 09:01:19 AM
LogName=Security
SourceName=Microsoft Windows security auditing.
EventCode=4672
EventType=0
Type=Information
ComputerName=WDV-SERVICESINT.fhlbi.com
TaskCategory=Special Logon
OpCode=Info
RecordNumber=19604902
Keywords=Audit Success
Message=Special privileges assigned to new logon.

Subject:
    Security ID:        FHLBI-EC\ansible-svc-dev
    Account Name:       ansible-svc-dev
    Account Domain:     FHLBI-EC
    Logon ID:       0x5ffbfe68

Privileges:     SeSecurityPrivilege
            SeBackupPrivilege
            SeRestorePrivilege
            SeTakeOwnershipPrivilege
            SeDebugPrivilege
            SeSystemEnvironmentPrivilege
            SeLoadDriverPrivilege
            SeImpersonatePrivilege

followed by an immediate successful logout

04/29/2016 09:01:19 AM
LogName=Security
SourceName=Microsoft Windows security auditing.
EventCode=4634
EventType=0
Type=Information
ComputerName=WDV-SERVICESINT.fhlbi.com
TaskCategory=Logoff
OpCode=Info
RecordNumber=19604904
Keywords=Audit Success
Message=An account was logged off.

Subject:
    Security ID:        FHLBI-EC\ansible-svc-dev
    Account Name:       ansible-svc-dev
    Account Domain:     FHLBI-EC
    Logon ID:       0x5ffbfe68

Logon Type:         3

This event is generated when a logon session is destroyed. It may be positively correlated with a logon event using the Logon ID value. Logon IDs are only unique between reboots on the same computer.

I looked at the pywinrm code and the timeout is 3600 seconds, so I don't think that's the problem.

EXPECTED RESULTS

I expected so see a successful login, followed by the transfer of the setup module to the windows server, followed by it's execution

ACTUAL RESULTS

actual results with ANSIBLE_DEBUG=yes and -vvvv are posted above

@jhawkesworth jhawkesworth added the windows Windows community label Apr 29, 2016
@jhawkesworth
Copy link
Contributor

This feels like an environmental issue.

I have a few suggestions of things that might be worth investigating.

Check your ansible machine's clock is synchronized with your domain controllers. I have seen random failures when the clocks have drifted.

I'd also examine the winrm configuration on the target windows box to,see if any limits or timeouts might be affecting things.

Oh and maybe the firewall configurations on both ends.

What version of Windows are you targeting?

@MichaelBaydoun
Copy link
Contributor Author

The times are in sync and using the same ntp source.

The firewalls are open on both ends, the error is sporadic, sometimes it works, sometimes it fails.

There is nothing of interest logged for winrm on the windows side, and I haven't found any limits or timeouts (yet).

Most of our windows targets are windows 2008r2, including this one. I messed up the formatting in the issue template, let me go fix that.

@jhawkesworth
Copy link
Contributor

Ok, I have a couple of other ideas.

Is the S2008R2 box fully up to date with Windows Updates? There was a bug in WMF 3.0 when S2008R2 was first released which messed up the memory allocation. There is a hotfix for it, although my preferred fix was to update to Windows Management Framework 4.0.

My other ideas boil down to health checking the hardware itself - checking network cables, hard disk errors, temperature sensors.

Other than that monitoring the CPU and memory usage might at least establish if anything else is going on at the time of the failures.

Hope this helps

@MichaelBaydoun
Copy link
Contributor Author

I found a memory/cpu issue with one of our hosts, and the ansible debug run noted the out of memory condition. However, since this issue is effecting 12 other windows servers (since I started tracking it), let's keep this open and see what else I can find.

I'll check the version of WMF on the servers in question, and look for the hotfix mentions if they are running 3.0. Thanks for the suggestion.

@ansibot ansibot added bug_report affects_2.0 This issue/PR affects Ansible v2.0 labels Sep 7, 2016
@jhawkesworth
Copy link
Contributor

just going through old windows-related issues

@MichaelBaydoun - were any of the affected hosts running WMF 3.0?

needs_info

@ansibot ansibot added the needs_info This issue requires further information. Please answer any outstanding questions. label Mar 24, 2017
@MichaelBaydoun
Copy link
Contributor Author

@jhawkesworth yes, at least some of the hosts were running WMV 3.0.

However, we haven't seen this issue in a long time. Currently running ansible 2.1.4.0. Going to close this as it's not an issue for us any longer.

@ansibot ansibot added bug This issue/PR relates to a bug. and removed bug_report labels Mar 7, 2018
@ansible ansible locked and limited conversation to collaborators Apr 25, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
affects_2.0 This issue/PR affects Ansible v2.0 bug This issue/PR relates to a bug. needs_info This issue requires further information. Please answer any outstanding questions. windows Windows community
Projects
None yet
Development

No branches or pull requests

3 participants