Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[openssh/linux kernel > 3.12] Intermittent error "failed to resolve remote temporary directory from" [sleep 0] #13876

Closed
chinatree opened this issue Jan 14, 2016 · 128 comments

Comments

@chinatree
Copy link

commented Jan 14, 2016

After I upgrade ansible 2.0.0.1 from 1.9.4, when performing a playlook always encounter intermittent fault "failed to resolve remote temporary directory from".

$ ansible-playbook playlooks/playlook-filters.yml

PLAY ***************************************************************************

TASK [command] *****************************************************************
fatal: [YT_8_22]: FAILED! => {"failed": true, "msg": "ERROR! failed to resolve remote temporary directory from ansible-tmp-1452759681.1-95441304852350: `( umask 22 && mkdir -p \"$( echo /tmp/.ansible/tmp/ansible-tmp-1452759681.1-95441304852350 )\" && echo \"$( echo /tmp/.ansible/tmp/ansible-tmp-1452759681.1-95441304852350 )\" )` returned empty string"}
...ignoring

TASK [debug] *******************************************************************
ok: [YT_8_22] => {
    "msg": "it failed"
}

PLAY RECAP *********************************************************************
YT_8_22                    : ok=2    changed=0    unreachable=0    failed=0

This is happening for one of the tasks randomly, following are those tasks:


---
- hosts: YT_8_22
  tasks:
    - shell: /bin/true
      register: result
      ignore_errors: True

    - debug: msg="it failed"
      when: result|failed

I try on 1.9.4, have not seen the phenomenon. Can anybody tell me how to do that ?

I found another links to similar problems: https://groups.google.com/forum/#!searchin/ansible-project/Intermittent$20error%7Csort:relevance/ansible-project/FyK6au2O9KY/tWuf31P9AQAJ

@bcoca bcoca added the needs_info label Jan 18, 2016

@riversy

This comment has been minimized.

Copy link

commented Jan 25, 2016

Hi there,
I also cough that with -vvvv flag.

<45.56.96.138> ESTABLISH SSH CONNECTION FOR USER: root
<45.56.96.138> SSH: EXEC ssh -C -vvv -o ControlMaster=auto -o ControlPersist=60s -o 'IdentityFile="/home/riversy/.ssh/id_rsa"' -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=root -o ConnectTimeout=10 -o ControlPath=/home/riversy/.ansible/cp/ansible-ssh-%h-%p-%r -tt 45.56.96.138 'mkdir -p "$( echo $HOME/.ansible/tmp/ansible-tmp-1453707878.36-19081627291623 )" && echo "$( echo $HOME/.ansible/tmp/ansible-tmp-1453707878.36-19081627291623 )"'
fatal: [45.56.96.138]: FAILED! => {"failed": true, "msg": "ERROR! failed to resolve remote temporary directory from ansible-tmp-1453707878.36-19081627291623: `mkdir -p \"$( echo $HOME/.ansible/tmp/ansible-tmp-1453707878.36-19081627291623 )\" && echo \"$( echo $HOME/.ansible/tmp/ansible-tmp-1453707878.36-19081627291623 )\"` returned empty string"}

My version of ansible is:

ansible 2.0.0.2
  config file = /etc/ansible/ansible.cfg
  configured module search path = Default w/o overrides

Will be appreciate for any advice.

Update: Sometimes it also happens even on: ansible all -i inventory/test -m ping command.

@rpsiv

This comment has been minimized.

Copy link

commented Jan 25, 2016

I have noticed that if I run a playbook back to back I generally get success.

IE

Run #1:
fatal: [1.2.3.4]: FAILED! => {"failed": true, "msg": "failed to resolve remote temporary directory from ansible-tmp-1453754269.23-53039386122234: ( umask 22 && mkdir -p \"$( echo $HOME/.ansible/tmp/ansible-tmp-1453754269.23-53039386122234 )\" && echo \"$( echo $HOME/.ansible/tmp/ansible-tmp-1453754269.23-53039386122234 )\" ) returned empty string"}

Run #2: Success

However waiting ~ a minute the failure will occur again. Monitoring the remote machine I do not see the ~/.ansible/tmp/

being created when an error condition is reached. So it seems the error is truly valid, the question is what is preventing that directory from being created. This issue only started after upgrading to 2.0.0 (also tried 2.0.0.2-1)

@jeffwidman

This comment has been minimized.

Copy link
Contributor

commented Jan 25, 2016

Also started experiencing the error after upgrading from Ansible 1.9.4 to 2.0.0.2.

Doubt it matters, but just in case it's a CentOS 7.1 VPS hosted on Linode.

@mgaley

This comment has been minimized.

Copy link

commented Jan 27, 2016

I encountered this on random shell/command tasks in a playbook against a CentOS vps on amazon. That's our only CentOS 7.2 machine with 20+ other amazon ami servers which haven't shown the issue so far.

I tried 2.0.0, 2.0.0.2, 2.1, pipelining on/off, controlpersist on/off, and running from OS X / amazon ami. I also tried removing the .ansible files from the target node with no effect

This might not be related, but I ended up resolving this issue for now by restarting the machine. I had noticed logs like this mentioned a corrupted btmp log, and executed 'cat /dev/null > /var/log/btmp', followed by a restart. The btmp thing might not be related, but it was logging on every ansible command/connection. I had moved my /var partition to a different disk the day before, it's possible it became corrupted from that.

sshd[24031]: pam_unix(sshd:session): session closed for user centos
sshd[24248]: Accepted publickey for centos from port 45694 ssh2:
sshd[24248]: pam_unix(sshd:session): session opened for user centos by (uid=0)
sshd[24248]: pam_lastlog(sshd:session): corruption detected in /var/log/btmp
sshd[24251]: Received disconnect from : 11: disconnected by user

@rpsiv

This comment has been minimized.

Copy link

commented Jan 27, 2016

@mgaley based on your feedback I looked into the ssh connection to my client and found something interesting.

I am using RHEL 6.6 for both source and destination machines and Ansible 2.0.0.2

Remoteserver#> tail -f /var/log/secure

Jan 27 07:47:48 Remoteserver sshd[19329]: Accepted publickey for jenkins from Ansiblehost port 41384 ssh2
Jan 27 07:47:48 Remoteserver sshd[19329]: pam_unix(sshd:session): session opened for user jenkins by (uid=0)

<< --- >>
Ansibleserver: ERROR! failed to resolve remote temporary directory

SSH Connection remains established by the Ansible PID (but not the original pid that opens it, I assume this is expected)

Running again within the 60s window where the connection remains established produced my above result of success

"ControlPersist=60s"
<<-->>
Retry playbook (previous Established SSH session is used)

Jan 27 07:48:11 Remoteserver sshd[19333]: subsystem request for sftp
Jan 27 07:48:12 Remoteserver sshd[19333]: subsystem request for sftp
Jan 27 07:49:13 Remoteserver sshd[19333]: Received disconnect from Ansiblehost: 11: disconnected by user
Jan 27 07:49:13 Remoteserver sshd[19329]: pam_unix(sshd:session): session closed for user jenkins

Playbook completes successfully.

@djdreidel

This comment has been minimized.

Copy link

commented Jan 29, 2016

I see the same issue as well running
ansible 2.1.0 (devel 4b1d621) last updated 2016/01/28 19:20:32 (GMT -400)

@mgaley

This comment has been minimized.

Copy link

commented Jan 29, 2016

I also had the same issue when using the same playbook against 5 amazon ami servers in 5 separate blocks of a play, only against the first node though. Didn't see anything related in the secure log this time.

Only in related to shell/command modules, running multiple playbooks at the same time locally, even to the same remote hosts didn't cause the error to happen any more often.

@ghost

This comment has been minimized.

Copy link

commented Jan 31, 2016

We get this as well. Interesting to note https://groups.google.com/forum/#!msg/ansible-devel/5i6VDHKZ30I/0ksVpEEICwAJ the possibility of a race condition in dir creation?

@jkarneges

This comment has been minimized.

Copy link

commented Feb 1, 2016

I'm seeing this too. It appears to happen randomly on arbitrary tasks in a playbook, always regarding this tmp file stuff. I just re-run the playbook until it works. I am using a network filesystem (rackspace block storage).

@lenar

This comment has been minimized.

Copy link

commented Feb 3, 2016

Happens to me as well. Truly random it seems.

@ghost

This comment has been minimized.

Copy link

commented Feb 3, 2016

https://github.com/ansible/ansible/blob/200f95887373e215eb41adb3dcc4128a0a0e0f88/lib/ansible/plugins/action/__init__.py

There are two issues here.

  1. Don't use the fall through like setting rc to / in order to flag something has gone wrong. The error handling approach in this file is faulty. My opinion - discuss!
  2. In the function _make_tmp_path we find what I think is the root cause and why running the playbook twice fixes the issue. I suspect, though I have not confirmed, this never occurs more than once on any given server.

I suspect something to do with how function mkdtemp in file https://github.com/ansible/ansible/blob/42e312d3bd0516ceaf2b4533ac643bd9e05163cd/lib/ansible/plugins/shell/sh.py works.

Anyone feel I'm completely on the wrong track?

@dvershinin

This comment has been minimized.

Copy link

commented Feb 5, 2016

+1 getting the same issue. Really random

@ballPointPenguin

This comment has been minimized.

Copy link

commented Feb 6, 2016

I'm getting the same, using archlinux vagrant controller and archlinux vagrant targets.
All boxes are on ansible v 2.0.0.2

@joernheissler

This comment has been minimized.

Copy link
Contributor

commented Feb 7, 2016

This is caused by a bug in the Linux kernel or in openssh, not sure which one is at fault.
https://bugzilla.mindrot.org/show_bug.cgi?id=2492
https://lkml.org/lkml/2015/12/9/775

One workaround is to downgrade the linux kernel on the server to a version that does not include commit 1a48632ffed61352a7810ce089dc5a8bcd505a60. For Ubuntu 14.04 this is e.g. 3.16.0-60, or 3.19.0-22 or earlier.

@rpsiv

This comment has been minimized.

Copy link

commented Feb 7, 2016

I am running kernel 2.6.32-504 (rhel 6.6). And this bug only occurs on Ansible 2.x not previous versions.

@ballPointPenguin

This comment has been minimized.

Copy link

commented Feb 8, 2016

I'm using kernel 4.3.3-3-ARCH and am not interested in going back to 3.x kernels on these machines.

@bcoca bcoca removed the needs_info label Feb 8, 2016

@bcoca

This comment has been minimized.

Copy link
Member

commented Feb 8, 2016

Another workaround would be using a different transport other than Openssh (paramiko?).

I'm going to close this ticket as Ansible itself is not doing anything wrong here, this is just a bad interaction between tools we depend on.

@bcoca bcoca closed this Feb 8, 2016

@jeffwidman

This comment has been minimized.

Copy link
Contributor

commented Feb 9, 2016

Why did it only start appearing after folks switched from Ansible 1.9.x to Ansible 2.0.x?

Not disputing that it's not an issue with the underlying tools, just genuinely curious why it wasn't an issue before.

@MarkGavalda

This comment has been minimized.

Copy link

commented Feb 12, 2016

I'm not sure how this is considered resolved... @rpsiv confirmed it happens with older kernels as well, so not a kernel bug. It didn't happen for anyone before Ansible 2.0.x and now it's widespread.

between tools we depend on

Which are these tools that you're referring to @bcoca ? I'd be happy to find the root cause and help it get fixed by working with other open source teams, however at this point it's not clear to me where should I (we) begin? Thanks.

@evgkrsk

This comment has been minimized.

Copy link
Contributor

commented Feb 13, 2016

@bcoca : ping?
Thing works perfectly in 1.9.4.
After all, if you see source of problem, please consider adding workaround in ansible v2.x

@ghost

This comment has been minimized.

Copy link

commented Feb 14, 2016

The fact this happens only once on each server is a clue I think.

@MarkGavalda

This comment has been minimized.

Copy link

commented Feb 14, 2016

@khushil it has happened multiple times on some of our servers, even rebooting them etc. didn't help. Sometimes the issue goes away suddenly but sometimes no matter what we do it persists. It's really annoying :-/

@dvershinin

This comment has been minimized.

Copy link

commented Feb 14, 2016

+1000.

@TomasCrhonek

This comment has been minimized.

Copy link

commented Feb 15, 2016

I get this error every time (on random server) only with parallel execution (5, default):

ansible-playbook -i hosts -f 5 base_packages.yml

but never with (-f 1):

 ansible-playbook -i hosts -f 1 base_packages.yml
@MarkGavalda

This comment has been minimized.

Copy link

commented Feb 15, 2016

@TomasCrhonek I just tried with forks=1 and bumped into the same issue.

@lenar

This comment has been minimized.

Copy link

commented Feb 15, 2016

It happens also when limited (-l) to only one host (but forks= at default).

@TomasCrhonek

This comment has been minimized.

Copy link

commented Feb 15, 2016

Ok, after hundreds of runs I finaly get this error with -f 1. It's very rare.
(Debian Testing up to date, ansible 2.0.0.2)

@gundalow gundalow removed this from the 2.2.0 milestone Apr 13, 2017

@JoelFeiner

This comment has been minimized.

Copy link

commented May 19, 2017

I am also still getting this issue regularly, mostly when setting up packer boxes. I am using Ansible 2.3.0 on Ubuntu 16.10.

@joernheissler

This comment has been minimized.

Copy link
Contributor

commented May 19, 2017

@JoelFeiner Just out of interest, what OS / Linux Kernel Version is your managed node running?

@JoelFeiner

This comment has been minimized.

Copy link

commented May 19, 2017

@joernheissler: output of uname -a on the machine running Ansible is below

Linux ********* 4.8.0-52-generic #55-Ubuntu SMP Fri Apr 28 13:28:50 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

The VMs are running Centos 7, and a uname -a on one of them looks like this:

Linux localhost.localdomain 3.10.0-514.16.1.el7.x86_64 #1 SMP Wed Apr 12 15:04:24 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
@joernheissler

This comment has been minimized.

Copy link
Contributor

commented May 21, 2017

@JoelFeiner So you're running a 4 years old kernel which doesn't include the fix (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0f40fbbcc34e093255a2b2d70b6b0fb48c3f39aa).

Btw, I haven't seen this bug myself for a long time now, since I upgraded kernels on my servers.

Why is this bug not closed anyway? It's not an ansible problem.

@JoelFeiner

This comment has been minimized.

Copy link

commented May 22, 2017

I've gotten the go-ahead to use a newer kernel from elrepo. Hopefully that will fix the issue then.

@haubi

This comment has been minimized.

Copy link

commented May 22, 2017

LOL: I've stumbled on a sleep implementation that fails on sleep 0, as in:

$ sleep 0; echo $?
1
$ sleep 1; echo $?
0

Hacking plugins/action/__init__.py to add "{ sleep 0 || :; }" instead of "sleep 0" does help for now (using Bourne shell here only, not sure how that could work for csh).

However, there is another sleep implementation available on that machine, but specifying an explicit /path/to/sleep seems to be unsupported as of ansible-2.2.2.0...

For the records: This is /usr/bin/sleep on Interix (a POSIX layer for Windows, abandoned meanwhile, but still part of my infra here).

@JoelFeiner

This comment has been minimized.

Copy link

commented May 22, 2017

@joernheissler: unfortunately, this happened to me again on kernel 4.4.69. This kernel contains the commit with the fix.

@joernheissler

This comment has been minimized.

Copy link
Contributor

commented May 22, 2017

Are you able to put together a complete test case so that others can reproduce it?

@JoelFeiner

This comment has been minimized.

Copy link

commented May 22, 2017

It's an very intermittent error that can happen with any task, so I don't think that is feasible.

@jylenhofgfi

This comment has been minimized.

Copy link

commented May 23, 2017

JoelFeiner, try upgrading ssh as written in the thread...

@JoelFeiner

This comment has been minimized.

Copy link

commented May 23, 2017

@jylenhofgfi I don't see anything in this thread about upgrading SSH, only the kernel. What version are you using? On the remote hosts or local?

@sangrealest

This comment has been minimized.

Copy link

commented Jun 27, 2017

Mostly set ControlMaster=no should fix this issue. You should check your openssh version. since version 5.1, openssh started to support multiplexing, but it works super well in version 5.6. So I suggest upgrade your openssh server to 5.6 or higher.

Release Notes: https://www.openssh.com/txt/release-5.1

@alikins alikins changed the title Intermittent error "failed to resolve remote temporary directory from" [openssh/linux kernel > 3.12] Intermittent error "failed to resolve remote temporary directory from" [sleep 0] Aug 17, 2017

@ansibotdev

This comment has been minimized.

Copy link

commented Nov 22, 2017

@chinatree You have not responded to information requests in this issue so we will assume it no longer affects you. If you are still interested in this, please create a new issue with the requested information.

click here for bot help

@ansibotdev ansibotdev closed this Nov 22, 2017

@jctanner

This comment has been minimized.

Copy link
Member

commented Dec 21, 2017

reopening till #31022 is merged.

!needs_info

resolved_by_pr #31022

@jctanner jctanner reopened this Dec 21, 2017

@ansibot ansibot closed this Dec 21, 2017

@jctanner

This comment has been minimized.

Copy link
Member

commented Dec 21, 2017

bot_skip

@jctanner jctanner reopened this Dec 21, 2017

@jctanner

This comment has been minimized.

Copy link
Member

commented Dec 21, 2017

For anyone trying to experiment with this problem in devel, we now have a use_tty config option that can disable the connection plugin from adding "-tt" without having to hack the code.

218987e

@ansible ansible deleted a comment from ansibot Dec 21, 2017

@jctanner

This comment has been minimized.

Copy link
Member

commented Jan 29, 2018

The original problem described in this issue with temp dir resolution should have been resolved by #31677.

@jctanner jctanner closed this Jan 29, 2018

@ansibot ansibot added bug and removed bug_report labels Mar 7, 2018

@ansible ansible locked and limited conversation to collaborators Apr 25, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.