Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ansible hangs forever while executing playbook with no information on what is going on #30411

Open
benapetr opened this issue Sep 15, 2017 · 141 comments
Labels
affects_2.3 This issue/PR affects Ansible v2.3 bug This issue/PR relates to a bug. hang Ansible hangs or gets stuck has_pr This issue has an associated PR. support:core This issue/PR relates to code supported by the Ansible Engineering Team.
Projects

Comments

@benapetr
Copy link

ISSUE TYPE
  • Bug Report
COMPONENT NAME

ansible-playbook

ANSIBLE VERSION
ansible 2.3.1.0
  config file = /etc/ansible/ansible.cfg
  configured module search path = Default w/o overrides
  python version = 2.7.5 (default, Aug  2 2016, 04:20:16) [GCC 4.8.5 20150623 (Red Hat 4.8.5-4)]
CONFIGURATION
OS / ENVIRONMENT
SUMMARY

When running ansible playbook for all servers (1964 servers) it hangs somewhere in middle of execution. I used -vvvvvvvvvv to trace the problem, but it doesn't print ANYTHING. The previous host finishes and the message about finish is last thing I see in console, then there is NOTHING. It just hangs. It doesn't even print the FQDN or name of server that is next in queue, it doesn't say what is is waiting for. Just nothing.

In ps I see 2 processes: Running strace shows:

parent:
Process 22170 attached
select(0, NULL, NULL, NULL, {0, 599}) = 0 (Timeout)
wait4(28917, 0x7ffd44d0bf34, WNOHANG, NULL) = 0
wait4(28917, 0x7ffd44d0bf64, WNOHANG, NULL) = 0
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
wait4(28917, 0x7ffd44d0bf34, WNOHANG, NULL) = 0
wait4(28917, 0x7ffd44d0bf64, WNOHANG, NULL) = 0
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
wait4(28917, 0x7ffd44d0bf34, WNOHANG, NULL) = 0
wait4(28917, 0x7ffd44d0bf64, WNOHANG, NULL) = 0
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
wait4(28917, 0x7ffd44d0bf34, WNOHANG, NULL) = 0
wait4(28917, 0x7ffd44d0bf64, WNOHANG, NULL) = 0
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
wait4(28917, 0x7ffd44d0bf34, WNOHANG, NULL) = 0
wait4(28917, 0x7ffd44d0bf64, WNOHANG, NULL) = 0

...
repeats forever

child:
Process 28917 attached
epoll_wait(40,

STEPS TO REPRODUCE
# This playbook is used to test if server works or not

---
- hosts: run-tools.homecredit.net all:!ignored
  tasks:
  - command: echo works

Looks like a bug to me, it should at least tell me what it's trying to do that causes the hang

@ansibot ansibot added affects_2.3 This issue/PR affects Ansible v2.3 bug_report needs_triage Needs a first human triage before being processed. support:core This issue/PR relates to code supported by the Ansible Engineering Team. labels Sep 15, 2017
@jborean93 jborean93 removed the needs_triage Needs a first human triage before being processed. label Sep 21, 2017
@jborean93
Copy link
Contributor

You can enable debug mode to get more information than -v* gives you. To do this just export ANSIBLE_DEBUG=1 to enable it.

We would probably need some more info to try and help you with this which hopefully the debug logs will provide. That many hosts does seem like a lot to connect concurrently, are you running in parallel or in blocks of hosts?

needs_info

@ansibot ansibot added the needs_info This issue requires further information. Please answer any outstanding questions. label Sep 21, 2017
@jhawkesworth
Copy link
Contributor

Hmm, what are you setting forks to? Hitting lots of hosts at once might be taking up a lot of resources - worth checking /var/log/messages too.

You might like to try wait_for_connection as a way of testing that all your hosts are available. Its nice as its aware of the connection type so it works even if you have mixed inventory of say windows and linux hosts.

@benapetr
Copy link
Author

Hi, it's connecting max to 10 hosts at same time, this is config:

[defaults]
host_key_checking = False
timeout = 20
retry_files_save_path = /home/ansible/retry/
# Do not put more jobs here, or ssh will fail
# Anything more than 10 kills this poor server
forks = 10
remote_user = root
log_path=/var/log/ansible.log

[ssh_connection]
retries=3
pipelining=True

I managed to find a "block of hosts" that contain the one which is probably causing this hang, so I am now removing more and more hosts from that temporary file with hope to identify the one that causes this.

My rough guess is that it maybe hangs on DNS query? It's possible that one of these hosts doesn't exist and misconfigured DNS server make the query take forever? I have no idea, but I am fairly sure that no detailed information is available even in debug log, I will try to provide that one soonish. I see some debug lines, then I see green ok: [hostname] and then nothing forever, not a single debug or verbose line, it just hangs.

@ansibot ansibot removed the needs_info This issue requires further information. Please answer any outstanding questions. label Sep 26, 2017
@benapetr
Copy link
Author

After very long of debugging and testing servers one by one, I figured out which one was causing this. Its DNS is OK, it responds and I can ssh there, but there is some problem:

 12502 1506511086.24143: starting run
 12502 1506511086.42742: Loading CacheModule 'memory' from /usr/lib/python2.7/site-packages/ansible/plugins/cache/memory.py
 12502 1506511086.63981: Loading CallbackModule 'default' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/default.py
 12502 1506511086.64073: Loading CallbackModule 'actionable' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/actionable.py (found_in_cache=False, class_only=True)
 12502 1506511086.64109: Loading CallbackModule 'context_demo' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/context_demo.py (found_in_cache=False, class_only=True)
 12502 1506511086.64146: Loading CallbackModule 'debug' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/debug.py (found_in_cache=False, class_only=True)
 12502 1506511086.64166: Loading CallbackModule 'default' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/default.py (found_in_cache=False, class_only=True)
 12502 1506511086.64226: Loading CallbackModule 'dense' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/dense.py (found_in_cache=False, class_only=True)
 12502 1506511086.67297: Loading CallbackModule 'foreman' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/foreman.py (found_in_cache=False, class_only=True)
 12502 1506511086.68313: Loading CallbackModule 'hipchat' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/hipchat.py (found_in_cache=False, class_only=True)
 12502 1506511086.68376: Loading CallbackModule 'jabber' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/jabber.py (found_in_cache=False, class_only=True)
 12502 1506511086.68409: Loading CallbackModule 'json' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/json.py (found_in_cache=False, class_only=True)
 12502 1506511086.68476: Loading CallbackModule 'junit' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/junit.py (found_in_cache=False, class_only=True)
 12502 1506511086.68511: Loading CallbackModule 'log_plays' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/log_plays.py (found_in_cache=False, class_only=True)
 12502 1506511086.68598: Loading CallbackModule 'logentries' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/logentries.py (found_in_cache=False, class_only=True)
 12502 1506511086.68675: Loading CallbackModule 'logstash' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/logstash.py (found_in_cache=False, class_only=True)
 12502 1506511086.68789: Loading CallbackModule 'mail' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/mail.py (found_in_cache=False, class_only=True)
 12502 1506511086.68823: Loading CallbackModule 'minimal' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/minimal.py (found_in_cache=False, class_only=True)
 12502 1506511086.68855: Loading CallbackModule 'oneline' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/oneline.py (found_in_cache=False, class_only=True)
 12502 1506511086.68892: Loading CallbackModule 'osx_say' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/osx_say.py (found_in_cache=False, class_only=True)
 12502 1506511086.68932: Loading CallbackModule 'profile_tasks' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/profile_tasks.py (found_in_cache=False, class_only=True)
 12502 1506511086.68973: Loading CallbackModule 'selective' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/selective.py (found_in_cache=False, class_only=True)
 12502 1506511086.69006: Loading CallbackModule 'skippy' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/skippy.py (found_in_cache=False, class_only=True)
 12502 1506511086.69063: Loading CallbackModule 'slack' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/slack.py (found_in_cache=False, class_only=True)
 12502 1506511086.69186: Loading CallbackModule 'syslog_json' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/syslog_json.py (found_in_cache=False, class_only=True)
 12502 1506511086.69221: Loading CallbackModule 'timer' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/timer.py (found_in_cache=False, class_only=True)
 12502 1506511086.69256: Loading CallbackModule 'tree' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/tree.py (found_in_cache=False, class_only=True)
 12502 1506511086.69296: in VariableManager get_vars()
 12502 1506511086.71899: Loading FilterModule 'core' from /usr/lib/python2.7/site-packages/ansible/plugins/filter/core.py
 12502 1506511086.71980: Loading FilterModule 'ipaddr' from /usr/lib/python2.7/site-packages/ansible/plugins/filter/ipaddr.py
 12502 1506511086.72396: Loading FilterModule 'json_query' from /usr/lib/python2.7/site-packages/ansible/plugins/filter/json_query.py
 12502 1506511086.72433: Loading FilterModule 'mathstuff' from /usr/lib/python2.7/site-packages/ansible/plugins/filter/mathstuff.py
 12502 1506511086.72655: Loading TestModule 'core' from /usr/lib/python2.7/site-packages/ansible/plugins/test/core.py
 12502 1506511086.72845: Loading TestModule 'files' from /usr/lib/python2.7/site-packages/ansible/plugins/test/files.py
 12502 1506511086.72880: Loading TestModule 'mathstuff' from /usr/lib/python2.7/site-packages/ansible/plugins/test/mathstuff.py
 12502 1506511086.73542: done with get_vars()
 12502 1506511086.73603: in VariableManager get_vars()
 12502 1506511086.73665: Loading FilterModule 'core' from /usr/lib/python2.7/site-packages/ansible/plugins/filter/core.py (found_in_cache=True, class_only=False)
 12502 1506511086.73686: Loading FilterModule 'ipaddr' from /usr/lib/python2.7/site-packages/ansible/plugins/filter/ipaddr.py (found_in_cache=True, class_only=False)
 12502 1506511086.73702: Loading FilterModule 'json_query' from /usr/lib/python2.7/site-packages/ansible/plugins/filter/json_query.py (found_in_cache=True, class_only=False)
 12502 1506511086.73717: Loading FilterModule 'mathstuff' from /usr/lib/python2.7/site-packages/ansible/plugins/filter/mathstuff.py (found_in_cache=True, class_only=False)
 12502 1506511086.73765: Loading TestModule 'core' from /usr/lib/python2.7/site-packages/ansible/plugins/test/core.py (found_in_cache=True, class_only=False)
 12502 1506511086.73783: Loading TestModule 'files' from /usr/lib/python2.7/site-packages/ansible/plugins/test/files.py (found_in_cache=True, class_only=False)
 12502 1506511086.73799: Loading TestModule 'mathstuff' from /usr/lib/python2.7/site-packages/ansible/plugins/test/mathstuff.py (found_in_cache=True, class_only=False)
 12502 1506511086.73921: done with get_vars()

PLAY [all:!ignored] **********************************************************************************************************************************************************************************************************************************
 12502 1506511086.81512: Loading StrategyModule 'linear' from /usr/lib/python2.7/site-packages/ansible/plugins/strategy/linear.py
 12502 1506511086.82241: getting the remaining hosts for this loop
 12502 1506511086.82263: done getting the remaining hosts for this loop
 12502 1506511086.82282: building list of next tasks for hosts
 12502 1506511086.82297: getting the next task for host in-terminal01.prod.domain.tld
 12502 1506511086.82315: done getting next task for host in-terminal01.prod.domain.tld
 12502 1506511086.82330:  ^ task is: TASK: Gathering Facts
 12502 1506511086.82345:  ^ state is: HOST STATE: block=0, task=0, rescue=0, always=0, run_state=ITERATING_SETUP, fail_state=FAILED_NONE, pending_setup=True, tasks child state? (None), rescue child state? (None), always child state? (None), did rescue? False, did start at task? False
 12502 1506511086.82360: done building task lists
 12502 1506511086.82376: counting tasks in each state of execution
 12502 1506511086.82390: done counting tasks in each state of execution:
        num_setups: 1
        num_tasks: 0
        num_rescue: 0
        num_always: 0
 12502 1506511086.82405: advancing hosts in ITERATING_SETUP
 12502 1506511086.82418: starting to advance hosts
 12502 1506511086.82432: getting the next task for host in-terminal01.prod.domain.tld
 12502 1506511086.82448: done getting next task for host in-terminal01.prod.domain.tld
 12502 1506511086.82462:  ^ task is: TASK: Gathering Facts
 12502 1506511086.82478:  ^ state is: HOST STATE: block=0, task=0, rescue=0, always=0, run_state=ITERATING_SETUP, fail_state=FAILED_NONE, pending_setup=True, tasks child state? (None), rescue child state? (None), always child state? (None), did rescue? False, did start at task? False
 12502 1506511086.82492: done advancing hosts to next task
 12502 1506511086.83078: getting variables
 12502 1506511086.83098: in VariableManager get_vars()
 12502 1506511086.83157: Loading FilterModule 'core' from /usr/lib/python2.7/site-packages/ansible/plugins/filter/core.py (found_in_cache=True, class_only=False)
 12502 1506511086.83176: Loading FilterModule 'ipaddr' from /usr/lib/python2.7/site-packages/ansible/plugins/filter/ipaddr.py (found_in_cache=True, class_only=False)
 12502 1506511086.83193: Loading FilterModule 'json_query' from /usr/lib/python2.7/site-packages/ansible/plugins/filter/json_query.py (found_in_cache=True, class_only=False)
 12502 1506511086.83209: Loading FilterModule 'mathstuff' from /usr/lib/python2.7/site-packages/ansible/plugins/filter/mathstuff.py (found_in_cache=True, class_only=False)
 12502 1506511086.83247: Loading TestModule 'core' from /usr/lib/python2.7/site-packages/ansible/plugins/test/core.py (found_in_cache=True, class_only=False)
 12502 1506511086.83263: Loading TestModule 'files' from /usr/lib/python2.7/site-packages/ansible/plugins/test/files.py (found_in_cache=True, class_only=False)
 12502 1506511086.83281: Loading TestModule 'mathstuff' from /usr/lib/python2.7/site-packages/ansible/plugins/test/mathstuff.py (found_in_cache=True, class_only=False)
 12502 1506511086.83394: done with get_vars()
 12502 1506511086.83421: done getting variables
 12502 1506511086.83438: sending task start callback, copying the task so we can template it temporarily
 12502 1506511086.83453: done copying, going to template now
 12502 1506511086.83470: done templating
 12502 1506511086.83484: here goes the callback...

TASK [Gathering Facts] ********************************************************************************************************************************************************************************************************************************************************
 12502 1506511086.83515: sending task start callback
 12502 1506511086.83530: entering _queue_task() for in-terminal01.prod.domain.tld/setup
 12502 1506511086.83694: worker is 1 (out of 1 available)
 12502 1506511086.83772: exiting _queue_task() for in-terminal01.prod.domain.tld/setup
 12502 1506511086.83840: done queuing things up, now waiting for results queue to drain
 12502 1506511086.83859: waiting for pending results...
 12511 1506511086.84236: running TaskExecutor() for in-terminal01.prod.domain.tld/TASK: Gathering Facts
 12511 1506511086.84316: in run()
 12511 1506511086.84379: calling self._execute()
 12511 1506511086.85119: Loading Connection 'ssh' from /usr/lib/python2.7/site-packages/ansible/plugins/connection/ssh.py
 12511 1506511086.85207: Loading ShellModule 'csh' from /usr/lib/python2.7/site-packages/ansible/plugins/shell/csh.py
 12511 1506511086.85264: Loading ShellModule 'fish' from /usr/lib/python2.7/site-packages/ansible/plugins/shell/fish.py
 12511 1506511086.85346: Loading ShellModule 'powershell' from /usr/lib/python2.7/site-packages/ansible/plugins/shell/powershell.py
 12511 1506511086.85381: Loading ShellModule 'sh' from /usr/lib/python2.7/site-packages/ansible/plugins/shell/sh.py
 12511 1506511086.85423: Loading ShellModule 'sh' from /usr/lib/python2.7/site-packages/ansible/plugins/shell/sh.py (found_in_cache=True, class_only=False)
 12511 1506511086.85479: Loading ActionModule 'normal' from /usr/lib/python2.7/site-packages/ansible/plugins/action/normal.py
 12511 1506511086.85500: starting attempt loop
 12511 1506511086.85516: running the handler
 12511 1506511086.85572: ANSIBALLZ: Using lock for setup
 12511 1506511086.85589: ANSIBALLZ: Acquiring lock
 12511 1506511086.85606: ANSIBALLZ: Lock acquired: 29697296
 12511 1506511086.85624: ANSIBALLZ: Creating module
 12511 1506511087.15592: ANSIBALLZ: Writing module
 12511 1506511087.15653: ANSIBALLZ: Renaming module
 12511 1506511087.15678: ANSIBALLZ: Done creating module
 12511 1506511087.15765: _low_level_execute_command(): starting
 12511 1506511087.15786: _low_level_execute_command(): executing: /bin/sh -c '/usr/bin/python && sleep 0'
 12511 1506511087.16363: Sending initial data
 12511 1506511089.69186: Sent initial data (103646 bytes)
 12511 1506511089.69246: stderr chunk (state=3):
>>>mux_client_request_session: session request failed: Session open refused by peer
ControlSocket /home/ansible/.ansible/cp/170b9dc5f6 already exists, disabling multiplexing
<<<

Here it hangs forever. I replaced domain of this company with "domain.tld", the actual domain name is different.

@benapetr
Copy link
Author

This is still bug in Ansible at least in a sense that Ansible should produce some verbose ERROR message that will let user know which host is causing the hang-up so that the host can be removed from inventory file with no need for complex investigation of what went wrong.

@harrytruman
Copy link

harrytruman commented Nov 14, 2017

Also running a large inventory (2400+), seeing similar errors, and going through the same lengthy troubleshooting in hopes of finding log verbosity that points me in a better direction.

ansible 2.4.1.0
python version = 2.7.5 (default, May  3 2017, 07:55:04) [GCC 4.8.5 20150623 (Red Hat 4.8.5-14)]

@jmighion
Copy link
Member

Seeing it get to the end of a playbook and not getting the futex and closing calls. From the strace of a job that show the problem :

21:39:26 stat("/var/lib/awx/projects/<job>/host_vars", 0x7ffd4ce0dfd0) = -1 ENOENT (No such file or directory)
21:39:26 stat("/var/lib/awx/.cache/facts/<host>", 0x7ffd4ce0dae0) = -1 ENOENT (No such file or directory)
21:39:26 stat("/var/lib/awx/.cache/facts/<host>", 0x7ffd4ce0dcf0) = -1 ENOENT (No such file or directory)
21:39:26 select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
21:39:26 select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
... infinite Timeout

From a successful job :

21:39:39 stat("/var/lib/awx/projects/<job>/host_vars", 0x7ffd4ce0dfd0) = -1 ENOENT (No such file or directory)
21:39:39 stat("/var/lib/awx/.cache/facts/<host>", 0x7ffd4ce0dae0) = -1 ENOENT (No such file or directory)
21:39:39 stat("/var/lib/awx/.cache/facts/<host>", 0x7ffd4ce0dcf0) = -1 ENOENT (No such file or directory)
21:39:39 select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
21:39:39 select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
21:39:39 futex(0x7fb178001390, FUTEX_WAKE_PRIVATE, 1) = 1
21:39:39 futex(0x1e9d090, FUTEX_WAKE_PRIVATE, 1) = 1
21:39:39 futex(0x1e9d090, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
21:39:39 futex(0x1e9d090, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
21:39:39 futex(0x1e9d090, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
21:39:39 futex(0x1e9d090, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
21:39:39 futex(0x7f15690, FUTEX_WAKE_PRIVATE, 1) = 1
21:39:39 futex(0x1e9d090, FUTEX_WAKE_PRIVATE, 1) = 1
21:39:39 futex(0x711f9e0, FUTEX_WAIT_PRIVATE, 0, NULL) = -1 EAGAIN (Resource temporarily unavailable)
21:39:39 close(7)                       = 0
... continues to close and wrap up

@EdwardIII
Copy link

@benapetr / @jmighion how did you guys solve the underlying problem? Trying to figure out a workaround when issues like this arise

@benapetr
Copy link
Author

There are multiple reasons why this can happen. I never said I found a solution. You can in theory run the playbook by hand in a loop on each host separately to figure out which one is causing problems, and then you need to investigate this host.

Good example of problem that can cause this is disconnected NFS mount, which would hang even "df" command.

@jmighion
Copy link
Member

Sorry, don't remember finding a solution. I think ours was hanging on a device we were trying to connect to, but honestly cant say. We're not hitting the problem anymore though.

@leupo
Copy link

leupo commented Jan 30, 2018

#30411 (comment)

Good example of problem that can cause this is disconnected NFS mount, which would hang even > "df" command.

That was exactly my problem. No ansible.cfg parameter(neither timeout fact_caching_timeout) did helped me to interrupt the process. Thank you so much!
IMO this schould result in a error/timeout on the control machine.

@ansibot ansibot added bug This issue/PR relates to a bug. and removed bug_report labels Mar 1, 2018
@strarsis
Copy link

strarsis commented Mar 2, 2018

I had a similar issue, I set pipelining from True to False, run the playbook (successfully after this change) and then set back to True whereafter subsequent playbook runs also worked correctly.

Edit: @benapetr: Thanks! This was actually the underlying issue. A folder is mounted over SSHFS through a reverse SSH tunnel (for pushing from ansible control machine). After manually connecting over SSH to target machine and unmounting umount -l ... the problem went away.

@jeroenflvr
Copy link

Looks like I have a similar problem with ansible and an nfs mount on the target. lsof, df, ansible.. all hanging.
Setting the pipelining to false in the ansible.cfg did not help.

@strarsis
Copy link

strarsis commented Apr 23, 2018

@jeroenflvr: In my case the sshfs mount hangs (also when invoking the mount command manually). I switched to CIFS/samba mount over the SSH reverse tunnel and now it works without hanging. This seems to be partly fuse- but not ansible-related.

@benapetr
Copy link
Author

it doesn't matter if hang on target host is caused by fuse, nfs or anything, it's still Ansible bug in the way that 1 target host should not be able to lock up whole play for all other hosts.

There must be some timeout or internal watchdog implemented that would kill playbook for misbehaving and broken machines. 1 broken machine out of 5000 shouldn't break Ansible for all 5000 machines.

@FlorianHeigl
Copy link

@benapetr yes, this has to be tackled some day. not as a bug, but as a design issue, such hangs are the big plague and no single "fix" will resolve them for real. I occasionally used it since 2011, and IMO it'll always be like this until finally someone accepts something has to be done on a higher level.

@byumov
Copy link

byumov commented Apr 26, 2018

Same issue for me.
I have six MacOS hosts and i run:
ansible-playbook tcmacagents.yml -f 6
It hangs on first step "Gathering Facts". Always on the same host(tcmacagent5):

ok: [tcmacagent1]
ok: [tcmacagent2]
ok: [tcmacagent6]
ok: [tcmacagent3]
ok: [tcmacagent4]


strace:

[pid 24072] wait4(24211, 0x7ffc608410a4, WNOHANG, NULL) = 0
[pid 24072] wait4(24211, 0x7ffc608410d4, WNOHANG, NULL) = 0
[pid 24072] select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
[pid 24072] wait4(24211, 0x7ffc608410a4, WNOHANG, NULL) = 0
[pid 24072] wait4(24211, 0x7ffc608410d4, WNOHANG, NULL) = 0
[pid 24072] select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
[pid 24072] wait4(24211, 0x7ffc608410a4, WNOHANG, NULL) = 0
[pid 24072] wait4(24211, 0x7ffc608410d4, WNOHANG, NULL) = 0
[pid 24072] select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
[pid 24072] wait4(24211, 0x7ffc608410a4, WNOHANG, NULL) = 0
[pid 24072] wait4(24211, 0x7ffc608410d4, WNOHANG, NULL) = 0
[pid 24072] select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
[pid 24072] wait4(24211, 0x7ffc608410a4, WNOHANG, NULL) = 0
[pid 24072] wait4(24211, 0x7ffc608410d4, WNOHANG, NULL) = 0
[pid 24072] select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)

How i can fix it? What wrong with my tcmacagent5?

# ansible --version
ansible 2.5.0

@leupo
Copy link

leupo commented Apr 26, 2018

If it is a mount and it is not usable(touch a file or something) remount it. In my special situation i always reverted a vm to a snapshot for testing purposes and the NFS handle doesnt match within this vm - so it is in stale state

@jctanner jctanner added this to TODO in CEM May 20, 2018
@dentarg
Copy link

dentarg commented May 23, 2018

We have also been experiencing Ansible hanging without any error message, when running playbooks against 23 hosts, with forks=25. We use Ansible 2.2.3.0, but have the same problem with Ansible 2.5.3. We run our playbooks from macOS 10.12.6, using Python 2.7.15. With forks=5I don't see the problem. We also don't see the problem when using transport=paramiko (and forks=25).

@bkregmi
Copy link

bkregmi commented Dec 20, 2021

This issue was reported five years ago and the problem still as is. Some bad target host holding the play forever. It's a serois design flaw. At least it should give up that bad host and continue the remaining tasks on good hosts. It is very frustating...

@bcoca
Copy link
Member

bcoca commented Dec 21, 2021

@bkregmi for that behavior you can use the new timeout keyword, since it is very hard to programmatically distinguish between a very long running task and a hung one.

@KeithTt
Copy link

KeithTt commented Dec 22, 2021

@bkregmi for that behavior you can use the new timeout keyword, since it is very hard to programmatically distinguish between a very long running task and a hung one.

-T TIMEOUT, --timeout=TIMEOUT override the SSH timeout in seconds (default=10)

Do u mean this option of ansible-playbook? There is a default value 10.

@bcoca
Copy link
Member

bcoca commented Dec 22, 2021

@KeithTt no, that timeout only works if the problem is the connection itself, a hung process will not break tcp nor ssh timeouts.

See my previous comment for the 'keyword' timeout

#30411 (comment)

@KeithTt
Copy link

KeithTt commented Dec 23, 2021

@KeithTt no, that timeout only works if the problem is the connection itself, a hung process will not break tcp nor ssh timeouts.

See my previous comment for the 'keyword' timeout

#30411 (comment)

I have checked the document, unfortunately the keyword timeout is not in the list of version 2.9, in our jumpserver it is even 2.4.2.0...

@frank-dc
Copy link

frank-dc commented May 6, 2022

Upgrade Ansible to above 2.10, and it supports keyword “timeout”.

@krejcir
Copy link

krejcir commented May 24, 2022

This issue is still "alive". I've been playing with Ansible for few weeks - a great tool, but almost every time hangs. The "timeout" helps to avoid hanging forever, but the result for the "hanging" server is failed then and I have to do hand work, which is not very helpful. I've tried to set up ssh params as suggested in this thread, however, it doesn't help. When this happened, there's ssh process hanging and throttling (100 % CPU - just one core), ps output bellow:
ssh -C -o ControlMaster=no -o ControlPersist=60s -o ServerAliveInterval=30 -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User="root" -o ConnectTimeout=10 -o ControlPath="/home/joedoe/.ansible/cp/7265f62947" -tt joedoeserver.somewhere /bin/sh -c '/usr/bin/python /root/.ansible/tmp/ansible-tmp-1653380215.1941621-7496-65793715540189/AnsiballZ_apt.py && sleep 0'

@krejcir
Copy link

krejcir commented May 26, 2022

Just a note on previous comment - it seems that this happens on Debian 9... (Ubuntu and Debian 10 looks fine).

@abousselmi
Copy link

abousselmi commented Jun 10, 2022

TL;DR: Fixing source.list resolved the problem.

I had the same issue. I've tried tweaking ssh connection options as suggested above but with no luck. I've checked sshd configuration on the faulty server and compared it to another one, no difference. Then I checked the ssh server package on both servers, it was openssh-server with the same version. Ok..

Then I've simply tried an apt-get install something and here we go, the problem pops out, it's the apt install that is hanging. It turns out that there was a cdrom entry in source.list which was waiting for the user to mount the disk in /media/cdrom/.

@homerl's comment helped me but since it's a small one and underrated it could go unnoticed.

@TumenbayarSh
Copy link

I deleted the ansible tmp folder (rm -rf ~/.ansible/tmp/*) then it works for me.

@krejvl
Copy link

krejvl commented Jun 22, 2022

@abousselmi I have done this as well as the advice @TumenbayarSh, thank you. However, nothing helps :-(. It's interesting (and annoying), because it's just one server (and the only one with Debian 9) out of many and I can handle it manually, but I would like to know what's wrong...

@kompjuteras
Copy link

I have this problem when the target server has an enormous load, so in case I am running playbook against 5000 servers, and there is one server with a load>100, the playbook will be hung infinitely - and mostly on gathering facts, and I didn't find any solution better than dump "Remove that server from the inventory and try again".
Is there any solution for this?

@bcoca
Copy link
Member

bcoca commented Jul 13, 2022

@Joe-Heffer-Shef
Copy link

Could this be caused by an input prompt (e.g. a password request)?

@Rakesh2904
Copy link

Hello All,

I have an issue any help/suggestions would be really helpful, I will just try to summarize the issue that I have in hand and please let me know if anything is not clear or more details are needed.

I have a playbook, which is to scan the entire server(using UAC binary), so the tasks involved in this are as follows.

  1. Create a workspace - Based on the max available space on drive(win)/fs(linux)
  2. Download the binary from a centralized repo - No issues in this step on any of the servers that were tested
  3. Unzip the binary - No issues in this step on any of the servers that were tested
  4. Execute the binary - This is where I have issue on few of the LINUX servers (approx. 15-20% of the servers) the issue is the binary execution is completed at the server but Ansible is not moving forward to the next task, when we manually execute the same binary it works like a charm(so it is claimed the the binary is not having issues), Since the playbook works in most of the servers(I am claiming there is no issue at the playbook), I suspect this has something to do at the server level which I am unable to figure out. So with the problematic servers it does not move to the next step even though the execution is actually completed.
  5. Download azcopy binary from the same centralized repo - No issues in this step on any of the servers that were tested except for the ones which has problems since Ansible is not moving to this task.
  6. Unzip the azcopy binary - No issues in this step on any of the servers that were tested except for the ones which has problems since Ansible is not moving to this task.
  7. Upload the collected logs/artefacts from UAC binary scan to BLOB storage - No issues in this step on any of the servers that were tested except for the ones which has problems since Ansible is not moving to this task.
  8. Clean up the workspace - No issues in this step on any of the servers that were tested except for the ones which has problems since Ansible is not moving to this task.

Troubleshooting that I have done on this are as below.

  1. Tried to increase the disk space - It worked (this was done since most of the servers having problem were having less than 50G avail space on the FS used as workspace) But we can't conclude this as the cause because when I used a NAS storage which is of 2T size the issue still occurs.
  2. Since the binary scans the entire server, there is one particular scan (rpm -a -V) this is the one which takes time, so we disabled this and tried to scan the problematic server it worked like a charm so when compared with problematic server and working server there is no much difference in the rpm packages. also this scan is important and the binary has to be included with this.(In my opinion the problem lies here but I am unsure of how to fix it).
  3. There are few other small things that I've tested which I don't think is worth mentioning here.

PS: The entire scan takes around 30mins to 2hours for the servers which we have tested, so we have not set any sort of Timeout.

Any suggestions would be of real great help.

@BillKanawyer
Copy link

BillKanawyer commented Sep 14, 2022

I am also chasing a variant of the hanging problem using a with_nested lineinfile edit of /etc/group that includes a complex regexp expression with back references.

The task in question has been in use for about four years now across a wide range of OS but has recently been hanging on every host. Clearly, something had changed in Ansible, python, or another supporting module.

Being an old Unix developer, my first thought after "test simplification" was to chase the actual hang on the target host.

As all of you likely know, when ansible-playbook is runs it copies a bundle of Python code to ~/.ansible/tmp/ansible-tmp-<UUID string> on the target host. In my case the bundle is named AnsiballZ_lineinfile.py

As per https://docs.ansible.com/ansible/latest/dev_guide/debugging.html, it is possible to unpack this bundle (which contains some compressed python code) using the command:

python AnsiballZ_lineinfile.py explode

The unpacked code is placed into a "debug_dir" directory alongside AnsiballZ_lineinfile.py. The unpacked code can be modified and then run using the command:

python AnsiballZ_lineinfile.py execute

Obviously this execution hangs as well, but at least we have recourse to strace:

strace -o /tmp/output.txt -f -r -s 128 python AnsiballZ_lineinfile.py execute

(Note: additional strace arguments like "-r -s 128" increase the amount of useful output.)

This at least provides some system call insights into what the process is doing just before it hangs.

The following is a selected bit from one blocked-process debug session:

4902       0.000023 openat(AT_FDCWD, "/etc/group", O_RDONLY|O_CLOEXEC) = 4
4902       0.000019 newfstatat(4, "", {st_mode=S_IFREG|0644, st_size=1236, ...}, AT_EMPTY_PATH) = 0
4902       0.000020 ioctl(4, TCGETS, 0x7ffcb9818ac0) = -1 ENOTTY (Inappropriate ioctl for device)
4902       0.000017 lseek(4, 0, SEEK_CUR) = 0
4902       0.000017 read(4, "root:x:0:\ndaemon:x:1:\nbin:x:2:\nsys:x:3:\nadm:x:4:root,jeff,bill\ntty:x:5:\ndisk:x:6:\nlp:x"..., 4096) = 1236
4902       0.000021 read(4, "", 4096)   = 0
4902       0.000016 close(4)            = 0

After the close the process appears to be hanging but in my particular instance, given sufficient time it returns to complete execution:

4902     **456.133294** newfstatat(AT_FDCWD, "/etc/group", {st_mode=S_IFREG|0644, st_size=1236, ...}, 0) = 0
4902       0.001059 write(1, "\n{\"changed\": false, \"msg\": \"\", \"backup\": \"\", \"diff\": [{\"before\": \"\", \"after\": \"\", \"before_header\": \"/etc/group (content)\", \"afte"..., 783) = 783
4902       0.000201 write(1, "\n", 1)   = 1
4902       0.000148 newfstatat(AT_FDCWD, "/tmp/ansible_ansible.builtin.lineinfile_payload_vy8b7kp2", {st_mode=S_IFDIR|0700, st_size=4096, ...}, AT_SYMLINK_NOFOLLOW) = 0
4902       0.000032 openat(AT_FDCWD, "/tmp/ansible_ansible.builtin.lineinfile_payload_vy8b7kp2", O_RDONLY|O_CLOEXEC) = 4
...
4902       0.000011 close(5)            = 0
4902       0.000020 unlinkat(4, "ansible_ansible.builtin.lineinfile_payload.zip", 0) = 0
4902       0.000045 close(4)            = 0
4902       0.000012 rmdir("/tmp/ansible_ansible.builtin.lineinfile_payload_vy8b7kp2") = 0
...
4902       0.000199 exit_group(0)       = ?
4902       0.000525 +++ exited with 0 +++

Note the comparatively huge relative time between when the process is finished reading /etc/group and when it decides the regular expression has determined that no changes are required. This points at potential inefficiencies in the regexp being used in the ansible.builtin.lineinfile.

This is not a substitute for Ansible failing to support a task timeout value, something badly needed.

But at least someone facing what appears to be a block can obtain some clues as to possible causes.

@bcoca
Copy link
Member

bcoca commented Sep 14, 2022

@BillKanawyer
|This is not a substitute for Ansible failing to support a task timeout value, something badly needed.

like the existing timeout keyword?
https://docs.ansible.com/ansible/latest/reference_appendices/playbooks_keywords.html#task

Which i had commented about here previously #30411 (comment)

@BillKanawyer
Copy link

@BillKanawyer |This is not a substitute for Ansible failing to support a task timeout value, something badly needed.

like the existing timeout keyword? https://docs.ansible.com/ansible/latest/reference_appendices/playbooks_keywords.html#task

Which i had commented about here previously #30411 (comment)

Thank you; I respectfully withdraw that observation.

@bcoca
Copy link
Member

bcoca commented Sep 14, 2022

@BillKanawyer you have reminded me of an idea, something we have sometimes done internally, appended 'strace' to the remote commands to try to give the user a way to narrow down the issue. This might be worth adding to an 'execution mode' that will do this w/o making you jump through 8-10 steps.

@BillKanawyer
Copy link

Having an 'execution mode' option that adds strace functionality would indeed be useful.

I have also been looking into using python trace:

# python3 -m trace --ignore-dir=/usr/lib --trace AnsiballZ_lineinfile.py execute > /tmp/info.txt 2> /tmp/error.txt

This approach suggests my problem may not be the fault of regexp but rather a missing package:

# egrep -n base64 /tmp/info.txt /tmp/error.txt
/tmp/info.txt:24:AnsiballZ_lineinfile.py(17):     import base64
/tmp/error.txt:15:    import base64
/tmp/error.txt:16:ModuleNotFoundError: No module named 'base64'
/tmp/error.txt:38:    import base64
/tmp/error.txt:39:ModuleNotFoundError: No module named 'base64'

Note that "import base64" is the last line in the stdout file "info.txt"

Using locate seems to confirm this. So this approach is generating concrete debugging results.

Why AnsiballZ_lineinfile.py isn't bailing out on import failure and why the base64 module is missing are open questions.

@bcoca
Copy link
Member

bcoca commented Sep 14, 2022

It does not import it directly, this happens in the common module_utils code, which does not guard on base64 since it is part of core python and 'assumed' to be present.

Any import that is not part of core python should be guarded, but core libraries are not required to be so, in any case, that should result in a quick error, not hanging.

@BillKanawyer
Copy link

I thought it might be useful to some to leave a few concluding remarks now that I have reached a more complete understanding of my issue.

As mentioned before, it is possible to "explode" the AnsiballZ_lineinfile.py bundle, instrument the code, and then "execute" the result. To verify it was my regular expression that was causing the problem, I bracketed the re.compile & re.match calls with print statements.

Verifying that my real issue was with the underlying python3 re code and not Ansible lineinfile allowed me to redirect my efforts to understanding why the supplied regular expression match time was increasing exponentially with each character added to the string being matched.

Thank you @bcoca for providing useful feedback.

@Wehzie
Copy link

Wehzie commented Dec 23, 2022

I have a virt-manger guest running Ubuntu 22.04 LTS. Initially, the network configuration was set to NAT mode and my playbook worked fine. After changing the network configuration to Open mode, the playbook hangs on the apt module. When commenting out the apt module, while keeping the Open network mode, the playbook runs fine.

Changing back and forth between network modes yields the same outcome. Why would the apt module in particular be problematic?

Edit:
Host OS is Ubuntu 22.10, ansible installed via APT. Had the same hang in another context. Was able to fix the issue by removing the package with apt purge and reinstalling.

@mutech
Copy link

mutech commented Feb 28, 2023

Interesting to read this... I keep having variations of this issue with different target hosts, different ansible controllers, different playbooks and at different tasks, different targets in the same playbook when I interrupt the play and restart it. This happens whether I am working with remote targets or target on the local network. There is no pattern that I could discern, other than that more complex playbooks tend to hang more often and more simultaneous targets create more hangs, consistent. This happens just as likely with stock ansible from ubuntu packages, those installed from the ansible repo or ansible installed via pip. Same in the office (completely different network setup) or at home. On the other end of the connection I see no running processes, no pending connections. It looks very much like the last action terminated cleanly. And again, this happens with any kind of command. I have not the slightest idea what the cause of all this could be, because the only constant is the observable effect.

There are days when I have no such problems, but I just concluded that this is how ansible works. Windows 3.1 style problem solving, rince, reboot and repeat until it's working or until the day is over and it's somebody else's problem.

I wish this happened before I invested months learning ansible and writing scripts for everything. This is really frustrating.

@kharnden-amd
Copy link

I have also tried several iterations of -T and connect_timeout and I still cant force a timeout of a particular part of my playbook when a host has an issue with load or hanging. Its very frustrating.

@esmacgil
Copy link

The project that I work on seems to be having this problem, or a problem like it, and when I see it "live" I check the ps -ef to find the stalled command and then I try to execute a similar command. The command is usually in the forms:

ssh <ssh options> <target IP> /bin/sh -c 'cat <some file path> && sleep 0'
ssh <ssh options> <target IP> /bin/sh -c 'python3 <some AnsiballZ path> && sleep 0'

so I tend to try to execute the same (or similar) command. Every time I do these manual commands they stall, so this makes me wonder how it ever works for Ansible.

That said, the ssh part does connect and if I start typing and follow with a \n, then most of the time I get an "echo" of why I typed, at least for the cat command. This makes me think that for whatever reason, the "file path" that was meant to be passed to cat is not being passed and instead cat is in "read from stdin" mode.

When I do the python3 command, I actually get the interactive mode of Python3 and I can type python statements and they get executed. So again, the "command" appears to not be getting the argument and therefore going into "read from stdin" mode.

Further, when I modify the command to be like:

ssh <ssh options> <target IP> "/bin/sh -c 'cat <some file path> && sleep 0'"
ssh <ssh options> <target IP> "/bin/sh -c 'python3 <some AnsiballZ path> && sleep 0'"

Then I never get this "read from stdin" mode and the behavior that I expect is the behavior that I get.

I can also get the desired behavior if I do commands like:

ssh <ssh options> <target IP> /bin/sh -c 'echo && cat <some file path> && sleep 0'
ssh <ssh options> <target IP> /bin/sh -c 'echo && python3 <some AnsiballZ path> && sleep 0'

Though this produces a blank line in the output, so not the best solution.

Has anyone else noticed this behavior?
Am I observing some other problem?

@mutech
Copy link

mutech commented Apr 11, 2023

The command is usually in the forms:

ssh <ssh options> <target IP> /bin/sh -c 'cat <some file path> && sleep 0'
ssh <ssh options> <target IP> /bin/sh -c 'python3 <some AnsiballZ path> && sleep 0'

I see these processes for tasks that are still alive. After some time, they are gone and the controller keeps hanging.

If these processes keep running, I guess you have a different problem, at least from what I observe. When your scripts consistently hang for specific tasks (which is not what I see on my side), then it looks more like an issue with the task itself. Did you try to strace the python process to see what it's doing?

Has anyone else noticed this behavior? Am I observing some other problem?

I tend to believe that this is another problem. What happens if you interrupt the playbook and restart it? Will it hang at the same task consistently? For me, this is definitely not the case, I see randomish behavior. A slow network definitely increases the probability of such hangers, as well as more target hosts and more complex scripts. Then I have periods of days or even weeks without problems.

@priyankaprabhune26
Copy link

I am also facing the same issue on RHEL 8 when I am executing the command through jenkins but it is working from the vm.

It's getting hand on AnsiballZ_command.py
44699 1686567241.27096: _low_level_execute_command(): executing: /bin/sh -c '/usr/bin/python /home//.ansible/tmp/ansible-tmp-1686567240.16-44699-236344823441760/AnsiballZ_command.py && sleep 0'

@techie879
Copy link

techie879 commented Jun 12, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects_2.3 This issue/PR affects Ansible v2.3 bug This issue/PR relates to a bug. hang Ansible hangs or gets stuck has_pr This issue has an associated PR. support:core This issue/PR relates to code supported by the Ansible Engineering Team.
Projects
No open projects
CEM
TODO
Development

No branches or pull requests