get_mount_facts timeout is very small and not ignored as expected #10779

rosowiecki · 2015-04-20T09:06:07Z

I've analysed #10746 more deeply and found that my first suspition was wrong and timeout error was not really caused by lack of lsblk. Moreover, it was not ingored as intened.

The reason is that UUID getting added by #10055 broke timeout handling because of Python's subprocess module code. In _execute_child method there's an except code catching all exceptions: that's why TimeoutError is not caught in Facts.populate and ignored.

failed: [SOME_HOST] => {"cmd": "/bin/lsblk -ln --output UUID /dev/sdi2", "failed": true, "rc": 257}
msg: Traceback (most recent call last):
  File "<stdin>", line 1650, in run_command
  File "/usr/lib64/python2.4/subprocess.py", line 550, in __init__
    errread, errwrite)
  File "/usr/lib64/python2.4/subprocess.py", line 988, in _execute_child
    data = os.read(errpipe_read, 1048576) # Exceptions limited to 1 MB
  File "<stdin>", line 1796, in _handle_timeout
TimeoutError: Timer expired

Suggested solution: get UUID's without calling lsblk with subprocess.

Margin note: 10s timeout was enough before, but with lsblk get_mount_facts on my system with no network filesystems mounted lasted even 15 seconds.

The text was updated successfully, but these errors were encountered:

bcoca · 2015-04-20T16:58:05Z

I disagree that 10s is too little for listing UUIDs and in the many systems i've tested lsblk takes less than 1s. Keeping this open to investigate the issue about the timeouts not being handled properly.

benjixx · 2015-05-15T07:00:15Z

I ran into the same issue some days ago. In my particular case I have a install-once-and-throw-away test environment where I'm using docker containers for the underlying system.

It turns out that in this case the problem was already known (see moby/moby#12192) and that upgrading the kernel of my docker host system resolved the issue.

epyonavenger · 2015-07-15T20:36:37Z

In our production environment, we see this come up fairly often since the servers are pretty busy and not particularly high-performance. When RAID is rebuilding it's especially common. Even if the default stays the same, it would be nice if we could override or otherwise ignore when it isn't pertinent.

infernix · 2015-08-03T12:53:29Z

We are seeing this happen a lot on CEPH nodes with LSI SAS controllers. It's extremely random.

failed: [obj-005] => {"cmd": "/bin/lsblk -ln --output UUID /dev/sdb1", "failed": true, "rc": 257}
msg: Traceback (most recent call last):
File "<stdin>", line 1671, in run_command
File "<stdin>", line 1801, in _handle_timeout
TimeoutError: Timer expired

When running it again the timeout may or may not occur.

@bcoca although ideally blkid should return immediately, there are cases where the timeout may need to be increased due to high system IO load (CEPH rebuilds come to mind). Please at least consider making it a variable so that it can be adjusted in ansible.cfg.

bcoca · 2015-08-03T13:51:53Z

Yes, I've been thinking of adding at timeout option to setup, defaulting to current

vinsh · 2015-08-12T19:41:27Z

Any suggested workarounds to this for now? We hit this very often on swift nodes.

infernix · 2015-08-12T20:38:16Z

@vinsh nothing other than patching ansible; for 1.9.1-1 @ b47d1d7 (increase from 20 to whatever time you need):

--- a/lib/ansible/module_utils/facts.py
+++ b/lib/ansible/module_utils/facts.py
@@ -53,7 +53,7 @@ except ImportError:
 class TimeoutError(Exception):
     pass

-def timeout(seconds=10, error_message="Timer expired"):
+def timeout(seconds=20, error_message="Timer expired"):
     def decorator(func):
         def _handle_timeout(signum, frame):
             raise TimeoutError(error_message)

vinsh · 2015-08-12T21:08:08Z

Thank you for the clarification, we might branch/patch in house with your suggestion for now.

rosowiecki · 2015-08-19T17:22:53Z

Update: the reason we had this timeout was that "nofile" in limits.conf is set to vert large value on our servers (1000000) and python subprocess module was busy closing all those filedescriptors on every command. We introduced a workaround: ansible_python_inerpretter points to a wrapper script calling "ulimit -n 1024" before python. Regardless of timeout case, ignoring still doesn't work because of exception handling in subprocess module.

mbookman · 2015-11-20T17:52:21Z

Was there a conclusion on this issue? I have observed the same issue.

Is the answer that on many systems, 10 seconds is indeed too tight of a timeout and so it should simply be increased in master?
is the answer that many systems are improperly configured and should drop the nofile limit?

For context, I am using Elasticluster, which installs Ansible 1.9.1 as a dependency:

elasticluster/lib/python2.7/site-packages/ansible-1.9.1-py2.7.egg

Is there a workaround here, or does Elasticluster need to provide a custom build of Ansible?

insom · 2016-02-04T10:42:32Z

I'd just like to add that busy hosts with hundreds of LVM logical volumes is also a scenario where I'm seeing intermittent failures caused by the ten second timeout. I'm now running a fork with a larger value. I can see that, because it's a decorator, making this timeout configurable may be a pain, but it would be really useful for certain use cases (while I can also see that for some people 10 seconds is a really long time, so I'm not in favour of just making it bigger).

thenewwazoo · 2016-06-16T21:20:33Z

This has also bitten me. A 10s timeout is generous on dedicated hardware, but when we provision hundreds of VMs in shared infrastructure that are busy doing disk-intensive operations, 10s is not nearly long enough. This very much should be configurable.

oleyka · 2016-06-24T05:04:03Z

@thenewwazoo it is configurable. "timeout = " in ansible.cfg

jctanner · 2016-07-01T22:21:05Z

@oleyka that would definitely alter the timeout for the ssh connection, but this issue is related to the timeout decorator inside facts.py which has no hooks into any configuration.

alikins · 2016-07-05T14:56:21Z

For folks with storage configs that are causing timeouts:

are the ansible_mounts facts being used?
If so, how. In particular, is the device uuid used?

My first thought is 'stop running lsblk just to add the device uuids', so looking for evidence that is a used feature.
are you using other tools that handle this better?

Everything seems to eventually end up reading /sys/block/* so I'm curious if there is a better tools for this. Ideally something that actually completes quickly versus something that just doesn't timeout, but either way ;->
would filtering the types of block devices and mount points that is collected be useful? Any easy targets for skipping by default?
Any ideas on which parts of storage info collection are troublesome? findmnt? lsblk? All of the above?
Anyone have example reports from troublesome systems they could share?

For example:
- a 'ansible -m setup' output (or interesting subsets).
  - lsblk output ( 'lsblk -ln --output UUID')
  - ls -laR /sys/block/
(any variation of those or other reports is useful)

insom · 2016-07-05T15:25:02Z

@alikins I think, for me, part of the problem is that this fact is O(n) -- it forks a command for every entry in /etc/mtab, but has a constant timeout:

ansible/lib/ansible/module_utils/facts.py

Line 1216 in 4b5203c

mtab = get_file_content('/etc/mtab', '')

On one of my troublesome machines, I have 96 entries in /etc/mtab, with 89 LVM volumes and moderately high load. Forking around 90 processes that hit the disks, one after another, on a loaded machine regularly exceeds 10 seconds. I realise my case is probably verging on pathological, so I don't expect the defaults to change, but being able to tune this would be great.

(Also I personally don't even use these facts for anything, but I am using other facts, so I can't just run with gather_facts: False)

My current workaround is to execute my playbook several times until the gathering of facts succeeds.

jctanner · 2016-07-08T21:49:15Z

Hi!

A change has been applied for this ticket, which should resolve this item for you.

ansible/ansible-modules-core#4093
#16551

If you believe this ticket is not resolved, or have further questions, please let us know by stopping by one of the two mailing lists, as appropriate:

https://groups.google.com/forum/#!forum/ansible-project - for user questions, tips, and tricks
https://groups.google.com/forum/#!forum/ansible-devel - for strategy, future planning, and questions about writing code

Because this project is very active, we're unlikely to see comments on closed tickets, though the mailing list is a great way to get involved or discuss this one further.

Thanks!

bwbeach · 2016-08-10T00:15:00Z

@alikins: To answer your questions on behalf of Backblaze...

Our storage pods have 45 or 60 drives, each mounted as a separate ext4 file system. We're seeing frequent timeouts on lsblk with the 10 second timeout built into 2.1. The new gather_timeout setting would fix the error for us, but would mean that we have to wait for lsblk on all the drives, which is annoying.

We would prefer just to disable gathering the uuids of the drives so things run faster.

alikins · 2016-08-10T14:54:25Z

Based on #10779 (comment) I'm working on a patch that should improve this.

The gist of the patching is replacing the use of a 'lsblk' call for every mtab entry, to just making one lsblk call to collect all of the info and mapping it to the mtab info. Hopefully that will be faster (unless the single lsblk entry takes just as long...)

bwbeach · 2016-08-10T17:09:47Z

This is the error I get when things time out:

fatal: [pod-000-1000-12]: FAILED! => {"changed": false, "cmd": "/bin/lsblk -ln --output UUID /dev/mapper/vg_0037-lv_0037", "failed": true, "msg": "Traceback (most recent call last):\n  File \"/tmp/ansible_Gwor7c/ansible_modlib.zip/ansible/module_utils/basic.py\", line 2109, in run_command\n    rfd, wfd, efd = select.select(rpipes, [], rpipes, 1)\n  File \"/tmp/ansible_Gwor7c/ansible_modlib.zip/ansible/module_utils/facts.py\", line 101, in _handle_timeout\n    raise TimeoutError(error_message)\nTimeoutError: Timer expired\n", "rc": 257}

The confusing thing is that when I run the lsblk command by hand, it's fast:

brianb@pod-000-1000-12:~$ time /bin/lsblk -ln --output UUID /dev/mapper/vg_0037-lv_0037


real    0m0.001s
user    0m0.000s
sys 0m0.000s

Even running /bin/lsblk -lna --output UUID to report on all drives only takes 0.01s.

alikins · 2016-08-10T18:52:55Z

@bwbeach Take a look at #17036

If you can test it, I'm curious if it helps.

bwbeach · 2016-08-10T21:51:26Z

Looks like a typo:

fatal: [pod-000-1000-00]: FAILED! => {"changed": false, "failed": true, "module_stderr": "", "module_stdout": "\r\nTraceback (most recent call last):\r\n  File \"/tmp/ansible_O52w9A/ansible_module_setup.py\", line 134, in <module>\r\n    main()\r\n  File \"/tmp/ansible_O52w9A/ansible_module_setup.py\", line 126, in main\r\n    data = get_all_facts(module)\r\n  File \"/tmp/ansible_O52w9A/ansible_modlib.zip/ansible/module_utils/facts.py\", line 3378, in get_all_facts\r\n  File \"/tmp/ansible_O52w9A/ansible_modlib.zip/ansible/module_utils/facts.py\", line 3321, in ansible_facts\r\n  File \"/tmp/ansible_O52w9A/ansible_modlib.zip/ansible/module_utils/facts.py\", line 1002, in populate\r\n  File \"/tmp/ansible_O52w9A/ansible_modlib.zip/ansible/module_utils/facts.py\", line 134, in wrapper\r\n  File \"/tmp/ansible_O52w9A/ansible_modlib.zip/ansible/module_utils/facts.py\", line 1288, in get_mount_facts\r\n  File \"/tmp/ansible_O52w9A/ansible_modlib.zip/ansible/module_utils/facts.py\", line 1255, in _find_bind_mounts\r\n  File \"/tmp/ansible_O52w9A/ansible_modlib.zip/ansible/module_utils/facts.py\", line 1246, in _run_findmnt\r\nAttributeError: 'AnsibleModule' object has no attribute 'un_command'\r\n", "msg": "MODULE FAILURE", "parsed": false}

bwbeach · 2016-08-10T22:15:56Z

After fixing the typo in my clone, it works. The failure rate was low, so I'll let it run on a large set of hosts and see what happens.

UPDATE: I let it run over a few hundred hosts, and got no timeouts. This is an improvement.

alikins · 2016-08-11T15:05:59Z

Looks like a typo:

Doh. And in the dozen or so lines I didn't add test coverage for ;->

Just pushed a update to #17036 with that fixed.

robinro · 2016-11-03T08:18:39Z

I stumbled over a follow-up to this, when I ran into timeouts on old machines, where the floppy drive took 12 seconds to answer lsblk.
The fix there was to add --exclude 2 to lsblk:

/usr/bin/lsblk --list --noheadings --paths --output NAME,UUID --exclude 2

See details see #18326 and #18327

Fixes ansible#10779 Refactor some of the block device, mount point, and mtab/fstab facts collection for linux for better performance on systems with lots of block devices. Instead of invoking 'lsblk' for every entry in mtab, invoke it once, then map the results to mtab entries. Change the args used for invoking 'findmnt' since the previous combination of args conflicts, so this would always fail on some systems depending on version. Add test cases for facts Hardware()/Network()/Virtual() classes __new__ method and verify they create the proper subclass based on the platform.system() results. Split out all the 'invoke some command and grab it's output' bits related to linux mount paths into their own methods so it is easier to mock them in unit tests. Fix the DragonFly* classes that did not defined a 'platform' class attribute. This caused FreeBSD systems to potentially get the DragonFly* subclasses incorrectly. In practice it didnt matter much since the DragonFly* subclasses duplicated the FreeBSD ones. Actual DragonFly systems would end up with the generic Hardware() etc instead of the DragonFly* classes. Fix Hardware.__new__() on PY3, passing args to __new__ would cause "object() takes no parameters" errors. So check for PY3 and just call __new__ without the args See https://hg.python.org/cpython/file/44ed0cd3dc6d/Objects/typeobject.c#l2818 for some explaination.

oskapt · 2017-03-29T11:28:07Z

FWIW (in case someone else finds this in the future), I ran into this last night on an haproxy system with a high nofile value in limits.conf. The setup module would time out when gathering facts until I lowered it.

hjnijlunsing · 2017-06-11T11:05:32Z

I still experience this issue at home. On my file server my 4tb goes to sleep when not in used, my guess is that this spins up the hard-drive but takes longer then 10 sec. To me it is really weird to have a "magic number" of 10 seconds hardcoded in the facts.py. Any suggestions? (As I am executing Ansible from Jenkins I am currently using the Naginator-plugin to retry the deployment if it fails the first time; but this is far from ideal...)

rosowiecki referenced this issue Apr 20, 2015

Add the uuid of the device in ansible_mounts variable

99bcbe7

rosowiecki changed the title ~~get_mount_facts timeout is very small and not ignored as intended~~ get_mount_facts timeout is very small and not ignored as expected Apr 20, 2015

bcoca added bug_report labels Apr 20, 2015

jimi-c removed the P3 label Dec 7, 2015

kilburn mentioned this issue Feb 18, 2016

Unnecessary /bin/lsblk executions in the get_mount_facts function of the setup module #14551

Closed

This was referenced Jul 1, 2016

add gather_timeout parameter ansible/ansible-modules-core#4093

Merged

make timeout decorator for facts have a configurable duration #16551

Merged

jctanner closed this as completed Jul 8, 2016

alikins reopened this Aug 10, 2016

alikins mentioned this issue Aug 10, 2016

Collect block device uuids in one lsblk call so facts collection doesn't timeout #17036

Merged

abadger closed this as completed in 7bd57ac Aug 18, 2016

mathieumd mentioned this issue Aug 31, 2016

gather_timeout is used by "ansible-playbook" but not by "ansible" #17316

Closed

ansibot added bug This issue/PR relates to a bug. and removed bug_report labels Mar 6, 2018

vrutkovs mentioned this issue May 16, 2018

Adding new node Failure - "/bin/lsblk --list --noheadings --paths --output NAME,UUID --exclude 2, Timer expired" openshift/openshift-ansible#8385

Closed

ansible locked and limited conversation to collaborators Apr 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

get_mount_facts timeout is very small and not ignored as expected #10779

get_mount_facts timeout is very small and not ignored as expected #10779

rosowiecki commented Apr 20, 2015 •

edited by jctanner

bcoca commented Apr 20, 2015

benjixx commented May 15, 2015

epyonavenger commented Jul 15, 2015

infernix commented Aug 3, 2015

bcoca commented Aug 3, 2015

vinsh commented Aug 12, 2015

infernix commented Aug 12, 2015

vinsh commented Aug 12, 2015

rosowiecki commented Aug 19, 2015

mbookman commented Nov 20, 2015

insom commented Feb 4, 2016

thenewwazoo commented Jun 16, 2016

oleyka commented Jun 24, 2016

jctanner commented Jul 1, 2016

alikins commented Jul 5, 2016

insom commented Jul 5, 2016 •

edited

jctanner commented Jul 8, 2016 •

edited

bwbeach commented Aug 10, 2016

alikins commented Aug 10, 2016

bwbeach commented Aug 10, 2016

alikins commented Aug 10, 2016

bwbeach commented Aug 10, 2016

bwbeach commented Aug 10, 2016 •

edited

alikins commented Aug 11, 2016

robinro commented Nov 3, 2016 •

edited

oskapt commented Mar 29, 2017

hjnijlunsing commented Jun 11, 2017 •

edited

get_mount_facts timeout is very small and not ignored as expected #10779

get_mount_facts timeout is very small and not ignored as expected #10779

Comments

rosowiecki commented Apr 20, 2015 • edited by jctanner

bcoca commented Apr 20, 2015

benjixx commented May 15, 2015

epyonavenger commented Jul 15, 2015

infernix commented Aug 3, 2015

bcoca commented Aug 3, 2015

vinsh commented Aug 12, 2015

infernix commented Aug 12, 2015

vinsh commented Aug 12, 2015

rosowiecki commented Aug 19, 2015

mbookman commented Nov 20, 2015

insom commented Feb 4, 2016

thenewwazoo commented Jun 16, 2016

oleyka commented Jun 24, 2016

jctanner commented Jul 1, 2016

alikins commented Jul 5, 2016

insom commented Jul 5, 2016 • edited

jctanner commented Jul 8, 2016 • edited

bwbeach commented Aug 10, 2016

alikins commented Aug 10, 2016

bwbeach commented Aug 10, 2016

alikins commented Aug 10, 2016

bwbeach commented Aug 10, 2016

bwbeach commented Aug 10, 2016 • edited

alikins commented Aug 11, 2016

robinro commented Nov 3, 2016 • edited

oskapt commented Mar 29, 2017

hjnijlunsing commented Jun 11, 2017 • edited

rosowiecki commented Apr 20, 2015 •

edited by jctanner

insom commented Jul 5, 2016 •

edited

jctanner commented Jul 8, 2016 •

edited

bwbeach commented Aug 10, 2016 •

edited

robinro commented Nov 3, 2016 •

edited

hjnijlunsing commented Jun 11, 2017 •

edited