Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_mount_facts timeout is very small and not ignored as expected #10779

Closed
rosowiecki opened this issue Apr 20, 2015 · 27 comments
Closed

get_mount_facts timeout is very small and not ignored as expected #10779

rosowiecki opened this issue Apr 20, 2015 · 27 comments
Labels
bug This issue/PR relates to a bug.

Comments

@rosowiecki
Copy link
Contributor

rosowiecki commented Apr 20, 2015

I've analysed #10746 more deeply and found that my first suspition was wrong and timeout error was not really caused by lack of lsblk. Moreover, it was not ingored as intened.

The reason is that UUID getting added by #10055 broke timeout handling because of Python's subprocess module code. In _execute_child method there's an except code catching all exceptions: that's why TimeoutError is not caught in Facts.populate and ignored.

failed: [SOME_HOST] => {"cmd": "/bin/lsblk -ln --output UUID /dev/sdi2", "failed": true, "rc": 257}
msg: Traceback (most recent call last):
  File "<stdin>", line 1650, in run_command
  File "/usr/lib64/python2.4/subprocess.py", line 550, in __init__
    errread, errwrite)
  File "/usr/lib64/python2.4/subprocess.py", line 988, in _execute_child
    data = os.read(errpipe_read, 1048576) # Exceptions limited to 1 MB
  File "<stdin>", line 1796, in _handle_timeout
TimeoutError: Timer expired

Suggested solution: get UUID's without calling lsblk with subprocess.

Margin note: 10s timeout was enough before, but with lsblk get_mount_facts on my system with no network filesystems mounted lasted even 15 seconds.

@rosowiecki rosowiecki changed the title get_mount_facts timeout is very small and not ignored as intended get_mount_facts timeout is very small and not ignored as expected Apr 20, 2015
@bcoca
Copy link
Member

bcoca commented Apr 20, 2015

I disagree that 10s is too little for listing UUIDs and in the many systems i've tested lsblk takes less than 1s. Keeping this open to investigate the issue about the timeouts not being handled properly.

@benjixx
Copy link
Contributor

benjixx commented May 15, 2015

I ran into the same issue some days ago. In my particular case I have a install-once-and-throw-away test environment where I'm using docker containers for the underlying system.

It turns out that in this case the problem was already known (see moby/moby#12192) and that upgrading the kernel of my docker host system resolved the issue.

@epyonavenger
Copy link

In our production environment, we see this come up fairly often since the servers are pretty busy and not particularly high-performance. When RAID is rebuilding it's especially common. Even if the default stays the same, it would be nice if we could override or otherwise ignore when it isn't pertinent.

@infernix
Copy link

infernix commented Aug 3, 2015

We are seeing this happen a lot on CEPH nodes with LSI SAS controllers. It's extremely random.

failed: [obj-005] => {"cmd": "/bin/lsblk -ln --output UUID /dev/sdb1", "failed": true, "rc": 257}
msg: Traceback (most recent call last):
File "<stdin>", line 1671, in run_command
File "<stdin>", line 1801, in _handle_timeout
TimeoutError: Timer expired

When running it again the timeout may or may not occur.

@bcoca although ideally blkid should return immediately, there are cases where the timeout may need to be increased due to high system IO load (CEPH rebuilds come to mind). Please at least consider making it a variable so that it can be adjusted in ansible.cfg.

@bcoca
Copy link
Member

bcoca commented Aug 3, 2015

Yes, I've been thinking of adding at timeout option to setup, defaulting to current

@vinsh
Copy link

vinsh commented Aug 12, 2015

Any suggested workarounds to this for now? We hit this very often on swift nodes.

@infernix
Copy link

@vinsh nothing other than patching ansible; for 1.9.1-1 @ b47d1d7 (increase from 20 to whatever time you need):

--- a/lib/ansible/module_utils/facts.py
+++ b/lib/ansible/module_utils/facts.py
@@ -53,7 +53,7 @@ except ImportError:
 class TimeoutError(Exception):
     pass

-def timeout(seconds=10, error_message="Timer expired"):
+def timeout(seconds=20, error_message="Timer expired"):
     def decorator(func):
         def _handle_timeout(signum, frame):
             raise TimeoutError(error_message)

@vinsh
Copy link

vinsh commented Aug 12, 2015

Thank you for the clarification, we might branch/patch in house with your suggestion for now.

@rosowiecki
Copy link
Contributor Author

Update: the reason we had this timeout was that "nofile" in limits.conf is set to vert large value on our servers (1000000) and python subprocess module was busy closing all those filedescriptors on every command. We introduced a workaround: ansible_python_inerpretter points to a wrapper script calling "ulimit -n 1024" before python. Regardless of timeout case, ignoring still doesn't work because of exception handling in subprocess module.

@mbookman
Copy link

Was there a conclusion on this issue? I have observed the same issue.

  • Is the answer that on many systems, 10 seconds is indeed too tight of a timeout and so it should simply be increased in master?
  • is the answer that many systems are improperly configured and should drop the nofile limit?

For context, I am using Elasticluster, which installs Ansible 1.9.1 as a dependency:

elasticluster/lib/python2.7/site-packages/ansible-1.9.1-py2.7.egg

Is there a workaround here, or does Elasticluster need to provide a custom build of Ansible?

@jimi-c jimi-c removed the P3 label Dec 7, 2015
@insom
Copy link

insom commented Feb 4, 2016

I'd just like to add that busy hosts with hundreds of LVM logical volumes is also a scenario where I'm seeing intermittent failures caused by the ten second timeout. I'm now running a fork with a larger value. I can see that, because it's a decorator, making this timeout configurable may be a pain, but it would be really useful for certain use cases (while I can also see that for some people 10 seconds is a really long time, so I'm not in favour of just making it bigger).

@thenewwazoo
Copy link

This has also bitten me. A 10s timeout is generous on dedicated hardware, but when we provision hundreds of VMs in shared infrastructure that are busy doing disk-intensive operations, 10s is not nearly long enough. This very much should be configurable.

@oleyka
Copy link

oleyka commented Jun 24, 2016

@thenewwazoo it is configurable. "timeout = " in ansible.cfg

@jctanner
Copy link
Contributor

jctanner commented Jul 1, 2016

@oleyka that would definitely alter the timeout for the ssh connection, but this issue is related to the timeout decorator inside facts.py which has no hooks into any configuration.

@alikins
Copy link
Contributor

alikins commented Jul 5, 2016

For folks with storage configs that are causing timeouts:

  • are the ansible_mounts facts being used?
    If so, how. In particular, is the device uuid used?

    My first thought is 'stop running lsblk just to add the device uuids', so looking for evidence that is a used feature.

  • are you using other tools that handle this better?

    Everything seems to eventually end up reading /sys/block/* so I'm curious if there is a better tools for this. Ideally something that actually completes quickly versus something that just doesn't timeout, but either way ;->

  • would filtering the types of block devices and mount points that is collected be useful? Any easy targets for skipping by default?

  • Any ideas on which parts of storage info collection are troublesome? findmnt? lsblk? All of the above?

  • Anyone have example reports from troublesome systems they could share?

    For example:

    • a 'ansible -m setup' output (or interesting subsets).
      • lsblk output ( 'lsblk -ln --output UUID')
      • ls -laR /sys/block/

    (any variation of those or other reports is useful)

@insom
Copy link

insom commented Jul 5, 2016

@alikins I think, for me, part of the problem is that this fact is O(n) -- it forks a command for every entry in /etc/mtab, but has a constant timeout:

mtab = get_file_content('/etc/mtab', '')

On one of my troublesome machines, I have 96 entries in /etc/mtab, with 89 LVM volumes and moderately high load. Forking around 90 processes that hit the disks, one after another, on a loaded machine regularly exceeds 10 seconds. I realise my case is probably verging on pathological, so I don't expect the defaults to change, but being able to tune this would be great.

(Also I personally don't even use these facts for anything, but I am using other facts, so I can't just run with gather_facts: False)

My current workaround is to execute my playbook several times until the gathering of facts succeeds.

@jctanner
Copy link
Contributor

jctanner commented Jul 8, 2016

Hi!

A change has been applied for this ticket, which should resolve this item for you.

ansible/ansible-modules-core#4093
#16551

If you believe this ticket is not resolved, or have further questions, please let us know by stopping by one of the two mailing lists, as appropriate:

Because this project is very active, we're unlikely to see comments on closed tickets, though the mailing list is a great way to get involved or discuss this one further.

Thanks!

@jctanner jctanner closed this as completed Jul 8, 2016
@bwbeach
Copy link

bwbeach commented Aug 10, 2016

@alikins: To answer your questions on behalf of Backblaze...

Our storage pods have 45 or 60 drives, each mounted as a separate ext4 file system. We're seeing frequent timeouts on lsblk with the 10 second timeout built into 2.1. The new gather_timeout setting would fix the error for us, but would mean that we have to wait for lsblk on all the drives, which is annoying.

We would prefer just to disable gathering the uuids of the drives so things run faster.

@alikins alikins reopened this Aug 10, 2016
@alikins
Copy link
Contributor

alikins commented Aug 10, 2016

Based on #10779 (comment) I'm working on a patch that should improve this.

The gist of the patching is replacing the use of a 'lsblk' call for every mtab entry, to just making one lsblk call to collect all of the info and mapping it to the mtab info. Hopefully that will be faster (unless the single lsblk entry takes just as long...)

@bwbeach
Copy link

bwbeach commented Aug 10, 2016

This is the error I get when things time out:

fatal: [pod-000-1000-12]: FAILED! => {"changed": false, "cmd": "/bin/lsblk -ln --output UUID /dev/mapper/vg_0037-lv_0037", "failed": true, "msg": "Traceback (most recent call last):\n  File \"/tmp/ansible_Gwor7c/ansible_modlib.zip/ansible/module_utils/basic.py\", line 2109, in run_command\n    rfd, wfd, efd = select.select(rpipes, [], rpipes, 1)\n  File \"/tmp/ansible_Gwor7c/ansible_modlib.zip/ansible/module_utils/facts.py\", line 101, in _handle_timeout\n    raise TimeoutError(error_message)\nTimeoutError: Timer expired\n", "rc": 257}

The confusing thing is that when I run the lsblk command by hand, it's fast:

brianb@pod-000-1000-12:~$ time /bin/lsblk -ln --output UUID /dev/mapper/vg_0037-lv_0037


real    0m0.001s
user    0m0.000s
sys 0m0.000s

Even running /bin/lsblk -lna --output UUID to report on all drives only takes 0.01s.

@alikins
Copy link
Contributor

alikins commented Aug 10, 2016

@bwbeach Take a look at #17036

If you can test it, I'm curious if it helps.

@bwbeach
Copy link

bwbeach commented Aug 10, 2016

Looks like a typo:

fatal: [pod-000-1000-00]: FAILED! => {"changed": false, "failed": true, "module_stderr": "", "module_stdout": "\r\nTraceback (most recent call last):\r\n  File \"/tmp/ansible_O52w9A/ansible_module_setup.py\", line 134, in <module>\r\n    main()\r\n  File \"/tmp/ansible_O52w9A/ansible_module_setup.py\", line 126, in main\r\n    data = get_all_facts(module)\r\n  File \"/tmp/ansible_O52w9A/ansible_modlib.zip/ansible/module_utils/facts.py\", line 3378, in get_all_facts\r\n  File \"/tmp/ansible_O52w9A/ansible_modlib.zip/ansible/module_utils/facts.py\", line 3321, in ansible_facts\r\n  File \"/tmp/ansible_O52w9A/ansible_modlib.zip/ansible/module_utils/facts.py\", line 1002, in populate\r\n  File \"/tmp/ansible_O52w9A/ansible_modlib.zip/ansible/module_utils/facts.py\", line 134, in wrapper\r\n  File \"/tmp/ansible_O52w9A/ansible_modlib.zip/ansible/module_utils/facts.py\", line 1288, in get_mount_facts\r\n  File \"/tmp/ansible_O52w9A/ansible_modlib.zip/ansible/module_utils/facts.py\", line 1255, in _find_bind_mounts\r\n  File \"/tmp/ansible_O52w9A/ansible_modlib.zip/ansible/module_utils/facts.py\", line 1246, in _run_findmnt\r\nAttributeError: 'AnsibleModule' object has no attribute 'un_command'\r\n", "msg": "MODULE FAILURE", "parsed": false}

@bwbeach
Copy link

bwbeach commented Aug 10, 2016

After fixing the typo in my clone, it works. The failure rate was low, so I'll let it run on a large set of hosts and see what happens.

UPDATE: I let it run over a few hundred hosts, and got no timeouts. This is an improvement.

@alikins
Copy link
Contributor

alikins commented Aug 11, 2016

Looks like a typo:

Doh. And in the dozen or so lines I didn't add test coverage for ;->

Just pushed a update to #17036 with that fixed.

@robinro
Copy link
Contributor

robinro commented Nov 3, 2016

I stumbled over a follow-up to this, when I ran into timeouts on old machines, where the floppy drive took 12 seconds to answer lsblk.
The fix there was to add --exclude 2 to lsblk:

/usr/bin/lsblk --list --noheadings --paths --output NAME,UUID --exclude 2

See details see #18326 and #18327

sereinity pushed a commit to sereinity-forks/ansible that referenced this issue Jan 25, 2017
Fixes ansible#10779

Refactor some of the block device, mount point, and
mtab/fstab facts collection for linux for better
performance on systems with lots of block devices.

Instead of invoking 'lsblk' for every entry in mtab,
invoke it once, then map the results to mtab entries.

Change the args used for invoking 'findmnt' since the
previous combination of args conflicts, so this would
always fail on some systems depending on version.

Add test cases for facts Hardware()/Network()/Virtual() classes
__new__ method and verify they create the proper subclass based
on the platform.system() results.

Split out all the 'invoke some command and grab it's output'
bits related to linux mount paths into their own methods so
it is easier to mock them in unit tests.

Fix the DragonFly* classes that did not defined a 'platform'
class attribute. This caused FreeBSD systems to potentially
get the DragonFly* subclasses incorrectly. In practice it
didnt matter much since the DragonFly* subclasses duplicated
the FreeBSD ones. Actual DragonFly systems would end up with
the generic Hardware() etc instead of the DragonFly* classes.

Fix Hardware.__new__() on PY3, passing args to __new__
would cause "object() takes no parameters" errors. So
check for PY3 and just call __new__ without the args

See
https://hg.python.org/cpython/file/44ed0cd3dc6d/Objects/typeobject.c#l2818
for some explaination.
@oskapt
Copy link

oskapt commented Mar 29, 2017

FWIW (in case someone else finds this in the future), I ran into this last night on an haproxy system with a high nofile value in limits.conf. The setup module would time out when gathering facts until I lowered it.

@hjnijlunsing
Copy link

hjnijlunsing commented Jun 11, 2017

I still experience this issue at home. On my file server my 4tb goes to sleep when not in used, my guess is that this spins up the hard-drive but takes longer then 10 sec. To me it is really weird to have a "magic number" of 10 seconds hardcoded in the facts.py. Any suggestions? (As I am executing Ansible from Jenkins I am currently using the Naginator-plugin to retry the deployment if it fails the first time; but this is far from ideal...)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug This issue/PR relates to a bug.
Projects
None yet
Development

No branches or pull requests