New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
get_mount_facts timeout is very small and not ignored as expected #10779
Comments
I disagree that 10s is too little for listing UUIDs and in the many systems i've tested lsblk takes less than 1s. Keeping this open to investigate the issue about the timeouts not being handled properly. |
I ran into the same issue some days ago. In my particular case I have a install-once-and-throw-away test environment where I'm using docker containers for the underlying system. It turns out that in this case the problem was already known (see moby/moby#12192) and that upgrading the kernel of my docker host system resolved the issue. |
In our production environment, we see this come up fairly often since the servers are pretty busy and not particularly high-performance. When RAID is rebuilding it's especially common. Even if the default stays the same, it would be nice if we could override or otherwise ignore when it isn't pertinent. |
We are seeing this happen a lot on CEPH nodes with LSI SAS controllers. It's extremely random.
When running it again the timeout may or may not occur. @bcoca although ideally blkid should return immediately, there are cases where the timeout may need to be increased due to high system IO load (CEPH rebuilds come to mind). Please at least consider making it a variable so that it can be adjusted in ansible.cfg. |
Yes, I've been thinking of adding at timeout option to setup, defaulting to current |
Any suggested workarounds to this for now? We hit this very often on swift nodes. |
@vinsh nothing other than patching ansible; for 1.9.1-1 @ b47d1d7 (increase from 20 to whatever time you need): --- a/lib/ansible/module_utils/facts.py
+++ b/lib/ansible/module_utils/facts.py
@@ -53,7 +53,7 @@ except ImportError:
class TimeoutError(Exception):
pass
-def timeout(seconds=10, error_message="Timer expired"):
+def timeout(seconds=20, error_message="Timer expired"):
def decorator(func):
def _handle_timeout(signum, frame):
raise TimeoutError(error_message) |
Thank you for the clarification, we might branch/patch in house with your suggestion for now. |
Update: the reason we had this timeout was that "nofile" in limits.conf is set to vert large value on our servers (1000000) and python subprocess module was busy closing all those filedescriptors on every command. We introduced a workaround: ansible_python_inerpretter points to a wrapper script calling "ulimit -n 1024" before python. Regardless of timeout case, ignoring still doesn't work because of exception handling in subprocess module. |
Was there a conclusion on this issue? I have observed the same issue.
For context, I am using Elasticluster, which installs Ansible 1.9.1 as a dependency:
Is there a workaround here, or does Elasticluster need to provide a custom build of Ansible? |
I'd just like to add that busy hosts with hundreds of LVM logical volumes is also a scenario where I'm seeing intermittent failures caused by the ten second timeout. I'm now running a fork with a larger value. I can see that, because it's a decorator, making this timeout configurable may be a pain, but it would be really useful for certain use cases (while I can also see that for some people 10 seconds is a really long time, so I'm not in favour of just making it bigger). |
This has also bitten me. A 10s timeout is generous on dedicated hardware, but when we provision hundreds of VMs in shared infrastructure that are busy doing disk-intensive operations, 10s is not nearly long enough. This very much should be configurable. |
@thenewwazoo it is configurable. "timeout = " in ansible.cfg |
@oleyka that would definitely alter the timeout for the ssh connection, but this issue is related to the timeout decorator inside facts.py which has no hooks into any configuration. |
For folks with storage configs that are causing timeouts:
|
@alikins I think, for me, part of the problem is that this fact is O(n) -- it forks a command for every entry in ansible/lib/ansible/module_utils/facts.py Line 1216 in 4b5203c
On one of my troublesome machines, I have 96 entries in (Also I personally don't even use these facts for anything, but I am using other facts, so I can't just run with My current workaround is to execute my playbook several times until the gathering of facts succeeds. |
Hi! A change has been applied for this ticket, which should resolve this item for you. ansible/ansible-modules-core#4093 If you believe this ticket is not resolved, or have further questions, please let us know by stopping by one of the two mailing lists, as appropriate:
Because this project is very active, we're unlikely to see comments on closed tickets, though the mailing list is a great way to get involved or discuss this one further. Thanks! |
@alikins: To answer your questions on behalf of Backblaze... Our storage pods have 45 or 60 drives, each mounted as a separate ext4 file system. We're seeing frequent timeouts on We would prefer just to disable gathering the uuids of the drives so things run faster. |
Based on #10779 (comment) I'm working on a patch that should improve this. The gist of the patching is replacing the use of a 'lsblk' call for every mtab entry, to just making one lsblk call to collect all of the info and mapping it to the mtab info. Hopefully that will be faster (unless the single lsblk entry takes just as long...) |
This is the error I get when things time out:
The confusing thing is that when I run the
Even running |
Looks like a typo:
|
After fixing the typo in my clone, it works. The failure rate was low, so I'll let it run on a large set of hosts and see what happens. UPDATE: I let it run over a few hundred hosts, and got no timeouts. This is an improvement. |
Doh. And in the dozen or so lines I didn't add test coverage for ;-> Just pushed a update to #17036 with that fixed. |
Fixes ansible#10779 Refactor some of the block device, mount point, and mtab/fstab facts collection for linux for better performance on systems with lots of block devices. Instead of invoking 'lsblk' for every entry in mtab, invoke it once, then map the results to mtab entries. Change the args used for invoking 'findmnt' since the previous combination of args conflicts, so this would always fail on some systems depending on version. Add test cases for facts Hardware()/Network()/Virtual() classes __new__ method and verify they create the proper subclass based on the platform.system() results. Split out all the 'invoke some command and grab it's output' bits related to linux mount paths into their own methods so it is easier to mock them in unit tests. Fix the DragonFly* classes that did not defined a 'platform' class attribute. This caused FreeBSD systems to potentially get the DragonFly* subclasses incorrectly. In practice it didnt matter much since the DragonFly* subclasses duplicated the FreeBSD ones. Actual DragonFly systems would end up with the generic Hardware() etc instead of the DragonFly* classes. Fix Hardware.__new__() on PY3, passing args to __new__ would cause "object() takes no parameters" errors. So check for PY3 and just call __new__ without the args See https://hg.python.org/cpython/file/44ed0cd3dc6d/Objects/typeobject.c#l2818 for some explaination.
FWIW (in case someone else finds this in the future), I ran into this last night on an haproxy system with a high |
I still experience this issue at home. On my file server my 4tb goes to sleep when not in used, my guess is that this spins up the hard-drive but takes longer then 10 sec. To me it is really weird to have a "magic number" of 10 seconds hardcoded in the facts.py. Any suggestions? (As I am executing Ansible from Jenkins I am currently using the Naginator-plugin to retry the deployment if it fails the first time; but this is far from ideal...) |
I've analysed #10746 more deeply and found that my first suspition was wrong and timeout error was not really caused by lack of lsblk. Moreover, it was not ingored as intened.
The reason is that UUID getting added by #10055 broke timeout handling because of Python's subprocess module code. In _execute_child method there's an except code catching all exceptions: that's why TimeoutError is not caught in Facts.populate and ignored.
Suggested solution: get UUID's without calling lsblk with subprocess.
Margin note: 10s timeout was enough before, but with lsblk get_mount_facts on my system with no network filesystems mounted lasted even 15 seconds.
The text was updated successfully, but these errors were encountered: