New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Too many open files" or "file descriptor out of range in select()" while gathering facts from CoreOS machine with 200 running containers #10157
Comments
This sounds pretty straight forward. The ulimit set in CoreOS is set too low. In order to allow more open files, you have to increase the ulimit. |
@sivel Thanks, I trusted the person who checked ulimit for me and said it was 3271601, but it was 1024. |
This is probably descriptors leak.After increasing of 'open files' limit to 65635 I'm getting another error:
|
How many forks do you have ansible configured to use? Although it would be slow, does it work if you set it to use 1 fork? |
@abadger It was 5 forks by default. With forks=1 it gives same error. |
It smells like a necessity of switch from select.select() to select.poll/epoll/kqueue. |
@alpo I don't know much about select vs. poll/epoll/kqueue, could you elaborate on how moving to those options would solve this issue? When the issue happens, can you see a lot of processes running? I'm wondering if something isn't taking too long. I tried to see if something pointed to a fd being leaked but I couldn't find much (no surprise there). Additionally, forks is used to control how many parallel connections are opened, the actual script doing the work in the remote host is not constrained about it. |
epoll/kqueue have to do with scaling, we should only go there once you have gathering facts is mostly opening kernel infor files ( /proc, /dev, /sys) limit.Brian Coca |
The problem with select() is that it has FD_SETSIZE hard limit for number of descriptors (1024 for Linux), as I know. |
I see this hit. I've added
|
hmm, never tested ansible with pypy, could it be an incompatibility? |
@alpo are you using the coreos host machine as your ansible controller or are you running ansible on another machine to configure the coreos host? |
@abadger I'm using CentOS6 host to configure CoreOS machines. |
Reproduced the original issue in a fedora=>CoreOS environment. Not sure yet if this is a pypy issue. however, some googling turned up this bug report: https://issues.apache.org/jira/browse/QPID-5588 If the bug report is accurate and it's the value of the file descriptor, not the number of open files then we might not be able to fix this via select, only by moving to poll/epoll/kqueue. I'm not sure about the portability of those APIs though, so we may be between a rock and a hard place if that proves to be correct. @alpo How are you setting the value of openfiles to ulimit in your CoreOS box? I'd like to get beyond the ulimit issue to the ValueError since that seems to be where we'll either have an unsolvable problem with select() or be able to find a workaround. |
python2.7 sourcecode:
So it is about value, not how many open file descriptors exist. |
@abadger I set nofile for CoreOS using following ansible piece:
This limitation is even limitation of select() system call but not python. See Notes section in http://linux.die.net/man/2/select. Undefined behavior if fd<0 or fd>=FD_SETSIZE. |
Compared some invocations of run_command() via setup with your debugging in. Found that pypy vs cypthon on a fedora box makes a difference here: cpython:
pypy:
However, a simple test loop based on run_command()'s code does not exhibit the increasing file descriptors. So it's likely somewhere else in the code that the filehandles are being inflated. CPython releases filehandles once the reference count drops to 0 whereas pypy doesn't do reference counting for garbage collection so files that are out of scope may be hanging around for longer. |
Helps control open file descriptor count with pypy (which is used with one coreos + ansible example). Part of a fix for #10157
@alpo If we're going down the correct path for these issues, that commit I just pushed should get you to the point where ansible can handle a coreos machine with 200 containers. The patch brought my localhost test of file descriptors down to 60. Running with the debugging on a coreos VM with 200 running containers showed a high water mark of 414 for the descriptor number. I'll try to work on facts.py to explicitly close() most of the files it accesses before closing this ticket instead of relying on files going out of scope to close them which should let you scale even more. Unfortunately, since cpython is ubiquitous and pypy is seen infrequently we'll probably never catch all of the places where file.close() is needed but we'll hopefully catch enough that this will no longer be an issue. If for some reason, the above commit doesn't seem to make a difference, please let me know that I'm pursuing a red herring :-) |
Okay, I think I've converted all of the file access in facts gathering to explicitly close the file. Running this on a coreos VM with 200 running containers shows a high water mark of 7 for the file descriptor number. So at least for the setup module, I think this problem is solved :-) Closing This TicketHi! We believe recent commits (likely detailed above) should resolve this question or problem for you. This will also be included in the next major release. If you continue seeing any problems related to this issue, or if you have any further questions, please let us know by stopping by one of the two mailing lists, as appropriate:
Because this project is very active, we're unlikely to see comments made on closed tickets, but the mailing list is a great way to ask questions, or post if you don't think this particular Thank you! |
Helps control open file descriptor count with pypy (which is used with one coreos + ansible example). Part of a fix for #10157
while executing
ansible -i infrastructure somehost.fqdn -m setup
I'm getting following result
Result is same while using ansible release 1.8.2 and
devel
brach. Everything works fine if I kill all containers or have 40 containers running. I'm usingdefunctzombie.coreos-bootstrap
role with docker-py 0.6.0 to prepare CoreOS machine.The text was updated successfully, but these errors were encountered: