Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Too many open files" or "file descriptor out of range in select()" while gathering facts from CoreOS machine with 200 running containers #10157

Closed
alpo opened this issue Feb 3, 2015 · 19 comments
Labels
bug This issue/PR relates to a bug. P2 Priority 2 - Issue Blocks Release

Comments

@alpo
Copy link

alpo commented Feb 3, 2015

while executing
ansible -i infrastructure somehost.fqdn -m setup
I'm getting following result

somehost.fqdn | FAILED >> {
    "cmd": "/usr/bin/ip addr show primary vethd91fe30", 
    "failed": true, 
    "msg": "[Errno 24] Too many open files", 
    "rc": 24
}

Result is same while using ansible release 1.8.2 and devel brach. Everything works fine if I kill all containers or have 40 containers running. I'm using defunctzombie.coreos-bootstrap role with docker-py 0.6.0 to prepare CoreOS machine.

@sivel
Copy link
Member

sivel commented Feb 3, 2015

This sounds pretty straight forward. The ulimit set in CoreOS is set too low. In order to allow more open files, you have to increase the ulimit.

@alpo
Copy link
Author

alpo commented Feb 3, 2015

@sivel Thanks, I trusted the person who checked ulimit for me and said it was 3271601, but it was 1024.

@alpo alpo closed this as completed Feb 3, 2015
@alpo alpo changed the title "[Errno 24] Too many open files" while gathering facts from CoreOS machine with 200 running containers (probably descriptors leak) "Too many open files" while gathering facts from CoreOS machine with 200 running containers Feb 4, 2015
@alpo
Copy link
Author

alpo commented Feb 4, 2015

This is probably descriptors leak.

After increasing of 'open files' limit to 65635 I'm getting another error:

somehost.fqdn | FAILED >> {
    "cmd": "/usr/bin/ip addr show primary veth7abf1d2", 
    "failed": true, 
    "msg": "Traceback (most recent call last):
  File \"/home/core/.ansible/tmp/ansible-tmp-1423039255.08-164539437898303/setup\", line 1654, in run_command
    rfd, wfd, efd = select.select(rpipes, [], rpipes, 1)
ValueError: file descriptor out of range in select()
", 
    "rc": 257
}

@alpo alpo reopened this Feb 4, 2015
@alpo alpo changed the title (probably descriptors leak) "Too many open files" while gathering facts from CoreOS machine with 200 running containers (probably descriptors leak) "Too many open files" or "file descriptor out of range in select()" while gathering facts from CoreOS machine with 200 running containers Feb 4, 2015
@alpo alpo changed the title (probably descriptors leak) "Too many open files" or "file descriptor out of range in select()" while gathering facts from CoreOS machine with 200 running containers "Too many open files" or "file descriptor out of range in select()" while gathering facts from CoreOS machine with 200 running containers Feb 4, 2015
@abadger
Copy link
Contributor

abadger commented Feb 5, 2015

How many forks do you have ansible configured to use? Although it would be slow, does it work if you set it to use 1 fork?

@alpo
Copy link
Author

alpo commented Feb 5, 2015

@abadger It was 5 forks by default. With forks=1 it gives same error.

@alpo
Copy link
Author

alpo commented Feb 5, 2015

It smells like a necessity of switch from select.select() to select.poll/epoll/kqueue.

@gtirloni
Copy link
Contributor

gtirloni commented Feb 6, 2015

@alpo I don't know much about select vs. poll/epoll/kqueue, could you elaborate on how moving to those options would solve this issue?

When the issue happens, can you see a lot of processes running? I'm wondering if something isn't taking too long. I tried to see if something pointed to a fd being leaked but I couldn't find much (no surprise there).

Additionally, forks is used to control how many parallel connections are opened, the actual script doing the work in the remote host is not constrained about it.

@bcoca
Copy link
Member

bcoca commented Feb 6, 2015

epoll/kqueue have to do with scaling, we should only go there once you have
1000+ sockets open, otherwise it is actually slower than select.

gathering facts is mostly opening kernel infor files ( /proc, /dev, /sys)
in a mostly sequential manner, unless your version of python is leaking
descriptors and not closing them, I cannot see how you hit the file open

limit.

Brian Coca

@alpo
Copy link
Author

alpo commented Feb 6, 2015

The problem with select() is that it has FD_SETSIZE hard limit for number of descriptors (1024 for Linux), as I know.
@gtirloni Regarding number of processes: control machine has 342 and increase to 345 while ansible executing, machine under control has 614 and 620 respectively.

@alpo
Copy link
Author

alpo commented Feb 6, 2015

I see this hit. I've added sys.stderr.write("%r," % [x.fileno() for x in rpipes]) before https://github.com/ansible/ansible/blob/devel/lib/ansible/module_utils/basic.py#L1513 and got (using -vvvv option)

<[REDACTED]> ESTABLISH CONNECTION FOR USER: core  
<[REDACTED]> REMOTE_MODULE setup  
<[REDACTED]> EXEC ssh -C -tt -vvv -o ControlMaster=auto -o ControlPersist=60s -o ControlPath="/root/.ansible/cp/ansible-ssh-%h-%p-%r" -o KbdInteractiveAuthentication=no -o PreferredAuthentications=g
ssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=core -o ConnectTimeout=10 [REDACTED] /bin/sh -c 'mkdir -p $HOME/.ansible/tmp/ansible-tmp-1423249741.65-2182705941
3758 && echo $HOME/.ansible/tmp/ansible-tmp-1423249741.65-21827059413758'  
<[REDACTED]> PUT /tmp/tmpg8T7hM TO /home/core/.ansible/tmp/ansible-tmp-1423249741.65-21827059413758/setup  
<[REDACTED]> EXEC ssh -C -tt -vvv -o ControlMaster=auto -o ControlPersist=60s -o ControlPath="/root/.ansible/cp/ansible-ssh-%h-%p-%r" -o KbdInteractiveAuthentication=no -o PreferredAuthentications=g
ssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=core -o ConnectTimeout=10 [REDACTED] /bin/sh -c 'LANG=en_US.UTF-8 LC_CTYPE=en_US.UTF-8 PATH=/home/core/bin:$PATH 
python /home/core/.ansible/tmp/ansible-tmp-1423249741.65-21827059413758/setup; rm -rf /home/core/.ansible/tmp/ansible-tmp-1423249741.65-21827059413758/ >/dev/null 2>&1'  
[REDACTED] | FAILED >> {  
    "failed": true,   
    "msg": "/home/core/pypy/bin/pypy: /lib64/libssl.so.1.0.0: no version information available (required by /home/core/pypy/bin/pypy)\r\n/home/core/pypy/bin/pypy: /lib64/libcrypto.so.1.0.0: no version 
information available (required by /home/core/pypy/bin/pypy)\r\n[31, 33],[31, 33],[31, 33],[78, 80],[78],[78, 80],[78],[78, 80],[78],[87, 89],[87, 89],[87, 89],[87, 89],[92, 94],[92, 94],[92, 94],[97, 
99],[97, 99],[97, 99],[102, 104],[102, 104],[102, 104],[107, 109],[107, 109],[107, 109],[112, 114],[112, 114],[112, 114],[117, 119],[117, 119],[117, 119],[122, 124],[122, 124],[122, 124],[127, 129],[12
7, 129],[127, 129],[132, 134],[132, 134],[132, 134],[137, 139],[137, 139],[137, 139],[142, 144],[142, 144],[142, 144],[147, 149],[147, 149],[147, 149],[152, 154],[152, 154],[152, 154],[157, 159],[157, 
159],[157, 159],[162, 164],[162, 164],[
[REDACTED]
 916],[914, 916],[914, 916],[919, 921],[919, 921],[919, 921],[924, 926],[924, 926],[924, 926],[929, 931],[929, 931],[929, 931],[934, 936],[934, 936],[934, 936],[939, 941],[939, 941],[939, 941],[944, 946],[944, 946],[944, 946],[949, 951],[949, 951],[949, 951],[954, 956],[954, 956],[954, 956],[959, 961],[959, 961],[959, 961],[964, 966],[964, 966],[964, 966],[969, 971],[969, 971],[969, 971],[974, 976],[974, 976],[974, 976],[979, 981],[979, 981],[979, 981],[984, 986],[984, 986],[984, 986],[989, 991],[989, 991],[989, 991],[994, 996],[994, 996],[994, 996],[999, 1001],[999, 1001],[999, 1001],[1004, 1006],[1004, 1006],[1004, 1006],[1009, 1011],[1009, 1011],[1009, 1011],[1014, 1016],[1014, 1016],[1014, 1016],[1019, 1021],[1019, 1021],[1019, 1021],[1024, 1026],{\"rc\": 257, \"msg\": \"Traceback (most recent call last):\\n  File \\\"/home/core/.ansible/tmp/ansible-tmp-1423249741.65-21827059413758/setup\\\", line 1655, in run_command\\n    rfd, wfd, efd = select.select(rpipes, [], rpipes, 1)\\nValueError: file descriptor out of range in select()\\n\", \"cmd\": \"/usr/bin/ip addr show primary veth7abd778\", \"failed\": true}\r\nOpenSSH_5.3p1, OpenSSL 1.0.1e-fips 11 Feb 2013\ndebug1: Reading configuration data /etc/ssh/ssh_config\r\ndebug1: Applying options for *\r\ndebug1: auto-mux: Trying existing master\r\ndebug2: fd 3 setting O_NONBLOCK\r\ndebug2: mux_client_hello_exchange: master version 4\r\ndebug3: mux_client_request_forwards: requesting forwardings: 0 local, 0 remote\r\ndebug3: mux_client_request_session: entering\r\ndebug3: mux_client_request_alive: entering\r\ndebug3: mux_client_request_alive: done pid = 17027\r\ndebug3: mux_client_request_session: session request sent\r\ndebug1: mux_client_request_session: master session id: 2\r\ndebug1: mux_client_request_session: master session id: 2\r\ndebug3: mux_client_read_packet: read header failed: Broken pipe\r\ndebug2: Received exit status from master 0\r\nShared connection to [REDACTED] closed.\r\n", 
    "parsed": false
}

@bcoca
Copy link
Member

bcoca commented Feb 6, 2015

hmm, never tested ansible with pypy, could it be an incompatibility?

@abadger
Copy link
Contributor

abadger commented Feb 8, 2015

@alpo are you using the coreos host machine as your ansible controller or are you running ansible on another machine to configure the coreos host?

@alpo
Copy link
Author

alpo commented Feb 9, 2015

@abadger I'm using CentOS6 host to configure CoreOS machines.

@bcoca bcoca added P2 Priority 2 - Issue Blocks Release bug_report labels Feb 9, 2015
@abadger
Copy link
Contributor

abadger commented Feb 9, 2015

Reproduced the original issue in a fedora=>CoreOS environment. Not sure yet if this is a pypy issue. however, some googling turned up this bug report: https://issues.apache.org/jira/browse/QPID-5588 If the bug report is accurate and it's the value of the file descriptor, not the number of open files then we might not be able to fix this via select, only by moving to poll/epoll/kqueue. I'm not sure about the portability of those APIs though, so we may be between a rock and a hard place if that proves to be correct.

@alpo How are you setting the value of openfiles to ulimit in your CoreOS box? I'd like to get beyond the ulimit issue to the ValueError since that seems to be where we'll either have an unsolvable problem with select() or be able to find a workaround.

@abadger
Copy link
Contributor

abadger commented Feb 9, 2015

python2.7 sourcecode:

#define _PyIsSelectable_fd(FD) (((FD) >= 0) && ((FD) < FD_SETSIZE))

So it is about value, not how many open file descriptors exist.

@alpo
Copy link
Author

alpo commented Feb 9, 2015

@abadger I set nofile for CoreOS using following ansible piece:

- name: set nofile
  lineinfile: dest=/etc/systemd/system.conf
              regexp="DefaultLimitNOFILE"
              line="DefaultLimitNOFILE=65536"
              state=present

This limitation is even limitation of select() system call but not python. See Notes section in http://linux.die.net/man/2/select. Undefined behavior if fd<0 or fd>=FD_SETSIZE.

@abadger
Copy link
Contributor

abadger commented Feb 9, 2015

Compared some invocations of run_command() via setup with your debugging in. Found that pypy vs cypthon on a fedora box makes a difference here:

cpython:

0.0s run_command: "%r," % [x.fileno() for x in rpipes]='[4, 6],'
[many iterations later]
0.1s run_command: "%r," % [x.fileno() for x in rpipes]='[4, 6],'

pypy:

0.0s run_command: "%r," % [x.fileno() for x in rpipes]='[9, 11],'
[many iterations later]
0.2s run_command: "%r," % [x.fileno() for x in rpipes]='[99, 101],'

However, a simple test loop based on run_command()'s code does not exhibit the increasing file descriptors. So it's likely somewhere else in the code that the filehandles are being inflated. CPython releases filehandles once the reference count drops to 0 whereas pypy doesn't do reference counting for garbage collection so files that are out of scope may be hanging around for longer.

abadger added a commit that referenced this issue Feb 9, 2015
Helps control open file descriptor count with pypy (which is used with
one coreos + ansible example).  Part of a fix for
#10157
@abadger
Copy link
Contributor

abadger commented Feb 9, 2015

@alpo If we're going down the correct path for these issues, that commit I just pushed should get you to the point where ansible can handle a coreos machine with 200 containers. The patch brought my localhost test of file descriptors down to 60. Running with the debugging on a coreos VM with 200 running containers showed a high water mark of 414 for the descriptor number.

I'll try to work on facts.py to explicitly close() most of the files it accesses before closing this ticket instead of relying on files going out of scope to close them which should let you scale even more. Unfortunately, since cpython is ubiquitous and pypy is seen infrequently we'll probably never catch all of the places where file.close() is needed but we'll hopefully catch enough that this will no longer be an issue.

If for some reason, the above commit doesn't seem to make a difference, please let me know that I'm pursuing a red herring :-)

@abadger abadger closed this as completed in 402a6d0 Feb 9, 2015
@abadger
Copy link
Contributor

abadger commented Feb 9, 2015

Okay, I think I've converted all of the file access in facts gathering to explicitly close the file. Running this on a coreos VM with 200 running containers shows a high water mark of 7 for the file descriptor number. So at least for the setup module, I think this problem is solved :-)

Closing This Ticket

Hi!

We believe recent commits (likely detailed above) should resolve this question or problem for you.

This will also be included in the next major release.

If you continue seeing any problems related to this issue, or if you have any further questions, please let us know by stopping by one of the two mailing lists, as appropriate:

Because this project is very active, we're unlikely to see comments made on closed tickets, but the mailing list is a great way to ask questions, or post if you don't think this particular
issue is resolved.

Thank you!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug This issue/PR relates to a bug. P2 Priority 2 - Issue Blocks Release
Projects
None yet
Development

No branches or pull requests

6 participants