"Too many open files" or "file descriptor out of range in select()" while gathering facts from CoreOS machine with 200 running containers #10157

alpo · 2015-02-03T13:53:54Z

while executing
ansible -i infrastructure somehost.fqdn -m setup
I'm getting following result

somehost.fqdn | FAILED >> {
    "cmd": "/usr/bin/ip addr show primary vethd91fe30", 
    "failed": true, 
    "msg": "[Errno 24] Too many open files", 
    "rc": 24
}

Result is same while using ansible release 1.8.2 and devel brach. Everything works fine if I kill all containers or have 40 containers running. I'm using defunctzombie.coreos-bootstrap role with docker-py 0.6.0 to prepare CoreOS machine.

The text was updated successfully, but these errors were encountered:

sivel · 2015-02-03T15:58:58Z

This sounds pretty straight forward. The ulimit set in CoreOS is set too low. In order to allow more open files, you have to increase the ulimit.

alpo · 2015-02-03T19:04:09Z

@sivel Thanks, I trusted the person who checked ulimit for me and said it was 3271601, but it was 1024.

alpo · 2015-02-04T08:49:56Z

This is probably descriptors leak.

After increasing of 'open files' limit to 65635 I'm getting another error:

somehost.fqdn | FAILED >> {
    "cmd": "/usr/bin/ip addr show primary veth7abf1d2", 
    "failed": true, 
    "msg": "Traceback (most recent call last):
  File \"/home/core/.ansible/tmp/ansible-tmp-1423039255.08-164539437898303/setup\", line 1654, in run_command
    rfd, wfd, efd = select.select(rpipes, [], rpipes, 1)
ValueError: file descriptor out of range in select()
", 
    "rc": 257
}

abadger · 2015-02-05T03:00:58Z

How many forks do you have ansible configured to use? Although it would be slow, does it work if you set it to use 1 fork?

alpo · 2015-02-05T09:03:45Z

@abadger It was 5 forks by default. With forks=1 it gives same error.

alpo · 2015-02-05T09:24:01Z

It smells like a necessity of switch from select.select() to select.poll/epoll/kqueue.

gtirloni · 2015-02-06T17:54:54Z

@alpo I don't know much about select vs. poll/epoll/kqueue, could you elaborate on how moving to those options would solve this issue?

When the issue happens, can you see a lot of processes running? I'm wondering if something isn't taking too long. I tried to see if something pointed to a fd being leaked but I couldn't find much (no surprise there).

Additionally, forks is used to control how many parallel connections are opened, the actual script doing the work in the remote host is not constrained about it.

bcoca · 2015-02-06T18:34:19Z

epoll/kqueue have to do with scaling, we should only go there once you have
1000+ sockets open, otherwise it is actually slower than select.

gathering facts is mostly opening kernel infor files ( /proc, /dev, /sys)
in a mostly sequential manner, unless your version of python is leaking
descriptors and not closing them, I cannot see how you hit the file open

limit.

Brian Coca

alpo · 2015-02-06T18:57:57Z

The problem with select() is that it has FD_SETSIZE hard limit for number of descriptors (1024 for Linux), as I know.
@gtirloni Regarding number of processes: control machine has 342 and increase to 345 while ansible executing, machine under control has 614 and 620 respectively.

alpo · 2015-02-06T19:18:55Z

I see this hit. I've added sys.stderr.write("%r," % [x.fileno() for x in rpipes]) before https://github.com/ansible/ansible/blob/devel/lib/ansible/module_utils/basic.py#L1513 and got (using -vvvv option)

<[REDACTED]> ESTABLISH CONNECTION FOR USER: core  
<[REDACTED]> REMOTE_MODULE setup  
<[REDACTED]> EXEC ssh -C -tt -vvv -o ControlMaster=auto -o ControlPersist=60s -o ControlPath="/root/.ansible/cp/ansible-ssh-%h-%p-%r" -o KbdInteractiveAuthentication=no -o PreferredAuthentications=g
ssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=core -o ConnectTimeout=10 [REDACTED] /bin/sh -c 'mkdir -p $HOME/.ansible/tmp/ansible-tmp-1423249741.65-2182705941
3758 && echo $HOME/.ansible/tmp/ansible-tmp-1423249741.65-21827059413758'  
<[REDACTED]> PUT /tmp/tmpg8T7hM TO /home/core/.ansible/tmp/ansible-tmp-1423249741.65-21827059413758/setup  
<[REDACTED]> EXEC ssh -C -tt -vvv -o ControlMaster=auto -o ControlPersist=60s -o ControlPath="/root/.ansible/cp/ansible-ssh-%h-%p-%r" -o KbdInteractiveAuthentication=no -o PreferredAuthentications=g
ssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=core -o ConnectTimeout=10 [REDACTED] /bin/sh -c 'LANG=en_US.UTF-8 LC_CTYPE=en_US.UTF-8 PATH=/home/core/bin:$PATH 
python /home/core/.ansible/tmp/ansible-tmp-1423249741.65-21827059413758/setup; rm -rf /home/core/.ansible/tmp/ansible-tmp-1423249741.65-21827059413758/ >/dev/null 2>&1'  
[REDACTED] | FAILED >> {  
    "failed": true,   
    "msg": "/home/core/pypy/bin/pypy: /lib64/libssl.so.1.0.0: no version information available (required by /home/core/pypy/bin/pypy)\r\n/home/core/pypy/bin/pypy: /lib64/libcrypto.so.1.0.0: no version 
information available (required by /home/core/pypy/bin/pypy)\r\n[31, 33],[31, 33],[31, 33],[78, 80],[78],[78, 80],[78],[78, 80],[78],[87, 89],[87, 89],[87, 89],[87, 89],[92, 94],[92, 94],[92, 94],[97, 
99],[97, 99],[97, 99],[102, 104],[102, 104],[102, 104],[107, 109],[107, 109],[107, 109],[112, 114],[112, 114],[112, 114],[117, 119],[117, 119],[117, 119],[122, 124],[122, 124],[122, 124],[127, 129],[12
7, 129],[127, 129],[132, 134],[132, 134],[132, 134],[137, 139],[137, 139],[137, 139],[142, 144],[142, 144],[142, 144],[147, 149],[147, 149],[147, 149],[152, 154],[152, 154],[152, 154],[157, 159],[157, 
159],[157, 159],[162, 164],[162, 164],[
[REDACTED]
 916],[914, 916],[914, 916],[919, 921],[919, 921],[919, 921],[924, 926],[924, 926],[924, 926],[929, 931],[929, 931],[929, 931],[934, 936],[934, 936],[934, 936],[939, 941],[939, 941],[939, 941],[944, 946],[944, 946],[944, 946],[949, 951],[949, 951],[949, 951],[954, 956],[954, 956],[954, 956],[959, 961],[959, 961],[959, 961],[964, 966],[964, 966],[964, 966],[969, 971],[969, 971],[969, 971],[974, 976],[974, 976],[974, 976],[979, 981],[979, 981],[979, 981],[984, 986],[984, 986],[984, 986],[989, 991],[989, 991],[989, 991],[994, 996],[994, 996],[994, 996],[999, 1001],[999, 1001],[999, 1001],[1004, 1006],[1004, 1006],[1004, 1006],[1009, 1011],[1009, 1011],[1009, 1011],[1014, 1016],[1014, 1016],[1014, 1016],[1019, 1021],[1019, 1021],[1019, 1021],[1024, 1026],{\"rc\": 257, \"msg\": \"Traceback (most recent call last):\\n  File \\\"/home/core/.ansible/tmp/ansible-tmp-1423249741.65-21827059413758/setup\\\", line 1655, in run_command\\n    rfd, wfd, efd = select.select(rpipes, [], rpipes, 1)\\nValueError: file descriptor out of range in select()\\n\", \"cmd\": \"/usr/bin/ip addr show primary veth7abd778\", \"failed\": true}\r\nOpenSSH_5.3p1, OpenSSL 1.0.1e-fips 11 Feb 2013\ndebug1: Reading configuration data /etc/ssh/ssh_config\r\ndebug1: Applying options for *\r\ndebug1: auto-mux: Trying existing master\r\ndebug2: fd 3 setting O_NONBLOCK\r\ndebug2: mux_client_hello_exchange: master version 4\r\ndebug3: mux_client_request_forwards: requesting forwardings: 0 local, 0 remote\r\ndebug3: mux_client_request_session: entering\r\ndebug3: mux_client_request_alive: entering\r\ndebug3: mux_client_request_alive: done pid = 17027\r\ndebug3: mux_client_request_session: session request sent\r\ndebug1: mux_client_request_session: master session id: 2\r\ndebug1: mux_client_request_session: master session id: 2\r\ndebug3: mux_client_read_packet: read header failed: Broken pipe\r\ndebug2: Received exit status from master 0\r\nShared connection to [REDACTED] closed.\r\n", 
    "parsed": false
}

bcoca · 2015-02-06T19:54:45Z

hmm, never tested ansible with pypy, could it be an incompatibility?

abadger · 2015-02-08T22:32:52Z

@alpo are you using the coreos host machine as your ansible controller or are you running ansible on another machine to configure the coreos host?

alpo · 2015-02-09T08:06:49Z

@abadger I'm using CentOS6 host to configure CoreOS machines.

abadger · 2015-02-09T16:33:23Z

Reproduced the original issue in a fedora=>CoreOS environment. Not sure yet if this is a pypy issue. however, some googling turned up this bug report: https://issues.apache.org/jira/browse/QPID-5588 If the bug report is accurate and it's the value of the file descriptor, not the number of open files then we might not be able to fix this via select, only by moving to poll/epoll/kqueue. I'm not sure about the portability of those APIs though, so we may be between a rock and a hard place if that proves to be correct.

@alpo How are you setting the value of openfiles to ulimit in your CoreOS box? I'd like to get beyond the ulimit issue to the ValueError since that seems to be where we'll either have an unsolvable problem with select() or be able to find a workaround.

abadger · 2015-02-09T16:41:19Z

python2.7 sourcecode:

#define _PyIsSelectable_fd(FD) (((FD) >= 0) && ((FD) < FD_SETSIZE))

So it is about value, not how many open file descriptors exist.

alpo · 2015-02-09T17:33:58Z

@abadger I set nofile for CoreOS using following ansible piece:

- name: set nofile
  lineinfile: dest=/etc/systemd/system.conf
              regexp="DefaultLimitNOFILE"
              line="DefaultLimitNOFILE=65536"
              state=present

This limitation is even limitation of select() system call but not python. See Notes section in http://linux.die.net/man/2/select. Undefined behavior if fd<0 or fd>=FD_SETSIZE.

abadger · 2015-02-09T20:55:46Z

Compared some invocations of run_command() via setup with your debugging in. Found that pypy vs cypthon on a fedora box makes a difference here:

cpython:

0.0s run_command: "%r," % [x.fileno() for x in rpipes]='[4, 6],'
[many iterations later]
0.1s run_command: "%r," % [x.fileno() for x in rpipes]='[4, 6],'

pypy:

0.0s run_command: "%r," % [x.fileno() for x in rpipes]='[9, 11],'
[many iterations later]
0.2s run_command: "%r," % [x.fileno() for x in rpipes]='[99, 101],'

However, a simple test loop based on run_command()'s code does not exhibit the increasing file descriptors. So it's likely somewhere else in the code that the filehandles are being inflated. CPython releases filehandles once the reference count drops to 0 whereas pypy doesn't do reference counting for garbage collection so files that are out of scope may be hanging around for longer.

Helps control open file descriptor count with pypy (which is used with one coreos + ansible example). Part of a fix for #10157

abadger · 2015-02-09T21:17:26Z

@alpo If we're going down the correct path for these issues, that commit I just pushed should get you to the point where ansible can handle a coreos machine with 200 containers. The patch brought my localhost test of file descriptors down to 60. Running with the debugging on a coreos VM with 200 running containers showed a high water mark of 414 for the descriptor number.

I'll try to work on facts.py to explicitly close() most of the files it accesses before closing this ticket instead of relying on files going out of scope to close them which should let you scale even more. Unfortunately, since cpython is ubiquitous and pypy is seen infrequently we'll probably never catch all of the places where file.close() is needed but we'll hopefully catch enough that this will no longer be an issue.

If for some reason, the above commit doesn't seem to make a difference, please let me know that I'm pursuing a red herring :-)

abadger · 2015-02-09T22:34:46Z

Okay, I think I've converted all of the file access in facts gathering to explicitly close the file. Running this on a coreos VM with 200 running containers shows a high water mark of 7 for the file descriptor number. So at least for the setup module, I think this problem is solved :-)

Closing This Ticket

Hi!

We believe recent commits (likely detailed above) should resolve this question or problem for you.

This will also be included in the next major release.

If you continue seeing any problems related to this issue, or if you have any further questions, please let us know by stopping by one of the two mailing lists, as appropriate:

https://groups.google.com/forum/#!forum/ansible-project - for user questions, tips, and tricks
https://groups.google.com/forum/#!forum/ansible-devel - for strategy, future planning, and questions about writing code

Because this project is very active, we're unlikely to see comments made on closed tickets, but the mailing list is a great way to ask questions, or post if you don't think this particular
issue is resolved.

Thank you!

Helps control open file descriptor count with pypy (which is used with one coreos + ansible example). Part of a fix for #10157

alpo closed this as completed Feb 3, 2015

alpo changed the title ~~"[Errno 24] Too many open files" while gathering facts from CoreOS machine with 200 running containers~~ (probably descriptors leak) "Too many open files" while gathering facts from CoreOS machine with 200 running containers Feb 4, 2015

alpo reopened this Feb 4, 2015

alpo mentioned this issue Feb 7, 2015

CPython as option defunctzombie/ansible-coreos-bootstrap#9

Closed

bcoca added P2 Priority 2 - Issue Blocks Release bug_report labels Feb 9, 2015

abadger added a commit that referenced this issue Feb 9, 2015

Close some file handles explicitly in facts.py

425dee1

Helps control open file descriptor count with pypy (which is used with one coreos + ansible example). Part of a fix for #10157

abadger closed this as completed in 402a6d0 Feb 9, 2015

abadger added a commit that referenced this issue Feb 17, 2015

Close some file handles explicitly in facts.py

8be7716

Helps control open file descriptor count with pypy (which is used with one coreos + ansible example). Part of a fix for #10157

chrrrles mentioned this issue Jan 27, 2016

ansible 2.0.0.2 - ValueError: file descriptor out of range in select() #14143

Closed

p-l- mentioned this issue Feb 22, 2016

filedescriptor out of range in select() secdev/scapy#67

Closed

andreyto mentioned this issue Feb 26, 2018

cwltoil: Too many file descriptors and filedescriptor out of range in select() DataBiosphere/toil#2087

Closed

ansibot added bug This issue/PR relates to a bug. and removed bug_report labels Mar 6, 2018

ForWhomTheBellCodes mentioned this issue Mar 12, 2018

sshuttle fails with "filedescriptor out of range in select()" sshuttle/sshuttle#212

Open

ipv6freely mentioned this issue Jul 19, 2018

ValueError: filedescriptor out of range in select() ncclient/ncclient#254

Closed

aimeejtix mentioned this issue Jul 23, 2018

Filedescriptor out of range in select() robotframework/SSHLibrary#252

Closed

ansible locked and limited conversation to collaborators Apr 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Too many open files" or "file descriptor out of range in select()" while gathering facts from CoreOS machine with 200 running containers #10157

"Too many open files" or "file descriptor out of range in select()" while gathering facts from CoreOS machine with 200 running containers #10157

alpo commented Feb 3, 2015

sivel commented Feb 3, 2015

alpo commented Feb 3, 2015

alpo commented Feb 4, 2015

abadger commented Feb 5, 2015

alpo commented Feb 5, 2015

alpo commented Feb 5, 2015

gtirloni commented Feb 6, 2015

bcoca commented Feb 6, 2015

alpo commented Feb 6, 2015

alpo commented Feb 6, 2015

bcoca commented Feb 6, 2015

abadger commented Feb 8, 2015

alpo commented Feb 9, 2015

abadger commented Feb 9, 2015

abadger commented Feb 9, 2015

alpo commented Feb 9, 2015

abadger commented Feb 9, 2015

abadger commented Feb 9, 2015

abadger commented Feb 9, 2015

"Too many open files" or "file descriptor out of range in select()" while gathering facts from CoreOS machine with 200 running containers #10157

"Too many open files" or "file descriptor out of range in select()" while gathering facts from CoreOS machine with 200 running containers #10157

Comments

alpo commented Feb 3, 2015

sivel commented Feb 3, 2015

alpo commented Feb 3, 2015

alpo commented Feb 4, 2015

This is probably descriptors leak.

abadger commented Feb 5, 2015

alpo commented Feb 5, 2015

alpo commented Feb 5, 2015

gtirloni commented Feb 6, 2015

bcoca commented Feb 6, 2015

limit.

alpo commented Feb 6, 2015

alpo commented Feb 6, 2015

bcoca commented Feb 6, 2015

abadger commented Feb 8, 2015

alpo commented Feb 9, 2015

abadger commented Feb 9, 2015

abadger commented Feb 9, 2015

alpo commented Feb 9, 2015

abadger commented Feb 9, 2015

abadger commented Feb 9, 2015

abadger commented Feb 9, 2015

Closing This Ticket