Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dies on too many open files #70

Open
mcanini opened this issue Jan 15, 2014 · 9 comments
Open

Dies on too many open files #70

mcanini opened this issue Jan 15, 2014 · 9 comments
Assignees
Labels

Comments

@mcanini
Copy link
Contributor

mcanini commented Jan 15, 2014

Running on mininet with 80 switches crashes with the following backtrace:

("unhandled exception"
 ((lib/monitor.ml.Error_
   ((exn (Unix.Unix_error "Too many open files" accept "((fd 4))"))
    (backtrace
     ("Raised at file \"lib/core_unix.ml\", line 48, characters 11-42"
      "Called from file \"lib/core_unix.ml\", line 2125, characters 17-40"
      "Called from file \"lib/raw_fd.ml\", line 268, characters 11-25"
      "Re-raised at file \"lib/unix_syscalls.ml\", line 786, characters 28-31"
      "Called from file \"lib/deferred.ml\", line 119, characters 6-13"
      "Called from file \"lib/raw_deferred.ml\", line 48, characters 2-10"
      "Called from file \"lib/unix_syscalls.ml\", line 790, characters 4-57"
      "Called from file \"lib/monitor.ml\", line 169, characters 25-32"
      "Called from file \"lib/jobs.ml\", line 213, characters 10-13" ""))
    (monitor
     (((name try_with) (here ()) (id 24421) (has_seen_error true)
       (someone_is_listening true) (kill_index 0))))))
  (Pid 94636)))
libgcc_s.so.1 must be installed for pthread_cancel to work
Aborted
@mcanini
Copy link
Contributor Author

mcanini commented Jan 15, 2014

I've raised the limit to 4096 but now it dies with this message:

("unhandled exception"
 ((lib/monitor.ml.Error_
   ((exn
     ("key's index out of range" (1024 1024 (Should_be_between_0_and 1023))))
    (backtrace
     ("Raised at file \"lib/error.ml\", line 7, characters 21-29"
      "Called from file \"lib/bounded_int_table.ml\", line 203, characters 8-23"
      "Called from file \"lib/fd_by_descr.ml\", line 31, characters 8-50"
      "Called from file \"lib/raw_scheduler.ml\", line 131, characters 2-38"
      "Called from file \"lib/deferred.ml\", line 119, characters 6-13"
      "Called from file \"lib/raw_deferred.ml\", line 48, characters 2-10"
      "Called from file \"lib/unix_syscalls.ml\", line 790, characters 4-57"
      "Called from file \"lib/monitor.ml\", line 169, characters 25-32"
      "Called from file \"lib/jobs.ml\", line 213, characters 10-13" ""))
    (monitor
     (((name try_with) (here ()) (id 21271) (has_seen_error true)
       (someone_is_listening true) (kill_index 0))))))
  (Pid 96778)))

@adferguson
Copy link
Contributor

btw, Marco, if you are using Mininet, try upgrading to the latest version. they added some code to raise these limits automatically (eg, open files to 10,000) back in August: mininet/mininet@b20c947

@mcanini
Copy link
Contributor Author

mcanini commented Jan 15, 2014

Thanks Andrew.

I'm pretty sure I'm stumbling on a fd limit inside async_unix.
Looking at https://github.com/janestreet/async_unix/blob/master/lib/scheduler.mli#L43, I'd like to be able to raise max_num_open_file_descrs instead of using the default in Config.

@seliopou
Copy link
Collaborator

Using go_main allows you to set the maximum number. Maybe relevant,
here's how the default's determined. It looks like async checks for
the presence of epoll and select, and sets the default accordingly. Does
select need further configuration to handle more fds?

On Tue, Jan 14, 2014 at 8:42 PM, Marco Canini notifications@github.comwrote:

Thanks Andrew.

I'm pretty sure I'm stumbling on a fd limit inside async_unix.
Looking at
https://github.com/janestreet/async_unix/blob/master/lib/scheduler.mli#L43,
I'd like to be able to raise max_num_open_file_descrs instead of using the
default in Config.


Reply to this email directly or view it on GitHubhttps://github.com//issues/70#issuecomment-32327342
.

@adferguson
Copy link
Contributor

easier still, here's how to change the default from the command line: https://ocaml.janestreet.com/ocaml-core/109.58.00/doc/async/#Std.Async_config

seliopou added a commit to frenetic-lang/frenetic that referenced this issue Jan 15, 2014
The file descriptor limit seems to be set to 1024 by async. This commit
makes katnetic use a different entry point for the scheduler, which
allows the user to configure the maximum number of file descriptors.

Related to frenetic-lang/ocaml-openflow#70.
seliopou added a commit to frenetic-lang/frenetic that referenced this issue Jan 15, 2014
The file descriptor limit seems to be set to 1024 by async. This commit
makes katnetic use a different entry point for the scheduler, which
allows the user to configure the maximum number of file descriptors.

Related to frenetic-lang/ocaml-openflow#70.
@mcanini
Copy link
Contributor Author

mcanini commented Jan 15, 2014

Hmm, brute force didn't really work. Still dying with "Too many open files" accept "((fd 6))" despite I raised the limit to 4096.

@mcanini
Copy link
Contributor Author

mcanini commented Jan 15, 2014

I think we are leaking file descriptors. See this execution trace:

accept(6, {sa_family=AF_INET, sin_port=htons(45375), sin_addr=inet_addr("127.0.0.1")}, [16]) = 4091
fcntl(4091, F_GETFD)                    = 0
fcntl(4091, F_SETFD, FD_CLOEXEC)        = 0
setsockopt(4091, SOL_TCP, TCP_NODELAY, [1], 4) = 0
mmap(NULL, 135168, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0d6b654000
mmap(NULL, 135168, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0d6b633000
futex(0x27ca4c4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x27ca4c0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x27ca490, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x240d874, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x240d870, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x240d840, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x122b5d4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x122b5d0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
accept(6, {sa_family=AF_INET, sin_port=htons(45376), sin_addr=inet_addr("127.0.0.1")}, [16]) = 4092
fcntl(4092, F_GETFD)                    = 0
fcntl(4092, F_SETFD, FD_CLOEXEC)        = 0
setsockopt(4092, SOL_TCP, TCP_NODELAY, [1], 4) = 0
mmap(NULL, 135168, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0d6b612000
mmap(NULL, 135168, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0d6b5f1000
futex(0x240d874, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x240d870, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x240d840, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x27ca4c4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x27ca4c0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x27ca490, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x122b5d4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x122b5d0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
accept(6, {sa_family=AF_INET, sin_port=htons(45383), sin_addr=inet_addr("127.0.0.1")}, [16]) = 4093
fcntl(4093, F_GETFD)                    = 0
fcntl(4093, F_SETFD, FD_CLOEXEC)        = 0
setsockopt(4093, SOL_TCP, TCP_NODELAY, [1], 4) = 0
mmap(NULL, 135168, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0d6b5d0000
mmap(NULL, 135168, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0d6b5af000
futex(0x27ca4c4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x27ca4c0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x27ca490, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x240d874, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x240d870, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x240d840, FUTEX_WAKE_PRIVATE, 1) = 1
rt_sigprocmask(SIG_BLOCK, [VTALRM], [], 8) = 0
futex(0x122b5d4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x122b5d0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
accept(6, {sa_family=AF_INET, sin_port=htons(45387), sin_addr=inet_addr("127.0.0.1")}, [16]) = 4094

@seliopou
Copy link
Collaborator

seliopou commented Feb 3, 2014

@mcanini, this has been resolved, correct?

@mcanini
Copy link
Contributor Author

mcanini commented Feb 3, 2014

@seliopou: no this is not yet resolved -- with a bigger accept queue, we only moved off the point where we start leaking...
there was a very small example that Arjung wronte in the fd-leak branch that could reproduce the problem. we should report the bug with jane street based on that example

@seliopou seliopou added the bug label Feb 5, 2014
@seliopou seliopou added this to the 0.4.0 Release milestone Feb 5, 2014
@seliopou seliopou removed this from the 0.3.0 Release milestone Mar 28, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants