Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MacOS poll() is broken - gsi with --enable-poll consumes 100% CPU #403

Closed
lassik opened this issue Aug 13, 2019 · 50 comments
Closed

MacOS poll() is broken - gsi with --enable-poll consumes 100% CPU #403

lassik opened this issue Aug 13, 2019 · 50 comments

Comments

@lassik
Copy link
Contributor

lassik commented Aug 13, 2019

Gambit installed on Mac OS Mojave from the Homebrew package manager (brew install gambit-scheme) consumes all available CPU even when idle: just start gsi from the shell and wait a few seconds.

The Homebrew formula (which can be viewed with brew edit gambit-scheme) configures Gambit with:

--enable-single-host
--enable-multiple-versions
--enable-default-runtime-options=f8,-8,t8
--enable-openssl

Possibly related Linux problem: #126

Does this have something to do with signals and/or threads?

@lassik
Copy link
Contributor Author

lassik commented Aug 13, 2019

I assume this version is compiled with clang. gcc --version on this Mac says: Apple LLVM version 10.0.1 (clang-1001.0.46.4)

@lassik
Copy link
Contributor Author

lassik commented Aug 17, 2019

Putting gsi in (thread-sleep! 100) makes the problem go away. It only consumes all CPU when not sleeping.

@lassik
Copy link
Contributor Author

lassik commented Aug 17, 2019

Compiling with the default configure flags from the official release, the problem is gone:

tar -xf gambit-v4_9_3.tgz
cd gambit-v4_9_3/
./configure
make
./gsi/gsi

The same release 4.9.3 is used as the basis of the Homebrew build. As mentioned above, the configure flags used by Homebrew are:

--enable-single-host
--enable-multiple-versions
--enable-default-runtime-options=f8,-8,t8
--enable-openssl

Which one would be most likely to cause the problem? I can try compiling from the official source with some of those flags to find out if I can reproduce the problem.

@lassik
Copy link
Contributor Author

lassik commented Aug 17, 2019

Tried again with --enable-single-host and it does not have the problem:

tar -xf gambit-v4_9_3.tgz
cd gambit-v4_9_3/
./configure --enable-single-host
make
gsi/gsi

@lassik
Copy link
Contributor Author

lassik commented Aug 17, 2019

What does --enable-default-runtime-options=f8,-8,t8 from the Homebrew build do?

@lassik
Copy link
Contributor Author

lassik commented Aug 17, 2019

@feeley
Copy link
Member

feeley commented Aug 17, 2019

Yes, they set the default character encoding to UTF-8 (f8 is for file I/O, -8 for stdio I/O, t8 for terminal I/O).

@lassik
Copy link
Contributor Author

lassik commented Aug 17, 2019

Would threads be a likely culprit? If there a way to disable multi-threading, or a Scheme procedure / REPL command to view some kind of thread status while gsi is running?

If I understood correctly, use of multiple OS threads is disabled in the default build, but "green threads" (i.e. threads emulated by the Gambit runtime) are always on.

@feeley
Copy link
Member

feeley commented Aug 17, 2019

I'll try to reproduce on my end. It may have to do with the UTF-8 parser.

@lassik
Copy link
Contributor Author

lassik commented Aug 17, 2019

Thanks! I'll also retry on my end with ./configure --enable-default-runtime-options=f8,-8,t8 as the only non-standard configure flags.

@lassik
Copy link
Contributor Author

lassik commented Aug 17, 2019

I have now tried builds with each of these configure option sets:

./configure
./configure --enable-single-host
./configure --enable-default-runtime-options=f8,-8,t8
./configure --enable-single-host --enable-default-runtime-options=f8,-8,t8

None of those exhibit the problem that the pre-compiled build from Homebrew has.

@feeley
Copy link
Member

feeley commented Aug 17, 2019

Did you try with/without --enable-openssl? That's the only one left...

Otherwise, start in lldb or gdb and check where it is looping...

@lassik
Copy link
Contributor Author

lassik commented Aug 17, 2019

I didn't enable that one for any of my manual builds.

@lassik
Copy link
Contributor Author

lassik commented Aug 17, 2019

With ./configure --enable-openssl I don't experience problems either.

@lassik
Copy link
Contributor Author

lassik commented Aug 17, 2019

Nor with ./configure --enable-single-host --enable-openssl. I'm stumped.

@lassik
Copy link
Contributor Author

lassik commented Aug 17, 2019

If you have access to a Mac, please try brew install gambit-scheme. It installs a gsi command. If you just run that with no arguments and wait about 5-10 seconds and go to Activity Monitor's CPU tab, it should show gsi near 100% usage.

@feeley
Copy link
Member

feeley commented Aug 17, 2019

I reinstalled gambit-scheme using brew on my mac and I don't get the 100% usage...

Maybe you have to debug that on your end.

@feeley
Copy link
Member

feeley commented Aug 17, 2019

I wonder if the problem might be caused by an unusual setting of your tty. Can you start gsi from a bash shell in a plain Terminal and see if the problem goes away?

@lassik
Copy link
Contributor Author

lassik commented Aug 18, 2019

Interesting. Here's an lldb debugging session with the build that consumes all CPU. The places where the process stopped is where I pressed Ctrl+C.

$ lldb -- gsi -:d-
(lldb) target create "gsi"
Current executable set to 'gsi' (x86_64).
(lldb) settings set -- target.run-args  "-:d-"
(lldb) run
Process 73227 launched: '/Users/lassi/.local/bin/gsi' (x86_64)
Gambit v4.9.3

> Process 73227 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
    frame #0: 0x00007fff66b2d296 libsystem_kernel.dylib`sigprocmask + 10
libsystem_kernel.dylib`sigprocmask:
->  0x7fff66b2d296 <+10>: jae    0x7fff66b2d2a0            ; <+20>
    0x7fff66b2d298 <+12>: movq   %rax, %rdi
    0x7fff66b2d29b <+15>: jmp    0x7fff66b27457            ; cerror_nocancel
    0x7fff66b2d2a0 <+20>: retq
Target 0: (gsi) stopped.
(lldb) cont
Process 73227 resuming
Process 73227 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
    frame #0: 0x0000000100172ab8 gsi`___H___thread + 1707
gsi`___H___thread:
->  0x100172ab8 <+1707>: leaq   0x4500(%rdx), %rcx
    0x100172abf <+1714>: movq   %rcx, 0x2a0(%rsp)
    0x100172ac7 <+1722>: leaq   0x7660(%rdx), %rcx
    0x100172ace <+1729>: movq   %rcx, 0x1d0(%rsp)
Target 0: (gsi) stopped.
(lldb) cont
Process 73227 resuming
Process 73227 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
    frame #0: 0x00000001001754b5 gsi`___H___thread + 12456
gsi`___H___thread:
->  0x1001754b5 <+12456>: movq   0x38(%r12), %rax
    0x1001754ba <+12461>: jmp    0x10018042e               ; <+57377>
    0x1001754bf <+12466>: leaq   0x34e0(%rdx), %r9
    0x1001754c6 <+12473>: jmp    0x1001802c2               ; <+57013>
Target 0: (gsi) stopped.
(lldb) cont
Process 73227 resuming
Process 73227 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
    frame #0: 0x00007fff66b2f36e libsystem_kernel.dylib`poll + 10
libsystem_kernel.dylib`poll:
->  0x7fff66b2f36e <+10>: jae    0x7fff66b2f378            ; <+20>
    0x7fff66b2f370 <+12>: movq   %rax, %rdi
    0x7fff66b2f373 <+15>: jmp    0x7fff66b27421            ; cerror
    0x7fff66b2f378 <+20>: retq
Target 0: (gsi) stopped.
(lldb) info threads
error: 'info' is not a valid command.
(lldb) thread list
Process 73227 stopped
* thread #1: tid = 0x2e4626, 0x00007fff66b2f36e libsystem_kernel.dylib`poll + 10, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
(lldb)

Doesn't seem all that different from the build without the problem.

@lassik
Copy link
Contributor Author

lassik commented Aug 18, 2019

Using the ordinary Mac Terminal.app doesn't make a difference, and bash is my shell.

@lassik
Copy link
Contributor Author

lassik commented Aug 18, 2019

The executable /Users/lassi/.local/bin/gsi is a symlink to /usr/local/Cellar/gambit-scheme/4.9.3/v4.9.3/bin/gsi so it's the standard Homebrew version.

@lassik
Copy link
Contributor Author

lassik commented Aug 18, 2019

Each time I stopped the process above, it was at about 97% CPU (which I guess is the maximum that the kernel gives to a normal process without a special scheduling priority). When I stopped it, the CPU usage naturally also stopped. When I continued it, the CPU usage "revs up" from zero to 97% in about 5 seconds. The rev up takes a little longer when running in the debugger. don't know why it takes several seconds to get to full usage.

@lassik
Copy link
Contributor Author

lassik commented Aug 18, 2019

I tried stopping the process when it hits 97% CPU and then single-stepping it. It runs lots of simple instructions in ___H___thread. I think at one point it got into a stack overflow handler. Here's a backtrace:

(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = instruction step into
  * frame #0: 0x000000010017d12d gsi`___H___thread + 44320
    frame #1: 0x00000001000059c4 gsi`___trampoline + 26
    frame #2: 0x0000000100005aff gsi`___call + 283
    frame #3: 0x000000010000504e gsi`___run + 106
    frame #4: 0x0000000100006f28 gsi`___setup + 3980
    frame #5: 0x0000000100004a1b gsi`___main + 3099
    frame #6: 0x0000000100010f19 gsi`___main_char + 314
    frame #7: 0x00007fff669f23d5 libdyld.dylib`start + 1
    frame #8: 0x00007fff669f23d5 libdyld.dylib`start + 1

At one point it got there via device_select. Does this say anything obvious?

@lassik
Copy link
Contributor Author

lassik commented Aug 18, 2019

Seems like it is selecting on the terminal (as it should) but maybe the select or parsing doesn't work as expected.

OK, gsi -:d- causes 100% CPU usage in lldb but not without lldb! But the default gsi or gsi -:dc (without lldb) uses all CPU.

@lassik
Copy link
Contributor Author

lassik commented Aug 18, 2019

gsi -:dc is hard to run in lldb since it constantly stops with SIGTTOU.

@feeley
Copy link
Member

feeley commented Aug 18, 2019

I always wondered about that one... any idea why?

As for the 100% CPU usage problem, we will have to use some of the heavy duty debugging tricks to figure this one out, to see what Scheme code is being executed.

Please ./configure <usual-options> --enable-debug-host-changes. Then run gsi and exit after you reach 100% usage. This will generate a gambit.log file with all of the control flow changes between Scheme procedures. That will tell us where gsi is spending its time. Note that because of all the logging, the execution is very slow (maybe 100 times slower that usual) so it may take a while to reach 100% usage.

@lassik
Copy link
Contributor Author

lassik commented Aug 18, 2019

I always wondered about that one... any idea why?

You mean the SIGTTOU? I can look into it. Need to dive into Unix job control anyway.

As for the 100% CPU usage problem, we will have to use some of the heavy duty debugging tricks to figure this one out, to see what Scheme code is being executed.

Please ./configure <usual-options> --enable-debug-host-changes. Then run gsi and exit after you reach 100% usage. This will generate a gambit.log file with all of the control flow changes between Scheme procedures. That will tell us where gsi is spending its time. Note that because of all the logging, the execution is very slow (maybe 100 times slower that usual) so it may take a while to reach 100% usage.

Sounds you have some impressive debugging support built in.

Unfortunately for debugging purposes, all of the builds I have made manually with ./configure and make have worked fine. I only have the binary pre-built by the Homebrew build cluster for the problematic copy. shasum /usr/local/Cellar/gambit-scheme/4.9.3/v4.9.3/bin/gsi gives 57313de8c117de9bb5e14666dddd73b955537b70. I uploaded that exact binary here: https://misc.lassi.io/2019/gambit-homebrew-gsi-with-cpu-problem.

@feeley
Copy link
Member

feeley commented Aug 18, 2019

I confirm your binary is using 100% CPU.

@feeley
Copy link
Member

feeley commented Aug 18, 2019

--enable-poll was also used... hmmm

@lassik
Copy link
Contributor Author

lassik commented Aug 18, 2019

I missed --enable-poll because it's not in the latest Homebrew formula (which can be viewed with brew edit gambit-scheme). But brew log --patch gambit-scheme shows this Git commit:

commit a60ae6cd41ee43aa66adf5fcf9180d950e5bd10b
Author: ober <jaimef@linbsd.org>
Date:   Tue May 14 09:42:47 2019 -0700

    gambit-scheme: remove poll

    Closes #39850.

    Signed-off-by: FX Coudert <fxcoudert@gmail.com>

diff --git a/Formula/gambit-scheme.rb b/Formula/gambit-scheme.rb
index b07082cdda..fb15f14d47 100644
--- a/Formula/gambit-scheme.rb
+++ b/Formula/gambit-scheme.rb
@@ -18,7 +18,6 @@ class GambitScheme < Formula
       --enable-single-host
       --enable-multiple-versions
       --enable-default-runtime-options=f8,-8,t8
-      --enable-poll
       --enable-openssl
     ]

And indeed, here's the issue: Homebrew/homebrew-core#39850

What I don't understand is why the old gsi binary persisted on my system. I upgrade and clean all my brew packages at least weekly.

@feeley
Copy link
Member

feeley commented Aug 18, 2019

For next time, you can get the configure options of gsi with gsi -v.

The poll issue should be debugged and fixed. The use of select, poll and ppoll should be equivalent in behaviour, but possibly different performance.

@lassik
Copy link
Contributor Author

lassik commented Aug 18, 2019

gsi -v is great! I'm just about to submit a SRFI that specifies some REPL amenities mostly adapted from the Common Lisp standard. I thought it'd be good to add a standard (bug-report) procedure for REPL use. It would show the kinds of things gsi -v shows, plus a web link where to submit the report (e.g. GitHub issues). Though a command line flag is useful to have as well (especially for management scripts that want to check the version without actually starting the interpreter).

The use of select, poll and ppoll should be equivalent in behaviour, but possibly different performance.

Which one does Mac Gambit use with --enable-poll currently?

I believe the libuv source code is the canonical reference for how to do all this portable and fast polling stuff nowadays. I plan to study it myself as well.

@lassik
Copy link
Contributor Author

lassik commented Aug 18, 2019

I did brew uninstall gambit-scheme followed by brew install gambit-scheme and now I also have the newest build without --enable-poll that works flawlessly. But at no point did Homebrew say anything about a new version of Gambit being available, and didn't upgrade to it without a manual uninstall+install. Strange. Maybe it's because the Gambit version is 4.9.3 for both builds. Many distro package managers append a running number like 4.9.3_1, 4.9.3_2, etc. to disambiguate fixed builds of the same upstream version, but apparently that wasn't done here.

@lassik
Copy link
Contributor Author

lassik commented Aug 18, 2019

Just replicated the problem from the 4.9.3 source tarball:

tar -xf gambit-v4_9_3.tgz
cd gambit-v4_9_3/
./configure --enable-poll
make
gsi/gsi        # Has the bug.
gsi/gsi -:dc   # Has the bug.
gsi/gsi -:d-   # Does not have the bug.

@feeley
Copy link
Member

feeley commented Aug 18, 2019

Yes I'm looking into the issue, and am baffled that poll returns immediately when given a terminal file descriptor. I thought it was the terminal mode, but I commented out the tcsetattr and still the same problem. Perhaps you can look into this, I have other pressing things to do (non-Gambit).

@lassik
Copy link
Contributor Author

lassik commented Aug 18, 2019

poll returns immediately when given a terminal file descriptor.

In other words, poll always claims the terminal fd is available for reading but read() doesn't return any data?

@feeley
Copy link
Member

feeley commented Aug 18, 2019

Yes, that's what I understand (haven't actually tested with a trace). This document might be useful to better understand SIGTTOU: http://curiousthing.org/sigttin-sigttou-deep-dive-linux . I now think the two issues are related (i.e. lldb always saying signal SIGTTOU was received).

@feeley
Copy link
Member

feeley commented Aug 18, 2019

I wonder if linux has the same issue with --enable-poll... perhaps a macOS bug in poll (it is not consistent with select)

@lassik
Copy link
Contributor Author

lassik commented Aug 18, 2019

--enable-poll build running in Virtual Box does not have the problem. Linux kernel 4.19.0.

@lassik
Copy link
Contributor Author

lassik commented Aug 18, 2019

FreeBSD 12.0-RELEASE amd64 running in Virtual Box with --enable-poll does not have the problem. This makes it likely that MacOS is the only OS manifesting the bug.

@feeley
Copy link
Member

feeley commented Aug 18, 2019

... or macOS is itself buggy... (I have another outstanding bug with macOS concerning the interval timer when running on multi-core... the interrupt rate is divided by the number of cores being used)

@feeley
Copy link
Member

feeley commented Aug 18, 2019

hmmm... https://daniel.haxx.se/blog/2016/10/11/poll-on-mac-10-12-is-broken/

It says the bug was fixed, but maybe there is still a problem with ttys...

@feeley
Copy link
Member

feeley commented Aug 19, 2019

There seems to be a bug in poll in macOS. It returns POLLNVAL ("The file descriptor is not open.") on the controlling terminal:

% cat macos-poll-bug.c
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <errno.h>

#include <poll.h>

#ifndef INFTIM
#define INFTIM -1
#endif

void wait_until_readable(int fd) {
  struct pollfd fds[1];
  fds[0].fd = fd;
  fds[0].events = POLLIN;
  printf("waiting until readable with 'poll' fd=%d\n", fd);
  int n = poll(fds, 1, INFTIM);
  if (n < 0) {
    printf("poll returned with error errno=%d\n", errno);
  } else {
    printf("poll returned n=%d\n", n);
    if (n == 1) {
      if (fds[0].revents & POLLIN)   printf("POLLIN is set\n");
      if (fds[0].revents & POLLPRI)  printf("POLLPRI is set\n");
      if (fds[0].revents & POLLOUT)  printf("POLLOUT is set\n");
      if (fds[0].revents & POLLERR)  printf("POLLERR is set\n");
      if (fds[0].revents & POLLHUP)  printf("POLLHUP is set\n");
      if (fds[0].revents & POLLNVAL) printf("POLLNVAL is set\n");
    }
  }
}

int main(int argc, char *argv[]) {
  if (argc == 2) {
    int fd = open(argv[1], O_RDONLY);
    if (fd < 0) {
      printf("can't open %s\n", argv[1]);
    } else {
      wait_until_readable(fd);
      close(fd);
    } 
  }
  return 0;
}
% clang -o macos-poll-bug macos-poll-bug.c
% ./macos-poll-bug /etc/hosts
waiting until readable with 'poll' fd=3
poll returned n=1
POLLIN is set
% ./macos-poll-bug /dev/stdin
waiting until readable with 'poll' fd=3
#I TYPED THIS LINE AT THE TERMINAL
poll returned n=1
POLLIN is set
% #I TYPED THIS LINE AT THE TERMINAL
% ./macos-poll-bug /dev/tty
waiting until readable with 'poll' fd=3
poll returned n=1
POLLNVAL is set

@feeley
Copy link
Member

feeley commented Aug 19, 2019

This is particularly strange because /dev/stdin is an alias for the controlling terminal given that the program's stdin was not redirected.

@lassik
Copy link
Contributor Author

lassik commented Aug 19, 2019

It's very plausible that something subtle like this is broken in the Unix layer of MacOS. The curl programmers' experiences are informative. Thanks for taking the time to write that test program. I get the same results as you. Additionally:

$ ./macos-poll-bug /dev/stdin </dev/tty
waiting until readable with 'poll' fd=3
poll returned n=1
POLLNVAL is set

I would start by studying what libuv does as they have a good reputation for this stuff and it's actively maintained. However, they tend to prefer the OS-native polling framework, such as kqueue on BSD (including MacOS), to portable facilities like select and poll so it may be a lot of work to adopt their approach. They appear to have a dedicated TTY API - some example code: https://github.com/libuv/libuv/blob/master/docs/code/tty/main.c. Docs front page: http://docs.libuv.org/en/v1.x/

@lassik
Copy link
Contributor Author

lassik commented Aug 19, 2019

I just compiled Bigloo yesterday and they have simply imported libuv into their source tree. That approach may not make sense for Gambit which aims to be extremely portable (though being portable and lightweight are also among the main goals of libuv). We discussed how to handle some murky Unix signal stuff in Scheme with Göran, and both came to the conclusion that it'd be best to start by studying what libuv does and possibly just adopt it wholesale.

@lassik
Copy link
Contributor Author

lassik commented Aug 19, 2019

It's a pity that the OS native facilities are so buggy, non-standard or hard to use that third-party wrapper libraries are needed but we just have to live with that. Every Scheme is going to hit problems like this eventually.

@lassik
Copy link
Contributor Author

lassik commented Aug 19, 2019

Apparently MacOS poll() is also broken on named pipes and it's the same bug that affects TTYs: https://stackoverflow.com/questions/591826/it-really-looks-like-os-x-has-a-bug-when-using-poll-on-a-named-pipe-fifo

@lassik lassik changed the title Mac gsi installed from Homebrew consumes 100% CPU MacOS poll() is broken - gsi with --enable-poll consumes 100% CPU Aug 19, 2019
@feeley
Copy link
Member

feeley commented Aug 19, 2019

Unfortunately Apple is very slow at responding to bug reports, so I'm not even going to report this and don't have any hopes it will be resolved (it seems to be a known issue for some time). Since select of tty and named pipes works on macOS, that is the "solution" (don't --enable-poll). I've updated the configure script so that on macOS it will give this warning when --enable-poll is used:

****************************************************************************
***                                                                      ***
*** The target OS is macOS and the option "--enable-poll" was specified. ***
*** This combination should be avoided because there is a bug in macOS   ***
*** that causes "poll" on ttys and named pipes to misbehave, causing     ***
*** a busy wait loop consuming 100% CPU when waiting on those devices    ***
*** (including waiting for input at the REPL).  For more information,    ***
*** please see:                                                          ***
***   https://github.com/gambit/gambit/issues/403                        ***
***   https://stackoverflow.com/questions/591826/it-really-looks-like-os-x-has-a-bug-when-using-poll-on-a-named-pipe-fifo ***
***                                                                      ***
****************************************************************************

As for libev, it would be nice to add this as an optional feature for Gambit, i.e. --enable-libev, similarly to --enable-openssl. But Gambit can't use this unconditionally because of its objective of being independent of external libraries.

@feeley feeley closed this as completed in 3a12468 Aug 19, 2019
@lassik
Copy link
Contributor Author

lassik commented Aug 19, 2019

That's a perfectly appropriate solution. Indeed, Apple bugs can go unfixed for years. Thanks for taking good care of Gambit and being very responsive with this issue.

As for libev, it would be nice to add this as an optional feature for Gambit, i.e. --enable-libev, similarly to --enable-openssl. But Gambit can't use this unconditionally because of its objective of being independent of external libraries.

That sounds like the best of both worlds. Not sure how easy it is to integrate libuv (or libev or libevent) into Gambit's existing I/O and scheduling framework.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants