poll() on macOS 10.12 returns prematurely #1057

Closed
ngg opened this Issue Oct 7, 2016 · 25 comments

Projects

None yet
@ngg
Contributor
ngg commented Oct 7, 2016

Tested with both curl 7.50.3 (which I compiled) and with curl 7.49.1 (bundled with OSX).
It only affects 10.12 OSX version (I tried it on 10.11 and 10.7 and both of those work well).

It can be reproduced with this command:
curl --limit-rate 10 http://google.com

Programatically I could reproduce this with the easy interface by setting either CURLOPT_MAX_RECV_SPEED_LARGE or CURLOPT_MAX_SEND_SPEED_LARGE.

@peterbudai

I found an issue in Python's bug tracker, that is possibly related. There may seem a bug or change in the poll() syscall on Mac OS 12 that causes it to return prematurely.

As far as I know, libcurl preferably uses poll() system call internally on platforms where it is supported.
Though not defining HAVE_POLL_FINE for OSX builds would solve the issue I think (haven't tested), this seems rather drastic.

@ngg
Contributor
ngg commented Oct 7, 2016

I can confirm that using select() instead of poll() solves the issue.

@ngg
Contributor
ngg commented Oct 7, 2016

On Linux the manual page of poll says:
The timeout argument specifies the minimum number of milliseconds that poll() will block.

On OSX the manual page says:
If timeout is greater than zero, it specified a maximum interval (in milliseconds) to wait for any file descriptor to become ready.

This means that it's not guaranteed that calling poll without any fds will sleep at all.

Unfortunately the OSX select man page also says:
If timeout is a non-nil pointer, it specifies a maximum interval to wait for the selection to complete.

Which means that select can return immediately also, but that's not the case according to my tests.

@bagder
Member
bagder commented Oct 7, 2016

Thanks for the excellent report and researching the specifics here!

poll() is defined by POSIX and TSUS, Nobody can just arbitrarily change how it works without causing havoc in applications using it. Like in this case. Note that macOS has a history of broken poll implementations in the past as well and that's why we have the HAVE_POLL_FINE define in the first place.

The initial fix for this problem is probably to make the CURL_CHECK_FUNC_POLL configure macro return false for the check so that libcurl will switch to using select() instead.

I've reached out to a friend at Apple about this issue since it also affects Apple's own curl, maybe that'll give something.

@koczkatamas

I compared the poll implementations between 10.11 El Capitan and 10.12 Sierra and it looks like they changed this line https://github.com/opensource-apple/xnu/blob/10.11/bsd/kern/sys_generic.c#L1784 :

From: if (rfds > 0) to if (rfds == nfds).

If I understand correctly the change, poll previously exited immediately if it found error with ANY fd while the new implementation returns only if founds error with ALL fds.

In our case nfds is 0, so rfds == nfds will always be true thus it never calls kqueue_scan and poll never waits for the timeout.

This was probably an unintended side-effect of the change.

@tobypeterson

For anyone inquiring with Apple, the Radar number is 28372390.

@bagder
Member
bagder commented Oct 8, 2016

Did anyone try to make the configure check detect this problem yet? It'd be cool to have the work-around in place when we ship the next release! I don't have any 10.12 mac myself to try this out on right now.

@bagder bagder changed the title from Setting bandwidth limit causes 100% CPU usage on Mac OSX 10.12 to poll() on macOS 10.12 returns prematurely Oct 8, 2016
@bagder
Member
bagder commented Oct 11, 2016

How to detect this? The following code seems to be good enough.

#include <poll.h>
#include <stdio.h>
#include <sys/time.h>

int main(void)
{
  struct timeval before, after;
  int rc;
  size_t us;

  gettimeofday(&before, NULL);
  rc = poll(NULL, 0, 500);
  gettimeofday(&after, NULL);

  us = (after.tv_sec - before.tv_sec) * 1000000 +
    (after.tv_usec - before.tv_usec);

  if(us < 400000) {
    puts("poll() is broken");
    return 1;
  }
  else {
    puts("poll() works");
  }
  return 0;
}
@bagder bagder closed this in 9297ca4 Oct 11, 2016
@bagder
Member
bagder commented Oct 11, 2016

I added a check based on the snippet above in commit 9297ca4. It detects the problem for me on my 10.12 machine and it verifies a working poll() on my Linux machine.

Please try it out and report if you experience any problems with it!

@peterbudai

I tried out the detection code from the comment above and it seems to work.
It prints poll() works on OSX 10.11, but prints poll() is broken on OSX 10.12.

@jay
Member
jay commented Oct 12, 2016

A few questions about this from someone with no familiarity to mac. What happens when curl is built for mac using something other than autotools, if that's possible, and also what if libcurl built for mac xx is used on mac yy where poll doesn't work? I'm thinking about this from a Windows standpoint where we have several build systems and compile-time decisions in a case like this I think would be more correct to do at runtime.

@nickzman
Collaborator

curl would have to be built using either autotools or CMake. poll() was introduced 12 years ago in macOS 10.4, and very, very few people use a macOS version that old or older. Apple's XP is 10.6.

The one advantage one would get from a runtime check is, if it was built under 10.11, then it could use select() in 10.12 and poll() in 10.11 even though it wasn't built for 10.12. But then the test would have to be run every time the tool is started.

@bagder
Member
bagder commented Oct 12, 2016

If we would do the check in run-time, we would also have to actually build both versions of Curl_poll() which would break new ground.

I think one of the problems with a run-time check is that it makes it more important that it is reliable, as a build-time check can always be fixed manually by for example editing the configure generated header file. Also, the current detection logic for this poll() problems requires a waiting period, which, if we would do the test in run-time would artificiality make the curl_global_init call much slower - and since that is often not invoked directly it will delay the first curl transfer for many programs. And for the curl tool, all invokes. Not an ideal situation.

@jay
Member
jay commented Oct 13, 2016

but why run the test at runtime if apple will tell you what versions are affected? can't you do a getversion at runtime and if it's osx 10.12 or whatever then use select?

@pipping
pipping commented Oct 13, 2016

Is unconditionally switching to select on macOS not an option or highly undesirable?

@bagder
Member
bagder commented Oct 13, 2016

can't you do a getversion at runtime and if it's osx 10.12 or whatever then use select?

Ah yes, that's right. I was thinking for the case where we also have a fixed 10.12 version out, but we don't need to that until we know it is needed or wanted.

Is unconditionally switching to select on macOS not an option or highly undesirable?

That's also an option, and a pretty attractive one. Switch to select() for all macOS builds no matter which version as then you can probably run the same library on different mac os versions without problems.

And really, using select() instead of poll() in there is a very minor difference and doesn't really have a measurable impact to anything.

@craig65535
Contributor
craig65535 commented Oct 18, 2016 edited

@bagder I just wanted to voice my support for a runtime check. We have a project that for various reasons is bundled with its own statically linked libcurl. It is distributed as a binary and may run on any version of the OS from 10.8 to 10.12. While the fix in 9297ca4 appears to work for 10.12 when the project is built on macOS 10.12, we build on OSX 10.10 so I had to disable poll entirely (curl_disallow_poll=yes). I would not have known to do that if I had not been following libcurl closely.

@bagder
Member
bagder commented Oct 18, 2016

In my view that's just another voice for switching to select() for macOS unconditionally, as I don't think there's enough value in poll() over select() to justify a run-time check, but there is value for people to build one version of libcurl that runs on multiple macOS versions.

I propose this additional patch:

diff --git a/m4/curl-functions.m4 b/m4/curl-functions.m4
index e1a1e32..61ed1ac 100644
--- a/m4/curl-functions.m4
+++ b/m4/curl-functions.m4
@@ -4737,16 +4737,17 @@ AC_DEFUN([CURL_CHECK_FUNC_POLL], [
   tst_compi_poll="unknown"
   tst_works_poll="unknown"
   tst_allow_poll="unknown"
   #
   case $host_os in
-    darwin[[123456789]].*|darwin10.*|darwin11.*|darwin12.*|interix*)
+    darwin[[123456789]]*.*|interix*)
       dnl poll() does not work on these platforms
       dnl Interix: "does provide poll(), but the implementing developer must
       dnl have been in a bad mood, because poll() only works on the /proc
       dnl filesystem here"
       curl_disallow_poll="yes"
+      tst_compi_poll="no"
       ;;
   esac
   #
   AC_MSG_CHECKING([if poll can be linked])
   AC_LINK_IFELSE([
@ryandesign

@badger Doesn't your proposal neglect to handle "darwin10" and any future Darwin versions that contain "0"?

@bagder
Member
bagder commented Oct 18, 2016

Ehm, yes it would appear so, thanks @ryandesign . It should probably just use:

darwin*|interix*)

and thus match all darwin and interix versions.

@bagder bagder added a commit that referenced this issue Oct 18, 2016
@bagder bagder configure/CURL_CHECK_FUNC_POLL: disable poll completely on mac
... so that the same libcurl build easier can run on any version.

Follow-up to issue #1057
a34c7ce
@bagder
Member
bagder commented Oct 18, 2016

I pushed that version now to let people try it out for real.

@jay
Member
jay commented Oct 28, 2016

so is this still an issue for cmake, for example should it be

if(NOT APPLE)
check_symbol_exists(poll          "${CURL_INCLUDES}" HAVE_POLL)
endif()

also should there be some sort of failsafe in setup like

#if defined(__APPLE__) && (defined(HAVE_POLL) || defined(HAVE_POLL_FINE))
#error "failsafe: we don't allow poll on apple, build config should have detected this"
#endif

this way it doesn't get away from us as more build systems are added or say for example someone adds HAVE_POLL_FINE to cmake

@bagder
Member
bagder commented Oct 28, 2016

The configure script still has numerous tiny tweaks that aren't present in cmake and its hard to come up with a firm stance on how to deal with them. For this particular case, I think we should file a separate issue so that we can tag it cmake and ping the most active cmake contributors rather than to keep this thread going in an already-closed issue.

@jimjag
jimjag commented Jan 11, 2017 edited

On 10.12.2, at least, poll works:

% uname -a
Darwin jimsys.local 16.3.0 Darwin Kernel Version 16.3.0: Thu Nov 17 20:23:58 PST 2016; root:xnu-3789.31.2~1/RELEASE_X86_64 x86_64
% ./a.out
poll() works

@bagder
Member
bagder commented Jan 11, 2017

Thank for testing and confirming!

In curl we will stick with select() on macos for the foreseeable future so that the same binaries have a higher chance of working across versions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment