New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

7.64 regression: curl_multi_fdset() returns a write fd instead of read fd #3446

Closed
jeroen opened this Issue Jan 8, 2019 · 23 comments

Comments

Projects
None yet
4 participants
@jeroen
Copy link
Contributor

jeroen commented Jan 8, 2019

This is a complicated issue so bear with me.

update: see here for the test program: #3446 (comment)

Summary

I have a unit test that CURLMOPT_MAX_TOTAL_CONNECTIONS and CURLMOPT_MAX_HOST_CONNECTIONS work as expected, by making a bunch of request and inspecting curl_multi_fdset() throughout.

What the test does

  1. Create a multi handle and set CURLMOPT_MAX_TOTAL_CONNECTIONS = 4 and CURLMOPT_PIPELINING = CURLPIPE_NOTHING. (we disable multiplexing for this test).
  2. Add a bunch of requests. My unit test uses "https://eu.httpbin.org/delay/" + x where x equals [4,3,2,1,0,1,2,3,4] (so 9 request in total).
  3. Repeatedly call curl_multi_perform(). Every time a request has finished (CURLMSG_DONE) do:
    1. Call curl_multi_fdset() and read read_fd_set to get the open connections.
    2. The number of fd's in read_fd_set must be equal to MIN(CURLMOPT_MAX_TOTAL_CONNECTIONS, running_handles) where running_handles is the number of pending connections returned by curl_multi_perform().

So basically we test that at any given point, the number of open connections (read fd's) is never more than CURLMOPT_MAX_TOTAL_CONNECTIONS or the number of outstanding requests.

The test passed for many versions up and till version 7.63 (so even after http/2 is enabled by default) but fails in 7.64.0-DEV (master).

Previous behavior

At each of the 9 checks, the fd count in write_fd_set is 0 and the fd count in read_fd_set is equal to min(4, running_handles).

In master

At the first check we expect the fd count in read_fd_set to be 4. However instead, it is only 3, but now we also have 1 fd in write_fd_set ?!

@jay

This comment has been minimized.

Copy link
Member

jay commented Jan 8, 2019

Would you mind bisecting this issue and running against the offending commit a few times to see how reproducible it is. Thanks
https://github.com/curl/curl/wiki/how-to-git-bisect

@jeroen

This comment has been minimized.

Copy link
Contributor

jeroen commented Jan 8, 2019

OK this is what git bisect says:

84a30d0a419ad95c53cbdfc76eb2eb75d2e51835 is the first bad commit
commit 84a30d0a419ad95c53cbdfc76eb2eb75d2e51835
Author: Brad Spencer <bspencer@blackberry.com>
Date:   Fri Dec 14 17:18:22 2018 -0400

    curl_multi_remove_handle() don't block terminating c-ares requests

    Added Curl_resolver_kill() for all three resolver modes, which only
    blocks when necessary, along with test 1592 to confirm
    curl_multi_remove_handle() doesn't block unless it must.

    Closes #3428
    Fixes #3371
@jay

This comment has been minimized.

Copy link
Member

jay commented Jan 8, 2019

Ref: 84a30d0

What resolver?

  curl_version_info_data *ver = curl_version_info(CURLVERSION_NOW);
  
  if(!(ver->features & CURL_VERSION_ASYNCHDNS))
    printf("synchronous\n");
  else if(!ver->age || ver->ares_num)
    printf("ares\n");
  else
    printf("threaded\n");

/cc @b-spencer

@jeroen

This comment has been minimized.

Copy link
Contributor

jeroen commented Jan 8, 2019

The native configuration on MacOS (without c-ares).

@bagder

This comment has been minimized.

Copy link
Member

bagder commented Jan 8, 2019

@jeroen this doesn't use HTTP/2 (multiplexing), does it?

@jeroen

This comment has been minimized.

Copy link
Contributor

jeroen commented Jan 8, 2019

No, as described above the test disables this with CURLMOPT_PIPELINING = CURLPIPE_NOTHING and also the server I am testing with does not support http/2.

@b-spencer

This comment has been minimized.

Copy link
Contributor

b-spencer commented Jan 8, 2019

If I understand correctly, this test seems sensitive to timing.

If this test starts multiple requests and then examines the state after each finishes, won't there sometimes be a file descriptor in the write_fd_set when one of the requests' TCP connections is still trying to connect and waiting for the asynchronous "writability" notification? See for example the POLLOUT cases below below, where curl is waiting for the connection to complete.

$ strace -e trace=connect,poll curl -sq https://eu.httpbin.org/delay/2 2>&1 |egrep 'connect\(3|fd=3' |head -5
connect(3, {sa_family=AF_INET, sin_port=htons(443), sin_addr=inet_addr("34.240.217.226")}, 16) = -1 EINPROGRESS (Operation now in progress)
poll([{fd=3, events=POLLOUT|POLLWRNORM}], 1, 0) = 0 (Timeout)
poll([{fd=3, events=POLLOUT}], 1, 198)  = 1 ([{fd=3, revents=POLLOUT}])
poll([{fd=3, events=POLLOUT|POLLWRNORM}], 1, 0) = 1 ([{fd=3, revents=POLLOUT|POLLWRNORM}])
poll([{fd=3, events=POLLIN}], 1, 54)    = 0 (Timeout)

(I used a version of cURL built with the default threaded resolver for that; your fd numbers may vary.)

Is it possible to attach a minimal version of the test case here that shows the failure?

@jeroen

This comment has been minimized.

Copy link
Contributor

jeroen commented Jan 8, 2019

I'm not sure if it is expected. The test was stable for the year. At any given point the number of fd's in read_fd_set would match CURLMOPT_MAX_TOTAL_CONNECTIONS.

I can try to port my test to C but it will take some time.

@b-spencer

This comment has been minimized.

Copy link
Contributor

b-spencer commented Jan 8, 2019

I agree that it's strange that it would suddenly change. From my understanding of the test's behaviour, I am not sure if it was ever safe to assume that requests' file descriptors would only ever be readable at the points the test measures.

If it's a lot of work to port the test or otherwise minimize it, maybe you can share more detail about its behaviour?

I think the test does this:

  1. It starts 9 requests to that website. It has cURL configured to throttle the number of simultaneous connections to a single server and you have pipelinine disabled. (But cURL can of course still send multiple requests at once; up to 4.)
  2. The test drives it with the multi interface and every time a request reports as finished (and thus, presumably does not participate in the fd sets itself anymore), it checks what other requests exist by examining the fds in the sets.

When request 1 finishes, I suppose that the expectation is that request 5 starts, reusing the connection from request 1? And therefore request 5 can't be waiting for TCP connect and thus won't be in write_fd_set? But what about request 4? Is there any guarantee that it has already connected on the fourth socket by the time request 1 finishes? And if it hasn't, couldn't its fd then be in write_fd_set?

Perhaps you can strace or otherwise trace the fd use in more detail so you can see what fds are being opened, closed, and connected when?

@b-spencer

This comment has been minimized.

Copy link
Contributor

b-spencer commented Jan 8, 2019

I modified https://curl.haxx.se/libcurl/c/multi-app.html to print out what curl_multi_fdset() returns (and made it do two HTTPS requests simultaneously) and it shows file descriptors in the write_fd_set:

$ ./multi-app
read_fd_set=0 write_fd_set=0
read_fd_set=0 write_fd_set=0
read_fd_set=0 write_fd_set=0
read_fd_set=1 write_fd_set=0
read_fd_set=1 write_fd_set=1
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
...
read_fd_set=0 write_fd_set=0
read_fd_set=0 write_fd_set=0
read_fd_set=0 write_fd_set=0
read_fd_set=0 write_fd_set=0
read_fd_set=0 write_fd_set=1
read_fd_set=0 write_fd_set=0
read_fd_set=0 write_fd_set=0
read_fd_set=0 write_fd_set=0
read_fd_set=0 write_fd_set=0
read_fd_set=1 write_fd_set=0
...

If I modify the example further to call curl_multi_info_read() inside the loop (to look for CURLMSG_DONE after each curl_multi_perform()) then I can create cases where the first request "finishes" even via the cURL API (with CURLMSG_DONE) before the second one even starts connecting. It seems only a matter of timing for the second to be in the process of connecting while the first one completes.

$ ./multi-app
read_fd_set=0 write_fd_set=0
read_fd_set=0 write_fd_set=0
read_fd_set=0 write_fd_set=0
read_fd_set=1 write_fd_set=0
read_fd_set=1 write_fd_set=1
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
...
HTTP transfer completed with status 0
read_fd_set=0 write_fd_set=0
read_fd_set=0 write_fd_set=0
read_fd_set=0 write_fd_set=0
read_fd_set=0 write_fd_set=0
read_fd_set=0 write_fd_set=1
read_fd_set=0 write_fd_set=0
read_fd_set=0 write_fd_set=0
read_fd_set=0 write_fd_set=0
read_fd_set=0 write_fd_set=0
read_fd_set=1 write_fd_set=0
...
@jeroen

This comment has been minimized.

Copy link
Contributor

jeroen commented Jan 8, 2019

OK I created an example program C to illustrate the issue. You can find it here. If you build this with 7.63 or older, you will see that the length of read_fd_set always matches what we expect:

screen shot 2019-01-08 at 5 46 24 pm

However, in the master branch this test is now failing. If you use MacOS, an easy way to reproduce is using homebrew. To install the master branch do:

brew install curl --HEAD
export PKG_CONFIG_PATH=/usr/local/opt/curl/lib/pkgconfig

I am now seeing this:

screen shot 2019-01-08 at 5 47 53 pm

So I am not sure if this is expected but it is definitely a change in behavior.

@bagder

This comment has been minimized.

Copy link
Member

bagder commented Jan 8, 2019

I will agree with @b-spencer that this test sounds fragile and seems to assume a socket behavior that we don't guarantee in the API. libcurl will of course ask for write action when a connection is about to connect it when it wants to send data. I don't think that in itself can be seen as an error.

I realize this is changed behavior for you, but we do reserve ourselves the right to subtle change the behavior as long as we remain API and ABI compatible and follow the documented behavior.

@jeroen

This comment has been minimized.

Copy link
Contributor

jeroen commented Jan 8, 2019

@bagder that is fine, I just want to make sure that this is expected, and there is no bigger underlying problem. If this new behavior is expected than I can drop the unit test and you can close the issue.

However if you think about it, if we have 4 outstanding requests waiting on a response from the server, wouldn't it be expected that you have 4 read fd's? This always used to be the case at least. We are way past resolving domain names at this point, so I think this is an unexpected change.

@bagder

This comment has been minimized.

Copy link
Member

bagder commented Jan 8, 2019

if we have 4 outstanding requests waiting on a response from the server, wouldn't it be expected that you have 4 read fd's?

If they are for 4 different connections and they are truly and surely waiting for a response after having sent a full request that can be true (and most certainly should be in most cases) but since you use HTTPS it may also want to write even the HTTP state would otherwise suggest it should wait for reading.

I rather not work on symptoms or guesses on this level. If libcurl indeed would report the wrong action to wait for back, then surely the connection will be kept hanging there since then it won't get signaled properly! It would at least get some timeouts or otherwise end up weird. (Like for example in the #3436 case.) Then we could be certain that something is wrong in there. Now, it feels like a little too much of a guess to base conclusions on the fact that a socket descriptor wants to get written to at a particular moment in time.

I'm not saying with certainty that this isn't a bug, but I do say that this isn't an obvious enough sign that there is one!

@jeroen

This comment has been minimized.

Copy link
Contributor

jeroen commented Jan 8, 2019

OK well if you studied the example code and feel confident this change of behavior is harmless then we can leave it at that. It still puzzles me why this change suddenly arises but perhaps it's a side effect of changes in the resolver.

My follow up question: my goal is to test that CURLMOPT_MAX_TOTAL_CONNECTIONS is working. If my unit test is modified such that it adds up the number of fd's in both read_fd_set and write_fd_set, that should surely be equal to CURLMOPT_MAX_TOTAL_CONNECTIONS (given that there are at least that many pending requests)? Or is this also a guess?

@b-spencer

This comment has been minimized.

Copy link
Contributor

b-spencer commented Jan 8, 2019

FWIW, when using c-ares as the resolver, I think it would be a guess, because the resolver itself allocates file descriptors, and there's no promise that the lifetime of those sockets don't overlap with the connections.

If I didn't trust cURL to implement its connection limits properly, I'm not sure I'd also trust it to report on its connection count properly via its API. 😄 Perhaps it would be worth adding a unit test to cURL itself that measures the connection count from the server side if you really want to be sure?

It looks like cURL's test1506 might be a starting point for doing that.

@jeroen

This comment has been minimized.

Copy link
Contributor

jeroen commented Jan 8, 2019

I think it would be a guess, because the resolver itself allocates file descriptors, and there's no promise that the lifetime of those sockets don't overlap with the connections.

But are those fd's also returned by curl_multi_fdset()? I thought curl_multi_fdset() specifically gives the fd's to poll the connections.

@b-spencer

This comment has been minimized.

Copy link
Contributor

b-spencer commented Jan 8, 2019

But are those fd's also returned by curl_multi_fdset()?

Yes. Otherwise, your application wouldn't know to monitor them for events or report those events to cURL, which in turn reports them to the resolver.

Here's some selected strace output from multi-app.c again showing just the DNS phase of the connection and the file descriptor sets returned by curl_multi_fdset():

socket(AF_INET, SOCK_DGRAM, IPPROTO_IP) = 3
connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("...")}, 16) = 0
sendto(3, ..., 29, MSG_NOSIGNAL, NULL, 0) = 29
read_fd_set=1 write_fd_set=0 maxfd=3
recvfrom(3, ..., 4097, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("...")}, [16]) = 45
recvfrom(3, ..., 4097, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("...")}, [16]) = 57
close(3)                                = 0

When read_fd_set=1 appears, the c-ares DNS resolver's UDP socket (fd 3) is in the set of file descriptors that the application must monitor for reading.

The story is similar for curl_multi_socket_action().

(In previous comments, I was accidentally printing the fd set counts too many times as I counted them instead of after the counting was complete. That's why there were so many of them, and so many zeros.)

@bagder

This comment has been minimized.

Copy link
Member

bagder commented Jan 9, 2019

The resolver backend can add one - or more - sockets to the mix, and then there's also happy eyeballs that might make it use two sockets for the connection phase. It's simply not reliable to assume a number of connections based on what file descriptors libcurl does or doesn't return.

@jeroen

This comment has been minimized.

Copy link
Contributor

jeroen commented Jan 9, 2019

OK so I'll drop the test. It's a bit sad, it was a beautiful test which checked both curl_multi_fdset() and CURLMOPT_MAX_TOTAL_CONNECTIONS and it was stable for a very long time. But I guess it's based on false assumptions that just happened to be never violated until now.

@jeroen

This comment has been minimized.

Copy link
Contributor

jeroen commented Jan 10, 2019

I'm fine with closing this issue if everyone agrees this change in behavior is expected and harmless.

@bagder

This comment has been minimized.

Copy link
Member

bagder commented Jan 10, 2019

It is harmless and doesn't identify any bugs as far as I can tell. I think we can close.

@jeroen

This comment has been minimized.

Copy link
Contributor

jeroen commented Jan 10, 2019

OK so let's close it for now. If other users start reporting unexpected behavior in curl_multi_fdset() we can revisit.

Thank you for your time investigating this issue!

@jeroen jeroen closed this Jan 10, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment