Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

waiting for name resolver threads before quitting can cause delays even with --max-time #2975

Closed
crvv opened this issue Sep 11, 2018 · 10 comments

Comments

@crvv
Copy link

crvv commented Sep 11, 2018

I did this

  1. Edit /etc/resolve.conf to use a bad DNS server.
  2. time curl https://github.com/ --max-time 1 -v
    And curl exits after 5 seconds. The output is
time curl https://github.com/ --max-time 1 -v
* Resolving timed out after 1000 milliseconds
* Could not resolve host: github.com
* stopped the pause stream!
* Closing connection 0
curl: (28) Resolving timed out after 1000 milliseconds
curl https://github.com/ --max-time 1 -v  0.02s user 0.02s system 0% cpu 5.042 total

I expected the following

curl exits after 1 second.
Document at https://ec.haxx.se/usingcurl-timeouts.html says "When the set time has elapsed, curl will exit no matter what is going on at that moment", so I think this is a bug.

curl/libcurl version

curl 7.61.0 (x86_64-pc-linux-gnu) libcurl/7.61.0 OpenSSL/1.1.0i zlib/1.2.11 libidn2/2.0.5 libpsl/0.20.2 (+libidn2/2.0.4) nghttp2/1.32.0
Release-Date: 2018-07-11
Protocols: dict file ftp ftps gopher http https imap imaps pop3 pop3s rtsp smb smbs smtp smtps telnet tftp
Features: AsynchDNS IDN IPv6 Largefile GSS-API Kerberos SPNEGO NTLM NTLM_WB SSL libz TLS-SRP HTTP2 UnixSockets HTTPS-proxy PSL

operating system

Arch Linux

@jay
Copy link
Member

jay commented Sep 11, 2018

Yes in your case it calls getaddrinfo which is blocking in a separate thread, however it waits in multi_done for the resolve to complete:

curl/lib/multi.c

Lines 536 to 539 in 432eb5f

if(data->mstate == CURLM_STATE_WAITRESOLVE) {
/* still waiting for the resolve to complete */
(void)Curl_resolver_wait_resolv(conn, NULL);
}

I'm not sure why that is necessary because libcurl threaded resolver was designed to handle cleaning up the resolves in the background, which would happen later on when multi_done calls via Curl_resolver_cancel and in this case that would call asyn-thread's destroy_async_data

curl/lib/asyn-thread.c

Lines 351 to 362 in 432eb5f

/*
* if the thread is still blocking in the resolve syscall, detach it and
* let the thread do the cleanup...
*/
Curl_mutex_acquire(td->tsd.mtx);
done = td->tsd.done;
td->tsd.done = 1;
Curl_mutex_release(td->tsd.mtx);
if(!done) {
Curl_thread_destroy(td->thread_hnd);
}

@jay
Copy link
Member

jay commented Sep 11, 2018

git blame puts it at ac9a179

When the application just started the transfer and then stops it while
the name resolve in the background thread hasn't completed, we need to
wait for the resolve to complete and then cleanup data accordingly.

Enabled test 1553 again and added test 1590 to also check when the host
name resolves successfully.

Detected by OSS-fuzz.
Closes #1968

I think we should just let it leak

@bagder
Copy link
Member

bagder commented Sep 11, 2018

This isn't easy.

Leaking is a problem for tools that trigger on this (like OSS-Fuzz did) and for applications who'd do this several times. I don't think we can allow that to happen unconditionally. But also not responding sooner to this sort of failure is also disturbing...

@bagder bagder added the name lookup DNS and related tech label Sep 11, 2018
@jay
Copy link
Member

jay commented Sep 11, 2018

If the thread can be abandoned with ownership of all its resources which it later cleans up itself (I'm assuming here) then I don't see what benefit there is to waiting for it other than to please OSS-Fuzz. I think the fuzzer is wrong in this case.

@alexeip0
Copy link

alexeip0 commented Sep 14, 2018

Perhaps we should track the number of detached threads and have Curl_resolver_global_cleanup wait for the count to drop to 0 before returning?

@alexeip0
Copy link

alexeip0 commented Sep 14, 2018

And smarter fix would track detached threads on per-multi-handle basis and do similar wait in curl_multi_cleanup

This would have to be rather intricate, since curl_multi_cleanup is called from multi thread as well, thus, cannot do blocking wait.

@bagder
Copy link
Member

bagder commented Sep 16, 2018

Any wait for the pending thread(s) will risk blocking a call that will effect some users. We can only really fix this issue by removing the wait completely. If the threads don't leak any memory by this, I think it can be done. We just have to keep logic to make sure the OSS-Fuzz and possibly other tests don't consider that a leak.

@bagder bagder changed the title option --max-time is ineffective when DNS timeout waiting for name resolver threads before quitting can cause delays even with --max-time Oct 18, 2018
@Alexander--
Copy link

Alexander-- commented Nov 24, 2018

If the threads don't leak any memory by this, I think it can be done.

Threads leak their stacks.

This issue can't be fully fixed in libcurl, but users of libcurl (e.g. the command line curl tool) should be able to clean up by invoking exit() instead of curl_multi_cleanup(). GNU cp command does exactly this, when it isn't built for Valgrind.

@stale
Copy link

stale bot commented Oct 10, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Oct 10, 2019
@stale stale bot closed this as completed Oct 24, 2019
@nicolashohm
Copy link

nicolashohm commented Nov 11, 2019

I also encountered this issue.
I'm currently building a PHP application, accessing an external API while the user requests the page. In order to not slow down the page speed in case of any problems with the API, I want to skip the api call after a certain timeout. For that I set CURLOPT_TIMEOUT_MS to 150. Unfortunately this doesn't worked reliably. Since we faced issues with the external API or better said, with the dns resolving, I often got this error: Resolving timed out after 252 milliseconds. I've no idea where the 252 ms comes from, but not that 252ms is already too much, I saw the requests in the end takes even longer. For example when the average request took 500ms, some requests took 1200ms or even 1500ms which should never happen if the timeout would be adhered.

My guess: even that were seems to be a timeout of 252ms for the dns resolving, this timeout will be ignored. This could be valid since I figured out with dig that the dns resolving some times really took 500ms or longer.

Since I put the domain in the /etc/hosts the page speed is much more stable.

I hope this story was helpful for someone!

curl 7.38.0 (x86_64-pc-linux-gnu) libcurl/7.38.0

The php script was similar to this:

<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, '...');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FAILONERROR, true)
curl_setopt($ch, CURLOPT_TIMEOUT_MS, 150);
curl_setopt($ch, CURLOPT_PROXY, null);
curl_exec($ch);
var_dump(curl_error($ch));

@lock lock bot locked as resolved and limited conversation to collaborators Feb 9, 2020
ferrieux added a commit to ferrieux/curl that referenced this issue Jul 12, 2022
ferrieux added a commit to ferrieux/curl that referenced this issue Sep 29, 2022
ferrieux added a commit to ferrieux/curl that referenced this issue Nov 6, 2022
ferrieux added a commit to ferrieux/curl that referenced this issue Nov 17, 2022
bagder pushed a commit that referenced this issue Nov 17, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants