Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--retry doesn't retry on connection timeout, if you don't pass --connect-timeout #4461

Closed
njsmith opened this issue Oct 4, 2019 · 5 comments

Comments

@njsmith
Copy link

commented Oct 4, 2019

I did this

240.0.0.1 is a reserved IP address, so connections to it should just time out. When using --retry, I expected curl would report a timeout, and then retry the connection as requested. Instead, it reports a timeout, and doesn't retry the connection:

~$ curl --retry 5 -v http://240.0.0.1                 
* Expire in 0 ms for 6 (transfer 0x557c953145c0)
*   Trying 240.0.0.1...
* TCP_NODELAY set
* Expire in 200 ms for 4 (transfer 0x557c953145c0)
* connect to 240.0.0.1 port 80 failed: Connection timed out
* Failed to connect to 240.0.0.1 port 80: Connection timed out
* Closing connection 0
curl: (7) Failed to connect to 240.0.0.1 port 80: Connection timed out

If I add an explicit --connect-timeout 10, then it works as expected:

~$ curl --connect-timeout 10 --retry 5 -v http://240.0.0.1                 
* Expire in 0 ms for 6 (transfer 0x5577b299d5c0)
* Expire in 10000 ms for 2 (transfer 0x5577b299d5c0)
*   Trying 240.0.0.1...
* TCP_NODELAY set
* Expire in 200 ms for 4 (transfer 0x5577b299d5c0)
* Connection timed out after 10001 milliseconds
* Closing connection 0
Warning: Transient problem: timeout Will retry in 1 seconds. 5 retries left.
* Expire in 0 ms for 6 (transfer 0x5577b299d5c0)
* Expire in 10000 ms for 2 (transfer 0x5577b299d5c0)
* Hostname 240.0.0.1 was found in DNS cache
*   Trying 240.0.0.1...
* TCP_NODELAY set
* Expire in 200 ms for 4 (transfer 0x5577b299d5c0)
^C

I suspect the problem is that curl's retry logic doesn't know that connect resolving with ETIMEDOUT should be treated as a connect that timed out.

I can sort of see the justification for this, even, if I squint. But this really confused me. In the real situation where I ran into this, I was getting transient errors in CI runs, and curl was reporting them as "connection timed out". So I checked the man page, it says --retry can be used to retry when connect times out, I added --retry, and then ... a few days later I got the error again. So then I thought there must be some kind of weird failure mode where the host was timing out all 6 connection attempts... I think ideally --retry should just work in this case, but if not then at least the man page should be updated to warn about this.

curl/libcurl version

curl 7.64.0 (x86_64-pc-linux-gnu) libcurl/7.64.0 OpenSSL/1.1.1b zlib/1.2.11 libidn2/2.0.5 libpsl/0.20.2 (+libidn2/2.0.5) libssh/0.8.6/openssl/zlib nghttp2/1.36.0 librtmp/2.3
Release-Date: 2019-02-06
Protocols: dict file ftp ftps gopher http https imap imaps ldap ldaps pop3 pop3s rtmp rtsp scp sftp smb smbs smtp smtps telnet tftp 
Features: AsynchDNS IDN IPv6 Largefile GSS-API Kerberos SPNEGO NTLM NTLM_WB SSL libz TLS-SRP HTTP2 UnixSockets HTTPS-proxy PSL 

operating system

Ubuntu 19.04

bagder added a commit that referenced this issue Oct 4, 2019
Previosly all connect() failures would return CURLE_COULDNT_CONNECT, no
matter what errno said.

This makes for example --retry work on these transfer failures.

Fixes #4461
@bagder

This comment has been minimized.

Copy link
Member

commented Oct 4, 2019

I think the logic has roughly been that the timeout error code is for when libcurl itself decides to give up due to timing constraints, while when a regular libc call fails it ends up in a plain CURLE_COULDNT_CONNECT error.

PR #4462 changes this when connect() fails with a ETIMEDOUT error and I've attempted to make it act accordingly on Windows.

I'm just a tad bit worried this changes behavior for someone or that by returning a timeout error it might hint for users that they can extend the timeout value and get a better shot at it (which they can't since this is the TCP stack's timeout value which isn't easily accessible).

@njsmith

This comment has been minimized.

Copy link
Author

commented Oct 4, 2019

Maybe the docs for --connect-timeout + its C equivalent should have a note added warning that operating systems impose their own system-specific timeout on connect, and this option can be used to shorten it but not lengthen it? That seems like a useful change in any case, and would also help with your worry here.

@bagder

This comment has been minimized.

Copy link
Member

commented Oct 7, 2019

Yeah, I'm rather convinced that this is the right fix. The little problem I've run into now is that there are three tests failing on the freebsd tests because the port it tries to connect to that it wants a connection refused from (it presumes port 60000 on 127.0.0.1 has nothing listening on it) instead causes an ETIMEDOUT so thus with this PR the return code is different and three tests fail! 😞

@njsmith

This comment has been minimized.

Copy link
Author

commented Oct 7, 2019

@bagder That sounds annoying!

I've noticed before that on macOS, if you try to connect to a port that (a) has a socket bound to it, and (b) that socket has not called listen, then you get a thirty second timeout before the connection fails. I wonder if that could be the same thing you're running into – macOS and freebsd share a lot of networking quirks. And 60000 is in the ephemeral port range so maybe some clients were auto-bound to that port?

In my case I solved it by switching to port 2 as my "nothing should be there" port. I don't remember why 2 specifically. But using one of the low numbered ports that's been assigned to some protocol that hasn't been used for 50 years seems like a reliable way to get a port where nothing will be happening.

python-trio/trio@8d40410

@bagder

This comment has been minimized.

Copy link
Member

commented Oct 8, 2019

That's a good idea, I'll try that. Thanks!

@bagder bagder closed this in 490effc Oct 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.