-
-
Notifications
You must be signed in to change notification settings - Fork 6.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alter timeout ordering with connections #7178
Conversation
- Check whether a connection has succeded before checking whether it's timed out. This means if we've connected quickly, but subsequently been descheduled, we allow the connection to succeed. Note, if we timeout, but between checking the timeout, and connecting to the server the connection succeeds, we will allow it to go ahead. This is viewed as an acceptable trade off. - Add additional failf logging around failed connection attempts to propogate the cause up to the caller.
40aabab
to
7449865
Compare
build failures. example
|
I would like to learn more specifics of the exact benefits you see from using this. You're adjusting the connect timeout check, but how big difference does this actually make and what timeouts do you use? Couldn't you also just increase your connection timeout a few milliseconds to avoid the problems? This code somewhat obfuscates the timeout handling so I think it better be worth it and then I need you to explain to us how it is! |
Thanks for your patience - taking over from Tom on this. This issue has a particular impact on heavily loaded systems. We were seeing that the kernel had successfully set up the connection within the timeout (in ~100 microseconds), but the process that asked for it (via curl) was de-scheduled for longer than the connection timeout because the system was heavily loaded. We set a relatively aggressive timeout (50ms) so we can notice that the remote is dead quickly and switch to a different one. However, it's common to see a process being de-scheduled for >100ms. When the process is scheduled again, curl sees that this 100ms > 50ms and marks the connection as timed out, even though the connection was setup within the timeout. The alternative of setting the timeout to be longer to account for this means we have an unacceptably long time to spot that a peer is down. This PR changes the order of checking for connection success: instead of checking for timeout before success, check for success before timeout. In terms of stability/testing, we've been using this change on our fork with high-load customers for years. I'm starting to look into fixing the CI complaints now :) |
Looks like 2 tests still fail (2100, 575) - I can't work out how these could relate to these changes :( Are they known to be flaky tests (they didn't fail on all platforms), or could you give me some pointers? Thanks! |
A few of the Windows builds are unfortunately flaky so these fails are not because of your changes. |
lib/multi.c
Outdated
@@ -1700,47 +1751,13 @@ static CURLMcode multi_runsingle(struct Curl_multi *multi, | |||
if(data->conn && | |||
(data->mstate >= MSTATE_CONNECT) && | |||
(data->mstate < MSTATE_COMPLETED)) { | |||
/* We defer handling the connection timeout to later, to see if the | |||
* connection has actually succeeded. | |||
* See https://github.com/Metaswitch/curl/pull/2 for original changes */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't refer to an external URL (that will vanish at some random point in time). If there's an additional explanation available that helps here, write it in the comment too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated this comment to be more useful
multi_ischanged(multi, false)) { | ||
/* We now handle stream timeouts if and only if this will be the last | ||
* loop iteration */ | ||
multi_handle_timeout(data, nowp, &stream_error, &result, TRUE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TRUE
in there will make the function check the timeout for the connection phase, and yet you call this function also in the MSTATE_DO
state which then already has connected. Surely that makes this too trigger-happy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is preserving the existing behaviour. The previous check at line 1706 included checking the connect timeout in the MSTATE_DO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes I see. That's certainly a bug that surely in itself could've contributed to you having the problems you've had! The DO state is passed the connect so it is just wrong to check the connect timeout in that state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - I've changed the check to data->mstate < MSTATE_DO
to exclude the DO state
lib/multi.c
Outdated
data->mstate >= MSTATE_CONNECT && | ||
data->mstate <= MSTATE_DO && | ||
rc != CURLM_CALL_MULTI_PERFORM && | ||
multi_ischanged(multi, false)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the multi_ischanged
a requirement for this check? For most "normal" case that's never true and then you don't check the timeout here at all...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question! I agree with your assessment here, and I've changed this to a !multi_ischanged(multi, false)
because I think it was supposed to match the opposite of the conditions to exit the while loop based on the comment right below.
I've also expanded the comment to explain the last two checks in this check.
|
I think the remaining failures are also flaky windows tests - is there anything else I need to do to get this ready to merge? |
Let's just have the rest of the CI run, then I'll prepare for merge |
Thanks! |
@richardwhiuk made this fix in the @Metaswitch fork of curl at https://github.com/Metaswitch/curl a few years ago - see Metaswitch#2. Since then it's had a great deal of soak testing in our production code. As part of getting ourselves off the fork, we'd like to upstream the fix.
This change:
Checks whether a connection has succeded before checking whether it's timed
out.
This means if we've connected quickly, but subsequently been descheduled, we
allow the connection to succeed. Note, if we timeout, but between checking the
timeout, and connecting to the server the connection succeeds, we will allow
it to go ahead. This is viewed as an acceptable trade off.
Add additional failf logging around failed connection attempts to propogate
the cause up to the caller.
More info in the linked PR.