Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alter timeout ordering with connections #7178

Closed

Conversation

@tomwatson1024
Copy link

@tomwatson1024 tomwatson1024 commented Jun 2, 2021

@richardwhiuk made this fix in the @Metaswitch fork of curl at https://github.com/Metaswitch/curl a few years ago - see Metaswitch#2. Since then it's had a great deal of soak testing in our production code. As part of getting ourselves off the fork, we'd like to upstream the fix.

This change:

  • Checks whether a connection has succeded before checking whether it's timed
    out.

    This means if we've connected quickly, but subsequently been descheduled, we
    allow the connection to succeed. Note, if we timeout, but between checking the
    timeout, and connecting to the server the connection succeeds, we will allow
    it to go ahead. This is viewed as an acceptable trade off.

  • Add additional failf logging around failed connection attempts to propogate
    the cause up to the caller.

More info in the linked PR.

- Check whether a connection has succeded before checking whether it's timed
  out.

  This means if we've connected quickly, but subsequently been descheduled, we
  allow the connection to succeed. Note, if we timeout, but between checking the
  timeout, and connecting to the server the connection succeeds, we will allow
  it to go ahead. This is viewed as an acceptable trade off.

- Add additional failf logging around failed connection attempts to propogate
  the cause up to the caller.
@tomwatson1024 tomwatson1024 force-pushed the tomwatson1024:tomw/timeout-ordering branch from 40aabab to 7449865 Jun 2, 2021
@bagder
Copy link
Member

@bagder bagder commented Jun 2, 2021

build failures. example

connect.c: In function 'Curl_is_connected':
connect.c:989:17: error: 'allow' may be used uninitialized in this function [-Werror=maybe-uninitialized]
  988 |         conn->timeoutms_per_addr[i] = conn->tempaddr[i]->ai_next == NULL ?
      |                                       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  989 |           allow : allow / 2;
      |           ~~~~~~^~~~~~~~~~~
@bagder
Copy link
Member

@bagder bagder commented Jun 2, 2021

I would like to learn more specifics of the exact benefits you see from using this. You're adjusting the connect timeout check, but how big difference does this actually make and what timeouts do you use? Couldn't you also just increase your connection timeout a few milliseconds to avoid the problems?

This code somewhat obfuscates the timeout handling so I think it better be worth it and then I need you to explain to us how it is!

@MartinHowarth
Copy link

@MartinHowarth MartinHowarth commented Jun 18, 2021

Thanks for your patience - taking over from Tom on this.

This issue has a particular impact on heavily loaded systems. We were seeing that the kernel had successfully set up the connection within the timeout (in ~100 microseconds), but the process that asked for it (via curl) was de-scheduled for longer than the connection timeout because the system was heavily loaded. We set a relatively aggressive timeout (50ms) so we can notice that the remote is dead quickly and switch to a different one. However, it's common to see a process being de-scheduled for >100ms. When the process is scheduled again, curl sees that this 100ms > 50ms and marks the connection as timed out, even though the connection was setup within the timeout.

The alternative of setting the timeout to be longer to account for this means we have an unacceptably long time to spot that a peer is down.

This PR changes the order of checking for connection success: instead of checking for timeout before success, check for success before timeout.

In terms of stability/testing, we've been using this change on our fork with high-load customers for years.

I'm starting to look into fixing the CI complaints now :)

@MartinHowarth
Copy link

@MartinHowarth MartinHowarth commented Jun 21, 2021

Looks like 2 tests still fail (2100, 575) - I can't work out how these could relate to these changes :(

Are they known to be flaky tests (they didn't fail on all platforms), or could you give me some pointers? Thanks!

@bagder
Copy link
Member

@bagder bagder commented Jun 21, 2021

Looks like 2 tests still fail (2100, 575) - I can't work out how these could relate to these changes

A few of the Windows builds are unfortunately flaky so these fails are not because of your changes.

lib/multi.c Outdated
@@ -1700,47 +1751,13 @@ static CURLMcode multi_runsingle(struct Curl_multi *multi,
if(data->conn &&
(data->mstate >= MSTATE_CONNECT) &&
(data->mstate < MSTATE_COMPLETED)) {
/* We defer handling the connection timeout to later, to see if the
* connection has actually succeeded.
* See https://github.com/Metaswitch/curl/pull/2 for original changes */

This comment has been minimized.

@bagder

bagder Jun 21, 2021
Member

Please don't refer to an external URL (that will vanish at some random point in time). If there's an additional explanation available that helps here, write it in the comment too.

This comment has been minimized.

@MartinHowarth

MartinHowarth Jun 21, 2021

I've updated this comment to be more useful

multi_ischanged(multi, false)) {
/* We now handle stream timeouts if and only if this will be the last
* loop iteration */
multi_handle_timeout(data, nowp, &stream_error, &result, TRUE);

This comment has been minimized.

@bagder

bagder Jun 21, 2021
Member

The TRUE in there will make the function check the timeout for the connection phase, and yet you call this function also in the MSTATE_DO state which then already has connected. Surely that makes this too trigger-happy?

This comment has been minimized.

@MartinHowarth

MartinHowarth Jun 21, 2021

This is preserving the existing behaviour. The previous check at line 1706 included checking the connect timeout in the MSTATE_DO.

This comment has been minimized.

@bagder

bagder Jun 21, 2021
Member

Ah yes I see. That's certainly a bug that surely in itself could've contributed to you having the problems you've had! The DO state is passed the connect so it is just wrong to check the connect timeout in that state.

This comment has been minimized.

@MartinHowarth

MartinHowarth Jun 21, 2021

Thanks - I've changed the check to data->mstate < MSTATE_DO to exclude the DO state

lib/multi.c Outdated
data->mstate >= MSTATE_CONNECT &&
data->mstate <= MSTATE_DO &&
rc != CURLM_CALL_MULTI_PERFORM &&
multi_ischanged(multi, false)) {

This comment has been minimized.

@bagder

bagder Jun 21, 2021
Member

Why is the multi_ischanged a requirement for this check? For most "normal" case that's never true and then you don't check the timeout here at all...

This comment has been minimized.

@MartinHowarth

MartinHowarth Jun 21, 2021

Good question! I agree with your assessment here, and I've changed this to a !multi_ischanged(multi, false) because I think it was supposed to match the opposite of the conditions to exit the while loop based on the comment right below.

I've also expanded the comment to explain the last two checks in this check.

@bagder
Copy link
Member

@bagder bagder commented Jun 21, 2021

./multi.c:1754:73: warning: Trailing whitespace (TRAILINGSPACE)
       /* Check for overall operation timeout here but defer handling the 
                                                                         ^
./multi.c:1756:71: warning: Trailing whitespace (TRAILINGSPACE)
        * in the window since we last checked timeout. This prevents us 
                                                                       ^
checksrc: 0 errors and 2 warnings
@MartinHowarth
Copy link

@MartinHowarth MartinHowarth commented Jun 23, 2021

I think the remaining failures are also flaky windows tests - is there anything else I need to do to get this ready to merge?

@bagder
Copy link
Member

@bagder bagder commented Jun 24, 2021

Let's just have the rest of the CI run, then I'll prepare for merge

@bagder bagder closed this in 0842175 Jun 24, 2021
@bagder
Copy link
Member

@bagder bagder commented Jun 24, 2021

Thanks!

@bagder bagder removed the needs-info label Jun 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

6 participants