Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

T22555 Retry downloads on flaky networks #120

Merged
merged 14 commits into from
May 30, 2018
Merged

Conversation

pwithnall
Copy link
Contributor

This is a straight backport of upstream’s:

This introduces no functional changes, but will make some upcoming
refactoring a little easier.

Signed-off-by: Philip Withnall <withnall@endlessm.com>

Closes: #1599
Approved by: jlebon
This introduces no functional changes, but does make the code a little
cleaner.

Signed-off-by: Philip Withnall <withnall@endlessm.com>

Closes: #1599
Approved by: jlebon
Rename from `fdata` to `fetch_data` to clarify things and make it
consistent with other similar functionality in the file.

This introduces no functional changes.

Signed-off-by: Philip Withnall <withnall@endlessm.com>

Closes: #1599
Approved by: jlebon
This introduces no functional changes, but will make upcoming support
for retrying downloads easier to add.

Signed-off-by: Philip Withnall <withnall@endlessm.com>

Closes: #1599
Approved by: jlebon
This introduces no functional changes, but will make upcoming support
for retrying downloads easier to add.

Signed-off-by: Philip Withnall <withnall@endlessm.com>

Closes: #1599
Approved by: jlebon
This introduces no functional changes, but will make upcoming support
for retrying downloads easier to add.

Signed-off-by: Philip Withnall <withnall@endlessm.com>

Closes: #1599
Approved by: jlebon
Allow network requests to be re-queued if they failed with a transient
error, such as a socket timeout. Retry each request up to a limit
(default: 5), and only then fail the entire pull and propagate the error
to the caller.

Add a new ostree_repo_pull_with_options() option, n-network-retries, to
control the number of retries (including setting it back to the old
default of 0, if the caller wants).

Currently, retries are not supported for FetchDeltaSuperData requests,
as they are not queued. Once they are queued, adding support for retries
should be trivial. A FIXME comment has been left for this.

Signed-off-by: Philip Withnall <withnall@endlessm.com>

Closes: #1594
Approved by: jlebon
This allows the retry code in ostree-repo-pull.c to recover from (for
example) timeouts at the libsoup layer in the stack, as well as from the
GSocket layer in the stack.

Signed-off-by: Philip Withnall <withnall@endlessm.com>

Closes: #1594
Approved by: jlebon
This is exactly like the --random-500s option, except that it will cause
error 408 (request timeout) to be returned, rather than error 500
(internal server error).

This will be used in a following commit to test pull behaviour when
timeouts occur.

Signed-off-by: Philip Withnall <withnall@endlessm.com>

Closes: #1594
Approved by: jlebon
Extend test-pull-repeated.sh to test error 408 as well as error 500, to
ensure that the new retry-on-network-timeout code in ostree-repo-pull.c
correctly retries.

Rather than the 200 iterations needed for the error 500 tests, only do 5
iterations. The pull code internally does 5 retries (by default), which
means a full iteration count of 25. That seems to be sufficient to make
the tests reliably pass, in my testing — we can always bump it up to 200
/ 5 = 40 in future if needed (to put it in parity with the error 500
tests).

Signed-off-by: Philip Withnall <withnall@endlessm.com>

Closes: #1594
Approved by: jlebon
Various of the counters already have assertions like this; add some more
for total paranoia.

Signed-off-by: Philip Withnall <withnall@endlessm.com>

Closes: #1594
Approved by: jlebon
Use the same G_IO_ERROR_* values for HTTP status codes in both fetchers.
The libsoup fetcher still handles a few more internal error codes than
the libcurl one; this could be built on in future.

Signed-off-by: Philip Withnall <withnall@endlessm.com>

Closes: #1594
Approved by: jlebon
Just like all the other requests made for delta parts and objects by the
pull code, use a queue for delta superblocks. Currently this doesn’t do
any prioritisation or retries after transient failures, but it could do
in future.

This means that delta superblocks are now subject to the parallel
request limit in the fetcher, which was a problem highlighted here:
ostreedev/ostree#1453 (comment).

Signed-off-by: Philip Withnall <withnall@endlessm.com>
Use the recently introduced architecture for retrying network requests
on transient failure to do the same for delta superblock requests, now
that they’re queued.

Signed-off-by: Philip Withnall <withnall@endlessm.com>
@pwithnall pwithnall self-assigned this May 30, 2018
@rshuler
Copy link
Contributor

rshuler commented May 30, 2018

I've confirmed that these commits match those of the upstream PRs. Since this has all been reviewed upstream, I'm going to go ahead and accept the PR so we can get started with testing. If anyone else feels like reviewing more in-depth, go for it, and we can clean up later if any further changes needed.

@rshuler rshuler merged commit b8d7b8f into master May 30, 2018
@rshuler rshuler deleted the T22555-retry-downloads branch May 30, 2018 19:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants