T22555 Retry downloads on flaky networks #120

pwithnall · 2018-05-30T16:50:13Z

This is a straight backport of upstream’s:

Preparation for supporting retrying downloads ostreedev/ostree#1599 (omitting this commit because it didn’t apply cleanly, but it’s just a code cleanup)
Support retrying downloads if they transiently fail ostreedev/ostree#1594
Queue static delta superblocks ostreedev/ostree#1600 (not yet merged upstream; one merge conflict due to this downstream patch)

This introduces no functional changes, but will make some upcoming refactoring a little easier. Signed-off-by: Philip Withnall <withnall@endlessm.com> Closes: #1599 Approved by: jlebon

This introduces no functional changes, but does make the code a little cleaner. Signed-off-by: Philip Withnall <withnall@endlessm.com> Closes: #1599 Approved by: jlebon

Rename from `fdata` to `fetch_data` to clarify things and make it consistent with other similar functionality in the file. This introduces no functional changes. Signed-off-by: Philip Withnall <withnall@endlessm.com> Closes: #1599 Approved by: jlebon

This introduces no functional changes, but will make upcoming support for retrying downloads easier to add. Signed-off-by: Philip Withnall <withnall@endlessm.com> Closes: #1599 Approved by: jlebon

Allow network requests to be re-queued if they failed with a transient error, such as a socket timeout. Retry each request up to a limit (default: 5), and only then fail the entire pull and propagate the error to the caller. Add a new ostree_repo_pull_with_options() option, n-network-retries, to control the number of retries (including setting it back to the old default of 0, if the caller wants). Currently, retries are not supported for FetchDeltaSuperData requests, as they are not queued. Once they are queued, adding support for retries should be trivial. A FIXME comment has been left for this. Signed-off-by: Philip Withnall <withnall@endlessm.com> Closes: #1594 Approved by: jlebon

This allows the retry code in ostree-repo-pull.c to recover from (for example) timeouts at the libsoup layer in the stack, as well as from the GSocket layer in the stack. Signed-off-by: Philip Withnall <withnall@endlessm.com> Closes: #1594 Approved by: jlebon

This is exactly like the --random-500s option, except that it will cause error 408 (request timeout) to be returned, rather than error 500 (internal server error). This will be used in a following commit to test pull behaviour when timeouts occur. Signed-off-by: Philip Withnall <withnall@endlessm.com> Closes: #1594 Approved by: jlebon

Extend test-pull-repeated.sh to test error 408 as well as error 500, to ensure that the new retry-on-network-timeout code in ostree-repo-pull.c correctly retries. Rather than the 200 iterations needed for the error 500 tests, only do 5 iterations. The pull code internally does 5 retries (by default), which means a full iteration count of 25. That seems to be sufficient to make the tests reliably pass, in my testing — we can always bump it up to 200 / 5 = 40 in future if needed (to put it in parity with the error 500 tests). Signed-off-by: Philip Withnall <withnall@endlessm.com> Closes: #1594 Approved by: jlebon

Various of the counters already have assertions like this; add some more for total paranoia. Signed-off-by: Philip Withnall <withnall@endlessm.com> Closes: #1594 Approved by: jlebon

Use the same G_IO_ERROR_* values for HTTP status codes in both fetchers. The libsoup fetcher still handles a few more internal error codes than the libcurl one; this could be built on in future. Signed-off-by: Philip Withnall <withnall@endlessm.com> Closes: #1594 Approved by: jlebon

Just like all the other requests made for delta parts and objects by the pull code, use a queue for delta superblocks. Currently this doesn’t do any prioritisation or retries after transient failures, but it could do in future. This means that delta superblocks are now subject to the parallel request limit in the fetcher, which was a problem highlighted here: ostreedev/ostree#1453 (comment). Signed-off-by: Philip Withnall <withnall@endlessm.com>

Use the recently introduced architecture for retrying network requests on transient failure to do the same for delta superblock requests, now that they’re queued. Signed-off-by: Philip Withnall <withnall@endlessm.com>

rshuler · 2018-05-30T19:31:55Z

I've confirmed that these commits match those of the upstream PRs. Since this has all been reviewed upstream, I'm going to go ahead and accept the PR so we can get started with testing. If anyone else feels like reviewing more in-depth, go for it, and we can clean up later if any further changes needed.

pwithnall added 14 commits May 30, 2018 17:42

lib/repo-pull: Use values from struct in enqueue_one_object_request()

624fb5b

This introduces no functional changes, but will make some upcoming refactoring a little easier. Signed-off-by: Philip Withnall <withnall@endlessm.com> Closes: #1599 Approved by: jlebon

lib/repo-pull: Factor out free function for FetchDeltaSuperData

7e9dadf

This introduces no functional changes, but does make the code a little cleaner. Signed-off-by: Philip Withnall <withnall@endlessm.com> Closes: #1599 Approved by: jlebon

lib/repo-pull: Rename a variable

dbd3470

Rename from `fdata` to `fetch_data` to clarify things and make it consistent with other similar functionality in the file. This introduces no functional changes. Signed-off-by: Philip Withnall <withnall@endlessm.com> Closes: #1599 Approved by: jlebon

lib/repo-pull: Factor out enqueue function for ScanObjectQueueData

fd60d46

This introduces no functional changes, but will make upcoming support for retrying downloads easier to add. Signed-off-by: Philip Withnall <withnall@endlessm.com> Closes: #1599 Approved by: jlebon

lib/repo-pull: Factor out enqueue function for FetchObjectData

d9bc598

This introduces no functional changes, but will make upcoming support for retrying downloads easier to add. Signed-off-by: Philip Withnall <withnall@endlessm.com> Closes: #1599 Approved by: jlebon

lib/repo-pull: Factor out enqueue function for FetchStaticDeltaData

02bc550

This introduces no functional changes, but will make upcoming support for retrying downloads easier to add. Signed-off-by: Philip Withnall <withnall@endlessm.com> Closes: #1599 Approved by: jlebon

lib/repo-pull: Add some missing assertions for progress statistics

2c974d0

Various of the counters already have assertions like this; add some more for total paranoia. Signed-off-by: Philip Withnall <withnall@endlessm.com> Closes: #1594 Approved by: jlebon

lib/repo-pull: Support retries for delta superblocks

50a9cf2

Use the recently introduced architecture for retrying network requests on transient failure to do the same for delta superblock requests, now that they’re queued. Signed-off-by: Philip Withnall <withnall@endlessm.com>

pwithnall self-assigned this May 30, 2018

pwithnall requested review from wjt, dbnicholson and mwleeds May 30, 2018 16:50

rshuler merged commit b8d7b8f into master May 30, 2018

rshuler deleted the T22555-retry-downloads branch May 30, 2018 19:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

T22555 Retry downloads on flaky networks #120

T22555 Retry downloads on flaky networks #120

pwithnall commented May 30, 2018

rshuler commented May 30, 2018

T22555 Retry downloads on flaky networks #120

T22555 Retry downloads on flaky networks #120

Conversation

pwithnall commented May 30, 2018

rshuler commented May 30, 2018