Aws::S3::Client#get_object can use range requests to resume interrupted downloads #1535

janko · 2017-06-16T23:22:56Z

This Amazon article describes various ways in which you can download objects from S3. The Seahorse client currently retries any failed requests, which includes interrupted downloads. It retries them like any other request, by re-issuing the same request. That way it's currently not possible for streaming downloads, i.e. Aws::S3::Client#get_object with a block, to be resilient to network errors, because when :response_target is an IO object it can be truncated before retrying the request, while with a block (BlockIO) you cannot do that, the chunks were already yielded.

I propose that for Aws::S3::Client#get_object requests, instead of retrying the entire download request from the beginning, aws-sdk uses Range header to resume downloading from the last downloaded byte. That way retrying interrupted downloads will be faster, and it will also work for streaming downloads (where a block is passed to Aws::S3::Client#get_object).

The text was updated successfully, but these errors were encountered:

cjyclaire · 2017-06-19T19:55:14Z

Thanks for bring up the idea, appreciate that. I'd happy to dig more to see how do we make it clear as a feature request or enhance the user experience there.

For the download, we release multipart downloader in April, when it's making requests by :range or :part, if a single request fails, it ensures cleaning up completed chunks. For those requests, if retry happens, it's actually retrying the same request with :range or :part.

From my understanding of your proposal (feel free to correct me if I'm missing something : )), you are suggesting special handling for "basic"(no :range or :part in parameters) :get_object operation when making retries to make it more robustic. The problem is, if :range or :part_number is not specified in the #get_object api call, the response will contains nil value for content_range and parts_count, which will be useless if we want to resume the range.

Personally, I feel multipart downloader works better for this case if the file is big, for single #get_object request, I don't see a clear way for making recover less error prone. Thoughts?

janko · 2017-06-19T20:30:28Z

Thanks for a thorough answer!

I did consider the new multipart downloader, however, my use case is a bit different than normal download. I actually need the to stream the content (i.e. pass a block to #get_object), because I'm allowing users to partially retrieve object's content, without having to fully download it. Here is where I'm using this; that part of code transforms the yielded chunks into an IO-like object, which downloads content (fetches next chunks) as it is being read. For that I cannot use the multipart downloader feature, because it doesn't stream content.

cjyclaire · 2017-06-19T22:21:06Z

@janko-m Thanks for the information, I see.

Well, as that blog post suggests:

Please note, when using blocks to downloading objects, the Ruby SDK will NOT retry failed requests after the first chunk of data has been yielded. Doing so could cause file corruption on the client end by starting over mid-stream. For this reason, I recommend using one of the preceding methods for specifying the target file path or IO object.

However, if partially retrieve is your main concern, I'm thinking perhaps we could add an extra :recover mode for multipart downloader that writes partial contents to specified file and returns the last part/range bytes information if retry still fails, and accordingly, it should also have a easy usage interface if resume is wanted.

Will that sounds good to you? If so, I can have that in the feature backlog : )

janko · 2017-06-20T06:06:05Z

I'm not sure if that will have all the benefits of using #get_object. For one, we don't know how much the user wants to download, the user chooses that, and I don't think it would be easy to add such an on-demand behaviour to the multipart downloader.

io = Down::ChunkedIO.new(chunks: object.enum_for(:get))
io.read(1*1024*1024) # downloads and returns first ~1MB
io.close # `#get_object` request is terminated, and nothing more gets downloaded

Secondly, using the multipart downloader would mean the contents would necessarily need to be downloaded to disk. However, that's not ideal in all cases. For example, Shrine (this gem that I'm maintaining) has a download endpoint, which streams the file directly from S3 to the response body. Because #get_object allows me to retrieve chunks directly as they are downloaded, I can write them to the response body without ever writing to disk, which means you can download many S3 objects through that endpoint at the same time without worrying that you will run out of disk space.

Thirdly, which should be probably discussed in a separate ticket specific to the multipart downloader feature, I haven't managed to achieve the same performance as with #get_object for larger objects (I tested with up to 200MB). It was some time ago, and I meant to raise an issue for that, but I had forgotten the numbers, but I think with multipart downloader it took about twice as slow to download larger objects than using #get_object directly (or maybe less). I tried increasing the chunk size, but I don't remember there was much improvement. IIRC, the most time was spent on IO.copy_stream when merging the files at the end (about 60%). So I would prefer using #get_object also because it's currently faster.

cjyclaire · 2017-06-20T17:28:05Z

@janko-m Ah I see, thanks for the clarification, I agree it's a separate issue, tagging this as a feature request, happy to take a PR for review. I'll do some benchmarking/exploring myself : )

cjyclaire · 2017-07-11T23:25:19Z

Tracked in backlog to be prioritized : )

cjyclaire added enhancement feature-request A feature should be added or improved. and removed enhancement feature-request A feature should be added or improved. labels Jun 19, 2017

cjyclaire added the information requested label Jun 19, 2017

cjyclaire added feature-request A feature should be added or improved. and removed enhancement labels Jun 20, 2017

cjyclaire closed this as completed Jul 11, 2017

janko mentioned this issue Sep 22, 2017

Don't retry streaming requests with blocks after data has been recieved #1617

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aws::S3::Client#get_object can use range requests to resume interrupted downloads #1535

Aws::S3::Client#get_object can use range requests to resume interrupted downloads #1535

janko commented Jun 16, 2017 •

edited

cjyclaire commented Jun 19, 2017 •

edited

janko commented Jun 19, 2017 •

edited

cjyclaire commented Jun 19, 2017 •

edited

janko commented Jun 20, 2017 •

edited

cjyclaire commented Jun 20, 2017

cjyclaire commented Jul 11, 2017

Aws::S3::Client#get_object can use range requests to resume interrupted downloads #1535

Aws::S3::Client#get_object can use range requests to resume interrupted downloads #1535

Comments

janko commented Jun 16, 2017 • edited

cjyclaire commented Jun 19, 2017 • edited

janko commented Jun 19, 2017 • edited

cjyclaire commented Jun 19, 2017 • edited

janko commented Jun 20, 2017 • edited

cjyclaire commented Jun 20, 2017

cjyclaire commented Jul 11, 2017

janko commented Jun 16, 2017 •

edited

cjyclaire commented Jun 19, 2017 •

edited

janko commented Jun 19, 2017 •

edited

cjyclaire commented Jun 19, 2017 •

edited

janko commented Jun 20, 2017 •

edited