Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aws::S3::Client#get_object can use range requests to resume interrupted downloads #1535

Closed
janko opened this issue Jun 16, 2017 · 6 comments
Closed
Labels
feature-request A feature should be added or improved.

Comments

@janko
Copy link
Contributor

janko commented Jun 16, 2017

This Amazon article describes various ways in which you can download objects from S3. The Seahorse client currently retries any failed requests, which includes interrupted downloads. It retries them like any other request, by re-issuing the same request. That way it's currently not possible for streaming downloads, i.e. Aws::S3::Client#get_object with a block, to be resilient to network errors, because when :response_target is an IO object it can be truncated before retrying the request, while with a block (BlockIO) you cannot do that, the chunks were already yielded.

I propose that for Aws::S3::Client#get_object requests, instead of retrying the entire download request from the beginning, aws-sdk uses Range header to resume downloading from the last downloaded byte. That way retrying interrupted downloads will be faster, and it will also work for streaming downloads (where a block is passed to Aws::S3::Client#get_object).

@cjyclaire cjyclaire added enhancement feature-request A feature should be added or improved. and removed enhancement feature-request A feature should be added or improved. labels Jun 19, 2017
@cjyclaire
Copy link
Contributor

cjyclaire commented Jun 19, 2017

Thanks for bring up the idea, appreciate that. I'd happy to dig more to see how do we make it clear as a feature request or enhance the user experience there.

For the download, we release multipart downloader in April, when it's making requests by :range or :part, if a single request fails, it ensures cleaning up completed chunks. For those requests, if retry happens, it's actually retrying the same request with :range or :part.

From my understanding of your proposal (feel free to correct me if I'm missing something : )), you are suggesting special handling for "basic"(no :range or :part in parameters) :get_object operation when making retries to make it more robustic. The problem is, if :range or :part_number is not specified in the #get_object api call, the response will contains nil value for content_range and parts_count, which will be useless if we want to resume the range.

Personally, I feel multipart downloader works better for this case if the file is big, for single #get_object request, I don't see a clear way for making recover less error prone. Thoughts?

@janko
Copy link
Contributor Author

janko commented Jun 19, 2017

Thanks for a thorough answer!

I did consider the new multipart downloader, however, my use case is a bit different than normal download. I actually need the to stream the content (i.e. pass a block to #get_object), because I'm allowing users to partially retrieve object's content, without having to fully download it. Here is where I'm using this; that part of code transforms the yielded chunks into an IO-like object, which downloads content (fetches next chunks) as it is being read. For that I cannot use the multipart downloader feature, because it doesn't stream content.

@cjyclaire
Copy link
Contributor

cjyclaire commented Jun 19, 2017

@janko-m Thanks for the information, I see.

Well, as that blog post suggests:

Please note, when using blocks to downloading objects, the Ruby SDK will NOT retry failed requests after the first chunk of data has been yielded. Doing so could cause file corruption on the client end by starting over mid-stream. For this reason, I recommend using one of the preceding methods for specifying the target file path or IO object.

However, if partially retrieve is your main concern, I'm thinking perhaps we could add an extra :recover mode for multipart downloader that writes partial contents to specified file and returns the last part/range bytes information if retry still fails, and accordingly, it should also have a easy usage interface if resume is wanted.

Will that sounds good to you? If so, I can have that in the feature backlog : )

@janko
Copy link
Contributor Author

janko commented Jun 20, 2017

I'm not sure if that will have all the benefits of using #get_object. For one, we don't know how much the user wants to download, the user chooses that, and I don't think it would be easy to add such an on-demand behaviour to the multipart downloader.

io = Down::ChunkedIO.new(chunks: object.enum_for(:get))
io.read(1*1024*1024) # downloads and returns first ~1MB
io.close # `#get_object` request is terminated, and nothing more gets downloaded

Secondly, using the multipart downloader would mean the contents would necessarily need to be downloaded to disk. However, that's not ideal in all cases. For example, Shrine (this gem that I'm maintaining) has a download endpoint, which streams the file directly from S3 to the response body. Because #get_object allows me to retrieve chunks directly as they are downloaded, I can write them to the response body without ever writing to disk, which means you can download many S3 objects through that endpoint at the same time without worrying that you will run out of disk space.

Thirdly, which should be probably discussed in a separate ticket specific to the multipart downloader feature, I haven't managed to achieve the same performance as with #get_object for larger objects (I tested with up to 200MB). It was some time ago, and I meant to raise an issue for that, but I had forgotten the numbers, but I think with multipart downloader it took about twice as slow to download larger objects than using #get_object directly (or maybe less). I tried increasing the chunk size, but I don't remember there was much improvement. IIRC, the most time was spent on IO.copy_stream when merging the files at the end (about 60%). So I would prefer using #get_object also because it's currently faster.

@cjyclaire
Copy link
Contributor

@janko-m Ah I see, thanks for the clarification, I agree it's a separate issue, tagging this as a feature request, happy to take a PR for review. I'll do some benchmarking/exploring myself : )

@cjyclaire cjyclaire added feature-request A feature should be added or improved. and removed enhancement labels Jun 20, 2017
@cjyclaire
Copy link
Contributor

Tracked in backlog to be prioritized : )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request A feature should be added or improved.
Projects
None yet
Development

No branches or pull requests

2 participants