Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 TransferManager Should Allow Downloading to Stream #893

Closed
jeffquinn-nuna opened this issue Oct 15, 2016 · 10 comments

Comments

@jeffquinn-nuna
Copy link

commented Oct 15, 2016

The current implementation of TransferManager only allows a File as the download destination. I think it would be helpful if there was an interface for supplying an OutputStream as the final sink for the downloaded data. Overall it seems this would give greater flexibility in usage, and one could still just supply a FileOutputStream if they simply want to download to a file.

@kiiadi

This comment has been minimized.

Copy link
Contributor

commented Oct 15, 2016

You can get at the stream by using the AmazonS3Client.getObject method directly. This returns an S3Object which has a getObjectContent() method to get access to the stream. Does this solve your use-case?

@jeffquinn-nuna

This comment has been minimized.

Copy link
Author

commented Oct 15, 2016

Hi @kiiadi thanks for your response. Yes it is true that the api for S3Object works well for my use case, however my understanding is that my download will always be single threaded if I use that api.

For my use case the consumer of my stream is going to be able to consume data much faster than a single thread downloads from S3, so I believe I will be able to get performance gains if multiple threads are downloading from S3 simultaneously and buffering, and then dumping their contents to a single stream sequentially as they complete.

I tried to achieve this working within the existing API by passing a unix named pipe as the file parameter to TransferManager, but this fails because TransferManager's design assumes a regular file.

@kiiadi

This comment has been minimized.

Copy link
Contributor

commented Oct 17, 2016

As you may already be aware TransferManager will only parallelize downloads for S3 objects that were uploaded in multiple parts (which in turn will only happen if we know the content-length either through a File or an InputStream with content-length specified in the ObjectMetadata).

Historically we've stuck with File because it's easier for us to control the retry logic and also to parallelize uploads (see above). That said, we've had a few similar requests so I can add it as a feature request to our backlog.

@jeffquinn-nuna

This comment has been minimized.

Copy link
Author

commented Oct 17, 2016

Ah yes, I did not know that at the time I wrote this issue, but noticed it while playing around with the TransferManager. I have seen a lot of libraries out in the wild that use Range requests to achieve better parallelism, and they seem to get good results. I believe the SDK does not provide this functionality, is there any reason why? Is there something inherently more robust about using the multipart approach vs Range requests? (Maybe it makes retries more robust, etc?)

Happy to contribute a Range request parallel downloader if theres no fundamental reason against it. We have written it ourselves several times within my company in different languages.

@kiiadi

This comment has been minimized.

Copy link
Contributor

commented Oct 18, 2016

We're somewhat limited by what S3 itself can support; as it works today S3 only supports multi-part download on objects that have been uploaded in multiple parts. Range is not currently supported.

Unless I've misunderstood your proposal...

@kiiadi

This comment has been minimized.

Copy link
Contributor

commented Oct 18, 2016

Oops - @varunnvs92 just pointed out to me that you may have meant using the range property on the GetObjectRequest (which I admit I didn't know was there) - in which case this could work! A PR would be great!

@jeffquinn-nuna

This comment has been minimized.

Copy link
Author

commented Oct 18, 2016

Ok great! I'm glad to hear we are not doing anything fundamentally wrong by using the Range property (overwhelming the API etc.) Will try to review the contributing guidelines and open a PR soon :)

@stevematyas

This comment has been minimized.

Copy link

commented Nov 4, 2016

+1 (sooner the better or an unofficial work-around would be appreciated)

@kiiadi : Completely understand the history, here:

#893 (comment)

@kiiadi : From you earlier suggestion, community users are using AmazonS3Client.getObject to access the underlying stream (to achieve streaming support) and unfortunately due to #856 (java.net.SocketTimeoutException: Read timed out) we're forced to retry our large streams again -- a poor experience. And, storing a File is ill-advised in our use-case.

@justinuang

This comment has been minimized.

Copy link

commented Feb 21, 2019

I have a proof-of-concept implementation here: https://issues.apache.org/jira/browse/HADOOP-16132, with the PR being here: palantir/hadoop#47

@debora-ito

This comment has been minimized.

Copy link
Collaborator

commented Aug 12, 2019

@jeffquinn-nuna @stevematyas @justinuang

The SDK team has reviewed the feature request list for V1, and since they're concentrating efforts on V2 new features they decided to not implement this one in V1. It's still being considered for the TransferManager refactor in V2, see the referenced issue above. I'll go ahead and close this.

Please feel free to comment on the V2 tracking issue with your use case, and reach out if you have further questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.