Skip to content

AWS: Add progressive multipart upload to S3FileIO#1767

Merged
danielcweeks merged 13 commits intoapache:masterfrom
danielcweeks:s3fileio-progressive-upload
Nov 21, 2020
Merged

AWS: Add progressive multipart upload to S3FileIO#1767
danielcweeks merged 13 commits intoapache:masterfrom
danielcweeks:s3fileio-progressive-upload

Conversation

@danielcweeks
Copy link
Contributor

Add progressive upload to S3OutputStream using multipart upload.

A few key changes are:

  1. The output stream will switch from single PUT based upload to multipart when the multipart size * threshold is reached.
  2. Staging files will be rotated and uploaded async as output written
  3. Staging files will be deleted when the upload completes to reduce local disk requirements
  4. An attempt will be made to abort the upload upon failure, but this is not guaranteed (Bucket lifecycle rules should be implemented to ensure data is cleaned up.)

@github-actions github-actions bot added the AWS label Nov 13, 2020
@danielcweeks
Copy link
Contributor Author

@jackye1995 it would be great to get your thoughts on this approach. Still needs some work and lots more testing.

@danielcweeks danielcweeks linked an issue Nov 13, 2020 that may be closed by this pull request
Copy link
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Looks good to me, the two big questions I have regarding the upload logic are (1) can we progressively delete staging files, and (2) can we leverage S3AsyncClient for async upload.

@danielcweeks danielcweeks marked this pull request as draft November 16, 2020 22:00
@danielcweeks
Copy link
Contributor Author

@jackye1995 I've moved this PR to draft for now. I rebased on top of your changes in #1754, but it's somewhat complicated due to the fact that now we have 3 separate requests (create multipart, upload part, put object) that all require setting the properties, but they don't inherit from a common interface. I feel like reflection is probably the best way do simplify this, but I'll push the current state of things.

@jackye1995
Copy link
Contributor

@jackye1995 I've moved this PR to draft for now. I rebased on top of your changes in #1754, but it's somewhat complicated due to the fact that now we have 3 separate requests (create multipart, upload part, put object) that all require setting the properties, but they don't inherit from a common interface. I feel like reflection is probably the best way do simplify this, but I'll push the current state of things.

Sorry I was dealing with some other errands yesterday. Please see #1786 to see if that solves this issue, I am trying to not use reflection, but we can also switch to that approach if it is simpler.

@danielcweeks
Copy link
Contributor Author

@jackye1995 I've moved this PR to draft for now. I rebased on top of your changes in #1754, but it's somewhat complicated due to the fact that now we have 3 separate requests (create multipart, upload part, put object) that all require setting the properties, but they don't inherit from a common interface. I feel like reflection is probably the best way do simplify this, but I'll push the current state of things.

Sorry I was dealing with some other errands yesterday. Please see #1786 to see if that solves this issue, I am trying to not use reflection, but we can also switch to that approach if it is simpler.

Turns out I did something very similar here: https://github.com/apache/iceberg/pull/1767/files#diff-133c36e9cbb025f7cb44c2daac330c501d85a5a6eb44126c0eb6155f2bad7407R30

@jackye1995
Copy link
Contributor

@jackye1995 I've moved this PR to draft for now. I rebased on top of your changes in #1754, but it's somewhat complicated due to the fact that now we have 3 separate requests (create multipart, upload part, put object) that all require setting the properties, but they don't inherit from a common interface. I feel like reflection is probably the best way do simplify this, but I'll push the current state of things.

Sorry I was dealing with some other errands yesterday. Please see #1786 to see if that solves this issue, I am trying to not use reflection, but we can also switch to that approach if it is simpler.

Turns out I did something very similar here: https://github.com/apache/iceberg/pull/1767/files#diff-133c36e9cbb025f7cb44c2daac330c501d85a5a6eb44126c0eb6155f2bad7407R30

Oh yes, looks like It is a generalization for that class. For the name, I would prefer the class to be more generic just to avoid creating other utils for future use cases. For reflection, I am not sure what is the community guideline here but personally I would avoid using that, which resulted in that solution using functional interface.

@danielcweeks
Copy link
Contributor Author

@jackye1995 I've moved this PR to draft for now. I rebased on top of your changes in #1754, but it's somewhat complicated due to the fact that now we have 3 separate requests (create multipart, upload part, put object) that all require setting the properties, but they don't inherit from a common interface. I feel like reflection is probably the best way do simplify this, but I'll push the current state of things.

Sorry I was dealing with some other errands yesterday. Please see #1786 to see if that solves this issue, I am trying to not use reflection, but we can also switch to that approach if it is simpler.

Turns out I did something very similar here: https://github.com/apache/iceberg/pull/1767/files#diff-133c36e9cbb025f7cb44c2daac330c501d85a5a6eb44126c0eb6155f2bad7407R30

Oh yes, looks like It is a generalization for that class. For the name, I would prefer the class to be more generic just to avoid creating other utils for future use cases. For reflection, I am not sure what is the community guideline here but personally I would avoid using that, which resulted in that solution using functional interface.

@jackye1995 I actually stepped back from using reflection as I found out that some request types do not set all of the same parameters (e.g. UploadPartRequest, GetObjectRequest aren't exactly the same as PutObjectRequest and CreateMultipartUploadRequest). Since there are no common interfaces, it seems reasonable to just separate out the behavior for now.

As we get into other instances of this (like ACL, etc.), then maybe it would make sense to find a more concise solution.

I've got a few small changes to make and want to expand the testing, but hopefully have something ready for review today.

@jackye1995
Copy link
Contributor

The ACL support #1788 should not be a big issue, it is only a single additional method call acl(...) and only affects PutObjectRequest and CreateMultipartUploadRequest.

I was thinking there is still a way to use reflection, using a similar logic as #1786 . We need to check if the method exists or not when setting the value. The biggest concern I have with reflection is that I don't know how much it affects the performance since it is in the critical path for upload. Let me run some tests and come back with some data for that.

@danielcweeks danielcweeks marked this pull request as ready for review November 20, 2020 20:40
@danielcweeks danielcweeks force-pushed the s3fileio-progressive-upload branch from 3679ea7 to ac1aac2 Compare November 20, 2020 20:47
@danielcweeks danielcweeks requested a review from rdblue November 20, 2020 20:49
Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@danielcweeks danielcweeks merged commit 5e3f919 into apache:master Nov 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support multipart upload in S3FileIO

4 participants