AWS: Add progressive multipart upload to S3FileIO#1767
AWS: Add progressive multipart upload to S3FileIO#1767danielcweeks merged 13 commits intoapache:masterfrom
Conversation
|
@jackye1995 it would be great to get your thoughts on this approach. Still needs some work and lots more testing. |
jackye1995
left a comment
There was a problem hiding this comment.
Thank you! Looks good to me, the two big questions I have regarding the upload logic are (1) can we progressively delete staging files, and (2) can we leverage S3AsyncClient for async upload.
aws/src/main/java/org/apache/iceberg/aws/s3/S3OutputStream.java
Outdated
Show resolved
Hide resolved
aws/src/main/java/org/apache/iceberg/aws/s3/S3OutputStream.java
Outdated
Show resolved
Hide resolved
aws/src/main/java/org/apache/iceberg/aws/s3/S3OutputStream.java
Outdated
Show resolved
Hide resolved
aws/src/main/java/org/apache/iceberg/aws/s3/S3OutputStream.java
Outdated
Show resolved
Hide resolved
aws/src/main/java/org/apache/iceberg/aws/s3/S3OutputStream.java
Outdated
Show resolved
Hide resolved
aws/src/main/java/org/apache/iceberg/aws/s3/S3OutputStream.java
Outdated
Show resolved
Hide resolved
aws/src/main/java/org/apache/iceberg/aws/s3/S3OutputStream.java
Outdated
Show resolved
Hide resolved
aws/src/main/java/org/apache/iceberg/aws/s3/S3OutputStream.java
Outdated
Show resolved
Hide resolved
aws/src/main/java/org/apache/iceberg/aws/s3/S3OutputStream.java
Outdated
Show resolved
Hide resolved
|
@jackye1995 I've moved this PR to draft for now. I rebased on top of your changes in #1754, but it's somewhat complicated due to the fact that now we have 3 separate requests (create multipart, upload part, put object) that all require setting the properties, but they don't inherit from a common interface. I feel like reflection is probably the best way do simplify this, but I'll push the current state of things. |
ba3f8e5 to
c6ecb61
Compare
Sorry I was dealing with some other errands yesterday. Please see #1786 to see if that solves this issue, I am trying to not use reflection, but we can also switch to that approach if it is simpler. |
Turns out I did something very similar here: https://github.com/apache/iceberg/pull/1767/files#diff-133c36e9cbb025f7cb44c2daac330c501d85a5a6eb44126c0eb6155f2bad7407R30 |
Oh yes, looks like It is a generalization for that class. For the name, I would prefer the class to be more generic just to avoid creating other utils for future use cases. For reflection, I am not sure what is the community guideline here but personally I would avoid using that, which resulted in that solution using functional interface. |
@jackye1995 I actually stepped back from using reflection as I found out that some request types do not set all of the same parameters (e.g. UploadPartRequest, GetObjectRequest aren't exactly the same as PutObjectRequest and CreateMultipartUploadRequest). Since there are no common interfaces, it seems reasonable to just separate out the behavior for now. As we get into other instances of this (like ACL, etc.), then maybe it would make sense to find a more concise solution. I've got a few small changes to make and want to expand the testing, but hopefully have something ready for review today. |
|
The ACL support #1788 should not be a big issue, it is only a single additional method call I was thinking there is still a way to use reflection, using a similar logic as #1786 . We need to check if the method exists or not when setting the value. The biggest concern I have with reflection is that I don't know how much it affects the performance since it is in the critical path for upload. Let me run some tests and come back with some data for that. |
3679ea7 to
ac1aac2
Compare
Add progressive upload to S3OutputStream using multipart upload.
A few key changes are: