New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-3681] Make S3FileSystem copy atomic for smaller than 5GB objects #4739
Conversation
R: @lukecwik |
LGTM |
5291c5d
to
a893068
Compare
LGTM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small change and a question. Can you fix them and I'll merge once you have squashed. Thx
// Amazon parts are 1-indexed, not zero-indexed. | ||
for (int partNumber = 1; bytePosition < objectSize; partNumber++) { | ||
final long objectSize = objectMetadata.getContentLength(); | ||
if (objectSize == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this code will never be called. Indeed, you call multipartCopy
only if objectMetadata.getContentLength()
> 5GB. So in multipartCopy
objectMetadata.getContentLength()
can never be 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes if you follow the flow of the copy method this won't happen. However the problem was detected because the multipartCopy did not invoke correctly the Amazon APIs when copying an empty object. The extra validation makes multipart copy robust to deal also with files of size 0 so it is worth to have in case someone relies in the future on this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so add a comment that says something like "it is an extra validation in case a caller calls directly S3FileSystem.multipartCopy without using S3FileSystem.copy in the future"
} | ||
|
||
@Test | ||
public void testMultipartCopy() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test asserts that only one call to the S3 client is issued. Can we add a test for a bigger object that will be split into several pieces and assert that the number of calls to the S3 client is correct (I guess more than 0) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is exactly the original test that existed before and it invokes the multipartCopy
method (previously called copy) with a 'big' object with two parts (notice that we don't require a huge >5GB object to test that multipartCopy
works, Amazon supports this even for really small objects), The multipartCopy
method calls the amazons3,copyPart
method twice internally so I think this cover the case you mention via:
when(s3FileSystem.getAmazonS3Client().copyPart(argThat(notNullValue(CopyPartRequest.class))))
.thenReturn(copyPartResult1)
.thenReturn(copyPartResult2);
Notice that the validation at the end of the test:
verify(s3FileSystem.getAmazonS3Client(), times(1))
.completeMultipartUpload(argThat(notNullValue(CompleteMultipartUploadRequest.class)));
guarantees that the multipartCopy method calls the end of the copy by doing the CompleteMultipartUploadRequest
. This guarantees that all the calls to the API were done (Initialize -> CopyPart* -> Complete).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notice that I prefer not to squash the current two commits because they are totally unrelated, but if you want to do it go ahead, you can squash them on merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok fair enough for the multi-part, thanks.
For the squash I prefer not to leave the first commit as it is just formatting. Can you squash with the other and add in the commit message ", and reformat FileBasedSink" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM modulo squash. Ping me when done, I'll merge the PR just after so that it can be in 2.4
…s and fix wrong indentation on FileBasedSink
…System.multipartCopy
a893068
to
ffebd65
Compare
Run Java Gradle PreCommit |
DESCRIPTION HERE
Follow this checklist to help us incorporate your contribution quickly and easily:
[BEAM-XXX] Fixes bug in ApproximateQuantiles
, where you replaceBEAM-XXX
with the appropriate JIRA issue.mvn clean verify
to make sure basic checks pass. A more thorough check will be performed on your pull request automatically.