Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-3681] Make S3FileSystem copy atomic for smaller than 5GB objects #4739

Merged
merged 2 commits into from Feb 28, 2018

Conversation

iemejia
Copy link
Member

@iemejia iemejia commented Feb 23, 2018

DESCRIPTION HERE


Follow this checklist to help us incorporate your contribution quickly and easily:

  • Make sure there is a JIRA issue filed for the change (usually before you start working on it). Trivial changes like typos do not require a JIRA issue. Your pull request should address just this issue, without pulling in other changes.
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue.
  • Write a pull request description that is detailed enough to understand:
    • What the pull request does
    • Why it does it
    • How it does it
    • Why this approach
  • Each commit in the pull request should have a meaningful subject line and body.
  • Run mvn clean verify to make sure basic checks pass. A more thorough check will be performed on your pull request automatically.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

@iemejia
Copy link
Member Author

iemejia commented Feb 23, 2018

R: @lukecwik
CC: @jacobmarble

@jacobmarble
Copy link

LGTM

@aromanenko-dev
Copy link
Contributor

LGTM

Copy link
Contributor

@echauchot echauchot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small change and a question. Can you fix them and I'll merge once you have squashed. Thx

// Amazon parts are 1-indexed, not zero-indexed.
for (int partNumber = 1; bytePosition < objectSize; partNumber++) {
final long objectSize = objectMetadata.getContentLength();
if (objectSize == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this code will never be called. Indeed, you call multipartCopy only if objectMetadata.getContentLength() > 5GB. So in multipartCopy objectMetadata.getContentLength() can never be 0

Copy link
Member Author

@iemejia iemejia Feb 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes if you follow the flow of the copy method this won't happen. However the problem was detected because the multipartCopy did not invoke correctly the Amazon APIs when copying an empty object. The extra validation makes multipart copy robust to deal also with files of size 0 so it is worth to have in case someone relies in the future on this method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so add a comment that says something like "it is an extra validation in case a caller calls directly S3FileSystem.multipartCopy without using S3FileSystem.copy in the future"

}

@Test
public void testMultipartCopy() throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test asserts that only one call to the S3 client is issued. Can we add a test for a bigger object that will be split into several pieces and assert that the number of calls to the S3 client is correct (I guess more than 0) ?

Copy link
Member Author

@iemejia iemejia Feb 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is exactly the original test that existed before and it invokes the multipartCopy method (previously called copy) with a 'big' object with two parts (notice that we don't require a huge >5GB object to test that multipartCopy works, Amazon supports this even for really small objects), The multipartCopy method calls the amazons3,copyPart method twice internally so I think this cover the case you mention via:

    when(s3FileSystem.getAmazonS3Client().copyPart(argThat(notNullValue(CopyPartRequest.class))))
         .thenReturn(copyPartResult1)
         .thenReturn(copyPartResult2);

Notice that the validation at the end of the test:

    verify(s3FileSystem.getAmazonS3Client(), times(1))
         .completeMultipartUpload(argThat(notNullValue(CompleteMultipartUploadRequest.class)));

guarantees that the multipartCopy method calls the end of the copy by doing the CompleteMultipartUploadRequest. This guarantees that all the calls to the API were done (Initialize -> CopyPart* -> Complete).

Copy link
Member Author

@iemejia iemejia Feb 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notice that I prefer not to squash the current two commits because they are totally unrelated, but if you want to do it go ahead, you can squash them on merge.

Copy link
Contributor

@echauchot echauchot Feb 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok fair enough for the multi-part, thanks.
For the squash I prefer not to leave the first commit as it is just formatting. Can you squash with the other and add in the commit message ", and reformat FileBasedSink" ?

Copy link
Contributor

@echauchot echauchot Feb 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo squash. Ping me when done, I'll merge the PR just after so that it can be in 2.4

@echauchot
Copy link
Contributor

Run Java Gradle PreCommit

@iemejia iemejia merged commit 7fa6292 into apache:master Feb 28, 2018
@iemejia iemejia deleted the BEAM-3681-s3-write-error branch February 28, 2018 23:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants