Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-978] Support bulk get file size in GcsUtil. #1359

Closed
wants to merge 4 commits into from

Conversation

peihe
Copy link
Contributor

@peihe peihe commented Nov 15, 2016

Be sure to do all of the following to help us incorporate your contribution
quickly and easily:

  • Make sure the PR title is formatted like:
    [BEAM-<Jira issue #>] Description of pull request
  • Make sure tests pass via mvn clean verify. (Even better, enable
    Travis-CI on your fork and ensure the whole test matrix passes).
  • Replace <Jira issue #> in the title with the actual Jira issue
    number, if there is one.
  • If this contribution is large, please file an Apache
    Individual Contributor License Agreement.

@peihe
Copy link
Contributor Author

peihe commented Nov 15, 2016

R: @kennknowles @tgroh

Copy link
Member

@kennknowles kennknowles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failures are checkstyle, FYI.

}
throw new IOException("Unable to get file size", e);
}
List<Long> fileSize(Collection<GcsPath> paths) throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Readability nit: I would prefer fileSizes. At call sites, this makes the meaning of the collection more clear.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer sizes; is files not implicit here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to fileSizes:

I feel there could be ambiguity for "GcsUtil.sizes(paths)".

@@ -483,12 +475,39 @@ public Void call() throws IOException {
Thread.currentThread().interrupt();
throw new IOException("Interrupted while executing batch GCS request", e);
} catch (ExecutionException e) {
if (e.getCause() instanceof FileNotFoundException) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to get all the exceptions from failed requests? Are they already contained in this one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, e will be the first exception from all Futures.

I am going to propose a bulk get metadata API. And, with Metadata, we can wait for all requests to finish, and store the exception in Metadata if any request fails.

MockLowLevelHttpResponse notFoundResponse = new MockLowLevelHttpResponse();
notFoundResponse.setContent("");
notFoundResponse.setStatusCode(HttpStatusCodes.STATUS_CODE_NOT_FOUND);
String content = "Content-Type: application/http\n"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test seems overspecified. It should not require any of the details, just the HTTP 404 error code and perhaps the nested {error:...} bit.

+ " \"generation\": \"1470763581808000\",\n"
+ " \"metageneration\": \"2\",\n"
+ " \"contentType\": \"application/txt\",\n"
+ " \"timeCreated\": \"2016-08-09T17:26:21.801Z\",\n"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very overspecified. Anything that could change between requests, such as creation time, etc, should not be included.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems as though we should be able to generate a response of the type we desire instead of dumping in a response string. Is that not the case?

String content) throws Exception {
MockLowLevelHttpResponse notFoundResponse = new MockLowLevelHttpResponse()
.setContentType("multipart/mixed; boundary=batch_foobarbaz")
.setContent(content)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, where you mock the low level response, I think you should either omit or fill in junk for pieces that GcsUtil should not depend on. But you shouldn't mock the low level response unless our code relies on it. Instead, mock the storageClient.

@@ -266,16 +266,16 @@ Integer getUploadBufferSizeBytes() {
* Returns the file size from GCS or throws {@link FileNotFoundException}
* if the resource does not exist.
*/
public long fileSize(GcsPath path) throws IOException {
return fileSize(ImmutableList.of(path)).get(0);
public long fileSizes(GcsPath path) throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should still be fileSize

@peihe peihe force-pushed the bulk-size branch 2 times, most recently from c95aa4e to 57b8c74 Compare November 16, 2016 04:16
Copy link
Contributor Author

@peihe peihe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to defer improving the bulk get fileSizes together with https://issues.apache.org/jira/browse/BEAM-989

Since copy() and remove() need similar tests, I will just do them together.
I propose to merge this PR. And, I will send another one to fix all tests. Otherwise, I can append a commit maybe few days later.

}
throw new IOException("Unable to get file size", e);
}
List<Long> fileSize(Collection<GcsPath> paths) throws IOException {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to fileSizes:

I feel there could be ambiguity for "GcsUtil.sizes(paths)".

@@ -483,12 +475,39 @@ public Void call() throws IOException {
Thread.currentThread().interrupt();
throw new IOException("Interrupted while executing batch GCS request", e);
} catch (ExecutionException e) {
if (e.getCause() instanceof FileNotFoundException) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, e will be the first exception from all Futures.

I am going to propose a bulk get metadata API. And, with Metadata, we can wait for all requests to finish, and store the exception in Metadata if any request fails.

@asfbot
Copy link

asfbot commented Nov 19, 2016

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/beam_PreCommit_Java_Build/4/
--none--

@peihe
Copy link
Contributor Author

peihe commented Nov 19, 2016

@kennknowles @tgroh

I need mockito 2 to mock the final class BatchRequest.
However, mockito upgrading is blocked by a bytes-buddy versions conflict.
I think it will take us for a while to get the fix.

So, I suggest to move forward with this PR and address https://issues.apache.org/jira/browse/BEAM-989 later on.

thoughts?

@kennknowles
Copy link
Member

Our use of bytebuddy should be entirely private. Can you please file a JIRA to shade it? Unfortunately, we have hit blockers in getting the SDK's private dependencies shaded by default, so we need to still just aggressively shade things right away when they cause any trouble.

@peihe
Copy link
Contributor Author

peihe commented Nov 21, 2016

Filed https://issues.apache.org/jira/browse/BEAM-1019

More issues came out when updating to mockito 2. (Looks like there are few tests depends on the internal details of mock classes.)

I am not going to jump into the rabbit hole right now, will address them one by one.

@kennknowles
Copy link
Member

Following up on my last comment: good shading by Mockito would help us, but our own shading cannot help. But we can probably upgrade bytebuddy without too much difficulty, despite backwards-incompatible changes between 1.4.x and 1.5.x

+ " }\n"
+ "}\n"
+ "\n"
+ "--batch_CnzAnRit7rQ_AAax2S5XfdM--\n";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This boundary should be in a constant, and match below.

+ "HTTP/1.1 200 OK\n"
+ "Content-Length: 708\n"
+ "\n"
+ "{\n"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does there exist a Java API to create an object that will dump this JSON, so you don't have to know and maintain the exact format?

@kennknowles
Copy link
Member

I'm OK with moving on with this since PRs to clean it up are already in place.

@asfbot
Copy link

asfbot commented Dec 5, 2016

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/beam_PreCommit_Java_MavenInstall/5529/
--none--

@peihe
Copy link
Contributor Author

peihe commented Dec 5, 2016

PTAL

I removed the test that uses long fragile string, but kept the shorter one.
I think the test coverage is enough for current state (better than current code). I will improve the tests with mockito 2 with a separate PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants