[BEAM-314] Add zip compression support in TextIO #400

jbonofre · 2016-05-31T12:24:17Z

Be sure to do all of the following to help us incorporate your contribution
quickly and easily:

Make sure the PR title is formatted like:
[BEAM-<Jira issue #>] Description of pull request
Make sure tests pass via mvn clean verify. (Even better, enable
Travis-CI on your fork and ensure the whole test matrix passes).
Replace <Jira issue #> in the title with the actual Jira issue
number, if there is one.
If this contribution is large, please file an Apache
Individual Contributor License Agreement.

Add zip compression support in TextIO.

davorbonaci · 2016-05-31T17:01:34Z

dhalperi · 2016-06-02T19:45:12Z

sdks/java/core/src/test/java/org/apache/beam/sdk/io/TextIOTest.java

@@ -404,6 +406,37 @@ public void testCompressedRead() throws Exception {

  @Test
  @Category(NeedsRunner.class)
+  public void testZipCompressedRead() throws Exception {


please add a test with an empty (but valid) zip file.

Good idea, let me do it.

jbonofre · 2016-06-06T16:09:27Z

Rebased and updated. I have to figure out the expectException issue in the test.

dhalperi · 2016-06-06T16:43:31Z

sdks/java/core/src/main/java/org/apache/beam/sdk/io/CompressedSource.java

+        if (zip.getNextEntry() != null) {
+          return Channels.newChannel(zip);
+        }
+        throw new IllegalArgumentException("ZIP file doesn't contain any entry");


What will the behavior be today on a multi-entry zip? Will it silently produce bad data? Fail in some way?

Please comment, and then also add a test.

jbonofre · 2016-06-08T20:29:24Z

Rebased and updated.

dhalperi · 2016-06-08T21:42:39Z

sdks/java/core/src/test/java/org/apache/beam/sdk/io/TextIOTest.java

+    p.run();
+
+    // test with auto-detect ZIP based on extension.
+    p = TestPipeline.create();


please make this a separate unit test -- that way they can pass or fail independently :)

dhalperi · 2016-06-08T21:54:18Z

Hi JB,

This is looking pretty good!

But I have some questions about the tests. Specifically, since we mostly test empty files it seems tough to validate that the decompressor does exactly what we expect.

I've added some suggestions for improvements.

Thanks,
Dan

jbonofre · 2016-06-09T13:16:49Z

Updated. I added explanations on each test.
However, I have two points:

Using ZipInputStream there's not way to get the number of entries. So, unfortunately, I don't see a good way to raise an exception when the zip stream contains multiple entries.
In the testZipCompressedReadWithEmptyFile test, I got the right IllegalArgumentException but wrapped in a RuntimeException. That's why I'm doing expectedException.expect(RuntimeException.class) instead of expectedException.expect(IllegalArgumentException.class). Any idea why ?

dhalperi · 2016-06-09T18:06:45Z

sdks/java/core/src/test/java/org/apache/beam/sdk/io/TextIOTest.java

+    PCollection<String> output = p.apply(read);
+
+    PAssert.that(output).empty();
+    p.run();


I'd really like to find a way that this case either concats all the files or throws an exception. The current behavior is effectively silent data loss, the worst possible case!

One way to handle this could be a utility InputStream class that wraps ZipInputStream and under-the-hood concats all the different entry streams. This is probably the base case.

A second possibility is that once the input stream hits EOF, we check for a next entry and only then throw an exception. But this is less desirable as we don't fail until pretty late.

Can you look into it?

I like your idea of wrapping the stream. Let me figure it out.

jbonofre · 2016-06-13T19:53:04Z

I updated the PR to use another approach: I'm using directly the ZipInputStream with considering the entries. It allows user to read multi-entries zip file as single entry one. I also updated the tests to show the "new" behavior.

dhalperi · 2016-06-14T03:55:24Z

sdks/java/core/src/test/java/org/apache/beam/sdk/io/TextIOTest.java

+
+  /**
+   * Read a ZIP compressed file with multiple entries. Only the first entry is actually read.
+   * We expect an empty PCollection.


update the javadoc to be correct.

dhalperi · 2016-06-14T03:59:24Z

Thanks JB -- looking great.

Please ensure that the multi-entry test is actually passing by fixing the way the writers are used.
Fix the javadoc with new semantics.

Other than that, I'd love to see less code reuse in the tests. This is considered a strong recommendation, but I won't withhold and LGTM on that basis ;).

jbonofre · 2016-06-14T06:20:49Z

My bad about the javadoc. I will fix that. I'm also fixing the multi-entries test. Then I will refactore a bit to use a share method for zip file creation ;)

jbonofre · 2016-06-14T09:14:27Z

PR rebased and updated based on Dan's comments.

dhalperi · 2016-06-14T16:47:18Z

sdks/java/core/src/test/java/org/apache/beam/sdk/io/TextIOTest.java

+    String tmpFileName = tmpFile.getPath();
+
+    ZipOutputStream out = new ZipOutputStream(new FileOutputStream(tmpFile));
+    PrintStream writer = new PrintStream(out);


Please use

PrintStream writer = new PrintStream(out, true /* auto-flush on write */);

this way we can be sure that the PrintStream itself does not buffer any data. See javadoc,

dhalperi · 2016-06-14T16:57:24Z

Looking great! Only trivial changes left.

jbonofre · 2016-06-15T20:21:17Z

Rebased and updated according to Dan's comments.

jbonofre · 2016-06-17T15:14:41Z

There are tests failure in the DirectRunner due to the latest changes and rebase. I will fix that.

dhalperi · 2016-06-17T15:29:06Z

Hi JB!

One small request:

Please don't rebase every time. It's much easier to review if you simply add new incremental CLs to the existing pull request, as this way I can see what changed ;) (This is mentioned briefly in the contribution guide, but is not obvious and may not be standard practice everywhere).

I checked out your PR and tried to get the tests to pass -- but I was never able to make both the single-file and multi-file tests pass. I think that right now the ZipInputStream does not automatically concat all file contents without more effort. Might need to make a new class that wraps the ZipInputStream and manually calls getNextEntry whenever the current entry reaches EOF.

jbonofre · 2016-06-18T05:49:42Z

Hi Dan,

Thanks for the update and ok for the rebase (sorry about that). I'm checking and fixing the issue.

jbonofre · 2016-06-20T14:48:48Z

Working on a ZipInputStream wrapper (which call getNextEntry() behind the hood). Update soon.

…l entries in a ZIP

jbonofre · 2016-06-21T16:01:48Z

@dhalperi It should be OK now.

dhalperi · 2016-06-21T18:02:21Z

sdks/java/core/src/main/java/org/apache/beam/sdk/io/CompressedSource.java

+        currentEntry = getNextEntry();
+      }
+
+      public int read(byte[] b, int off, int len) throws IOException {


there are a lot of variants of read in an InputStream. Is it obvious that this is the only variant you need to override? (May well be, I just don't know).

Yes, it's the only read using in the IOChannel.

I do think you should override all applicable functions from InputStream -- implementation of IOChannel might change and/or this stream might be used in a different way.

This is the only read that has to be implemented, all the others are just wrappers for this read method.

int read() is the only one that you may want to handle specially since its very inefficient if its not implemented but its really inefficient for people to use so having an efficient implementation is useful.

dhalperi · 2016-06-22T17:17:34Z

JB this is awesome. One small fix to get the multiple-eof-in-a-row case fixed and let's merge it. A big accomplishment!

jbonofre · 2016-06-22T17:52:11Z

Great feedback guys ! Much appreciated. Fixing the multi-entries. Thanks !

…fixed the read() methods

jbonofre · 2016-06-23T07:12:33Z

Fixed the read() methods and now extends InputStream wrapping the ZipInputStream instead of extending it.

dhalperi · 2016-06-23T07:47:57Z

LGTM. Jenkins flaked on network errors, but Travis is fine. Merging.

…pache#400) * Add Additional Exists check to FileIOChannelFactory#create This ensures that if the folder did not exist when first checked, but did by the time mkdirs was executed (and thus mkdirs returned false) the create will not fail. * Dynamically choose number of shards in the InProcessPipelineRunner Add a Write Override Factory that limits the number of shards if unspecified. This ensures that we will not write an output file per-key due to bundling. Do so by obtaining a count of the elements and obtaining the number of shards based on the number of outputs.

…channel. This closes apache#400

PiperOrigin-RevId: 453710308

…dev pin (apache#400) Expand pins on library dependencies in preparation for these dependencies taking a new major version. See googleapis/google-cloud-python#10566.

dhalperi reviewed Jun 2, 2016
View reviewed changes

jbonofre force-pushed the BEAM-314 branch from ce2dcec to 4737893 Compare June 6, 2016 16:08

dhalperi reviewed Jun 6, 2016
View reviewed changes

jbonofre force-pushed the BEAM-314 branch from 4737893 to 28d5459 Compare June 8, 2016 20:28

dhalperi reviewed Jun 8, 2016
View reviewed changes

jbonofre force-pushed the BEAM-314 branch from 28d5459 to bd79d74 Compare June 9, 2016 13:13

dhalperi reviewed Jun 9, 2016
View reviewed changes

jbonofre force-pushed the BEAM-314 branch from bd79d74 to 1826a0e Compare June 13, 2016 19:51

dhalperi reviewed Jun 14, 2016
View reviewed changes

jbonofre force-pushed the BEAM-314 branch from 1826a0e to 49a57f0 Compare June 14, 2016 09:13

dhalperi reviewed Jun 14, 2016
View reviewed changes

jbonofre force-pushed the BEAM-314 branch from 49a57f0 to 038a111 Compare June 15, 2016 20:20

[BEAM-314] Add zip compression support in TextIO

e5d7d8f

jbonofre force-pushed the BEAM-314 branch from 038a111 to e5d7d8f Compare June 17, 2016 15:08

[BEAM-314] Add FullZipInputStream extending ZipInputStream to read al…

2c847d7

…l entries in a ZIP

dhalperi reviewed Jun 21, 2016
View reviewed changes

[BEAM-314] Change the FullZipInputStream to private static

f794160

[BEAM-314] Extend InputStream instead of directly ZipInputStream and …

1b412f0

…fixed the read() methods

asfgit closed this in f2d2ce5 Jun 23, 2016

jbonofre deleted the BEAM-314 branch June 23, 2016 07:54

tvalentyn pushed a commit to tvalentyn/beam that referenced this pull request May 15, 2018

[BEAM-3846] Add a link to be able to self join the Apache Beam slack …

c0ea71c

…channel. This closes apache#400

Abacn pushed a commit to Abacn/beam that referenced this pull request Jan 31, 2023

Merge pull request apache#400 from andreigurau:gcs-splunk-docs

fa063bb

PiperOrigin-RevId: 453710308

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BEAM-314] Add zip compression support in TextIO #400

[BEAM-314] Add zip compression support in TextIO #400

jbonofre commented May 31, 2016

davorbonaci commented May 31, 2016

dhalperi Jun 2, 2016

jbonofre Jun 6, 2016

jbonofre commented Jun 6, 2016

dhalperi Jun 6, 2016

jbonofre commented Jun 8, 2016

dhalperi Jun 8, 2016

dhalperi commented Jun 8, 2016

jbonofre commented Jun 9, 2016

dhalperi Jun 9, 2016

jbonofre Jun 13, 2016

jbonofre commented Jun 13, 2016

dhalperi Jun 14, 2016

dhalperi commented Jun 14, 2016

jbonofre commented Jun 14, 2016

jbonofre commented Jun 14, 2016

dhalperi Jun 14, 2016

dhalperi commented Jun 14, 2016

jbonofre commented Jun 15, 2016

jbonofre commented Jun 17, 2016

dhalperi commented Jun 17, 2016

jbonofre commented Jun 18, 2016

jbonofre commented Jun 20, 2016

jbonofre commented Jun 21, 2016

dhalperi Jun 21, 2016

jbonofre Jun 21, 2016

dhalperi Jun 21, 2016

lukecwik Jun 21, 2016

dhalperi commented Jun 22, 2016

jbonofre commented Jun 22, 2016

jbonofre commented Jun 23, 2016

dhalperi commented Jun 23, 2016 •

edited

[BEAM-314] Add zip compression support in TextIO #400

[BEAM-314] Add zip compression support in TextIO #400

Conversation

jbonofre commented May 31, 2016

davorbonaci commented May 31, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbonofre commented Jun 6, 2016

Choose a reason for hiding this comment

jbonofre commented Jun 8, 2016

Choose a reason for hiding this comment

dhalperi commented Jun 8, 2016

jbonofre commented Jun 9, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbonofre commented Jun 13, 2016

Choose a reason for hiding this comment

dhalperi commented Jun 14, 2016

jbonofre commented Jun 14, 2016

jbonofre commented Jun 14, 2016

Choose a reason for hiding this comment

dhalperi commented Jun 14, 2016

jbonofre commented Jun 15, 2016

jbonofre commented Jun 17, 2016

dhalperi commented Jun 17, 2016

jbonofre commented Jun 18, 2016

jbonofre commented Jun 20, 2016

jbonofre commented Jun 21, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dhalperi commented Jun 22, 2016

jbonofre commented Jun 22, 2016

jbonofre commented Jun 23, 2016

dhalperi commented Jun 23, 2016 • edited

dhalperi commented Jun 23, 2016 •

edited