[BEAM-8564] Add LZO compression and decompression support #10254

amoght · 2019-12-02T20:02:29Z

LZO is a lossless data compression algorithm which is focused on compression and decompression speeds.

This will enable Apache Beam sdk to compress/decompress files using LZO/LZOP compression algorithm.

This includes the following functionalities:

Appropriate Input and Output streams to enable working with LZO/LZOP files.
Compression using LZO/LZOP compression algorithm on Apache beam.
Decompression using LZO/LZOP decompression algorithm on Apache beam.

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Choose reviewer(s) and mention them in a comment (R: @username).
Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

Post-Commit Tests Status (on master branch)

Lang	SDK	Apex	Dataflow	Gearpump	Samza	Spark
Go		---	---	---	---
Java
Python		---		---	---
XLang	---	---	---	---	---	---

Pre-Commit Tests Status (on master branch)

---	Java	Python	Go	Website
Non-portable
Portable	---		---	---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

update fork

merge-beam changes #2

lukecwik

Why the GO sdk changes?

lukecwik · 2019-12-02T20:56:27Z

sdks/java/core/build.gradle

@@ -90,4 +90,6 @@ dependencies {
  shadowTest library.java.avro_tests
  shadowTest library.java.zstd_jni
  testRuntimeOnly library.java.slf4j_jdk14
+  compile 'io.airlift:aircompressor:0.16'


This library requires a Java 1.8+ virtual machine containing the sun.misc.Unsafe interface running on a little endian platform. Is there a different implementation we could use?

It also depends on the library below which is 21mbs.

The other implementations that are present are either not pure java(contain jni, .c, .h files) or have licensing issues.
This solves both of these issues since it is pure java implemented and is under Apache Licence.
I am not sure if we have any other pure java implementation which is under Apache Licence.

I asked on your original review request email thread if there were any alternative suggestions and hopefully the community may provide some suggestions. LZO implementations that contain C code would be ok if they ran on the three most popular platforms (Linux, Mac, Windows).

lukecwik · 2019-12-02T21:00:46Z

sdks/java/core/src/main/java/org/apache/beam/sdk/io/Compression.java

@@ -152,6 +156,38 @@ public WritableByteChannel writeCompressed(WritableByteChannel channel) throws I
    }
  },

+  /** LZO compression. */
+  LZO(".lzo", ".lzo") {


Should this use the .lzo_deflate extension for suggested and detected extensions?

Thanks for pointing this out. I have made the appropriate changes on this.

lukecwik · 2019-12-02T21:02:18Z

sdks/java/core/src/main/java/org/apache/beam/sdk/io/Compression.java

@@ -152,6 +156,38 @@ public WritableByteChannel writeCompressed(WritableByteChannel channel) throws I
    }
  },

+  /** LZO compression. */


It may not be obvious to users that the difference between LZO and LZOP is that one has headers and one is just the LZO algorithm. Please expand on this comment and the one below.

Agreed. I have added comments for both the modules (LZO, LZOP). This should make things clear.

lukecwik · 2019-12-02T21:07:28Z

R: @lukecwik

update fork 3rd Dec 2019, 14:55PM

3rd Dec 2019 21:59PM

gsteelman · 2019-12-04T22:07:02Z

sdks/java/core/src/main/java/org/apache/beam/sdk/util/LzoCompressorInputStream.java

+    if (len == 0) {
+      return 0;
+    }
+    final int ret = lzoIS.read(buf, off, len);


What is the reason for not delegating the call in the case of 0 length? Does lzoIS.read() not handle that case cleanly?

No, this case is getting handled. This check has been put simply for the reason that if buffer length is 0, the read method doesn't even get executed and is handled here itself. Basically, to avoid unnecessary method call overhead.

gsteelman · 2019-12-04T22:16:20Z

sdks/java/core/build.gradle

@@ -90,4 +90,6 @@ dependencies {
  shadowTest library.java.avro_tests
  shadowTest library.java.zstd_jni
  testRuntimeOnly library.java.slf4j_jdk14
+  compile 'io.airlift:aircompressor:0.16'
+  compile 'com.facebook.presto.hadoop:hadoop-apache2:3.2.0-1'


LZO itself should have no dependency on anything related to Hadoop, Presto, or Facebook.

Why do we need to include this?

If aircompressor really needs this, why does it need it?

This is included because LzoCodec class that has been used to create Input&Output streams is using some classes of the org.apache.hadoop package, which is a part of com.facebook.presto.hadoop.
Since the aircompressor is designed to also support optional hadoop configurations, hadoop is coming into picture(in our case, hadoop config is null).

gsteelman · 2019-12-04T22:33:41Z

sdks/java/core/src/test/java/org/apache/beam/sdk/io/CompressedSourceTest.java

@@ -161,6 +189,28 @@ public void testGzipSplittable() throws Exception {
    assertFalse(source.isSplittable());
  }

+  /** Test splittability of files in LZO mode -- none should be splittable. */
+  @Test
+  public void testLzoSplittable() throws Exception {


Please add a similar test for testLzopSplittable().

Thanks for pointing this out, this has been added.

gsteelman · 2019-12-04T22:39:19Z

sdks/java/core/src/test/java/org/apache/beam/sdk/io/CompressedSourceTest.java

+   * <p>A concatenation of lzo files as one file is a valid lzo file and should decompress to be the
+   * concatenation of those individual files.


I think we need an enclosing </p>, or we can simply remove the <p> tag. Either way, can you please ensure testReadConcatenatedGzip() and testReadMultiStreamBzip2() follow the same javadocs format?

This is happening when we run the spotlessApply task. When the
tag is closed, the spotlessCheck fails. Not sure of the reason behind that.

The closing
tag isn't needed in javadoc even if your editor is inserting it.

gsteelman · 2019-12-04T22:40:58Z

sdks/java/core/src/test/java/org/apache/beam/sdk/io/CompressedSourceTest.java

+   * concatenation of those individual files.
+   */
+  @Test
+  public void testReadConcatenatedLzo() throws IOException {


Please add a testReadConcatenatedLzop() as well if it is applicable. Not sure if it is due to headers.

The current behaviour of LZOP codec is that it returns the contents of the first file only, if concatenated files are given because of the presence of headers. This causes the test to fail. That is why we have not added this test.

Perhaps it would be a good idea to add a test with an expected failure then?

We have added this in this update.

Can we either add support for multistream or throw an exception if the stream isn't finished?

It would be dangerous for users to have part of their data silently dropped in this scenario. We should also add to the comment that concatenated streams aren't supported.

gsteelman · 2019-12-04T22:41:50Z

sdks/java/core/src/test/java/org/apache/beam/sdk/io/CompressedSourceTest.java

+   * <p>A lzo file may contain multiple streams and should decompress as the concatenation of those
+   * streams.


I think we need an enclosing </p>, or we can simply remove the <p> tag. Whichever, please be consistent with testReadConcatenatedLzo().

This is happening due to spotlessApply.

gsteelman · 2019-12-04T22:49:17Z

sdks/java/core/src/test/java/org/apache/beam/sdk/io/CompressedSourceTest.java

+      assertThat(readerOrig, instanceOf(CompressedReader.class));
+      CompressedReader<Byte> reader = (CompressedReader<Byte>) readerOrig;
+      // before starting
+      assertEquals(0.0, reader.getFractionConsumed(), 1e-6);


I see 1e-6 is used by many of the test cases as a threshold. Should it be factored out into a constant?

It can be done. But that would require altering all the tests that use this constant value. Will that be fine if we do that?

I think we can add the constant for CompressedSourceTest.java at least.

we have added this in the update

gsteelman

Thank you for opening this PR, happy to work with you to get these changes in.

amoght · 2019-12-10T21:01:35Z

While studying the code, we found that the airlift/ aircompressor library only requires some classes which are also present in apache hadoop common package(~3.9MB). Therefore, we are now thinking that of making changes in the airlift/ aircompressor package, replacing the
com.facebook.presto.hadoop with org.apache.hadoop.common and removing other compression mechanisms present in the airlift/aircompressor package(like zstd, gzip etc) while only keeping the required LZO package.
But if we go ahead with this approach, we will have to manually update this library whenever any changes are made to the airlift/aircompressor's LZO package.
@lukecwik @gsteelman please provide your thoughts on this.

gsteelman · 2019-12-12T23:48:37Z

While studying the code, we found that the airlift/ aircompressor library only requires some classes which are also present in apache hadoop common package(~3.9MB). Therefore, we are now thinking that of making changes in the airlift/ aircompressor package, replacing the
com.facebook.presto.hadoop with org.apache.hadoop.common and removing other compression mechanisms present in the airlift/aircompressor package(like zstd, gzip etc) while only keeping the required LZO package.
But if we go ahead with this approach, we will have to manually update this library whenever any changes are made to the airlift/aircompressor's LZO package.
@lukecwik @gsteelman please provide your thoughts on this.

Is it possible to instead add the dependencies on the apache.hadoop.common package directly in these changes, and not add a dependency on airlift/aircompressor this change? I would prefer to stick with strict dependencies when possible, rather than relying on transitive dependencies to bring in the classes we need.

Relying on the transitive dependencies brought in by airlift/aircompressor has its own set of issues, including having to update our libraries whenever changes are made to airlift.

amoght · 2019-12-13T19:30:31Z

@gsteelman we have used the airlift/aircompressor library to only get the compression and decompression mechanism, the implementation of Input/Output stream there introduces the transitive dependency, which can be removed and replaced with apache hadoop common library. This significantly reduces the size as well.
So, here are the 2 possible options:

We only use the compression and decompression mechanism from airlift/aircompressor and design the Input/Output Streams for beam accordingly. This will be needed to be updated if there is any change in those classes on airlift/aircompressor's end. But, since we will only be using the compression and decompression mechanism from airlift/aircompressor, the updates will be small and quite rare. Therefore, this won't be that big of an issue.
We introduce LZO as an optional package for beam. As this will give users the option to manage their beam size (if it is a constraint) or if LZO is not required.

gsteelman · 2019-12-13T20:56:49Z

@amoght I don't have enough context to make the call on that, as I am very new to Beam. I have reached out to some others at Twitter to also review this change, as they will have more context.

amoght · 2019-12-16T10:08:52Z

@amoght I don't have enough context to make the call on that, as I am very new to Beam. I have reached out to some others at Twitter to also review this change, as they will have more context.

Thanks Gary :) appreciate your help!

… 9/1/2020

merge after PR update 9/1/2020

amoght · 2020-01-09T10:44:50Z

@lukecwik I've updated the PR based on the discussion that we had. Please let me know your thoughts and suggestions.

…n.java Co-Authored-By: Lukasz Cwik <lcwik@google.com>

beam merge 20/02/2020 3:54 PM

amoght · 2020-02-20T13:15:06Z

After committing the comments, you may need to run spotlessApply again.

Done

lukecwik · 2020-02-21T23:13:18Z

retest this please

lukecwik · 2020-02-21T23:16:39Z

retest this please

lukecwik · 2020-02-21T23:28:07Z

Run JavaPortabilityApi PreCommit

lukecwik · 2020-02-21T23:28:13Z

Run Java_Examples_Dataflow PreCommit

lukecwik · 2020-02-24T16:43:00Z

Run Java PreCommit

lukecwik · 2020-02-24T16:43:07Z

Run JavaPortabilityApi PreCommit

lukecwik · 2020-02-24T16:43:12Z

Run Java_Examples_Dataflow PreCommit

shubham-srivastav · 2020-02-24T17:42:55Z

@lukecwik Do we need to add test dependency for facebook-presto and airlift in /beam/examples/java/build.gradle ?

lukecwik · 2020-02-24T20:53:32Z

WordCount doesn't depend on using LZO so it shouldn't be a dependency and the pipeline should execute successfully without it. The test may be picking up a legitimate case which users would hit as well.

shubham-srivastav · 2020-02-24T21:28:29Z

@lukecwik We Observed replacing Compression I/O stream with java.io I/O stream in LzoCompression.java can resolve the issue. Should we go ahead and do that?

lukecwik · 2020-02-24T21:33:34Z

That sounds great. I should have caught that earlier.

lukecwik · 2020-02-25T16:46:15Z

retest this please

lukecwik

will merge when tests are green

provided some minor suggestions

sdks/java/core/src/main/java/org/apache/beam/sdk/util/LzoCompression.java

lukecwik · 2020-02-25T16:48:24Z

sdks/java/core/src/test/java/org/apache/beam/sdk/io/CompressedSourceTest.java

@@ -83,7 +82,7 @@
 @RunWith(JUnit4.class)
 public class CompressedSourceTest {

-  private final double DELTA = 1e-6;
+  private final double delta = 1e-6;


nit: you should have declared this static and kept the capital letters instead of making it a member variable of CompressedSourceTest

…ession.java Co-Authored-By: Lukasz Cwik <lcwik@google.com>

amoghMTiwari and others added 4 commits November 21, 2019 19:29

lzo-addons

0bd4451

Merge pull request #1 from apache/master

0de962c

update fork

Merge pull request #2 from apache/master

ace8d21

merge-beam changes #2

3rd dec 2019, 12:43AM

802c2b6

lukecwik requested changes Dec 2, 2019

View reviewed changes

lukecwik changed the title ~~[Beam-8564] Add LZO compression and decompression support~~ [BEAM-8564] Add LZO compression and decompression support Dec 2, 2019

amoght and others added 4 commits December 3, 2019 14:55

Merge pull request #3 from apache/master

4f89355

update fork 3rd Dec 2019, 14:55PM

Merge pull request #4 from apache/master

87d1f0a

3rd Dec 2019 21:59PM

PR corrections

021a1c4

PR javaPreCommit update

74956c5

amoght requested a review from lukecwik December 4, 2019 08:08

gsteelman reviewed Dec 4, 2019

View reviewed changes

PR changes: added testLzopSpilttale()

8142825

amoghMTiwari and others added 3 commits January 9, 2020 02:39

updated gradle for supporting optional dependency of lzo- 2:39 AM IST…

25c9e20

… 9/1/2020

Merge pull request #5 from apache/master

b4b365f

merge after PR update 9/1/2020

Merge branch 'master' of https://github.com/amoght/beam

e17b65c

amoght and others added 10 commits February 20, 2020 15:46

Update sdks/java/core/src/main/java/org/apache/beam/sdk/io/Compressio…

e1b9c24

…n.java Co-Authored-By: Lukasz Cwik <lcwik@google.com>

Update sdks/java/core/src/main/java/org/apache/beam/sdk/io/Compressio…

00ac240

…n.java Co-Authored-By: Lukasz Cwik <lcwik@google.com>

Update sdks/java/core/src/main/java/org/apache/beam/sdk/io/Compressio…

d3771cc

…n.java Co-Authored-By: Lukasz Cwik <lcwik@google.com>

Update sdks/java/core/src/main/java/org/apache/beam/sdk/io/Compressio…

3633fea

…n.java Co-Authored-By: Lukasz Cwik <lcwik@google.com>

Update sdks/java/core/src/main/java/org/apache/beam/sdk/io/Compressio…

c22b9ea

…n.java Co-Authored-By: Lukasz Cwik <lcwik@google.com>

Update sdks/java/core/src/main/java/org/apache/beam/sdk/io/Compressio…

8dc1ce5

…n.java Co-Authored-By: Lukasz Cwik <lcwik@google.com>

Update sdks/java/core/src/main/java/org/apache/beam/sdk/io/Compressio…

caa0bd1

…n.java Co-Authored-By: Lukasz Cwik <lcwik@google.com>

Update sdks/java/core/src/main/java/org/apache/beam/sdk/io/Compressio…

e2b991a

…n.java Co-Authored-By: Lukasz Cwik <lcwik@google.com>

Merge pull request #7 from apache/master

0df8039

beam merge 20/02/2020 3:54 PM

finishing touch 20/02/2020 6:43PM

903db11

lukecwik approved these changes Feb 21, 2020

View reviewed changes

25/02/2020 updated imports Amogh Tiwari & Shubham Srivastava

e58d75e

lukecwik approved these changes Feb 25, 2020

View reviewed changes

amoght and others added 3 commits February 25, 2020 23:37

Update sdks/java/core/src/main/java/org/apache/beam/sdk/util/LzoCompr…

5c17fec

…ession.java Co-Authored-By: Lukasz Cwik <lcwik@google.com>

Update sdks/java/core/src/main/java/org/apache/beam/sdk/util/LzoCompr…

0495d34

…ession.java Co-Authored-By: Lukasz Cwik <lcwik@google.com>

Update sdks/java/core/src/main/java/org/apache/beam/sdk/util/LzoCompr…

99a6396

…ession.java Co-Authored-By: Lukasz Cwik <lcwik@google.com>

lukecwik merged commit db6eff8 into apache:master Feb 25, 2020

		* <p>A concatenation of lzo files as one file is a valid lzo file and should decompress to be the
		* concatenation of those individual files.

		* <p>A lzo file may contain multiple streams and should decompress as the concatenation of those
		* streams.

[BEAM-8564] Add LZO compression and decompression support #10254

[BEAM-8564] Add LZO compression and decompression support #10254

Conversation

amoght commented Dec 2, 2019 • edited

Post-Commit Tests Status (on master branch)

Pre-Commit Tests Status (on master branch)

lukecwik left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lukecwik Dec 3, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lukecwik commented Dec 2, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amoght Dec 5, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gsteelman Dec 4, 2019 • edited

Choose a reason for hiding this comment

amoght Dec 5, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gsteelman left a comment

Choose a reason for hiding this comment

amoght commented Dec 10, 2019 • edited

gsteelman commented Dec 12, 2019

amoght commented Dec 13, 2019

gsteelman commented Dec 13, 2019

amoght commented Dec 16, 2019

amoght commented Jan 9, 2020

amoght commented Feb 20, 2020

lukecwik commented Feb 21, 2020

lukecwik commented Feb 21, 2020

lukecwik commented Feb 21, 2020

lukecwik commented Feb 21, 2020

lukecwik commented Feb 24, 2020

lukecwik commented Feb 24, 2020

lukecwik commented Feb 24, 2020

shubham-srivastav commented Feb 24, 2020 • edited

lukecwik commented Feb 24, 2020

shubham-srivastav commented Feb 24, 2020

lukecwik commented Feb 24, 2020 • edited

lukecwik commented Feb 25, 2020

lukecwik left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amoght commented Dec 2, 2019 •

edited

lukecwik Dec 3, 2019 •

edited

amoght Dec 5, 2019 •

edited

gsteelman Dec 4, 2019 •

edited

amoght Dec 5, 2019 •

edited

amoght commented Dec 10, 2019 •

edited

shubham-srivastav commented Feb 24, 2020 •

edited

lukecwik commented Feb 24, 2020 •

edited