HADOOP-17124. Support LZO Codec using aircompressor #2159

dbtsai · 2020-07-20T22:58:10Z

See https://issues.apache.org/jira/browse/HADOOP-17124 for details.

sunchao

Thanks @dbtsai . Overall looks good - left a few comments. Also, there are a lot of style issues need to be addressed. Please check "checkstyle" in the CI result. For instance, Hadoop limit line width to 80 chars.

sunchao · 2020-07-21T07:36:08Z

hadoop-common-project/hadoop-common/src/main/java/com/hadoop/compression/lzo/LzoCodec.java

+ * limitations under the License.
+ */
+
+package com.hadoop.compression.lzo;


hmm, why we need this bridging class in Hadoop repo while the class is from hadoop-lzo library?

Historically, Hadoop included LZO until it was removed in HADOOP-4874 due to GPL licensing concern. Then the GPL LZO codec was maintained as a separate project in https://github.com/twitter/hadoop-lzo with new codec class com.hadoop.compression.lzo.LzoCodec. In Hadoop's sequence file, the first couple bytes of file includes the class of the compression codec used when writing, and hadoop uses this information to pick up the right codec to read the data. As a result, we need to bridge it in order to enable Hadoop to read the old LZO compressed data.

@steveloughran this might answer why Not sure why the com.hadoop classes are there at all..

OK. In that case the classes should have the reason explained, but tag the new classes as Deprecated

sunchao · 2020-07-21T08:14:51Z

hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/compress/LzopCodec.java

+    @Override
+    public Class<? extends Compressor> getCompressorType()
+    {
+        return LzopCodec.HadoopLzopCompressor.class;


I see that createCompressor returns HadoopLzoCompressor(). Should we keep these two in sync?

Good catch. Ideally yes, but HadoopLzoCompressor is private in airlift, so I can not easily do it. I added a new override

@Override public Compressor createCompressor() { return new HadoopLzopCompressor(); }

to keep them in sync, and I will try to work with aircompressor to move this to their codebase.

steveloughran

This is a bit beyond my area of expertise. All I know about codecs tends to be related to stack traces

We cannot add any new jars to the classpath of hadoop-common unless they are absolutely essential. See #1948 as an example of the problems we've hit there.

This means that

the codec needs to use reflection to load classes; fail with some useful message (or at least the docs mention it in troubleshooting)
or compile time import of aircompressor, tagging as provided, and then requiring people who want to use this codec to add the JAR and any (documented) dependencies.

Other than this

to use a codec you have to especially ask for it -correct? Which means that: people need to know of it. Wherever codecs are documented this will have to be added.
Not sure why the com.hadoop classes are there at all.

steveloughran · 2020-07-21T09:45:16Z

hadoop-project/pom.xml

@@ -1727,6 +1728,11 @@
        <artifactId>jna</artifactId>
        <version>${jna.version}</version>
      </dependency>
+      <dependency>


does this add any transient dependencies?

Looks okay to me https://mvnrepository.com/artifact/io.airlift/aircompressor/0.16
No compile time dependency. A shaded Hadoop 2 jar in provided dependency.

What's tricky is it has test scope dependency on jmh-core, which is GPL 2.0.

Since we don't depend on the test scope, I think we should be fine.

@steveloughran we would like to actually bundle this jar into common since this is a very clean jar and used in many projects such as presto and orc to provide their compression codecs.

In fact, we would like to have snappy to fallback to aircompressor if no native lib is provided. For many of our devs, it's non-trivial to setup native libs in their development env since it requires to compile many native libs from sources and install them into LD path. If we can fallback to pure java snappy implementation in aircompressor, it will make the developers' life a way easier.

My point was that aircompressor has non-optional dependency on GPL artifact and thus making it GPL as well.
That said, I just verified mjh-core is not in the Hadoop dependency after the patch.
We should talk to aircompressor developers to clarify the license issue.

I would like to mention that other apache projects such as Presto and Spark all include this as dep, so would be great to get the clarification since it's heavily used in Apache community..

steveloughran · 2020-07-21T09:50:04Z

hadoop-common-project/hadoop-common/src/main/java/com/hadoop/compression/lzo/LzoCodec.java

+import org.apache.hadoop.io.compress.CompressionOutputStream;
+import org.apache.hadoop.io.compress.Compressor;
+
+import org.apache.commons.logging.Log;


addressed. Thanks.

hadoop-common-project/hadoop-common/src/main/java/com/hadoop/compression/lzo/LzoCodec.java

steveloughran · 2020-07-21T09:52:42Z

hadoop-common-project/hadoop-common/src/main/java/com/hadoop/compression/lzo/LzoCodec.java

+public class LzoCodec extends org.apache.hadoop.io.compress.LzoCodec {
+    private static final Log LOG = LogFactory.getLog(LzoCodec.class);
+
+    static final String gplLzoCodec = LzoCodec.class.getName();


private scope

Addressed. Thanks.

steveloughran · 2020-07-21T09:53:33Z

hadoop-common-project/hadoop-common/src/main/java/com/hadoop/compression/lzo/LzoCodec.java

+    @Override
+    public CompressionOutputStream createOutputStream(OutputStream out,
+                                                      Compressor compressor) throws IOException {
+        if (!warned) {


there's a risk of >1 warning in a multithread env, but not something I'm worried about. AtomicBoolean would be the purist way to do it.

Addressed. Thanks.

hadoop-common-project/hadoop-common/src/main/java/com/hadoop/compression/lzo/LzoCodec.java

steveloughran · 2020-07-21T09:55:16Z

hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/compress/LzopCodec.java

+     */
+    @DoNotPool
+    static class HadoopLzopCompressor
+            implements Compressor


K&R / java { placement please.

Done. I'm using intelji with default java formatter. Does hadoop provide a codestyle formatter that Intelji can use? Thanks.

'fraid not.

steveloughran · 2020-07-21T10:01:18Z

hadoop-common-project/hadoop-common/src/main/java/com/hadoop/compression/lzo/LzopCodec.java

+ * limitations under the License.
+ */
+
+package com.hadoop.compression.lzo;


com.hadoop?

I think this is to offer a bridge for those who are using hadoop-lzo library.

steveloughran · 2020-07-21T10:42:49Z

hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/compress/LzopCodec.java

+    }
+
+    /**
+     * No Hadoop code seems to actually use the compressor, so just return a dummy one so the createOutputStream method


does anyone know about downstream uses?

I only know Presto uses Lzop codec. Maybe we can skip this for this PR and focs on LzoCodec @dbtsai?

The test code is calling createCompressor without using the actual implementation. Without this, the test will not pass.

hadoop-yetus · 2020-08-05T00:50:57Z

💔 -1 overall

Vote	Subsystem	Runtime	Comment
+0 🆗	reexec	25m 53s	Docker mode activated.
		_ Prechecks _
+1 💚	dupname	0m 0s	No case conflicting files found.
+1 💚	@author	0m 0s	The patch does not contain any @author tags.
+1 💚	test4tests	0m 0s	The patch appears to include 1 new or modified test files.
		_ trunk Compile Tests _
+0 🆗	mvndep	0m 25s	Maven dependency ordering for branch
+1 💚	mvninstall	18m 59s	trunk passed
+1 💚	compile	20m 34s	trunk passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1
+1 💚	compile	17m 33s	trunk passed with JDK Private Build-1.8.0_252-8u252-b09-1~18.04-b09
+1 💚	checkstyle	2m 37s	trunk passed
+1 💚	mvnsite	1m 59s	trunk passed
+1 💚	shadedclient	19m 2s	branch has no errors when building and testing our client artifacts.
+1 💚	javadoc	1m 5s	trunk passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1
+1 💚	javadoc	2m 3s	trunk passed with JDK Private Build-1.8.0_252-8u252-b09-1~18.04-b09
+0 🆗	spotbugs	2m 11s	Used deprecated FindBugs config; considering switching to SpotBugs.
+0 🆗	findbugs	0m 27s	branch/hadoop-project no findbugs output file (findbugsXml.xml)
		_ Patch Compile Tests _
+0 🆗	mvndep	0m 26s	Maven dependency ordering for patch
+1 💚	mvninstall	1m 9s	the patch passed
+1 💚	compile	20m 43s	the patch passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1
+1 💚	javac	20m 43s	the patch passed
+1 💚	compile	17m 27s	the patch passed with JDK Private Build-1.8.0_252-8u252-b09-1~18.04-b09
+1 💚	javac	17m 27s	the patch passed
-0 ⚠️	checkstyle	3m 25s	root: The patch generated 121 new + 142 unchanged - 1 fixed = 263 total (was 143)
+1 💚	mvnsite	2m 3s	the patch passed
+1 💚	whitespace	0m 0s	The patch has no whitespace issues.
+1 💚	xml	0m 2s	The patch has no ill-formed XML file.
+1 💚	shadedclient	14m 1s	patch has no errors when building and testing our client artifacts.
+1 💚	javadoc	1m 13s	the patch passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1
+1 💚	javadoc	2m 9s	the patch passed with JDK Private Build-1.8.0_252-8u252-b09-1~18.04-b09
+0 🆗	findbugs	0m 35s	hadoop-project has no data from findbugs
-1 ❌	findbugs	2m 22s	hadoop-common-project/hadoop-common generated 2 new + 0 unchanged - 0 fixed = 2 total (was 0)
		_ Other Tests _
+1 💚	unit	0m 33s	hadoop-project in the patch passed.
-1 ❌	unit	9m 24s	hadoop-common in the patch passed.
-1 ❌	asflicense	0m 55s	The patch generated 4 ASF License warnings.
		189m 55s

Reason	Tests
FindBugs	module:hadoop-common-project/hadoop-common
	The class name com.hadoop.compression.lzo.LzoCodec shadows the simple name of the superclass org.apache.hadoop.io.compress.LzoCodec At LzoCodec.java:the simple name of the superclass org.apache.hadoop.io.compress.LzoCodec At LzoCodec.java:[lines 34-51]
	The class name com.hadoop.compression.lzo.LzopCodec shadows the simple name of the superclass org.apache.hadoop.io.compress.LzopCodec At LzopCodec.java:the simple name of the superclass org.apache.hadoop.io.compress.LzopCodec At LzopCodec.java:[lines 34-51]
Failed junit tests	hadoop.io.file.tfile.TestTFileLzoCodecsByteArrays

Subsystem	Report/Notes
Docker	ClientAPI=1.40 ServerAPI=1.40 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2159/6/artifact/out/Dockerfile
GITHUB PR	#2159
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient xml findbugs checkstyle
uname	Linux 5a32de837f7c 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	personality/hadoop.sh
git revision	trunk / `ed3ab4b`
Default Java	Private Build-1.8.0_252-8u252-b09-1~18.04-b09
Multi-JDK versions	/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_252-8u252-b09-1~18.04-b09
checkstyle	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2159/6/artifact/out/diff-checkstyle-root.txt
findbugs	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2159/6/artifact/out/new-findbugs-hadoop-common-project_hadoop-common.html
unit	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2159/6/artifact/out/patch-unit-hadoop-common-project_hadoop-common.txt
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2159/6/testReport/
asflicense	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2159/6/artifact/out/patch-asflicense-problems.txt
Max. process+thread count	1427 (vs. ulimit of 5500)
modules	C: hadoop-project hadoop-common-project/hadoop-common U: .
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2159/6/console
versions	git=2.17.1 maven=3.6.0 findbugs=4.0.6
Powered by	Apache Yetus 0.13.0-SNAPSHOT https://yetus.apache.org

This message was automatically generated.

steveloughran · 2020-08-10T14:00:50Z

I I can see why adding this to hadoop-common appeals in terms of ease of redistribution, but I want it in its own package, such as hadoop-tools/hadoop-compression

hadoop-common gets everywhere -so will a new JAR, potentially causing issues downstream.
if it is in its own module, we can be on 100% sure that changes in that module Will not have any side-effects in any bit of the deep which does not import it.
If you followed up this patch with some backports of the module to 3.1.x and 3.2.x, then again it would be nicely isolated.

There is a cost to this: you don't get it without changing your pom/sbt/gradle dependencies. We could think about including a stub hadoop-compression module even in those releases which don't get a backport.

There is a variant option: add it to hadoop-extras, which has been around for a long time and currently it doesn't do much. Add it there and while downstream apps have to extend their pom, everything will build against older versions if they do that?

The more I think about that the more I like it.

dbtsai · 2020-09-15T22:41:45Z

cc @viirya

@steveloughran We are thinking the same. For example, if we add hadoop-compression module, and move all the codecs such as snappy, lz4, gzip, and bzip there, other projects such as Parquet can directly depend on it without reimplementing their compression codec. Literally, many projects reimplement the codec using the same interface which we will be able to avoid by moving it into a separate module.

steveloughran · 2020-09-16T12:53:53Z

hadoop-compression would be cleanest. I was looking @ hadoop-extras as it is already bundled everywhere, so if a project declares a dependency on it, it will still build against older releases -just lack the new codes

steveloughran · 2020-09-23T18:11:31Z

OK, so what to do?

tools/hadoop-compression would be cleanest; it could be backported as a module to 3.2.x
the snappy binding & tests would go there

dbtsai · 2020-09-24T08:55:47Z

@steveloughran +1 on tools/hadoop-compression, and we will work on it.

Since #2297 is almost ready to merge, is it okay we merge it first, and work on creating tools/hadoop-compression?

cc @viirya

steveloughran · 2020-10-01T13:59:47Z

@dbtsai yes, lets do snappy first

dbtsai added 3 commits July 19, 2020 22:46

Add LZO

022eaad

add some test

592e053

add more tests

4054c63

sunchao reviewed Jul 21, 2020

View reviewed changes

steveloughran requested changes Jul 21, 2020

View reviewed changes

dbtsai added 2 commits July 21, 2020 16:23

address feedback

f9ffe03

add some java doc

08e4637

dbtsai mentioned this pull request Jul 22, 2020

HadoopLzoCompressor vs LzoCompressor? airlift/aircompressor#115

Closed

dbtsai added 3 commits July 22, 2020 15:10

implement input stream and output stream

80d3c01

implement compressor and decompressor

75863aa

remove LzoInputStream.java and LzopOutputStream

957f071

apache deleted a comment from hadoop-yetus Aug 10, 2020

viirya mentioned this pull request Sep 25, 2020

HADOOP-17125. Using snappy-java in SnappyCodec #2297

Merged

dbtsai mentioned this pull request Sep 7, 2021

[SPARK-36670][SQL][TEST] Add FileSourceCodecSuite apache/spark#33912

Closed

viirya mentioned this pull request Nov 3, 2021

WIP. HADOOP-17124. Support LZO Codec using aircompressor #3612

Draft

4 tasks

HADOOP-17124. Support LZO Codec using aircompressor #2159

Are you sure you want to change the base?

HADOOP-17124. Support LZO Codec using aircompressor #2159

Conversation

dbtsai commented Jul 20, 2020

sunchao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

steveloughran left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hadoop-yetus commented Aug 5, 2020

steveloughran commented Aug 10, 2020

dbtsai commented Sep 15, 2020

steveloughran commented Sep 16, 2020

steveloughran commented Sep 23, 2020

dbtsai commented Sep 24, 2020

steveloughran commented Oct 1, 2020