Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HADOOP-17124. Support LZO Codec using aircompressor #2159

Open
wants to merge 8 commits into
base: trunk
Choose a base branch
from

Conversation

dbtsai
Copy link
Member

@dbtsai dbtsai commented Jul 20, 2020

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dbtsai . Overall looks good - left a few comments. Also, there are a lot of style issues need to be addressed. Please check "checkstyle" in the CI result. For instance, Hadoop limit line width to 80 chars.

* limitations under the License.
*/

package com.hadoop.compression.lzo;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, why we need this bridging class in Hadoop repo while the class is from hadoop-lzo library?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Historically, Hadoop included LZO until it was removed in HADOOP-4874 due to GPL licensing concern. Then the GPL LZO codec was maintained as a separate project in https://github.com/twitter/hadoop-lzo with new codec class com.hadoop.compression.lzo.LzoCodec. In Hadoop's sequence file, the first couple bytes of file includes the class of the compression codec used when writing, and hadoop uses this information to pick up the right codec to read the data. As a result, we need to bridge it in order to enable Hadoop to read the old LZO compressed data.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@steveloughran this might answer why Not sure why the com.hadoop classes are there at all..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. In that case the classes should have the reason explained, but tag the new classes as Deprecated

@Override
public Class<? extends Compressor> getCompressorType()
{
return LzopCodec.HadoopLzopCompressor.class;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that createCompressor returns HadoopLzoCompressor(). Should we keep these two in sync?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Ideally yes, but HadoopLzoCompressor is private in airlift, so I can not easily do it. I added a new override

    @Override
    public Compressor createCompressor()
    {
        return new HadoopLzopCompressor();
    }

to keep them in sync, and I will try to work with aircompressor to move this to their codebase.

Copy link
Contributor

@steveloughran steveloughran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit beyond my area of expertise. All I know about codecs tends to be related to stack traces

We cannot add any new jars to the classpath of hadoop-common unless they are absolutely essential. See #1948 as an example of the problems we've hit there.

This means that

  • the codec needs to use reflection to load classes; fail with some useful message (or at least the docs mention it in troubleshooting)
  • or compile time import of aircompressor, tagging as provided, and then requiring people who want to use this codec to add the JAR and any (documented) dependencies.

Other than this

  • to use a codec you have to especially ask for it -correct? Which means that: people need to know of it. Wherever codecs are documented this will have to be added.
  • Not sure why the com.hadoop classes are there at all.

@@ -1727,6 +1728,11 @@
<artifactId>jna</artifactId>
<version>${jna.version}</version>
</dependency>
<dependency>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this add any transient dependencies?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks okay to me https://mvnrepository.com/artifact/io.airlift/aircompressor/0.16
No compile time dependency. A shaded Hadoop 2 jar in provided dependency.

What's tricky is it has test scope dependency on jmh-core, which is GPL 2.0.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we don't depend on the test scope, I think we should be fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@steveloughran we would like to actually bundle this jar into common since this is a very clean jar and used in many projects such as presto and orc to provide their compression codecs.

In fact, we would like to have snappy to fallback to aircompressor if no native lib is provided. For many of our devs, it's non-trivial to setup native libs in their development env since it requires to compile many native libs from sources and install them into LD path. If we can fallback to pure java snappy implementation in aircompressor, it will make the developers' life a way easier.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point was that aircompressor has non-optional dependency on GPL artifact and thus making it GPL as well.
That said, I just verified mjh-core is not in the Hadoop dependency after the patch.
We should talk to aircompressor developers to clarify the license issue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to mention that other apache projects such as Presto and Spark all include this as dep, so would be great to get the clarification since it's heavily used in Apache community..

import org.apache.hadoop.io.compress.CompressionOutputStream;
import org.apache.hadoop.io.compress.Compressor;

import org.apache.commons.logging.Log;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use SLF4J

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed. Thanks.

public class LzoCodec extends org.apache.hadoop.io.compress.LzoCodec {
private static final Log LOG = LogFactory.getLog(LzoCodec.class);

static final String gplLzoCodec = LzoCodec.class.getName();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private scope

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed. Thanks.

@Override
public CompressionOutputStream createOutputStream(OutputStream out,
Compressor compressor) throws IOException {
if (!warned) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's a risk of >1 warning in a multithread env, but not something I'm worried about. AtomicBoolean would be the purist way to do it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed. Thanks.

*/
@DoNotPool
static class HadoopLzopCompressor
implements Compressor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

K&R / java { placement please.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I'm using intelji with default java formatter. Does hadoop provide a codestyle formatter that Intelji can use? Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'fraid not.

* limitations under the License.
*/

package com.hadoop.compression.lzo;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

com.hadoop?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is to offer a bridge for those who are using hadoop-lzo library.

}

/**
* No Hadoop code seems to actually use the compressor, so just return a dummy one so the createOutputStream method
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does anyone know about downstream uses?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only know Presto uses Lzop codec. Maybe we can skip this for this PR and focs on LzoCodec @dbtsai?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test code is calling createCompressor without using the actual implementation. Without this, the test will not pass.

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 25m 53s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ trunk Compile Tests _
+0 🆗 mvndep 0m 25s Maven dependency ordering for branch
+1 💚 mvninstall 18m 59s trunk passed
+1 💚 compile 20m 34s trunk passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1
+1 💚 compile 17m 33s trunk passed with JDK Private Build-1.8.0_252-8u252-b09-1~18.04-b09
+1 💚 checkstyle 2m 37s trunk passed
+1 💚 mvnsite 1m 59s trunk passed
+1 💚 shadedclient 19m 2s branch has no errors when building and testing our client artifacts.
+1 💚 javadoc 1m 5s trunk passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1
+1 💚 javadoc 2m 3s trunk passed with JDK Private Build-1.8.0_252-8u252-b09-1~18.04-b09
+0 🆗 spotbugs 2m 11s Used deprecated FindBugs config; considering switching to SpotBugs.
+0 🆗 findbugs 0m 27s branch/hadoop-project no findbugs output file (findbugsXml.xml)
_ Patch Compile Tests _
+0 🆗 mvndep 0m 26s Maven dependency ordering for patch
+1 💚 mvninstall 1m 9s the patch passed
+1 💚 compile 20m 43s the patch passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1
+1 💚 javac 20m 43s the patch passed
+1 💚 compile 17m 27s the patch passed with JDK Private Build-1.8.0_252-8u252-b09-1~18.04-b09
+1 💚 javac 17m 27s the patch passed
-0 ⚠️ checkstyle 3m 25s root: The patch generated 121 new + 142 unchanged - 1 fixed = 263 total (was 143)
+1 💚 mvnsite 2m 3s the patch passed
+1 💚 whitespace 0m 0s The patch has no whitespace issues.
+1 💚 xml 0m 2s The patch has no ill-formed XML file.
+1 💚 shadedclient 14m 1s patch has no errors when building and testing our client artifacts.
+1 💚 javadoc 1m 13s the patch passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1
+1 💚 javadoc 2m 9s the patch passed with JDK Private Build-1.8.0_252-8u252-b09-1~18.04-b09
+0 🆗 findbugs 0m 35s hadoop-project has no data from findbugs
-1 ❌ findbugs 2m 22s hadoop-common-project/hadoop-common generated 2 new + 0 unchanged - 0 fixed = 2 total (was 0)
_ Other Tests _
+1 💚 unit 0m 33s hadoop-project in the patch passed.
-1 ❌ unit 9m 24s hadoop-common in the patch passed.
-1 ❌ asflicense 0m 55s The patch generated 4 ASF License warnings.
189m 55s
Reason Tests
FindBugs module:hadoop-common-project/hadoop-common
The class name com.hadoop.compression.lzo.LzoCodec shadows the simple name of the superclass org.apache.hadoop.io.compress.LzoCodec At LzoCodec.java:the simple name of the superclass org.apache.hadoop.io.compress.LzoCodec At LzoCodec.java:[lines 34-51]
The class name com.hadoop.compression.lzo.LzopCodec shadows the simple name of the superclass org.apache.hadoop.io.compress.LzopCodec At LzopCodec.java:the simple name of the superclass org.apache.hadoop.io.compress.LzopCodec At LzopCodec.java:[lines 34-51]
Failed junit tests hadoop.io.file.tfile.TestTFileLzoCodecsByteArrays
Subsystem Report/Notes
Docker ClientAPI=1.40 ServerAPI=1.40 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2159/6/artifact/out/Dockerfile
GITHUB PR #2159
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient xml findbugs checkstyle
uname Linux 5a32de837f7c 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality personality/hadoop.sh
git revision trunk / ed3ab4b
Default Java Private Build-1.8.0_252-8u252-b09-1~18.04-b09
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_252-8u252-b09-1~18.04-b09
checkstyle https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2159/6/artifact/out/diff-checkstyle-root.txt
findbugs https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2159/6/artifact/out/new-findbugs-hadoop-common-project_hadoop-common.html
unit https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2159/6/artifact/out/patch-unit-hadoop-common-project_hadoop-common.txt
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2159/6/testReport/
asflicense https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2159/6/artifact/out/patch-asflicense-problems.txt
Max. process+thread count 1427 (vs. ulimit of 5500)
modules C: hadoop-project hadoop-common-project/hadoop-common U: .
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2159/6/console
versions git=2.17.1 maven=3.6.0 findbugs=4.0.6
Powered by Apache Yetus 0.13.0-SNAPSHOT https://yetus.apache.org

This message was automatically generated.

@apache apache deleted a comment from hadoop-yetus Aug 10, 2020
@apache apache deleted a comment from hadoop-yetus Aug 10, 2020
@apache apache deleted a comment from hadoop-yetus Aug 10, 2020
@apache apache deleted a comment from hadoop-yetus Aug 10, 2020
@apache apache deleted a comment from hadoop-yetus Aug 10, 2020
@apache apache deleted a comment from hadoop-yetus Aug 10, 2020
@steveloughran
Copy link
Contributor

I I can see why adding this to hadoop-common appeals in terms of ease of redistribution, but I want it in its own package, such as hadoop-tools/hadoop-compression

  • hadoop-common gets everywhere -so will a new JAR, potentially causing issues downstream.
  • if it is in its own module, we can be on 100% sure that changes in that module Will not have any side-effects in any bit of the deep which does not import it.
  • If you followed up this patch with some backports of the module to 3.1.x and 3.2.x, then again it would be nicely isolated.

There is a cost to this: you don't get it without changing your pom/sbt/gradle dependencies. We could think about including a stub hadoop-compression module even in those releases which don't get a backport.

There is a variant option: add it to hadoop-extras, which has been around for a long time and currently it doesn't do much. Add it there and while downstream apps have to extend their pom, everything will build against older versions if they do that?

The more I think about that the more I like it.

@dbtsai
Copy link
Member Author

dbtsai commented Sep 15, 2020

cc @viirya

@steveloughran We are thinking the same. For example, if we add hadoop-compression module, and move all the codecs such as snappy, lz4, gzip, and bzip there, other projects such as Parquet can directly depend on it without reimplementing their compression codec. Literally, many projects reimplement the codec using the same interface which we will be able to avoid by moving it into a separate module.

@steveloughran
Copy link
Contributor

hadoop-compression would be cleanest. I was looking @ hadoop-extras as it is already bundled everywhere, so if a project declares a dependency on it, it will still build against older releases -just lack the new codes

@steveloughran
Copy link
Contributor

OK, so what to do?

  • tools/hadoop-compression would be cleanest; it could be backported as a module to 3.2.x
  • the snappy binding & tests would go there

@dbtsai
Copy link
Member Author

dbtsai commented Sep 24, 2020

@steveloughran +1 on tools/hadoop-compression, and we will work on it.

Since #2297 is almost ready to merge, is it okay we merge it first, and work on creating tools/hadoop-compression?

cc @viirya

@steveloughran
Copy link
Contributor

@dbtsai yes, lets do snappy first

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants