-
Notifications
You must be signed in to change notification settings - Fork 8.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HADOOP-17124. Support LZO Codec using aircompressor #2159
base: trunk
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dbtsai . Overall looks good - left a few comments. Also, there are a lot of style issues need to be addressed. Please check "checkstyle" in the CI result. For instance, Hadoop limit line width to 80 chars.
* limitations under the License. | ||
*/ | ||
|
||
package com.hadoop.compression.lzo; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, why we need this bridging class in Hadoop repo while the class is from hadoop-lzo library?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Historically, Hadoop included LZO until it was removed in HADOOP-4874 due to GPL licensing concern. Then the GPL LZO codec was maintained as a separate project in https://github.com/twitter/hadoop-lzo with new codec class com.hadoop.compression.lzo.LzoCodec
. In Hadoop's sequence file, the first couple bytes of file includes the class of the compression codec used when writing, and hadoop uses this information to pick up the right codec to read the data. As a result, we need to bridge it in order to enable Hadoop to read the old LZO compressed data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@steveloughran this might answer why Not sure why the com.hadoop classes are there at all.
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. In that case the classes should have the reason explained, but tag the new classes as Deprecated
@Override | ||
public Class<? extends Compressor> getCompressorType() | ||
{ | ||
return LzopCodec.HadoopLzopCompressor.class; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that createCompressor
returns HadoopLzoCompressor()
. Should we keep these two in sync?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Ideally yes, but HadoopLzoCompressor
is private in airlift, so I can not easily do it. I added a new override
@Override
public Compressor createCompressor()
{
return new HadoopLzopCompressor();
}
to keep them in sync, and I will try to work with aircompressor to move this to their codebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit beyond my area of expertise. All I know about codecs tends to be related to stack traces
We cannot add any new jars to the classpath of hadoop-common unless they are absolutely essential. See #1948 as an example of the problems we've hit there.
This means that
- the codec needs to use reflection to load classes; fail with some useful message (or at least the docs mention it in troubleshooting)
- or compile time import of aircompressor, tagging as
provided
, and then requiring people who want to use this codec to add the JAR and any (documented) dependencies.
Other than this
- to use a codec you have to especially ask for it -correct? Which means that: people need to know of it. Wherever codecs are documented this will have to be added.
- Not sure why the com.hadoop classes are there at all.
@@ -1727,6 +1728,11 @@ | |||
<artifactId>jna</artifactId> | |||
<version>${jna.version}</version> | |||
</dependency> | |||
<dependency> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this add any transient dependencies?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks okay to me https://mvnrepository.com/artifact/io.airlift/aircompressor/0.16
No compile time dependency. A shaded Hadoop 2 jar in provided dependency.
What's tricky is it has test scope dependency on jmh-core, which is GPL 2.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we don't depend on the test scope, I think we should be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@steveloughran we would like to actually bundle this jar into common since this is a very clean jar and used in many projects such as presto and orc to provide their compression codecs.
In fact, we would like to have snappy to fallback to aircompressor if no native lib is provided. For many of our devs, it's non-trivial to setup native libs in their development env since it requires to compile many native libs from sources and install them into LD path. If we can fallback to pure java snappy implementation in aircompressor, it will make the developers' life a way easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point was that aircompressor has non-optional dependency on GPL artifact and thus making it GPL as well.
That said, I just verified mjh-core is not in the Hadoop dependency after the patch.
We should talk to aircompressor developers to clarify the license issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to mention that other apache projects such as Presto and Spark all include this as dep, so would be great to get the clarification since it's heavily used in Apache community..
import org.apache.hadoop.io.compress.CompressionOutputStream; | ||
import org.apache.hadoop.io.compress.Compressor; | ||
|
||
import org.apache.commons.logging.Log; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use SLF4J
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed. Thanks.
hadoop-common-project/hadoop-common/src/main/java/com/hadoop/compression/lzo/LzoCodec.java
Show resolved
Hide resolved
public class LzoCodec extends org.apache.hadoop.io.compress.LzoCodec { | ||
private static final Log LOG = LogFactory.getLog(LzoCodec.class); | ||
|
||
static final String gplLzoCodec = LzoCodec.class.getName(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private scope
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed. Thanks.
@Override | ||
public CompressionOutputStream createOutputStream(OutputStream out, | ||
Compressor compressor) throws IOException { | ||
if (!warned) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's a risk of >1 warning in a multithread env, but not something I'm worried about. AtomicBoolean would be the purist way to do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed. Thanks.
hadoop-common-project/hadoop-common/src/main/java/com/hadoop/compression/lzo/LzoCodec.java
Outdated
Show resolved
Hide resolved
*/ | ||
@DoNotPool | ||
static class HadoopLzopCompressor | ||
implements Compressor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
K&R / java { placement please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I'm using intelji with default java formatter. Does hadoop provide a codestyle formatter that Intelji can use? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'fraid not.
* limitations under the License. | ||
*/ | ||
|
||
package com.hadoop.compression.lzo; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
com.hadoop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is to offer a bridge for those who are using hadoop-lzo library.
} | ||
|
||
/** | ||
* No Hadoop code seems to actually use the compressor, so just return a dummy one so the createOutputStream method |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does anyone know about downstream uses?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only know Presto uses Lzop codec. Maybe we can skip this for this PR and focs on LzoCodec @dbtsai?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test code is calling createCompressor
without using the actual implementation. Without this, the test will not pass.
💔 -1 overall
This message was automatically generated. |
I I can see why adding this to hadoop-common appeals in terms of ease of redistribution, but I want it in its own package, such as hadoop-tools/hadoop-compression
There is a cost to this: you don't get it without changing your pom/sbt/gradle dependencies. We could think about including a stub hadoop-compression module even in those releases which don't get a backport. There is a variant option: add it to hadoop-extras, which has been around for a long time and currently it doesn't do much. Add it there and while downstream apps have to extend their pom, everything will build against older versions if they do that? The more I think about that the more I like it. |
cc @viirya @steveloughran We are thinking the same. For example, if we add |
hadoop-compression would be cleanest. I was looking @ hadoop-extras as it is already bundled everywhere, so if a project declares a dependency on it, it will still build against older releases -just lack the new codes |
OK, so what to do?
|
@steveloughran +1 on tools/hadoop-compression, and we will work on it. Since #2297 is almost ready to merge, is it okay we merge it first, and work on creating cc @viirya |
@dbtsai yes, lets do snappy first |
See https://issues.apache.org/jira/browse/HADOOP-17124 for details.