-
Notifications
You must be signed in to change notification settings - Fork 478
Use Hadoop compression codec override (#2416) #2417
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Properly set Hadoop compression codec override and log error on failure. Continue to return null for failures to preserve existing logic.
|
I think this looks like a good fix. And thanks for the additions to the test. The only concern I have is with the exception handling. With your changes, I am now seeing an exception printed when |
|
Thank you for the review. I am concerned that the error which was previously ignored is rather fatal in certain contexts. For example, if data is written with a codec available on a MR system (such as lzo), but not available on the Accumulo system, the data read will be trash. There's not a good immediate fix here, the risk is relatively low, and the exception will not occur in normal operations; however, if it does happen, I'd rather know instead of attempting to track down some other irrelevant (data corruption) issue. |
|
@milleruntime I do not see an exception printed either in my local run or as part of the GH action. Could you please provide the reproduction steps where you see the exception? |
|
Run: [main] ERROR org.apache.accumulo.core.file.rfile.bcfile.Compression [] - Unable to load codec class org.apache.hadoop.io.compress.LzoCodec for io.compression.codec.lzo.class java.lang.ClassNotFoundException: org.apache.hadoop.io.compress.LzoCodec at jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:581) ~[?:?] at jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178) ~[?:?] at java.lang.ClassLoader.loadClass(ClassLoader.java:522) ~[?:?] at java.lang.Class.forName0(Native Method) ~[?:?] at java.lang.Class.forName(Class.java:315) ~[?:?] at org.apache.accumulo.core.file.rfile.bcfile.Compression$Algorithm.createNewCodec(Compression.java:773) [classes/:?] at org.apache.accumulo.core.file.rfile.bcfile.Compression$Algorithm$2.createNewCodec(Compression.java:268) [classes/:?] at org.apache.accumulo.core.file.rfile.bcfile.Compression$Algorithm.initCodec(Compression.java:750) [classes/:?] at org.apache.accumulo.core.file.rfile.bcfile.Compression$Algorithm$2.initializeDefaultCodec(Compression.java:263) [classes/:?] at org.apache.accumulo.core.file.rfile.bcfile.Compression$Algorithm.(Compression.java:635) [classes/:?] at org.apache.accumulo.core.file.rfile.bcfile.CompressionTest.testSupport(CompressionTest.java:57) [test-classes/:?] |
|
Thank you - this makes sense as LZO is not bundled with Hadoop due to licensing issues. In this case, the exception would be useful in determining which codec is missing and how to resolve the missing library. |
I agree but I would like to avoid printing an error if the codec is not there by default and most users aren't going to override the implementation. |
|
Understood. I would suggest the following without the stack trace but enough context: log.warn("Unable to load codec class {} for {}, reason: {}", clazz, codecClazzProp,
e.getMessage()); |
ctubbsii
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the logic can be further simplified
| String extClazz = conf.get(codecClazzProp); | ||
| if (extClazz == null) { | ||
| extClazz = System.getProperty(codecClazzProp); | ||
| } | ||
| String clazz = (extClazz != null) ? extClazz : defaultClazz; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this behaves the way it should. This would get the class from the config file, and only use the system property if it's not set in the config file. I think it should work the other way around. Setting the system property on the command-line, for example, should be able to override what is set in the config file.
| String extClazz = conf.get(codecClazzProp); | |
| if (extClazz == null) { | |
| extClazz = System.getProperty(codecClazzProp); | |
| } | |
| String clazz = (extClazz != null) ? extClazz : defaultClazz; | |
| String clazz = System.getProperty(codecClazzProp, conf.get(codecClazzProp, defaultClazz)); |
Also, this syntax is much more readable: try system props, fall back to config, fall back to default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The behavior comes from the existing Accumulo code which didn't work and the Hadoop TFile example linked here: https://github.com/apache/hadoop/blob/e103c83765898f756f88c27b2243c8dd3098a989/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/file/tfile/Compression.java#L88 - It is a bit backwards I agree and may be more Hadoop semantics rather than Accumulo. I would agree with your approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the simpler version of that implementation would look like:
String clazz = conf.get(codecClazzProp, System.getProperty(codecClazzProp, defaultClazz));But, that would make the value in the config file override a java property on the command-line. And, I think it should go the other way around. The command-line is should take precedence.
| // This is not okay. | ||
| log.warn("Unable to load codec class {} for {}, reason: {}", clazz, codecClazzProp, | ||
| e.getMessage()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // This is not okay. | |
| log.warn("Unable to load codec class {} for {}, reason: {}", clazz, codecClazzProp, | |
| e.getMessage()); | |
| log.warn("Unable to load codec class {} for {}", clazz, codecClazzProp, e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was discussed here: #2417 (comment) The exception stacktrace is not required since having the message is sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Varargs doesn't capture the exception object and it is unbound in your requested change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see harm in having a stack trace in the logs with a warning. I would leave it unbound so it does the stack trace, primarily because I think the code looks cleaner, but could go either way if somebody felt strongly about it.
| Algorithm.conf.clear(); | ||
| if (extLz4 != null) { | ||
| System.setProperty(Compression.Algorithm.CONF_LZ4_CLASS, extLz4); | ||
| } else { | ||
| System.clearProperty(Compression.Algorithm.CONF_LZ4_CLASS); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We run our unit tests in parallel and reuse JVM forks to do so. Messing with system properties like this could make these tests not thread safe. But, I'm not sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parallel tests may cause issues here. I can back these tests out.
| Algorithm.conf.set(Compression.Algorithm.CONF_LZ4_CLASS, DummyCodec.class.getName()); | ||
| CompressionCodec dummyCodec = Compression.Algorithm.LZ4.createNewCodec(4096); | ||
| assertEquals("Hadoop override DummyCodec not loaded", DummyCodec.class, dummyCodec.getClass()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests seem a bit heavyweight for what they are effectively testing. They are effectively testing that we can read the class name from the config files and/or system property. We don't actually need to create the codec to test that. These tests can probably be made more lightweight. However, given the triviality of my proposed one-liner above to get the class name, I'm not sure there's much to test here at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eh, for someone who had to chase down this bug, I would typically argue for more test cases and code coverage, but I get your point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the potential problems with JVM reuse and setting system properties, I'd probably just remove the test in this PR, but I agree with you in the general case about more unit testing for greater code coverage.
Yeah, I just don't want us to get to the point in our testing eagerness that we're doing things like: assertTrue(Set.of(x).contains(x));
At a certain point, we're not even testing Accumulo code, but that conf.get() and System.getProperty() work correctly.
|
Closing in favor of #2800 |
Implements fix for #2416.
Computes correct override class and adds unit test for both Hadoop and System property configuration.