[HUDI-268] Shade and relocate Avro dependency in hadoop-mr-bundle #915

umehrot2 · 2019-09-20T22:59:02Z

As of now Hudi depends on Parquet 1.8.1 and Avro 1.7.7 which might work fine for older versions of Spark and Hive.

But when we build it with Spark 2.4.3 which uses Parquet 1.10.1 and Avro 1.8.2 using:

mvn clean install -DskipTests -DskipITs -Dhadoop.version=2.8.5 -Dspark.version=2.4.3 -Dhbase.version=1.4.10 -Dhive.version=2.3.5 -Dparquet.version=1.10.1 -Davro.version=1.8.2

We run into runtime issue on Hive 2.3.5 when querying RT tables:

hive> select record_key from mytable_mor_sep20_01_rt limit 10;
OK
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/avro/LogicalType
	at org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.init(AbstractRealtimeRecordReader.java:323)
	at org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.<init>(AbstractRealtimeRecordReader.java:105)
	at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.<init>(RealtimeCompactedRecordReader.java:48)
	at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.constructRecordReader(HoodieRealtimeRecordReader.java:67)
	at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.<init>(HoodieRealtimeRecordReader.java:45)
	at org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat.getRecordReader(HoodieParquetRealtimeInputFormat.java:234)
	at org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit.getRecordReader(FetchOperator.java:695)
	at org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:333)
	at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:459)
	at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:428)
	at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:147)
	at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:2208)
	at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:253)
	at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184)
	at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
	at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:821)
	at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759)
	at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:686)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.util.RunJar.run(RunJar.java:239)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:153)
Caused by: java.lang.ClassNotFoundException: org.apache.avro.LogicalType
	at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)

This is happening because we are shading parquet-avro which is now 1.10.1. And it requires Avro 1.8.2 which has this LogicalType class. However, Hive 2.3.5 has Avro 1.7.7 available at runtime which does not have LogicalType class.

To avoid these scenarios, and atleast allow usage of higher versions of Spark without affecting Hive integrations we propose the following:

Compile Hudi with the Parquet/Avro version used by Spark always.
Shade Avro in hadoop-mr-bundle to avoid issues due to older version of Avro being available there.

This will also help in our other issues, where we want to upgrade to Spark 2.4 and also deprecate use of databricks-avro. Thoughts ?

vinothchandar · 2019-09-22T13:54:27Z

Hive 2.3.5 has Avro 1.7.7 available at runtime which does not have LogicalType class.
@umehrot2 I am little hesitant to head down this path of bundling higher versions that the target system itself does not support. This is what landed us into a mess we were in before..

My question to you is . Can Hive 2.3.5 as is support avro tables (not parquet) that have logical types? if yes, we can look into what we can do get parity.

If you still think we should do this, can we control bundling by making the scope of avro in mr-bundle pom configurable during build e.g -Dmr.bundle.avro.scope=compile (default its provided in the pom)?

vinothchandar · 2019-09-22T13:54:56Z

Also please prefix the PR with the JIRA it is related to :)

umehrot2 · 2019-09-23T19:59:49Z

My question to you is . Can Hive 2.3.5 as is support avro tables (not parquet) that have logical types? if yes, we can look into what we can do get parity.

@vinothchandar I don't think we need to be concerned about Hive 2.3.5 being able to support Avro tables having Logical Types. If this were a problem it should exist even now. Like Spark 2.4.3 supports higher version of Avro, and has support for handling Logical Types by converting to fixed length byte arrays. On Hive 2.3.5 side I believe it will try to convert this fixed length byte arrays back to its own decimal type. It should not necessarily have to understand LogicalType (if I understand correctly).

The problem is we are already bundling parquet-avro within the bundle jars. This is making it really difficult to upgrade parquet version. I think Hudi should strive to work with its own versions of parquet/avro irrespective of the consuming application. This particular change should make atleast the Avro version used by Hudi common with that of Spark, and we can claim to always compile Hudi with the version of Spark that is actually writing the dataset.

If you are not confident about this change, I can definitely make it configurable like you said. But on EMR side we will have to maintain this to be able to support Hudi with Spark 2.4.3 and Hive 2.3.5.

umehrot2 · 2019-09-23T20:00:58Z

Also please prefix the PR with the JIRA it is related to :)

Done. Thanks !

vinothchandar · 2019-09-23T20:10:57Z

I think Hudi should strive to work with its own versions of parquet/avro irrespective of the consuming application

I think we differ here. Speaking from experience of trying to do so, we ran into multiple issues with that approach

There is always disparity between what works on a default parquet table on Hive/Spark vs what Hudi tables do
Shading is not always a viable option esp with Avro and the public interfaces cc @bvaradar

we can claim to always compile Hudi with the version of Spark that is actually writing the dataset.

With avro 1.7.7 and Spark 2.1 I think thats what we were at. Bundling avro 1.7.7 was the problem since on higher spark versions also we are stuck with that.

I can definitely make it configurable like you said.

For now, I would recommend doing that and we can document build instructions for different spark/hive combinations. We can also maintain and evolve these in hudi project itself.

vinothchandar · 2019-09-23T20:14:40Z

@bvaradar what do you think? @umehrot2 's point is since we need parquet-avro, either we have to either

downgrade parquet-avro to match hive
Bundle our version of avro

But the logical type handling is resolved in a different manner by Spark's hive registration (thats what I understood from the explanation above) and the actual fix may be to mimic that?

bvaradar · 2019-09-23T23:22:22Z

@umehrot2 Shading Avro will cause some Realtime Table use-cases to break. This was one of the reasons why we ended up not using this approach.

Hudi allows for pluggable record payload implementations. HoodieRecordPayload (a public facing interface) exposes Avro GenericRecord types as part of its interface.

We have some deployments where custom implementations of this interface (which resides in different jar) is used to perform on-the-fly merges for Realtime Table reading. If we make shading avro as mandatory for hoodie-hadoop-mr-table, then these plugins also needs to shade avro in the same way.

I think keeping the bundling optional (default = not bundle) would be better.

@vinothchandar : Assuming this is a one-off case, If it makes things easier for everyone, would it make sense to publish a new jar type (hudi-hadoop-mr-bundle-with-avro-shaded) along with hudi-hadoop-mr-bundle ?

vinothchandar · 2019-09-24T00:02:36Z

@umehrot2 are you okay with generating an additional jar?

umehrot2 · 2019-09-24T10:11:26Z

@bvaradar @vinothchandar I can make changes to create another jar hudi-hadoop-mr-bundle-with-avro-shaded

However based on @bvaradar concerns, I am skeptical now whether just shading Avro in hadoop-mr-bundle is it going to break any use-case on EMR side if we use hudi-hadoop-mr-bundle-with-avro-shaded ? If you guys can point us to use-cases that might break, we can test and verify those. We would not want to introduce any regressions on our side by using the avro-shaded jar.

bvaradar · 2019-09-24T13:15:31Z

@umehrot2 : Unfortunately, the examples that I mentioned is not in the open source world. If you want to reproduce it, a high level direction would be

Implement HoodieRecordPayload class in a separate jar. (For testing : simply extend OverwriteWithLatestAvroPayload.java)
If you want to understand, how this is used in reader : See here : https://github.com/apache/incubator-hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/table/log/AbstractHoodieLogRecordScanner.java#L109
This is a quick and dirty way to repro this : You can simply use the docker based demo steps as described in https://hudi.incubator.apache.org/docker_demo.html but have the following changes:
(a) Pass --payload-class for all delta-streamer invocations with the class that you created in (1)
(b) After you ingest the data once using deltastreamer, you can simply add an entry to hoodie.properties file in the dataset .hoodie folder -> Add "hoodie.compaction.payload.class =
(c) Continue the steps.
(d) In step (6a) where you query Realtime table, you should see the class getting instantiated

vinothchandar · 2019-09-24T14:26:51Z

@umehrot2 I am not sure if you want to spend a lot of time reproducing this. May be I can summarize the implications already.

HoodieRecordPayload abstraction has return types and parameters that take GenericRecord
Those payload implementations that live in hudi will be properly shaded e.g
OverwriteWithLatestAvroPayload, since they are in the bundle
If an user implemented his/her own payload outside of Hudi and dropped the jar into Hive (like how we drop the mr-bundle), then for it to work avro needs to be shaded there as well. Otherwise the Hudi parent class will have org.apache.hudi.org.apache.avro while the user implementation will have org.apache.avro .

So seems like we are choosing between custom payloads and logical type support. In the short term, I'd still vote for making either the shading controllable via a -D and get all the data types to work.. then look into how to untangle the shading aspects in a follow on..

By no means, I am trying to pick one over the other. just trying to first get spark 2.4 and all the data types working..

umehrot2 · 2019-09-24T18:41:28Z

@vinothchandar @bvaradar Thank you for providing the steps and implications of this change. I think on our side, we should be fine if custom payload implementations would require customers to do this additional shading.

I will go ahead with creating another hudi-hadoop-mr-bundle-with-avro-shaded along with hudi-hadoop-mr-bundle everytime Hudi is built. We can then brainstorm over a more long term solution.

umehrot2 · 2019-09-25T22:39:23Z

@vinothchandar @bvaradar updated the PR

vinothchandar

Just few nits.. But the integration test seems to fail? ideas?

vinothchandar · 2019-09-27T15:06:45Z

packaging/hudi-hadoop-mr-bundle/pom.xml

@@ -30,6 +30,7 @@
    <checkstyle.skip>true</checkstyle.skip>
    <notice.dir>${project.basedir}/src/main/resources/META-INF</notice.dir>
    <notice.file>HUDI_NOTICE.txt</notice.file>
+    <avro.scope>provided</avro.scope>


nit: rename to mr.bundle.avro.scope ? in case, we decide this strategy for other bundles too?

vinothchandar · 2019-09-27T15:08:05Z

packaging/hudi-hadoop-mr-bundle/pom.xml

+
+  <profiles>
+    <profile>
+      <id>shade-avro</id>


same here.. have the bundle name in the profile id? and probably move this to root poom as may be <id>aws-emr-profile</id> , that way you can control other overrides as well

umehrot2 · 2019-09-28T01:33:53Z

@vinothchandar @bvaradar updated the PR

Just few nits.. But the integration test seems to fail? ideas?

@vinothchandar I noticed an issue with this approach today, and probably thats why the integration tests are failing.

When we will build with the profile shade-avro things would be file, but when we build without this profile it can cause other things to break. Because we have always added an <include> for the avro dependency as well as <relocation> for avro and are just trying to turn shading on or off using scope. However, when the scope is provided, due to our inclusion of relocation section all references to org.apache.avro in our code/jar would be replaced with org.apache.hudi.org.apache.avro. However, it will not actually find that relocated dependency because it has not actually been shaded/relocated because of the scope. This will cause runtime errors.

Essentially, now I am not able to find a good way how to activate/deactivate the shading. Only way now I can think of is adding the whole shade plugin with all its contents and the avro inside the profile. But that seems like a bad approach to me.

Have u guys achieved anything like this in the past ?

cdmikechen · 2019-09-29T02:32:39Z

I found that some codes does not use avro to process data structure when it does stream processing before. Can we refer to that part of code to convert avro when writing hudi data and save data with parquet's basic api?
Now, the main problem is that some data types cannot be converted correctly after using avro. This problem may be solved when it is stored without avro .

bvaradar · 2019-10-01T18:18:50Z

packaging/hudi-hadoop-mr-bundle/pom.xml

@@ -77,6 +79,10 @@
                  <pattern>com.esotericsoftware.minlog.</pattern>
                  <shadedPattern>org.apache.hudi.com.esotericsoftware.minlog.</shadedPattern>
                </relocation>
+                <relocation>
+                  <pattern>org.apache.avro.</pattern>
+                  <shadedPattern>org.apache.hudi.org.apache.avro.</shadedPattern>


Add one more property and use it here.
mr.bundle.avro.shade.prefix
${mr.bundle.avro.shade.prefix}org.apache.avro.

The default value should be empty string ""
<mr.bundle.avro.shade.prefix></mr.bundle.avro.shade.prefix>

and in the profile activation step (shade-avro), set it to
<mr.bundle.avro.shade.prefix>org.apache.hudi.</mr.bundle.avro.shade.prefix>

Can you try this setup and check.

bvaradar · 2019-10-01T18:19:28Z

@vinothchandar @bvaradar updated the PR

Just few nits.. But the integration test seems to fail? ideas?

@vinothchandar I noticed an issue with this approach today, and probably thats why the integration tests are failing.

When we will build with the profile shade-avro things would be file, but when we build without this profile it can cause other things to break. Because we have always added an <include> for the avro dependency as well as <relocation> for avro and are just trying to turn shading on or off using scope. However, when the scope is provided, due to our inclusion of relocation section all references to org.apache.avro in our code/jar would be replaced with org.apache.hudi.org.apache.avro. However, it will not actually find that relocated dependency because it has not actually been shaded/relocated because of the scope. This will cause runtime errors.

Essentially, now I am not able to find a good way how to activate/deactivate the shading. Only way now I can think of is adding the whole shade plugin with all its contents and the avro inside the profile. But that seems like a bad approach to me.

Have u guys achieved anything like this in the past ?

@umehrot2 : This can be fixed by adding another mvn property. I have explained in the comments. Can you please try and see if it works.

bvaradar · 2019-10-11T20:02:15Z

@umehrot2 : Trying to understand why this was closed. Is this no longer neeeded ?

umehrot2 · 2019-10-11T20:17:53Z

@umehrot2 : Trying to understand why this was closed. Is this no longer needed ?

@bvaradar It was by accident. I saw that there were merge conflicts so I thought I would delete my branch and fix the conflicts and push again. But that closed this PR instead, and now the Reopen functionality does not seem to work either. I might have to create a new PR :(

BTW your comments make sense, and I will update them.

Shade and relocate Avro dependency in hadoop-mr-bundle

2b28c30

umehrot2 changed the title ~~Shade and relocate Avro dependency in hadoop-mr-bundle~~ [HUDI-268] Shade and relocate Avro dependency in hadoop-mr-bundle Sep 23, 2019

Move avro shading through profile

0547331

vinothchandar approved these changes Sep 27, 2019

View reviewed changes

bvaradar reviewed Oct 1, 2019

View reviewed changes

umehrot2 closed this Oct 11, 2019

umehrot2 deleted the umehrot2-avro-shade branch October 11, 2019 19:58

umehrot2 mentioned this pull request Oct 15, 2019

[HUDI-268] Provide mechanism to shade and relocate Avro dependency in hadoop-mr-bundle #957

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-268] Shade and relocate Avro dependency in hadoop-mr-bundle #915

[HUDI-268] Shade and relocate Avro dependency in hadoop-mr-bundle #915

umehrot2 commented Sep 20, 2019

vinothchandar commented Sep 22, 2019

vinothchandar commented Sep 22, 2019

umehrot2 commented Sep 23, 2019

umehrot2 commented Sep 23, 2019

vinothchandar commented Sep 23, 2019 •

edited

vinothchandar commented Sep 23, 2019

bvaradar commented Sep 23, 2019

vinothchandar commented Sep 24, 2019

umehrot2 commented Sep 24, 2019

bvaradar commented Sep 24, 2019

vinothchandar commented Sep 24, 2019

umehrot2 commented Sep 24, 2019

umehrot2 commented Sep 25, 2019

vinothchandar left a comment

vinothchandar Sep 27, 2019

vinothchandar Sep 27, 2019

umehrot2 commented Sep 28, 2019

cdmikechen commented Sep 29, 2019 •

edited

bvaradar Oct 1, 2019

bvaradar commented Oct 1, 2019

bvaradar commented Oct 11, 2019

umehrot2 commented Oct 11, 2019

[HUDI-268] Shade and relocate Avro dependency in hadoop-mr-bundle #915

[HUDI-268] Shade and relocate Avro dependency in hadoop-mr-bundle #915

Conversation

umehrot2 commented Sep 20, 2019

vinothchandar commented Sep 22, 2019

vinothchandar commented Sep 22, 2019

umehrot2 commented Sep 23, 2019

umehrot2 commented Sep 23, 2019

vinothchandar commented Sep 23, 2019 • edited

vinothchandar commented Sep 23, 2019

bvaradar commented Sep 23, 2019

vinothchandar commented Sep 24, 2019

umehrot2 commented Sep 24, 2019

bvaradar commented Sep 24, 2019

vinothchandar commented Sep 24, 2019

umehrot2 commented Sep 24, 2019

umehrot2 commented Sep 25, 2019

vinothchandar left a comment

Choose a reason for hiding this comment

vinothchandar Sep 27, 2019

Choose a reason for hiding this comment

vinothchandar Sep 27, 2019

Choose a reason for hiding this comment

umehrot2 commented Sep 28, 2019

cdmikechen commented Sep 29, 2019 • edited

bvaradar Oct 1, 2019

Choose a reason for hiding this comment

bvaradar commented Oct 1, 2019

bvaradar commented Oct 11, 2019

umehrot2 commented Oct 11, 2019

vinothchandar commented Sep 23, 2019 •

edited

cdmikechen commented Sep 29, 2019 •

edited