Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-268] Shade and relocate Avro dependency in hadoop-mr-bundle #915

Closed
wants to merge 2 commits into from

Conversation

umehrot2
Copy link
Contributor

As of now Hudi depends on Parquet 1.8.1 and Avro 1.7.7 which might work fine for older versions of Spark and Hive.

But when we build it with Spark 2.4.3 which uses Parquet 1.10.1 and Avro 1.8.2 using:

mvn clean install -DskipTests -DskipITs -Dhadoop.version=2.8.5 -Dspark.version=2.4.3 -Dhbase.version=1.4.10 -Dhive.version=2.3.5 -Dparquet.version=1.10.1 -Davro.version=1.8.2

We run into runtime issue on Hive 2.3.5 when querying RT tables:

hive> select record_key from mytable_mor_sep20_01_rt limit 10;
OK
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/avro/LogicalType
	at org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.init(AbstractRealtimeRecordReader.java:323)
	at org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.<init>(AbstractRealtimeRecordReader.java:105)
	at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.<init>(RealtimeCompactedRecordReader.java:48)
	at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.constructRecordReader(HoodieRealtimeRecordReader.java:67)
	at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.<init>(HoodieRealtimeRecordReader.java:45)
	at org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat.getRecordReader(HoodieParquetRealtimeInputFormat.java:234)
	at org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit.getRecordReader(FetchOperator.java:695)
	at org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:333)
	at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:459)
	at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:428)
	at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:147)
	at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:2208)
	at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:253)
	at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184)
	at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
	at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:821)
	at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759)
	at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:686)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.util.RunJar.run(RunJar.java:239)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:153)
Caused by: java.lang.ClassNotFoundException: org.apache.avro.LogicalType
	at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)

This is happening because we are shading parquet-avro which is now 1.10.1. And it requires Avro 1.8.2 which has this LogicalType class. However, Hive 2.3.5 has Avro 1.7.7 available at runtime which does not have LogicalType class.

To avoid these scenarios, and atleast allow usage of higher versions of Spark without affecting Hive integrations we propose the following:

  • Compile Hudi with the Parquet/Avro version used by Spark always.
  • Shade Avro in hadoop-mr-bundle to avoid issues due to older version of Avro being available there.

This will also help in our other issues, where we want to upgrade to Spark 2.4 and also deprecate use of databricks-avro. Thoughts ?

@vinothchandar
Copy link
Member

Hive 2.3.5 has Avro 1.7.7 available at runtime which does not have LogicalType class.
@umehrot2 I am little hesitant to head down this path of bundling higher versions that the target system itself does not support. This is what landed us into a mess we were in before..

My question to you is . Can Hive 2.3.5 as is support avro tables (not parquet) that have logical types? if yes, we can look into what we can do get parity.

If you still think we should do this, can we control bundling by making the scope of avro in mr-bundle pom configurable during build e.g -Dmr.bundle.avro.scope=compile (default its provided in the pom)?

@vinothchandar
Copy link
Member

Also please prefix the PR with the JIRA it is related to :)

@umehrot2
Copy link
Contributor Author

My question to you is . Can Hive 2.3.5 as is support avro tables (not parquet) that have logical types? if yes, we can look into what we can do get parity.

@vinothchandar I don't think we need to be concerned about Hive 2.3.5 being able to support Avro tables having Logical Types. If this were a problem it should exist even now. Like Spark 2.4.3 supports higher version of Avro, and has support for handling Logical Types by converting to fixed length byte arrays. On Hive 2.3.5 side I believe it will try to convert this fixed length byte arrays back to its own decimal type. It should not necessarily have to understand LogicalType (if I understand correctly).

The problem is we are already bundling parquet-avro within the bundle jars. This is making it really difficult to upgrade parquet version. I think Hudi should strive to work with its own versions of parquet/avro irrespective of the consuming application. This particular change should make atleast the Avro version used by Hudi common with that of Spark, and we can claim to always compile Hudi with the version of Spark that is actually writing the dataset.

If you are not confident about this change, I can definitely make it configurable like you said. But on EMR side we will have to maintain this to be able to support Hudi with Spark 2.4.3 and Hive 2.3.5.

@umehrot2 umehrot2 changed the title Shade and relocate Avro dependency in hadoop-mr-bundle [HUDI-268] Shade and relocate Avro dependency in hadoop-mr-bundle Sep 23, 2019
@umehrot2
Copy link
Contributor Author

Also please prefix the PR with the JIRA it is related to :)

Done. Thanks !

@vinothchandar
Copy link
Member

vinothchandar commented Sep 23, 2019

I think Hudi should strive to work with its own versions of parquet/avro irrespective of the consuming application

I think we differ here. Speaking from experience of trying to do so, we ran into multiple issues with that approach

  • There is always disparity between what works on a default parquet table on Hive/Spark vs what Hudi tables do
  • Shading is not always a viable option esp with Avro and the public interfaces cc @bvaradar

we can claim to always compile Hudi with the version of Spark that is actually writing the dataset.

With avro 1.7.7 and Spark 2.1 I think thats what we were at. Bundling avro 1.7.7 was the problem since on higher spark versions also we are stuck with that.

I can definitely make it configurable like you said.

For now, I would recommend doing that and we can document build instructions for different spark/hive combinations. We can also maintain and evolve these in hudi project itself.

@vinothchandar
Copy link
Member

@bvaradar what do you think? @umehrot2 's point is since we need parquet-avro, either we have to either

  • downgrade parquet-avro to match hive
  • Bundle our version of avro

But the logical type handling is resolved in a different manner by Spark's hive registration (thats what I understood from the explanation above) and the actual fix may be to mimic that?

@bvaradar
Copy link
Contributor

@umehrot2 Shading Avro will cause some Realtime Table use-cases to break. This was one of the reasons why we ended up not using this approach.

Hudi allows for pluggable record payload implementations. HoodieRecordPayload (a public facing interface) exposes Avro GenericRecord types as part of its interface.

We have some deployments where custom implementations of this interface (which resides in different jar) is used to perform on-the-fly merges for Realtime Table reading. If we make shading avro as mandatory for hoodie-hadoop-mr-table, then these plugins also needs to shade avro in the same way.

I think keeping the bundling optional (default = not bundle) would be better.

@vinothchandar : Assuming this is a one-off case, If it makes things easier for everyone, would it make sense to publish a new jar type (hudi-hadoop-mr-bundle-with-avro-shaded) along with hudi-hadoop-mr-bundle ?

@vinothchandar
Copy link
Member

@umehrot2 are you okay with generating an additional jar?

@umehrot2
Copy link
Contributor Author

@bvaradar @vinothchandar I can make changes to create another jar hudi-hadoop-mr-bundle-with-avro-shaded

However based on @bvaradar concerns, I am skeptical now whether just shading Avro in hadoop-mr-bundle is it going to break any use-case on EMR side if we use hudi-hadoop-mr-bundle-with-avro-shaded ? If you guys can point us to use-cases that might break, we can test and verify those. We would not want to introduce any regressions on our side by using the avro-shaded jar.

@bvaradar
Copy link
Contributor

@umehrot2 : Unfortunately, the examples that I mentioned is not in the open source world. If you want to reproduce it, a high level direction would be

  1. Implement HoodieRecordPayload class in a separate jar. (For testing : simply extend OverwriteWithLatestAvroPayload.java)
  2. If you want to understand, how this is used in reader : See here : https://github.com/apache/incubator-hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/table/log/AbstractHoodieLogRecordScanner.java#L109
  3. This is a quick and dirty way to repro this : You can simply use the docker based demo steps as described in https://hudi.incubator.apache.org/docker_demo.html but have the following changes:
    (a) Pass --payload-class for all delta-streamer invocations with the class that you created in (1)
    (b) After you ingest the data once using deltastreamer, you can simply add an entry to hoodie.properties file in the dataset .hoodie folder -> Add "hoodie.compaction.payload.class =
    (c) Continue the steps.
    (d) In step (6a) where you query Realtime table, you should see the class getting instantiated

@vinothchandar
Copy link
Member

@umehrot2 I am not sure if you want to spend a lot of time reproducing this. May be I can summarize the implications already.

  • HoodieRecordPayload abstraction has return types and parameters that take GenericRecord
  • Those payload implementations that live in hudi will be properly shaded e.g
    OverwriteWithLatestAvroPayload, since they are in the bundle
  • If an user implemented his/her own payload outside of Hudi and dropped the jar into Hive (like how we drop the mr-bundle), then for it to work avro needs to be shaded there as well. Otherwise the Hudi parent class will have org.apache.hudi.org.apache.avro while the user implementation will have org.apache.avro .

So seems like we are choosing between custom payloads and logical type support. In the short term, I'd still vote for making either the shading controllable via a -D and get all the data types to work.. then look into how to untangle the shading aspects in a follow on..

By no means, I am trying to pick one over the other. just trying to first get spark 2.4 and all the data types working..

@umehrot2
Copy link
Contributor Author

@vinothchandar @bvaradar Thank you for providing the steps and implications of this change. I think on our side, we should be fine if custom payload implementations would require customers to do this additional shading.

I will go ahead with creating another hudi-hadoop-mr-bundle-with-avro-shaded along with hudi-hadoop-mr-bundle everytime Hudi is built. We can then brainstorm over a more long term solution.

@umehrot2
Copy link
Contributor Author

@vinothchandar @bvaradar updated the PR

Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just few nits.. But the integration test seems to fail? ideas?

@@ -30,6 +30,7 @@
<checkstyle.skip>true</checkstyle.skip>
<notice.dir>${project.basedir}/src/main/resources/META-INF</notice.dir>
<notice.file>HUDI_NOTICE.txt</notice.file>
<avro.scope>provided</avro.scope>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: rename to mr.bundle.avro.scope ? in case, we decide this strategy for other bundles too?


<profiles>
<profile>
<id>shade-avro</id>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here.. have the bundle name in the profile id? and probably move this to root poom as may be <id>aws-emr-profile</id> , that way you can control other overrides as well

@umehrot2
Copy link
Contributor Author

@vinothchandar @bvaradar updated the PR

Just few nits.. But the integration test seems to fail? ideas?

@vinothchandar I noticed an issue with this approach today, and probably thats why the integration tests are failing.

When we will build with the profile shade-avro things would be file, but when we build without this profile it can cause other things to break. Because we have always added an <include> for the avro dependency as well as <relocation> for avro and are just trying to turn shading on or off using scope. However, when the scope is provided, due to our inclusion of relocation section all references to org.apache.avro in our code/jar would be replaced with org.apache.hudi.org.apache.avro. However, it will not actually find that relocated dependency because it has not actually been shaded/relocated because of the scope. This will cause runtime errors.

Essentially, now I am not able to find a good way how to activate/deactivate the shading. Only way now I can think of is adding the whole shade plugin with all its contents and the avro inside the profile. But that seems like a bad approach to me.

Have u guys achieved anything like this in the past ?

@cdmikechen
Copy link
Contributor

cdmikechen commented Sep 29, 2019

I found that some codes does not use avro to process data structure when it does stream processing before. Can we refer to that part of code to convert avro when writing hudi data and save data with parquet's basic api?
Now, the main problem is that some data types cannot be converted correctly after using avro. This problem may be solved when it is stored without avro .

@@ -77,6 +79,10 @@
<pattern>com.esotericsoftware.minlog.</pattern>
<shadedPattern>org.apache.hudi.com.esotericsoftware.minlog.</shadedPattern>
</relocation>
<relocation>
<pattern>org.apache.avro.</pattern>
<shadedPattern>org.apache.hudi.org.apache.avro.</shadedPattern>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add one more property and use it here.
mr.bundle.avro.shade.prefix
${mr.bundle.avro.shade.prefix}org.apache.avro.

The default value should be empty string ""
<mr.bundle.avro.shade.prefix></mr.bundle.avro.shade.prefix>

and in the profile activation step (shade-avro), set it to
<mr.bundle.avro.shade.prefix>org.apache.hudi.</mr.bundle.avro.shade.prefix>

Can you try this setup and check.

@bvaradar
Copy link
Contributor

bvaradar commented Oct 1, 2019

@vinothchandar @bvaradar updated the PR

Just few nits.. But the integration test seems to fail? ideas?

@vinothchandar I noticed an issue with this approach today, and probably thats why the integration tests are failing.

When we will build with the profile shade-avro things would be file, but when we build without this profile it can cause other things to break. Because we have always added an <include> for the avro dependency as well as <relocation> for avro and are just trying to turn shading on or off using scope. However, when the scope is provided, due to our inclusion of relocation section all references to org.apache.avro in our code/jar would be replaced with org.apache.hudi.org.apache.avro. However, it will not actually find that relocated dependency because it has not actually been shaded/relocated because of the scope. This will cause runtime errors.

Essentially, now I am not able to find a good way how to activate/deactivate the shading. Only way now I can think of is adding the whole shade plugin with all its contents and the avro inside the profile. But that seems like a bad approach to me.

Have u guys achieved anything like this in the past ?

@umehrot2 : This can be fixed by adding another mvn property. I have explained in the comments. Can you please try and see if it works.

@umehrot2 umehrot2 closed this Oct 11, 2019
@umehrot2 umehrot2 deleted the umehrot2-avro-shade branch October 11, 2019 19:58
@bvaradar
Copy link
Contributor

@umehrot2 : Trying to understand why this was closed. Is this no longer neeeded ?

@umehrot2
Copy link
Contributor Author

@umehrot2 : Trying to understand why this was closed. Is this no longer needed ?

@bvaradar It was by accident. I saw that there were merge conflicts so I thought I would delete my branch and fix the conflicts and push again. But that closed this PR instead, and now the Reopen functionality does not seem to work either. I might have to create a new PR :(

BTW your comments make sense, and I will update them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants