-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-268] Shade and relocate Avro dependency in hadoop-mr-bundle #915
Conversation
My question to you is . Can Hive 2.3.5 as is support avro tables (not parquet) that have logical types? if yes, we can look into what we can do get parity. If you still think we should do this, can we control bundling by making the |
Also please prefix the PR with the JIRA it is related to :) |
@vinothchandar I don't think we need to be concerned about Hive 2.3.5 being able to support Avro tables having Logical Types. If this were a problem it should exist even now. Like Spark 2.4.3 supports higher version of Avro, and has support for handling Logical Types by converting to fixed length byte arrays. On Hive 2.3.5 side I believe it will try to convert this fixed length byte arrays back to its own decimal type. It should not necessarily have to understand LogicalType (if I understand correctly). The problem is we are already bundling parquet-avro within the bundle jars. This is making it really difficult to upgrade parquet version. I think Hudi should strive to work with its own versions of parquet/avro irrespective of the consuming application. This particular change should make atleast the Avro version used by Hudi common with that of Spark, and we can claim to always compile Hudi with the version of Spark that is actually writing the dataset. If you are not confident about this change, I can definitely make it configurable like you said. But on EMR side we will have to maintain this to be able to support Hudi with Spark 2.4.3 and Hive 2.3.5. |
Done. Thanks ! |
I think we differ here. Speaking from experience of trying to do so, we ran into multiple issues with that approach
With avro 1.7.7 and Spark 2.1 I think thats what we were at. Bundling avro 1.7.7 was the problem since on higher spark versions also we are stuck with that.
For now, I would recommend doing that and we can document build instructions for different spark/hive combinations. We can also maintain and evolve these in hudi project itself. |
@bvaradar what do you think? @umehrot2 's point is since we need parquet-avro, either we have to either
But the logical type handling is resolved in a different manner by Spark's hive registration (thats what I understood from the explanation above) and the actual fix may be to mimic that? |
@umehrot2 Shading Avro will cause some Realtime Table use-cases to break. This was one of the reasons why we ended up not using this approach. Hudi allows for pluggable record payload implementations. HoodieRecordPayload (a public facing interface) exposes Avro GenericRecord types as part of its interface. We have some deployments where custom implementations of this interface (which resides in different jar) is used to perform on-the-fly merges for Realtime Table reading. If we make shading avro as mandatory for hoodie-hadoop-mr-table, then these plugins also needs to shade avro in the same way. I think keeping the bundling optional (default = not bundle) would be better. @vinothchandar : Assuming this is a one-off case, If it makes things easier for everyone, would it make sense to publish a new jar type (hudi-hadoop-mr-bundle-with-avro-shaded) along with hudi-hadoop-mr-bundle ? |
@umehrot2 are you okay with generating an additional jar? |
@bvaradar @vinothchandar I can make changes to create another jar However based on @bvaradar concerns, I am skeptical now whether just shading Avro in |
@umehrot2 : Unfortunately, the examples that I mentioned is not in the open source world. If you want to reproduce it, a high level direction would be
|
@umehrot2 I am not sure if you want to spend a lot of time reproducing this. May be I can summarize the implications already.
So seems like we are choosing between custom payloads and logical type support. In the short term, I'd still vote for making either the shading controllable via a -D and get all the data types to work.. then look into how to untangle the shading aspects in a follow on.. By no means, I am trying to pick one over the other. just trying to first get spark 2.4 and all the data types working.. |
@vinothchandar @bvaradar Thank you for providing the steps and implications of this change. I think on our side, we should be fine if custom payload implementations would require customers to do this additional shading. I will go ahead with creating another |
@vinothchandar @bvaradar updated the PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just few nits.. But the integration test seems to fail? ideas?
@@ -30,6 +30,7 @@ | |||
<checkstyle.skip>true</checkstyle.skip> | |||
<notice.dir>${project.basedir}/src/main/resources/META-INF</notice.dir> | |||
<notice.file>HUDI_NOTICE.txt</notice.file> | |||
<avro.scope>provided</avro.scope> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: rename to mr.bundle.avro.scope
? in case, we decide this strategy for other bundles too?
|
||
<profiles> | ||
<profile> | ||
<id>shade-avro</id> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here.. have the bundle name in the profile id? and probably move this to root poom as may be <id>aws-emr-profile</id>
, that way you can control other overrides as well
@vinothchandar I noticed an issue with this approach today, and probably thats why the integration tests are failing. When we will build with the profile Essentially, now I am not able to find a good way how to activate/deactivate the shading. Only way now I can think of is adding the whole Have u guys achieved anything like this in the past ? |
I found that some codes does not use avro to process data structure when it does stream processing before. Can we refer to that part of code to convert avro when writing hudi data and save data with parquet's basic api? |
@@ -77,6 +79,10 @@ | |||
<pattern>com.esotericsoftware.minlog.</pattern> | |||
<shadedPattern>org.apache.hudi.com.esotericsoftware.minlog.</shadedPattern> | |||
</relocation> | |||
<relocation> | |||
<pattern>org.apache.avro.</pattern> | |||
<shadedPattern>org.apache.hudi.org.apache.avro.</shadedPattern> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add one more property and use it here.
mr.bundle.avro.shade.prefix
${mr.bundle.avro.shade.prefix}org.apache.avro.
The default value should be empty string ""
<mr.bundle.avro.shade.prefix></mr.bundle.avro.shade.prefix>
and in the profile activation step (shade-avro), set it to
<mr.bundle.avro.shade.prefix>org.apache.hudi.</mr.bundle.avro.shade.prefix>
Can you try this setup and check.
@umehrot2 : This can be fixed by adding another mvn property. I have explained in the comments. Can you please try and see if it works. |
@umehrot2 : Trying to understand why this was closed. Is this no longer neeeded ? |
@bvaradar It was by accident. I saw that there were merge conflicts so I thought I would delete my branch and fix the conflicts and push again. But that closed this PR instead, and now the Reopen functionality does not seem to work either. I might have to create a new PR :( BTW your comments make sense, and I will update them. |
As of now Hudi depends on Parquet 1.8.1 and Avro 1.7.7 which might work fine for older versions of Spark and Hive.
But when we build it with Spark 2.4.3 which uses Parquet 1.10.1 and Avro 1.8.2 using:
We run into runtime issue on Hive 2.3.5 when querying RT tables:
This is happening because we are shading parquet-avro which is now 1.10.1. And it requires Avro 1.8.2 which has this LogicalType class. However, Hive 2.3.5 has Avro 1.7.7 available at runtime which does not have LogicalType class.
To avoid these scenarios, and atleast allow usage of higher versions of Spark without affecting Hive integrations we propose the following:
This will also help in our other issues, where we want to upgrade to Spark 2.4 and also deprecate use of databricks-avro. Thoughts ?