Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-7005] Fix hudi-aws-bundle relocation issue with avro #9946

Merged
merged 5 commits into from
Nov 4, 2023
Merged

[HUDI-7005] Fix hudi-aws-bundle relocation issue with avro #9946

merged 5 commits into from
Nov 4, 2023

Conversation

PrabhuJoseph
Copy link
Contributor

Change Logs

Relocates org.apache.avro to org.apache.hudi.org.apache.avro while shading hudi-common classes in hudi-hive-sync-bundle and hudi-aws-bundle to avoid conflicts while working with hudi-flink-bundle and any other bundle where it is relocated already.

Impact

Flink SQL Client Fixes.

Risk level (write none, low medium or high below)

low

Documentation Update

NA

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

Signed-off-by: Prabhu Joseph <joprabhu@amazon.com>
@PrabhuJoseph PrabhuJoseph changed the title HUDI-7005: Fix hudi-aws-bundle relocation issue with avro [HUDI-7005] Fix hudi-aws-bundle relocation issue with avro Oct 30, 2023
@PrabhuJoseph
Copy link
Contributor Author

@danny0405 Could you review this patch when you get time. Thanks.

@PrabhuJoseph
Copy link
Contributor Author

@umehrot2 has reviewed the patch and found a problem in relocating avro classes in the hudi-aws-bundle. It will affect the hudi-spark bundle as which has not relocated avro. The right fix would be to remove including hudi-common classes from hudi-aws-bundle and hudi-aws classes from hudi-flink-bundle. I will make a new revision. Thanks.

@@ -84,7 +84,6 @@
<include>org.apache.hudi:hudi-sync-common</include>
<include>org.apache.hudi:hudi-hadoop-mr</include>
<include>org.apache.hudi:hudi-timeline-service</include>
<include>org.apache.hudi:hudi-aws</include>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flink code HiveSyncContext is dependent on AwsGlueCatalogSyncTool.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good Point. The idea is to have hudi-aws classes in the hudi-aws bundle. hudi-flink-bundle won't have any hudi-aws classes to avoid the below issue. We will ask users to place both hudi-flink-bundle and hudi-aws-bundle in the Flink/lib. Do you have any concerns with this approach?

Issue when hudi-flink-bundle includes hudi-aws

2023-10-07 14:47:03,463 ERROR org.apache.hudi.sink.StreamWriteOperatorCoordinator          [] - Executor executes action [sync hive metadata for instant 20231007144701183] error
java.lang.NoClassDefFoundError: software/amazon/awssdk/services/glue/model/EntityNotFoundException
    at org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool.initSyncClient(AwsGlueCatalogSyncTool.java:52) ~[hudi-flink-bundle.jar:0.13.1-amzn-1]
    at org.apache.hudi.hive.HiveSyncTool.<init>(HiveSyncTool.java:114) ~[hudi-flink-bundle.jar:0.13.1-amzn-1]

This issue happens as AwsGlueCatalogSyncTool (hudi-aws module) part of hudi-flink-bundle has not relocated AWS SDK classes whereas the hudi-aws-bundle (which includes AWS SDK) has relocated it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then what is the suggested way to load the AwsGlueCatalogSyncTool, maybe by reflection?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I also think reflection will be a better choice.

  1. If the user configures "hive_sync.mode" to "hms", then only hudi-flink-bundle is needed.
  2. If the user configures "hive_sync.mode" to "glue", then both hudi-flink-bundle and hudi-aws-bundle is needed. This needs to be documented here.

I see ReflectionUtils and SyncUtilHelpers have provided ways to easily load the class. I will use them and submit a new revision if you are also fine with this discussed approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me.

Copy link
Contributor

@danny0405 danny0405 Nov 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix is to remove including hudi-aws from hudi-flink-bundle. hudi-flink-bundle need not bring hudi-aws classes as hudi-aws-bundle jar can be used instead at runtime.

Is this the recommended or standard style how engine bundle jar should be used when AWS related function is turned on? cc @umehrot2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the hudi-aws-bundle needs to be used to access the features present in hudi-aws and its aws dependencies.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine with it, please keep @umehrot2 as noted.

@hudi-bot
Copy link

hudi-bot commented Nov 3, 2023

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@danny0405 danny0405 merged commit b14f9e4 into apache:master Nov 4, 2023
26 of 28 checks passed
@PrabhuJoseph
Copy link
Contributor Author

Thanks, @danny0405, for the review and commit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants