Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SUPPORT] Should we shade all aws dependencies to avoid class conflicts? #4474

Closed
boneanxs opened this issue Dec 30, 2021 · 16 comments · Fixed by #4542
Closed

[SUPPORT] Should we shade all aws dependencies to avoid class conflicts? #4474

boneanxs opened this issue Dec 30, 2021 · 16 comments · Fixed by #4542
Labels
aws-support priority:critical production down; pipelines stalled; Need help asap.

Comments

@boneanxs
Copy link
Contributor

As we introduce support for DynamoDb based lock by HUDI-2314, can we shade all aws dependencies for all our bundled jars(spark, flink)? As many users also import their own aws packages but also use hudi, which could cause many class conflicts like following error:

java.lang.NoSuchMethodError: com.amazonaws.http.HttpResponse.getHttpRequest()Lcom/amazonaws/thirdparty/apache/http/client/methods/HttpRequestBase;
	at com.amazonaws.services.s3.internal.S3ObjectResponseHandler.abortableIs(S3ObjectResponseHandler.java:64)
	at com.amazonaws.services.s3.internal.S3ObjectResponseHandler.handle(S3ObjectResponseHandler.java:57)

I'm not sure whether shading these jars will introduce other issues or not. @zhedoubushishi Can you take a look at this issue?

@LuPan2015
Copy link

I seem to have encountered the same problem #4475 . Is there any good way to solve it quickly?

@boneanxs
Copy link
Contributor Author

For our internal hudi version, we shade aws dependencies, you can add new relocation and build a new bundle package:

For example, to shade aws dependencies in spark, add following codes in packaging/hudi-spark-bundle/pom.xml

<!-- line 185-->
<relocation>
 <pattern>com.amazonaws.</pattern>
 <shadedPattern>${spark.bundle.spark.shade.prefix}com.amazonaws.</shadedPattern>
</relocation>

@LuPan2015
Copy link

LuPan2015 commented Dec 30, 2021

I tried it, but the following exception was still thrown。

spark/bin/spark-sql --packages org.apache.hadoop:hadoop-aws:3.2.0,com.amazonaws:aws-java-sdk:1.12.22 --jars hudi-spark3-bundle_2.12-0.11.0-SNAPSHOT.jar \
   --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
   --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
21/12/30 13:46:32 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
21/12/30 13:46:32 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist
21/12/30 13:46:34 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 2.3.0
21/12/30 13:46:34 WARN ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by MetaStore lupan@127.0.1.1
Spark master: local[*], Application Id: local-1640843189215
spark-sql> create table default.hudi_mor_s32 (
         >   id bigint,
         >   name string,
         >   dt string
         > ) using hudi
         > tblproperties (
         >   type = 'mor',
         >   primaryKey = 'id'
         >  )
         > partitioned by (dt)
         > location 's3a://iceberg-bucket/hudi-warehouse/';
ANTLR Tool version 4.7 used for code generation does not match the current runtime version 4.8ANTLR Tool version 4.7 used for code generation does not match the current runtime version 4.821/12/30 13:46:45 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
Error in query: Specified schema in create table statement is not equal to the table schema.You should not specify the schema for an exist table: `default`.`hudi_mor_s32`

@boneanxs
Copy link
Contributor Author

Error in query: Specified schema in create table statement is not equal to the table schema.You should not specify the schema for an exist table: default.hudi_mor_s32

Not the same exception?

@LuPan2015
Copy link

yes. But it works fine。
Next I need to store the metadata in glue。
Thanks.

@boneanxs
Copy link
Contributor Author

boneanxs commented Jan 4, 2022

@xushiyan @zhedoubushishi , Looks this is a common case for many users, does shading this have other side effects or not? If not, I'm willing to raise a ticket to solve this.

@nsivabalan nsivabalan added the priority:critical production down; pipelines stalled; Need help asap. label Jan 4, 2022
@codope
Copy link
Member

codope commented Jan 4, 2022

@boneanxs Shading is fine. Do consider adding a new profile so that users can build according to their use case.
cc @umehrot2 for more inputs.

@xushiyan
Copy link
Member

xushiyan commented Jan 4, 2022

@boneanxs +1 for shading. We already do that for other dependencies (global search shade-unbundle). Feel free to file a JIRA and a patch. thanks! @umehrot2 @zhedoubushishi please chime in. thanks.

@boneanxs
Copy link
Contributor Author

boneanxs commented Jan 4, 2022

After some investigation about this, I'm curious why flink remove amazonaws packages by this JIRA(it doesn't have too much description): HUDI-2803, and it's only include hudi-aws by the pr: 4127, can we follow flink to do the same thing for spark-bundle and utilities-bundle(I'm afraid flink might miss some dependencies if using dynamoDb based lock)?

@codope @xushiyan Also, could you please add me as a contributor to the project, I can not assign myself, and my username is : Bone An. Thanks in advance.

@a0x
Copy link

a0x commented Jan 5, 2022

Hi, I solved my issue by removing aws deps in the pom file. #4442
And whatsmore, I found in Hudi 0.8 there's not aws deps in the same pom file.

Are there any plans on adding aws deps to support more functinos in the future, or why to add thses in the first place?

@xushiyan
Copy link
Member

xushiyan commented Jan 5, 2022

@boneanxs @a0x Thanks for sharing the info and ideas. I've filed https://issues.apache.org/jira/browse/HUDI-3157
I'll defer to @zhedoubushishi to give some guidance from aws :)

@zhedoubushishi
Copy link
Contributor

Thanks for bringing up this issue. My initial idea is to relocate the aws jars with a Hudi prefix to avoid jar conflicts.

If we just directly remove the shading for aws jars, then we need to manually pass aws jars in the Spark/Flink classpath when the users are using AWS Dynamodb/cloudwatch features.

@boneanxs
Copy link
Contributor Author

boneanxs commented Jan 5, 2022

Looks flink-bundle already remove this...HUDI-2803

@xushiyan
Copy link
Member

xushiyan commented Jan 7, 2022

After some discussions, we think that we should keep cloud provider's jars out of open source bundle jars. Any cloud provider can create its own specific hudi module and hudi bundle jars. (like hudi-aws and hudi-spark-aws-bundle for example) But open source bundle jars should stay neutral. cc @danny0405 @nsivabalan @codope @vinothchandar @zhedoubushishi @umehrot2

I've pivoted this ticket to removing bundle deps to align with flink bundle changes. https://issues.apache.org/jira/browse/HUDI-3157

If we just directly remove the shading for aws jars, then we need to manually pass aws jars in the Spark/Flink classpath when the users are using AWS Dynamodb/cloudwatch features.

@zhedoubushishi If to help users use the bundle a bit easier, as I suggested above, please consider adding an aws specific hudi bundle to resolve dependency problem. Hope this could align with your thoughts too.

@nsivabalan
Copy link
Contributor

Closing the github issue as we have a tracking jira. thank you folks for chiming in.

@parisni
Copy link
Contributor

parisni commented Jan 10, 2022

For our internal hudi version, we shade aws dependencies, you can add new relocation and build a new bundle package:

For example, to shade aws dependencies in spark, add following codes in packaging/hudi-spark-bundle/pom.xml

<!-- line 185-->
<relocation>
 <pattern>com.amazonaws.</pattern>
 <shadedPattern>${spark.bundle.spark.shade.prefix}com.amazonaws.</shadedPattern>
</relocation>

could this be included in the future separate aws bundle ?

@xushiyan xushiyan linked a pull request Jan 10, 2022 that will close this issue
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
aws-support priority:critical production down; pipelines stalled; Need help asap.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants