[HUDI-4647] Keep the hive sync settings in spark sql consistent #6448

dongkelun · 2022-08-19T08:47:24Z

By default, an error will be reported when synchronizing hive. When using SQL, it is troublesome to fill in the JDBC URL parameter every time

Caused by: org.apache.hudi.hive.HoodieHiveSyncException: Cannot create hive connection jdbc:hive2://localhost:10000/
        at org.apache.hudi.hive.ddl.JDBCExecutor.createHiveConnection(JDBCExecutor.java:107)
        at org.apache.hudi.hive.ddl.JDBCExecutor.<init>(JDBCExecutor.java:59)
        at org.apache.hudi.hive.HoodieHiveSyncClient.<init>(HoodieHiveSyncClient.java:91)
        ... 48 more
Caused by: java.sql.SQLException: Could not open client transport with JDBC Uri: jdbc:hive2://localhost:10000: java.net.ConnectException: Connection refused (Connection refused)
        at org.apache.hive.jdbc.HiveConnection.openTransport(HiveConnection.java:234)
        at org.apache.hive.jdbc.HiveConnection.<init>(HiveConnection.java:178)
        at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105)
        at java.sql.DriverManager.getConnection(DriverManager.java:664)
        at java.sql.DriverManager.getConnection(DriverManager.java:247)
        at org.apache.hudi.hive.ddl.JDBCExecutor.createHiveConnection(JDBCExecutor.java:104)
        ... 50 more
Caused by: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection refused (Connection

Change Logs

Keep the hive sync settings in spark sql consistent

Impact

Keep the hive sync settings in spark sql consistent

Risk level (write none, low medium or high below)

none

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

dongkelun · 2022-08-21T10:04:48Z

@hudi-bot run azure

yihua · 2022-09-06T01:54:12Z

...udi-spark/src/main/scala/org/apache/spark/sql/hudi/command/MergeIntoHoodieTableCommand.scala

@@ -495,7 +496,7 @@ case class MergeIntoHoodieTableCommand(mergeInto: MergeIntoTable) extends Hoodie
        KEYGENERATOR_CLASS_NAME.key -> classOf[SqlKeyGenerator].getCanonicalName,
        SqlKeyGenerator.ORIGIN_KEYGEN_CLASS_NAME -> tableConfig.getKeyGeneratorClassName,
        HoodieSyncConfig.META_SYNC_ENABLED.key -> enableHive.toString,
-        HiveSyncConfigHolder.HIVE_SYNC_MODE.key -> hiveSyncConfig.getString(HiveSyncConfigHolder.HIVE_SYNC_MODE),
+        HiveSyncConfigHolder.HIVE_SYNC_MODE.key -> hiveSyncConfig.getStringOrDefault(HiveSyncConfigHolder.HIVE_SYNC_MODE, HiveSyncMode.HMS.name()),


Should we just add the default value below inside the HiveSyncConfigHolder class?

public static final ConfigProperty<String> HIVE_SYNC_MODE = ConfigProperty .key("hoodie.datasource.hive_sync.mode") .noDefaultValue() .withDocumentation("Mode to choose for Hive ops. Valid values are hms, jdbc and hiveql.");

cc @xushiyan

Should we just add the default value below inside the HiveSyncConfigHolder class?

public static final ConfigProperty<String> HIVE_SYNC_MODE = ConfigProperty .key("hoodie.datasource.hive_sync.mode") .noDefaultValue() .withDocumentation("Mode to choose for Hive ops. Valid values are hms, jdbc and hiveql.");

Here is to modify the global default value, which may affect a lot. I am not sure whether it is reasonable for each module. This PR only wants to change the default value corresponding to Spark SQL

xushiyan

@dongkelun thanks for the patch. there are 2 reasons to why we have to close this for now:

we should keep default value consistent for different scenarios
we don't want to introduce breaking changes (unless with strong reason) until 1.0 when we batch these breakings together

hence i'm tracking the tasks here https://issues.apache.org/jira/browse/HUDI-5062

dongkelun · 2022-10-19T13:41:23Z

@dongkelun thanks for the patch. there are 2 reasons to why we have to close this for now:

we should keep default value consistent for different scenarios

we don't want to introduce breaking changes (unless with strong reason) until 1.0 when we batch these breakings together

hence i'm tracking the tasks here https://issues.apache.org/jira/browse/HUDI-5062
OK. got it
One thing I want to say is that in previous versions of mergeInto, the default value of HIVE_SYNC_MODE is HMS. In other SQL statements, such as insert update, the default value is also HMS

xushiyan · 2022-10-20T04:17:47Z

@dongkelun thanks for the patch. there are 2 reasons to why we have to close this for now:

we should keep default value consistent for different scenarios

we don't want to introduce breaking changes (unless with strong reason) until 1.0 when we batch these breakings together

hence i'm tracking the tasks here https://issues.apache.org/jira/browse/HUDI-5062
OK. got it
One thing I want to say is that in previous versions of mergeInto, the default value of HIVE_SYNC_MODE is HMS. In other SQL statements, such as insert update, the default value is also HMS

@dongkelun ok in this case it's a different story. we should keep it aligned for all sql scenarios. I'm re-openning this PR. Can you please re-purpose this PR to move org.apache.spark.sql.hudi.command.MergeIntoHoodieTableCommand#buildMergeIntoConfig into org.apache.spark.sql.hudi.ProvidesHoodieConfig? we should fix the sync mode and make all hive sync settings aligned with others wherever applicable

dongkelun · 2022-10-20T05:44:09Z

@dongkelun thanks for the patch. there are 2 reasons to why we have to close this for now:

we should keep default value consistent for different scenarios

we don't want to introduce breaking changes (unless with strong reason) until 1.0 when we batch these breakings together

hence i'm tracking the tasks here https://issues.apache.org/jira/browse/HUDI-5062
OK. got it
One thing I want to say is that in previous versions of mergeInto, the default value of HIVE_SYNC_MODE is HMS. In other SQL statements, such as insert update, the default value is also HMS

@dongkelun ok in this case it's a different story. we should keep it aligned for all sql scenarios. I'm re-openning this PR. Can you please re-purpose this PR to move org.apache.spark.sql.hudi.command.MergeIntoHoodieTableCommand#buildMergeIntoConfig into org.apache.spark.sql.hudi.ProvidesHoodieConfig? we should fix the sync mode and make all hive sync settings aligned with others wherever applicable
@xushiyan Ok, I will try my best

xushiyan · 2022-10-20T09:09:33Z

...source/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/ProvidesHoodieConfig.scala

+    Map(
+      HoodieSyncConfig.META_SYNC_ENABLED.key -> hiveSyncConfig.getString(HoodieSyncConfig.META_SYNC_ENABLED.key),
+      HiveSyncConfigHolder.HIVE_SYNC_ENABLED.key -> hiveSyncConfig.getString(HiveSyncConfigHolder.HIVE_SYNC_ENABLED.key),
+      HiveSyncConfigHolder.HIVE_SYNC_MODE.key -> hiveSyncConfig.getString(HiveSyncConfigHolder.HIVE_SYNC_MODE),


since most of sql scenarios have already been using HMS mode, we should ensure the existing sql behavior is not affected. So we need to keep it HMS and only for merge into this is changed. (which is acceptable for consistency reason)

@xushiyan Its default value has been set in the buildHiveSyncConfig method. It is set in this pr #6322. I will add comments remind everybody to be careful when modifying

xushiyan · 2022-10-27T18:44:11Z

@dongkelun can you please rebase?

dongkelun · 2022-10-28T10:14:24Z

The job running on agent Azure Pipelines 8 ran longer than the maximum time of 150 minutes

hudi-bot · 2022-10-28T16:30:20Z

CI report:

c51f59c Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

dongkelun · 2022-10-29T02:24:04Z

@dongkelun can you please rebase?

@xushiyan Hi, the CI has passed

dongkelun · 2022-11-01T06:05:55Z

@xushiyan hello,can you please help me take a review?

dongkelun force-pushed the HUDI-4647 branch 3 times, most recently from ec9886f to b2c2844 Compare August 22, 2022 11:53

yihua added meta-sync priority:minor everything else; usability gaps; questions; feature reqs labels Sep 6, 2022

yihua reviewed Sep 6, 2022

View reviewed changes

yihua assigned xushiyan Sep 6, 2022

xushiyan reviewed Oct 19, 2022

View reviewed changes

xushiyan closed this Oct 19, 2022

xushiyan reopened this Oct 20, 2022

[HUDI-4647] Keep the hive sync settings in spark sql consistent

56c859d

dongkelun force-pushed the HUDI-4647 branch from b2c2844 to 56c859d Compare October 20, 2022 08:50

dongkelun changed the title ~~[HUDI-4647] Change the default value of HIVE_SYNC_MODE in MergeInto to HMS~~ [HUDI-4647] Keep the hive sync settings in spark sql consistent Oct 20, 2022

xushiyan reviewed Oct 20, 2022

View reviewed changes

Add comments and modify the order of payloadClassName

2e3520f

dongkelun force-pushed the HUDI-4647 branch 2 times, most recently from 4644aaa to 1767369 Compare October 26, 2022 06:17

Merge branch 'master' of https://github.com/apache/hudi into HUDI-4647

36ea5a5

dongkelun force-pushed the HUDI-4647 branch from 1767369 to 36ea5a5 Compare October 27, 2022 00:57

dongkelun added 2 commits October 28, 2022 09:25

Merge branch 'master' of https://github.com/apache/hudi into HUDI-4647

55da037

Modify payloadClassName can be overwritten because the test case fails

c51f59c

dongkelun force-pushed the HUDI-4647 branch from 745b577 to c51f59c Compare October 28, 2022 10:14

nsivabalan added the release-0.12.2 Patches targetted for 0.12.2 label Dec 6, 2022

codope added priority:major degraded perf; unable to move forward; potential bugs and removed priority:minor everything else; usability gaps; questions; feature reqs release-0.12.2 Patches targetted for 0.12.2 labels Dec 7, 2022

github-actions bot added the size:M PR with lines of changes in (100, 300] label Feb 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-4647] Keep the hive sync settings in spark sql consistent #6448

[HUDI-4647] Keep the hive sync settings in spark sql consistent #6448

dongkelun commented Aug 19, 2022 •

edited

dongkelun commented Aug 21, 2022

yihua Sep 6, 2022

yihua Sep 6, 2022

dongkelun Sep 6, 2022

xushiyan left a comment

dongkelun commented Oct 19, 2022

xushiyan commented Oct 20, 2022

dongkelun commented Oct 20, 2022

xushiyan Oct 20, 2022

dongkelun Oct 20, 2022

xushiyan commented Oct 27, 2022

dongkelun commented Oct 28, 2022

hudi-bot commented Oct 28, 2022

dongkelun commented Oct 29, 2022

dongkelun commented Nov 1, 2022

[HUDI-4647] Keep the hive sync settings in spark sql consistent #6448

Are you sure you want to change the base?

[HUDI-4647] Keep the hive sync settings in spark sql consistent #6448

Conversation

dongkelun commented Aug 19, 2022 • edited

Change Logs

Impact

Risk level (write none, low medium or high below)

Contributor's checklist

dongkelun commented Aug 21, 2022

yihua Sep 6, 2022

Choose a reason for hiding this comment

yihua Sep 6, 2022

Choose a reason for hiding this comment

dongkelun Sep 6, 2022

Choose a reason for hiding this comment

xushiyan left a comment

Choose a reason for hiding this comment

dongkelun commented Oct 19, 2022

xushiyan commented Oct 20, 2022

dongkelun commented Oct 20, 2022

xushiyan Oct 20, 2022

Choose a reason for hiding this comment

dongkelun Oct 20, 2022

Choose a reason for hiding this comment

xushiyan commented Oct 27, 2022

dongkelun commented Oct 28, 2022

hudi-bot commented Oct 28, 2022

CI report:

dongkelun commented Oct 29, 2022

dongkelun commented Nov 1, 2022

dongkelun commented Aug 19, 2022 •

edited