Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-4647] Keep the hive sync settings in spark sql consistent #6448

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

dongkelun
Copy link
Contributor

@dongkelun dongkelun commented Aug 19, 2022

By default, an error will be reported when synchronizing hive. When using SQL, it is troublesome to fill in the JDBC URL parameter every time

Caused by: org.apache.hudi.hive.HoodieHiveSyncException: Cannot create hive connection jdbc:hive2://localhost:10000/
        at org.apache.hudi.hive.ddl.JDBCExecutor.createHiveConnection(JDBCExecutor.java:107)
        at org.apache.hudi.hive.ddl.JDBCExecutor.<init>(JDBCExecutor.java:59)
        at org.apache.hudi.hive.HoodieHiveSyncClient.<init>(HoodieHiveSyncClient.java:91)
        ... 48 more
Caused by: java.sql.SQLException: Could not open client transport with JDBC Uri: jdbc:hive2://localhost:10000: java.net.ConnectException: Connection refused (Connection refused)
        at org.apache.hive.jdbc.HiveConnection.openTransport(HiveConnection.java:234)
        at org.apache.hive.jdbc.HiveConnection.<init>(HiveConnection.java:178)
        at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105)
        at java.sql.DriverManager.getConnection(DriverManager.java:664)
        at java.sql.DriverManager.getConnection(DriverManager.java:247)
        at org.apache.hudi.hive.ddl.JDBCExecutor.createHiveConnection(JDBCExecutor.java:104)
        ... 50 more
Caused by: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection refused (Connection 

Change Logs

Keep the hive sync settings in spark sql consistent

Impact

Keep the hive sync settings in spark sql consistent

Risk level (write none, low medium or high below)

none

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@dongkelun
Copy link
Contributor Author

@hudi-bot run azure

@dongkelun dongkelun force-pushed the HUDI-4647 branch 3 times, most recently from ec9886f to b2c2844 Compare August 22, 2022 11:53
@yihua yihua added meta-sync priority:minor everything else; usability gaps; questions; feature reqs labels Sep 6, 2022
@@ -495,7 +496,7 @@ case class MergeIntoHoodieTableCommand(mergeInto: MergeIntoTable) extends Hoodie
KEYGENERATOR_CLASS_NAME.key -> classOf[SqlKeyGenerator].getCanonicalName,
SqlKeyGenerator.ORIGIN_KEYGEN_CLASS_NAME -> tableConfig.getKeyGeneratorClassName,
HoodieSyncConfig.META_SYNC_ENABLED.key -> enableHive.toString,
HiveSyncConfigHolder.HIVE_SYNC_MODE.key -> hiveSyncConfig.getString(HiveSyncConfigHolder.HIVE_SYNC_MODE),
HiveSyncConfigHolder.HIVE_SYNC_MODE.key -> hiveSyncConfig.getStringOrDefault(HiveSyncConfigHolder.HIVE_SYNC_MODE, HiveSyncMode.HMS.name()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just add the default value below inside the HiveSyncConfigHolder class?

public static final ConfigProperty<String> HIVE_SYNC_MODE = ConfigProperty
      .key("hoodie.datasource.hive_sync.mode")
      .noDefaultValue()
      .withDocumentation("Mode to choose for Hive ops. Valid values are hms, jdbc and hiveql.");

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just add the default value below inside the HiveSyncConfigHolder class?

public static final ConfigProperty<String> HIVE_SYNC_MODE = ConfigProperty
      .key("hoodie.datasource.hive_sync.mode")
      .noDefaultValue()
      .withDocumentation("Mode to choose for Hive ops. Valid values are hms, jdbc and hiveql.");

Here is to modify the global default value, which may affect a lot. I am not sure whether it is reasonable for each module. This PR only wants to change the default value corresponding to Spark SQL

Copy link
Member

@xushiyan xushiyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongkelun thanks for the patch. there are 2 reasons to why we have to close this for now:

  1. we should keep default value consistent for different scenarios
  2. we don't want to introduce breaking changes (unless with strong reason) until 1.0 when we batch these breakings together

hence i'm tracking the tasks here https://issues.apache.org/jira/browse/HUDI-5062

@xushiyan xushiyan closed this Oct 19, 2022
@dongkelun
Copy link
Contributor Author

@dongkelun thanks for the patch. there are 2 reasons to why we have to close this for now:

  1. we should keep default value consistent for different scenarios
  2. we don't want to introduce breaking changes (unless with strong reason) until 1.0 when we batch these breakings together

hence i'm tracking the tasks here https://issues.apache.org/jira/browse/HUDI-5062
OK. got it
One thing I want to say is that in previous versions of mergeInto, the default value of HIVE_SYNC_MODE is HMS. In other SQL statements, such as insert update, the default value is also HMS

@xushiyan
Copy link
Member

@dongkelun thanks for the patch. there are 2 reasons to why we have to close this for now:

  1. we should keep default value consistent for different scenarios
  2. we don't want to introduce breaking changes (unless with strong reason) until 1.0 when we batch these breakings together

hence i'm tracking the tasks here https://issues.apache.org/jira/browse/HUDI-5062
OK. got it
One thing I want to say is that in previous versions of mergeInto, the default value of HIVE_SYNC_MODE is HMS. In other SQL statements, such as insert update, the default value is also HMS

@dongkelun ok in this case it's a different story. we should keep it aligned for all sql scenarios. I'm re-openning this PR. Can you please re-purpose this PR to move org.apache.spark.sql.hudi.command.MergeIntoHoodieTableCommand#buildMergeIntoConfig into org.apache.spark.sql.hudi.ProvidesHoodieConfig? we should fix the sync mode and make all hive sync settings aligned with others wherever applicable

@xushiyan xushiyan reopened this Oct 20, 2022
@dongkelun
Copy link
Contributor Author

@dongkelun thanks for the patch. there are 2 reasons to why we have to close this for now:

  1. we should keep default value consistent for different scenarios
  2. we don't want to introduce breaking changes (unless with strong reason) until 1.0 when we batch these breakings together

hence i'm tracking the tasks here https://issues.apache.org/jira/browse/HUDI-5062
OK. got it
One thing I want to say is that in previous versions of mergeInto, the default value of HIVE_SYNC_MODE is HMS. In other SQL statements, such as insert update, the default value is also HMS

@dongkelun ok in this case it's a different story. we should keep it aligned for all sql scenarios. I'm re-openning this PR. Can you please re-purpose this PR to move org.apache.spark.sql.hudi.command.MergeIntoHoodieTableCommand#buildMergeIntoConfig into org.apache.spark.sql.hudi.ProvidesHoodieConfig? we should fix the sync mode and make all hive sync settings aligned with others wherever applicable
@xushiyan Ok, I will try my best

@dongkelun dongkelun changed the title [HUDI-4647] Change the default value of HIVE_SYNC_MODE in MergeInto to HMS [HUDI-4647] Keep the hive sync settings in spark sql consistent Oct 20, 2022
Map(
HoodieSyncConfig.META_SYNC_ENABLED.key -> hiveSyncConfig.getString(HoodieSyncConfig.META_SYNC_ENABLED.key),
HiveSyncConfigHolder.HIVE_SYNC_ENABLED.key -> hiveSyncConfig.getString(HiveSyncConfigHolder.HIVE_SYNC_ENABLED.key),
HiveSyncConfigHolder.HIVE_SYNC_MODE.key -> hiveSyncConfig.getString(HiveSyncConfigHolder.HIVE_SYNC_MODE),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since most of sql scenarios have already been using HMS mode, we should ensure the existing sql behavior is not affected. So we need to keep it HMS and only for merge into this is changed. (which is acceptable for consistency reason)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xushiyan Its default value has been set in the buildHiveSyncConfig method. It is set in this pr #6322. I will add comments remind everybody to be careful when modifying

@dongkelun dongkelun force-pushed the HUDI-4647 branch 2 times, most recently from 4644aaa to 1767369 Compare October 26, 2022 06:17
@xushiyan
Copy link
Member

@dongkelun can you please rebase?

@dongkelun
Copy link
Contributor Author

The job running on agent Azure Pipelines 8 ran longer than the maximum time of 150 minutes

@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@dongkelun
Copy link
Contributor Author

@dongkelun can you please rebase?

@xushiyan Hi, the CI has passed

@dongkelun
Copy link
Contributor Author

@xushiyan hello,can you please help me take a review?

@nsivabalan nsivabalan added the release-0.12.2 Patches targetted for 0.12.2 label Dec 6, 2022
@codope codope added priority:major degraded perf; unable to move forward; potential bugs and removed priority:minor everything else; usability gaps; questions; feature reqs release-0.12.2 Patches targetted for 0.12.2 labels Dec 7, 2022
@github-actions github-actions bot added the size:M PR with lines of changes in (100, 300] label Feb 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
meta-sync priority:major degraded perf; unable to move forward; potential bugs size:M PR with lines of changes in (100, 300]
Projects
Status: 🏗 Under discussion
Development

Successfully merging this pull request may close these issues.

None yet

6 participants