[HUDI-1194] Refactor HoodieHiveClient based on the way to call Hive API #1975

zhedoubushishi · 2020-08-17T07:00:35Z

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contributing.html before opening a pull request.

What is the purpose of the pull request

JIRA https://issues.apache.org/jira/browse/HUDI-1194

Brief change log

Separate HoodieHiveClient into three classes:

HoodieHiveClient which implements all the APIs through Metastore API.
HoodieHiveJDBCClient which extends from HoodieHiveClient and overwrite several the APIs through Hive JDBC.
HoodieHiveDriverClient which extends from HoodieHiveClient and overwrite several the APIs through Hive Driver.

And also introduce a new parameter hoodie.datasource.hive_sync.hive_client_class which could let you choose which Hive Client class to use.

Verify this pull request

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end.
Added HoodieClientWriteTest to verify the change.
Manually verified the change by running a job locally.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

modi95 · 2020-08-24T22:31:28Z

hudi-spark/src/main/scala/org/apache/hudi/DataSourceOptions.scala

+      if (optParams(HIVE_USE_JDBC_OPT_KEY).equals("true")) {
+        optParams ++ Map(HIVE_CLIENT_CLASS_OPT_KEY -> DEFAULT_HIVE_CLIENT_CLASS_OPT_VAL)
+      } else if (optParams(HIVE_USE_JDBC_OPT_KEY).equals("false")) {
+        optParams ++ Map(HIVE_CLIENT_CLASS_OPT_KEY -> classOf[HoodieHiveDriverClient].getCanonicalName)


Could you add a comment either here or in HoodieHiveDriverClient explain why this is used when HIVE_USE_JDBC_OPT_KEY is false?

Sure. Here I just want to keep it the same behavior as before. Will add a comment.

modi95 · 2020-08-24T22:35:57Z

pom.xml

@@ -94,7 +94,7 @@
    <joda.version>2.9.9</joda.version>
    <hadoop.version>2.7.3</hadoop.version>
    <hive.groupid>org.apache.hive</hive.groupid>
-    <hive.version>2.3.1</hive.version>
+    <hive.version>2.3.6</hive.version>


Why are we updating Hive?

When running hive sync through Spark 2.x, it will use hive-spark as hive dependency which is 1.2.1-spark2 version. So I need to make sure all the Hive APIs used in HoodieHiveClient are compatible with both hive-spark 1.2.1-spark2 and Hive 2.3.x.

Actually for this alter_partition API: client.alter_partition(String, String, Partition). I didn't find a compatible API between 2.3.1 and 1.2.1. But I can find a compatible API with Hive 2.3.6 and hive-spark 1.2.1.

So I am thinking of bumping Hive version to 2.3.6. Is this acceptable to the community?

vinothchandar · 2020-09-08T02:25:46Z

@umehrot2 to take a first pass as well.
@modi95 I am assuming you can review this, with uber's setup in mind

codecov-io · 2020-11-16T12:01:41Z

Codecov Report

Merging #1975 (22757e3) into master (430d4b4) will decrease coverage by 0.01%.
The diff coverage is 35.29%.

@@             Coverage Diff              @@
##             master    #1975      +/-   ##
============================================
- Coverage     53.54%   53.52%   -0.02%     
  Complexity     2770     2770              
============================================
  Files           348      348              
  Lines         16109    16120      +11     
  Branches       1643     1646       +3     
============================================
+ Hits           8626     8629       +3     
- Misses         6785     6792       +7     
- Partials        698      699       +1

Flag	Coverage Δ	Complexity Δ
hudicli	`38.37% <ø> (ø)`	`0.00 <ø> (ø)`
hudiclient	`100.00% <ø> (ø)`	`0.00 <ø> (ø)`
hudicommon	`55.33% <ø> (ø)`	`0.00 <ø> (ø)`
hudihadoopmr	`32.94% <ø> (ø)`	`0.00 <ø> (ø)`
hudispark	`65.29% <35.29%> (-0.30%)`	`0.00 <0.00> (ø)`
huditimelineservice	`65.30% <ø> (ø)`	`0.00 <ø> (ø)`
hudiutilities	`70.09% <ø> (ø)`	`0.00 <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ	Complexity Δ
...src/main/java/org/apache/hudi/DataSourceUtils.java	`40.56% <0.00%> (-0.39%)`	`19.00 <0.00> (ø)`
...n/scala/org/apache/hudi/HoodieSparkSqlWriter.scala	`50.95% <0.00%> (ø)`	`0.00 <0.00> (ø)`
...main/scala/org/apache/hudi/DataSourceOptions.scala	`89.84% <36.36%> (-5.08%)`	`0.00 <0.00> (ø)`
...main/scala/org/apache/hudi/HoodieWriterUtils.scala	`87.87% <100.00%> (ø)`	`0.00 <0.00> (ø)`

… API

zhedoubushishi · 2020-11-16T19:45:41Z

@vinothchandar Can you remove the WIP label? It seems I cannot remove it from my side. This PR is ready to review.

vinothchandar · 2020-11-18T19:45:41Z

@zhedoubushishi done

vinothchandar · 2021-01-29T02:09:33Z

@lw309637554 could you review this as well?

nsivabalan

@n3nash : I started reviewing this, but looks like already modi reviewed this. Can you ask one of Uber folks to review and take this to finish line. Its been open for quite sometime.

nsivabalan · 2021-05-26T10:53:31Z

hudi-spark/src/main/scala/org/apache/hudi/DataSourceOptions.scala

  val DEFAULT_HIVE_USE_JDBC_OPT_VAL = "true"
  val DEFAULT_HIVE_AUTO_CREATE_DATABASE_OPT_KEY = "true"
  val DEFAULT_HIVE_SKIP_RO_SUFFIX_VAL = "false"
  val DEFAULT_HIVE_SUPPORT_TIMESTAMP = "false"

+  def translateUseJDBCToHiveClientClass(optParams: Map[String, String]) : Map[String, String] = {
+    if (optParams.contains(HIVE_USE_JDBC_OPT_KEY) && !optParams.contains(HIVE_CLIENT_CLASS_OPT_KEY)) {


minor optimization. all this matters only if hoodie.datasource.hive_sync.enable is enabled right? else we don't need to translate any of these.

nsivabalan · 2021-05-26T11:00:10Z

hudi-spark/src/main/java/org/apache/hudi/DataSourceUtils.java

    hiveSyncConfig.autoCreateDatabase = Boolean.valueOf(props.getString(DataSourceWriteOptions.HIVE_AUTO_CREATE_DATABASE_OPT_KEY(),
        DataSourceWriteOptions.DEFAULT_HIVE_AUTO_CREATE_DATABASE_OPT_KEY()));
    hiveSyncConfig.skipROSuffix = Boolean.valueOf(props.getString(DataSourceWriteOptions.HIVE_SKIP_RO_SUFFIX(),
        DataSourceWriteOptions.DEFAULT_HIVE_SKIP_RO_SUFFIX_VAL()));
    hiveSyncConfig.supportTimestamp = Boolean.valueOf(props.getString(DataSourceWriteOptions.HIVE_SUPPORT_TIMESTAMP(),
        DataSourceWriteOptions.DEFAULT_HIVE_SUPPORT_TIMESTAMP()));
+    hiveSyncConfig.hiveClientClass =


I see these are used in some test data files as well. can you fix it in this patch or create a follow up jira.
docker/demo/sparksql-incremental.commands

nsivabalan · 2021-05-26T11:04:11Z

hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/client/HoodieHiveDriverClient.java

+import java.util.Collections;
+import java.util.List;
+
+public class HoodieHiveDriverClient extends HoodieHiveClient {


java docs please.

nsivabalan · 2021-05-26T11:05:19Z

hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/client/HoodieHiveClient.java

+import java.util.Map;
+import java.util.stream.Collectors;
+
+public class HoodieHiveClient extends AbstractSyncHoodieClient {


java docs please

n3nash · 2021-05-31T21:51:45Z

@jsbali Can you review this diff ? Since you are looking to add ways to invoke JDBC as well as Metastore.

jsbali · 2021-07-28T10:45:46Z

Sorry for the delay in responding. Similar PR was in progress and has been merged #2879.
We can close this PR.

hudi-bot · 2021-07-28T10:46:14Z

CI report:

22757e3 UNKNOWN

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run travis re-run the last Travis build
@hudi-bot run azure re-run the last Azure build

zhedoubushishi · 2021-08-05T09:01:14Z

Sorry for the delay in responding. Similar PR was in progress and has been merged #2879.
We can close this PR.

Yes since it's duplicated work, closed this PR.

zhedoubushishi force-pushed the pr-for-metastore-api-2 branch from 304ef32 to a68b0ac Compare August 17, 2020 07:06

vinothchandar self-assigned this Aug 19, 2020

modi95 reviewed Aug 24, 2020

View reviewed changes

zhedoubushishi changed the title ~~[HUDI-1194][WIP] Reorganize HoodieHiveClient based on the way to call Hive API~~ [HUDI-1194][WIP] Refactor HoodieHiveClient based on the way to call Hive API Aug 27, 2020

vinothchandar assigned umehrot2 Sep 8, 2020

vinothchandar removed their assignment Sep 8, 2020

vinothchandar added the pr:wip Work in Progress/PRs label Oct 4, 2020

zhedoubushishi force-pushed the pr-for-metastore-api-2 branch from 8f8e126 to d941f7a Compare November 14, 2020 04:32

zhedoubushishi changed the title ~~[HUDI-1194][WIP] Refactor HoodieHiveClient based on the way to call Hive API~~ [HUDI-1194] Refactor HoodieHiveClient based on the way to call Hive API Nov 14, 2020

Wenning Ding added 3 commits November 16, 2020 11:28

[HUDI-1194] Reorganize HoodieHiveClient based on the way to call Hive…

93cf72a

… API

resolve comments

b03e44d

add support for cascade

22757e3

zhedoubushishi force-pushed the pr-for-metastore-api-2 branch from d941f7a to 22757e3 Compare November 16, 2020 19:28

vinothchandar removed the pr:wip Work in Progress/PRs label Nov 18, 2020

vinothchandar assigned lw309637554 Jan 29, 2021

vinothchandar added this to Ready For Review in PR Tracker Board Apr 15, 2021

nsivabalan reviewed May 26, 2021

View reviewed changes

zhedoubushishi closed this Aug 5, 2021

PR Tracker Board automation moved this from Under Discussion PRs to Done Aug 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-1194] Refactor HoodieHiveClient based on the way to call Hive API #1975

[HUDI-1194] Refactor HoodieHiveClient based on the way to call Hive API #1975

zhedoubushishi commented Aug 17, 2020

modi95 Aug 24, 2020

zhedoubushishi Aug 25, 2020

modi95 Aug 24, 2020

zhedoubushishi Aug 25, 2020

vinothchandar commented Sep 8, 2020

codecov-io commented Nov 16, 2020 •

edited

zhedoubushishi commented Nov 16, 2020

vinothchandar commented Nov 18, 2020

vinothchandar commented Jan 29, 2021

nsivabalan left a comment •

edited

nsivabalan May 26, 2021

nsivabalan May 26, 2021

nsivabalan May 26, 2021

nsivabalan May 26, 2021

n3nash commented May 31, 2021

jsbali commented Jul 28, 2021

hudi-bot commented Jul 28, 2021

zhedoubushishi commented Aug 5, 2021 •

edited

[HUDI-1194] Refactor HoodieHiveClient based on the way to call Hive API #1975

[HUDI-1194] Refactor HoodieHiveClient based on the way to call Hive API #1975

Conversation

zhedoubushishi commented Aug 17, 2020

Tips

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vinothchandar commented Sep 8, 2020

codecov-io commented Nov 16, 2020 • edited

Codecov Report

zhedoubushishi commented Nov 16, 2020

vinothchandar commented Nov 18, 2020

vinothchandar commented Jan 29, 2021

nsivabalan left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

n3nash commented May 31, 2021

jsbali commented Jul 28, 2021

hudi-bot commented Jul 28, 2021

CI report:

zhedoubushishi commented Aug 5, 2021 • edited

codecov-io commented Nov 16, 2020 •

edited

nsivabalan left a comment •

edited

zhedoubushishi commented Aug 5, 2021 •

edited