Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-1194] Refactor HoodieHiveClient based on the way to call Hive API #1975

Closed

Conversation

zhedoubushishi
Copy link
Contributor

Tips

What is the purpose of the pull request

JIRA https://issues.apache.org/jira/browse/HUDI-1194

Brief change log

Separate HoodieHiveClient into three classes:

  • HoodieHiveClient which implements all the APIs through Metastore API.
  • HoodieHiveJDBCClient which extends from HoodieHiveClient and overwrite several the APIs through Hive JDBC.
  • HoodieHiveDriverClient which extends from HoodieHiveClient and overwrite several the APIs through Hive Driver.

And also introduce a new parameter hoodie.datasource.hive_sync.hive_client_class which could let you choose which Hive Client class to use.

Verify this pull request

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end.
  • Added HoodieClientWriteTest to verify the change.
  • Manually verified the change by running a job locally.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

if (optParams(HIVE_USE_JDBC_OPT_KEY).equals("true")) {
optParams ++ Map(HIVE_CLIENT_CLASS_OPT_KEY -> DEFAULT_HIVE_CLIENT_CLASS_OPT_VAL)
} else if (optParams(HIVE_USE_JDBC_OPT_KEY).equals("false")) {
optParams ++ Map(HIVE_CLIENT_CLASS_OPT_KEY -> classOf[HoodieHiveDriverClient].getCanonicalName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment either here or in HoodieHiveDriverClient explain why this is used when HIVE_USE_JDBC_OPT_KEY is false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Here I just want to keep it the same behavior as before. Will add a comment.

@@ -94,7 +94,7 @@
<joda.version>2.9.9</joda.version>
<hadoop.version>2.7.3</hadoop.version>
<hive.groupid>org.apache.hive</hive.groupid>
<hive.version>2.3.1</hive.version>
<hive.version>2.3.6</hive.version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we updating Hive?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When running hive sync through Spark 2.x, it will use hive-spark as hive dependency which is 1.2.1-spark2 version. So I need to make sure all the Hive APIs used in HoodieHiveClient are compatible with both hive-spark 1.2.1-spark2 and Hive 2.3.x.

Actually for this alter_partition API: client.alter_partition(String, String, Partition). I didn't find a compatible API between 2.3.1 and 1.2.1. But I can find a compatible API with Hive 2.3.6 and hive-spark 1.2.1.

So I am thinking of bumping Hive version to 2.3.6. Is this acceptable to the community?

@zhedoubushishi zhedoubushishi changed the title [HUDI-1194][WIP] Reorganize HoodieHiveClient based on the way to call Hive API [HUDI-1194][WIP] Refactor HoodieHiveClient based on the way to call Hive API Aug 27, 2020
@vinothchandar
Copy link
Member

@umehrot2 to take a first pass as well.
@modi95 I am assuming you can review this, with uber's setup in mind

@vinothchandar vinothchandar removed their assignment Sep 8, 2020
@vinothchandar vinothchandar added the pr:wip Work in Progress/PRs label Oct 4, 2020
@zhedoubushishi zhedoubushishi changed the title [HUDI-1194][WIP] Refactor HoodieHiveClient based on the way to call Hive API [HUDI-1194] Refactor HoodieHiveClient based on the way to call Hive API Nov 14, 2020
@codecov-io
Copy link

codecov-io commented Nov 16, 2020

Codecov Report

Merging #1975 (22757e3) into master (430d4b4) will decrease coverage by 0.01%.
The diff coverage is 35.29%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #1975      +/-   ##
============================================
- Coverage     53.54%   53.52%   -0.02%     
  Complexity     2770     2770              
============================================
  Files           348      348              
  Lines         16109    16120      +11     
  Branches       1643     1646       +3     
============================================
+ Hits           8626     8629       +3     
- Misses         6785     6792       +7     
- Partials        698      699       +1     
Flag Coverage Δ Complexity Δ
hudicli 38.37% <ø> (ø) 0.00 <ø> (ø)
hudiclient 100.00% <ø> (ø) 0.00 <ø> (ø)
hudicommon 55.33% <ø> (ø) 0.00 <ø> (ø)
hudihadoopmr 32.94% <ø> (ø) 0.00 <ø> (ø)
hudispark 65.29% <35.29%> (-0.30%) 0.00 <0.00> (ø)
huditimelineservice 65.30% <ø> (ø) 0.00 <ø> (ø)
hudiutilities 70.09% <ø> (ø) 0.00 <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ Complexity Δ
...src/main/java/org/apache/hudi/DataSourceUtils.java 40.56% <0.00%> (-0.39%) 19.00 <0.00> (ø)
...n/scala/org/apache/hudi/HoodieSparkSqlWriter.scala 50.95% <0.00%> (ø) 0.00 <0.00> (ø)
...main/scala/org/apache/hudi/DataSourceOptions.scala 89.84% <36.36%> (-5.08%) 0.00 <0.00> (ø)
...main/scala/org/apache/hudi/HoodieWriterUtils.scala 87.87% <100.00%> (ø) 0.00 <0.00> (ø)

@zhedoubushishi
Copy link
Contributor Author

@vinothchandar Can you remove the WIP label? It seems I cannot remove it from my side. This PR is ready to review.

@vinothchandar vinothchandar removed the pr:wip Work in Progress/PRs label Nov 18, 2020
@vinothchandar
Copy link
Member

@zhedoubushishi done

@vinothchandar
Copy link
Member

@lw309637554 could you review this as well?

@vinothchandar vinothchandar added this to Ready For Review in PR Tracker Board Apr 15, 2021
Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@n3nash : I started reviewing this, but looks like already modi reviewed this. Can you ask one of Uber folks to review and take this to finish line. Its been open for quite sometime.

val DEFAULT_HIVE_USE_JDBC_OPT_VAL = "true"
val DEFAULT_HIVE_AUTO_CREATE_DATABASE_OPT_KEY = "true"
val DEFAULT_HIVE_SKIP_RO_SUFFIX_VAL = "false"
val DEFAULT_HIVE_SUPPORT_TIMESTAMP = "false"

def translateUseJDBCToHiveClientClass(optParams: Map[String, String]) : Map[String, String] = {
if (optParams.contains(HIVE_USE_JDBC_OPT_KEY) && !optParams.contains(HIVE_CLIENT_CLASS_OPT_KEY)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor optimization. all this matters only if hoodie.datasource.hive_sync.enable is enabled right? else we don't need to translate any of these.

hiveSyncConfig.autoCreateDatabase = Boolean.valueOf(props.getString(DataSourceWriteOptions.HIVE_AUTO_CREATE_DATABASE_OPT_KEY(),
DataSourceWriteOptions.DEFAULT_HIVE_AUTO_CREATE_DATABASE_OPT_KEY()));
hiveSyncConfig.skipROSuffix = Boolean.valueOf(props.getString(DataSourceWriteOptions.HIVE_SKIP_RO_SUFFIX(),
DataSourceWriteOptions.DEFAULT_HIVE_SKIP_RO_SUFFIX_VAL()));
hiveSyncConfig.supportTimestamp = Boolean.valueOf(props.getString(DataSourceWriteOptions.HIVE_SUPPORT_TIMESTAMP(),
DataSourceWriteOptions.DEFAULT_HIVE_SUPPORT_TIMESTAMP()));
hiveSyncConfig.hiveClientClass =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see these are used in some test data files as well. can you fix it in this patch or create a follow up jira.
docker/demo/sparksql-incremental.commands

import java.util.Collections;
import java.util.List;

public class HoodieHiveDriverClient extends HoodieHiveClient {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

java docs please.

import java.util.Map;
import java.util.stream.Collectors;

public class HoodieHiveClient extends AbstractSyncHoodieClient {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

java docs please

@n3nash
Copy link
Contributor

n3nash commented May 31, 2021

@jsbali Can you review this diff ? Since you are looking to add ways to invoke JDBC as well as Metastore.

@jsbali
Copy link
Contributor

jsbali commented Jul 28, 2021

Sorry for the delay in responding. Similar PR was in progress and has been merged #2879.
We can close this PR.

@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run travis re-run the last Travis build
  • @hudi-bot run azure re-run the last Azure build

@zhedoubushishi
Copy link
Contributor Author

zhedoubushishi commented Aug 5, 2021

Sorry for the delay in responding. Similar PR was in progress and has been merged #2879.
We can close this PR.

Yes since it's duplicated work, closed this PR.

PR Tracker Board automation moved this from Under Discussion PRs to Done Aug 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

None yet

10 participants