[HUDI-1848] Adding support for HMS for running DDL queries in hive-sy… #2879

jsbali · 2021-04-26T07:28:12Z

…nc-tool

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contributing.html before opening a pull request.

What is the purpose of the pull request

This PR takes work done as part of this PR https://github.com/apache/hudi/pull/2532/files and builds on top of it. Fixed some small changes in that PR and added tests as well.
This adds support for a HMS for running DDL queries in hive-sync-tool configured by useHMS flag.

Brief change log

Added a DDLExecutor interface and three implementations for the same namely
JDBCExecutor
HiveQueryExecutor
HMSDDLExecutor

There are small things still remaining but PR can be reviewed alongside those changes.

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end.
Added HoodieClientWriteTest to verify the change.
Manually verified the change by running a job locally.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

…nc-tool

codecov-commenter · 2021-04-26T09:21:14Z

Codecov Report

Merging #2879 (f360b8b) into master (2099bf4) will decrease coverage by 2.03%.
The diff coverage is 79.48%.

@@             Coverage Diff              @@
##             master    #2879      +/-   ##
============================================
- Coverage     47.84%   45.80%   -2.04%     
- Complexity     5564     5647      +83     
============================================
  Files           936     1003      +67     
  Lines         41652    43968    +2316     
  Branches       4195     4415     +220     
============================================
+ Hits          19928    20140     +212     
- Misses        19952    22037    +2085     
- Partials       1772     1791      +19

Flag	Coverage Δ
hudicli	`39.97% <ø> (ø)`
hudiclient	`34.55% <ø> (+0.03%)`	⬆️
hudicommon	`48.63% <ø> (-0.06%)`	⬇️
hudiflink	`59.65% <ø> (+0.21%)`	⬆️
hudihadoopmr	`52.02% <ø> (ø)`
hudiintegtest	`0.00% <ø> (?)`
hudisparkdatasource	`67.04% <ø> (-0.24%)`	⬇️
hudisync	`59.28% <79.48%> (+3.30%)`	⬆️
huditimelineservice	`64.07% <ø> (ø)`
hudiutilities	`59.87% <ø> (+0.10%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...org/apache/hudi/hive/ddl/HiveQueryDDLExecutor.java	`71.15% <71.15%> (ø)`
...in/java/org/apache/hudi/hive/ddl/JDBCExecutor.java	`73.13% <73.13%> (ø)`
...in/java/org/apache/hudi/hive/HoodieHiveClient.java	`66.40% <73.91%> (-3.74%)`	⬇️
.../java/org/apache/hudi/hive/ddl/HMSDDLExecutor.java	`81.45% <81.45%> (ø)`
...rg/apache/hudi/hive/ddl/QueryBasedDDLExecutor.java	`84.94% <84.94%> (ø)`
...java/org/apache/hudi/hive/util/HiveSchemaUtil.java	`70.90% <86.20%> (+1.53%)`	⬆️
...main/java/org/apache/hudi/hive/HiveSyncConfig.java	`98.24% <100.00%> (ø)`
...c/main/java/org/apache/hudi/hive/HiveSyncTool.java	`78.12% <100.00%> (ø)`
...va/org/apache/hudi/keygen/BuiltinKeyGenerator.java	`62.12% <0.00%> (-10.86%)`	⬇️
...g/apache/hudi/keygen/GlobalDeleteKeyGenerator.java	`90.90% <0.00%> (-9.10%)`	⬇️
... and 115 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2099bf4...f360b8b. Read the comment docs.

vinothchandar · 2021-05-11T01:58:45Z

@jsbali @satishkotha do we need #2532 to land for this?

jsbali · 2021-05-11T16:25:48Z

@vinothchandar. Both the PR's are different. This PR adds support for configurable options to be used for DDL queries within hive-sync-tool whereas the other PR removes all alternatives and keeps the HMS one only.
Both PR's are independent of each other.
The only similarity they have is the HMS code has been taken from #2532 which I will add as co-authored-by in the next commit.
Also as far as this PR is concerned there is a test case which I need to fix. Once that is done this can be reviewed.

jsbali · 2021-05-18T16:14:31Z

@satishkotha I have fixed the test cases but there are few minor merge conflicts which I will fix along with the code review changes for this.

satishkotha

Took first pass.I havent reviewed individual executors yet. I'll do that over next week.

hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/ddl/HMSDDLExecutor.java

satishkotha · 2021-05-12T20:07:04Z

hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncConfig.java

@@ -73,6 +73,9 @@
  @Parameter(names = {"--use-jdbc"}, description = "Hive jdbc connect url")
  public Boolean useJdbc = true;

+  @Parameter(names = {"--use-hms"}, description = "Use hms client for ddl commands")
+  public Boolean useHMS = false;


any thoughts on introducing other option instead of using booleans? only one of useJdbc/useHms is allowed to be true. can we introduce something like sync_mode=jdbc/hms etc? We can still keep the flags for backward compatibility for few more versions. wdyt?

satishkotha · 2021-05-12T20:08:37Z

hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java

-      LOG.info("Creating hive connection " + cfg.jdbcUrl);
-      createHiveConnection();
+
+    if (cfg.useHMS) {


nit: probably better to keep cfg.useJdbc as first check to ensure jdbc takes precedence over hms.

satishkotha · 2021-05-12T20:09:20Z

hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncTool.java

@@ -68,7 +70,7 @@ public HiveSyncTool(HiveSyncConfig cfg, HiveConf configuration, FileSystem fs) {

    try {
      this.hoodieHiveClient = new HoodieHiveClient(cfg, configuration, fs);
-    } catch (RuntimeException e) {
+    } catch (RuntimeException | HiveException | MetaException e) { //TODO-jsbali FIx this


minor: revert this change?

satishkotha · 2021-05-12T20:11:13Z

hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java

-      return schema;
-    } catch (Exception e) {
-      throw new HoodieHiveSyncException("Failed to get table schema for : " + tableName, e);
+    if (!doesTableExist(tableName)) {


Earlier this check seems to be present only for jdbc executor. Do we need it for all cases? (probably not a big deal, but making sure there is no redundant checks)

hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/ddl/DDLExecutor.java

satishkotha · 2021-05-12T23:02:21Z

hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/ddl/HMSDDLExecutor.java

+      List<FieldSchema> fieldSchema = HiveSchemaUtil.convertParquetSchemaToHiveFieldSchema(storageSchema, syncConfig);
+      Map<String, String> hiveSchema = HiveSchemaUtil.convertParquetSchemaToHiveSchema(storageSchema, syncConfig.supportTimestamp);


For certain large schemas it is expensive to do these multiple conversions 1) convertParquetSchemaToHiveFieldSchema and 2) convertParquetSchemaToHiveSchema

Can we avoid one of them? Maybe change getPartitionKeyType to work List? or change convertParquetSchemaToHiveFieldSchema to return partitionSchema as well?

jsbali · 2021-06-15T13:20:18Z

@satishkotha Made the changes. PTAL

hudi-bot · 2021-06-15T16:35:41Z

CI report:

9a0da13 Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run travis re-run the last Travis build
@hudi-bot run azure re-run the last Azure build

satishkotha

LGTM at a high level. Please also test with internal version and verify all all executors work and share test results.

@vinothchandar or @n3nash This is somewhat high impact change. Do one of you guys want to take a pass as well?

hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/ddl/DDLExecutor.java

hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/util/HiveSchemaUtil.java

satishkotha · 2021-06-21T19:58:08Z

hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/util/HiveSchemaUtil.java

-  private static Map<String, String> convertParquetSchemaToHiveSchema(MessageType messageType, boolean supportTimestamp) throws IOException {
-    Map<String, String> schema = new LinkedHashMap<>();
+  public static Map<String, String> convertParquetSchemaToHiveSchema(MessageType messageType, boolean supportTimestamp) throws IOException {
+    return convertMapSchemaToHiveSchema(parquetSchemaToMapSchema(messageType, supportTimestamp));


It seems like we are always doing two translations
parquet -> map and map -> hive schema? Is this extra step needed for all executors?

We were doing it before anyways. I have just broken it up for now for saving the work when we do the create table call in HMS api where we might have to do redundant works. Let me see if we can better refactor it.

satishkotha · 2021-06-21T20:15:46Z

hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/ddl/HMSDDLExecutor.java

+      Map<String, String> hiveSchema = HiveSchemaUtil.convertMapSchemaToHiveSchema(mapSchema);
+
+      List<FieldSchema> partitionSchema = syncConfig.partitionFields.stream().map(partitionKey -> {
+        String partitionKeyType = HiveSchemaUtil.getPartitionKeyType(hiveSchema, partitionKey);


Can we change getPartitionKeyType to work with fieldSchema/mapSchema? can we remove 'hiveSchema' variable with that?

@jsbali any thoughts on this?

…with multiple levels.

jsbali · 2021-07-01T17:16:21Z

Fixed a bug when complex schemas can cause HMS create api to fail. The fix is FieldSchemas which need to be provided explicitly for HMS don't work well with spaces and don't need backticks.

n3nash · 2021-07-02T17:17:56Z

@n3nash To review this

satishkotha

@n3nash the change LGTM expect for few minor javaodc etc comments below. Please take a look.

hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/ddl/DDLExecutor.java

satishkotha · 2021-07-12T16:44:44Z

hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/ddl/HMSDDLExecutor.java

+      Map<String, String> hiveSchema = HiveSchemaUtil.convertMapSchemaToHiveSchema(mapSchema);
+
+      List<FieldSchema> partitionSchema = syncConfig.partitionFields.stream().map(partitionKey -> {
+        String partitionKeyType = HiveSchemaUtil.getPartitionKeyType(hiveSchema, partitionKey);


@jsbali any thoughts on this?

hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/util/HiveSchemaUtil.java

jsbali · 2021-07-21T15:30:25Z

Have added a doFormat bool to parquetSchemaToMapSchema function. The motive behind that is when running with complex or nested schemas spaces can trip the HMS create table call.
After going through the HiveQL driver code which in turn calls HMS this is exactly similar to how the create Table params get passed on there.
Also @n3nash and @satishkotha Have resolved all merge conflicts and added a complex schema tests as well.

hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/util/HiveSchemaUtil.java

stym06 · 2021-07-29T10:49:48Z

Thanks for this! Are there any docs on the usage?

jsbali · 2021-07-29T18:11:15Z

Previously useJdbc flag was used to specify to useJdbc or HiveQL for ddl commands.
With this diff we have added the option for doing ddl commands via HMS as well.
Usage is as follows
New option is syncMode which can take three values

hms, hiveql (old way useJdbc as false) and jdbc (old way useJdbc as true)

stym06 · 2021-07-30T02:13:57Z

What parameters are required to be passed to sync with HMS ? Can we use the thrift url?

apache#2879) * [HUDI-1848] Adding support for HMS for running DDL queries in hive-sync-tool * [HUDI-1848] Fixing test cases * [HUDI-1848] CR changes * [HUDI-1848] Fix checkstyle violations * [HUDI-1848] Fixed a bug when metastore api fails for complex schemas with multiple levels. * [HUDI-1848] Adding the complex schema and resolving merge conflicts * [HUDI-1848] Adding some more javadocs * [HUDI-1848] Added javadocs for DDLExecutor impls * [HUDI-1848] Fixed style issue

[HUDI-1848] Adding support for HMS for running DDL queries in hive-sy…

56a2a21

…nc-tool

n3nash assigned satishkotha Apr 26, 2021

nsivabalan added the priority:minor everything else; usability gaps; questions; feature reqs label May 11, 2021

[HUDI-1848] Fixing test cases

88485eb

satishkotha reviewed May 18, 2021

View reviewed changes

Jagmeet Bali added 2 commits June 7, 2021 20:09

[HUDI-1848] CR changes

b783dd4

[HUDI-1848] Fix checkstyle violations

11e1ac3

n3nash mentioned this pull request Jun 16, 2021

[HUDI-1609] How to disable Hive JDBC and enable metastore #1679

Closed

satishkotha reviewed Jun 21, 2021

View reviewed changes

[HUDI-1848] Fixed a bug when metastore api fails for complex schemas …

7433ce8

…with multiple levels.

satishkotha approved these changes Jul 12, 2021

View reviewed changes

vinothchandar added priority:critical production down; pipelines stalled; Need help asap. and removed priority:critical production down; pipelines stalled; Need help asap. priority:minor everything else; usability gaps; questions; feature reqs labels Jul 15, 2021

Jagmeet Bali added 2 commits July 21, 2021 15:25

Merge branch 'master' into add_hms_support_hive_sync

c2a1a4b

[HUDI-1848] Adding the complex schema and resolving merge conflicts

f360b8b

satishkotha reviewed Jul 22, 2021

View reviewed changes

hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/util/HiveSchemaUtil.java Show resolved Hide resolved

Jagmeet Bali added 3 commits July 23, 2021 01:31

[HUDI-1848] Adding some more javadocs

9702ea1

[HUDI-1848] Added javadocs for DDLExecutor impls

f0c0a82

[HUDI-1848] Fixed style issue

9a0da13

satishkotha merged commit 66207ed into apache:master Jul 23, 2021

jsbali mentioned this pull request Jul 28, 2021

[HUDI-1194] Refactor HoodieHiveClient based on the way to call Hive API #1975

Closed

5 tasks

jsbali mentioned this pull request Sep 13, 2021

[HUDI-1534]HiveSyncTool-It is not necessary to use JDBC and MetaStoreClient at the same time #2532

Closed

BenjMaq mentioned this pull request Oct 22, 2021

[SUPPORT] Cannot write to null outputStream error #3848

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-1848] Adding support for HMS for running DDL queries in hive-sy… #2879

[HUDI-1848] Adding support for HMS for running DDL queries in hive-sy… #2879

jsbali commented Apr 26, 2021

codecov-commenter commented Apr 26, 2021 •

edited

Loading

vinothchandar commented May 11, 2021

jsbali commented May 11, 2021 •

edited

Loading

jsbali commented May 18, 2021

satishkotha left a comment

satishkotha May 12, 2021

satishkotha May 12, 2021

satishkotha May 12, 2021

satishkotha May 12, 2021

satishkotha May 12, 2021

jsbali commented Jun 15, 2021

hudi-bot commented Jun 15, 2021 •

edited

Loading

satishkotha left a comment

satishkotha Jun 21, 2021

jsbali Jul 1, 2021

satishkotha Jun 21, 2021

satishkotha Jul 12, 2021

jsbali commented Jul 1, 2021

n3nash commented Jul 2, 2021

satishkotha left a comment

satishkotha Jul 12, 2021

jsbali commented Jul 21, 2021

stym06 commented Jul 29, 2021

jsbali commented Jul 29, 2021

stym06 commented Jul 30, 2021

		List<FieldSchema> fieldSchema = HiveSchemaUtil.convertParquetSchemaToHiveFieldSchema(storageSchema, syncConfig);
		Map<String, String> hiveSchema = HiveSchemaUtil.convertParquetSchemaToHiveSchema(storageSchema, syncConfig.supportTimestamp);

[HUDI-1848] Adding support for HMS for running DDL queries in hive-sy… #2879

[HUDI-1848] Adding support for HMS for running DDL queries in hive-sy… #2879

Conversation

jsbali commented Apr 26, 2021

Tips

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

codecov-commenter commented Apr 26, 2021 • edited Loading

Codecov Report

vinothchandar commented May 11, 2021

jsbali commented May 11, 2021 • edited Loading

jsbali commented May 18, 2021

satishkotha left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsbali commented Jun 15, 2021

hudi-bot commented Jun 15, 2021 • edited Loading

CI report:

satishkotha left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsbali commented Jul 1, 2021

n3nash commented Jul 2, 2021

satishkotha left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsbali commented Jul 21, 2021

stym06 commented Jul 29, 2021

jsbali commented Jul 29, 2021

stym06 commented Jul 30, 2021

codecov-commenter commented Apr 26, 2021 •

edited

Loading

jsbali commented May 11, 2021 •

edited

Loading

hudi-bot commented Jun 15, 2021 •

edited

Loading