Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-1848] Adding support for HMS for running DDL queries in hive-sy… #2879

Merged
merged 10 commits into from
Jul 23, 2021

Conversation

jsbali
Copy link
Contributor

@jsbali jsbali commented Apr 26, 2021

…nc-tool

Tips

What is the purpose of the pull request

This PR takes work done as part of this PR https://github.com/apache/hudi/pull/2532/files and builds on top of it. Fixed some small changes in that PR and added tests as well.
This adds support for a HMS for running DDL queries in hive-sync-tool configured by useHMS flag.

Brief change log

Added a DDLExecutor interface and three implementations for the same namely
JDBCExecutor
HiveQueryExecutor
HMSDDLExecutor

There are small things still remaining but PR can be reviewed alongside those changes.

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end.
  • Added HoodieClientWriteTest to verify the change.
  • Manually verified the change by running a job locally.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@codecov-commenter
Copy link

codecov-commenter commented Apr 26, 2021

Codecov Report

Merging #2879 (f360b8b) into master (2099bf4) will decrease coverage by 2.03%.
The diff coverage is 79.48%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #2879      +/-   ##
============================================
- Coverage     47.84%   45.80%   -2.04%     
- Complexity     5564     5647      +83     
============================================
  Files           936     1003      +67     
  Lines         41652    43968    +2316     
  Branches       4195     4415     +220     
============================================
+ Hits          19928    20140     +212     
- Misses        19952    22037    +2085     
- Partials       1772     1791      +19     
Flag Coverage Δ
hudicli 39.97% <ø> (ø)
hudiclient 34.55% <ø> (+0.03%) ⬆️
hudicommon 48.63% <ø> (-0.06%) ⬇️
hudiflink 59.65% <ø> (+0.21%) ⬆️
hudihadoopmr 52.02% <ø> (ø)
hudiintegtest 0.00% <ø> (?)
hudisparkdatasource 67.04% <ø> (-0.24%) ⬇️
hudisync 59.28% <79.48%> (+3.30%) ⬆️
huditimelineservice 64.07% <ø> (ø)
hudiutilities 59.87% <ø> (+0.10%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...org/apache/hudi/hive/ddl/HiveQueryDDLExecutor.java 71.15% <71.15%> (ø)
...in/java/org/apache/hudi/hive/ddl/JDBCExecutor.java 73.13% <73.13%> (ø)
...in/java/org/apache/hudi/hive/HoodieHiveClient.java 66.40% <73.91%> (-3.74%) ⬇️
.../java/org/apache/hudi/hive/ddl/HMSDDLExecutor.java 81.45% <81.45%> (ø)
...rg/apache/hudi/hive/ddl/QueryBasedDDLExecutor.java 84.94% <84.94%> (ø)
...java/org/apache/hudi/hive/util/HiveSchemaUtil.java 70.90% <86.20%> (+1.53%) ⬆️
...main/java/org/apache/hudi/hive/HiveSyncConfig.java 98.24% <100.00%> (ø)
...c/main/java/org/apache/hudi/hive/HiveSyncTool.java 78.12% <100.00%> (ø)
...va/org/apache/hudi/keygen/BuiltinKeyGenerator.java 62.12% <0.00%> (-10.86%) ⬇️
...g/apache/hudi/keygen/GlobalDeleteKeyGenerator.java 90.90% <0.00%> (-9.10%) ⬇️
... and 115 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2099bf4...f360b8b. Read the comment docs.

@vinothchandar
Copy link
Member

@jsbali @satishkotha do we need #2532 to land for this?

@jsbali
Copy link
Contributor Author

jsbali commented May 11, 2021

@vinothchandar. Both the PR's are different. This PR adds support for configurable options to be used for DDL queries within hive-sync-tool whereas the other PR removes all alternatives and keeps the HMS one only.
Both PR's are independent of each other.
The only similarity they have is the HMS code has been taken from #2532 which I will add as co-authored-by in the next commit.
Also as far as this PR is concerned there is a test case which I need to fix. Once that is done this can be reviewed.

@nsivabalan nsivabalan added the priority:minor everything else; usability gaps; questions; feature reqs label May 11, 2021
@jsbali
Copy link
Contributor Author

jsbali commented May 18, 2021

@satishkotha I have fixed the test cases but there are few minor merge conflicts which I will fix along with the code review changes for this.

Copy link
Member

@satishkotha satishkotha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took first pass.I havent reviewed individual executors yet. I'll do that over next week.

@@ -73,6 +73,9 @@
@Parameter(names = {"--use-jdbc"}, description = "Hive jdbc connect url")
public Boolean useJdbc = true;

@Parameter(names = {"--use-hms"}, description = "Use hms client for ddl commands")
public Boolean useHMS = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any thoughts on introducing other option instead of using booleans? only one of useJdbc/useHms is allowed to be true. can we introduce something like sync_mode=jdbc/hms etc? We can still keep the flags for backward compatibility for few more versions. wdyt?

LOG.info("Creating hive connection " + cfg.jdbcUrl);
createHiveConnection();

if (cfg.useHMS) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: probably better to keep cfg.useJdbc as first check to ensure jdbc takes precedence over hms.

@@ -68,7 +70,7 @@ public HiveSyncTool(HiveSyncConfig cfg, HiveConf configuration, FileSystem fs) {

try {
this.hoodieHiveClient = new HoodieHiveClient(cfg, configuration, fs);
} catch (RuntimeException e) {
} catch (RuntimeException | HiveException | MetaException e) { //TODO-jsbali FIx this
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: revert this change?

return schema;
} catch (Exception e) {
throw new HoodieHiveSyncException("Failed to get table schema for : " + tableName, e);
if (!doesTableExist(tableName)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Earlier this check seems to be present only for jdbc executor. Do we need it for all cases? (probably not a big deal, but making sure there is no redundant checks)

Comment on lines 89 to 90
List<FieldSchema> fieldSchema = HiveSchemaUtil.convertParquetSchemaToHiveFieldSchema(storageSchema, syncConfig);
Map<String, String> hiveSchema = HiveSchemaUtil.convertParquetSchemaToHiveSchema(storageSchema, syncConfig.supportTimestamp);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For certain large schemas it is expensive to do these multiple conversions 1) convertParquetSchemaToHiveFieldSchema and 2) convertParquetSchemaToHiveSchema

Can we avoid one of them? Maybe change getPartitionKeyType to work List? or change convertParquetSchemaToHiveFieldSchema to return partitionSchema as well?

@jsbali
Copy link
Contributor Author

jsbali commented Jun 15, 2021

@satishkotha Made the changes. PTAL

@hudi-bot
Copy link

hudi-bot commented Jun 15, 2021

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run travis re-run the last Travis build
  • @hudi-bot run azure re-run the last Azure build

Copy link
Member

@satishkotha satishkotha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM at a high level. Please also test with internal version and verify all all executors work and share test results.

@vinothchandar or @n3nash This is somewhat high impact change. Do one of you guys want to take a pass as well?

private static Map<String, String> convertParquetSchemaToHiveSchema(MessageType messageType, boolean supportTimestamp) throws IOException {
Map<String, String> schema = new LinkedHashMap<>();
public static Map<String, String> convertParquetSchemaToHiveSchema(MessageType messageType, boolean supportTimestamp) throws IOException {
return convertMapSchemaToHiveSchema(parquetSchemaToMapSchema(messageType, supportTimestamp));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we are always doing two translations
parquet -> map and map -> hive schema? Is this extra step needed for all executors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We were doing it before anyways. I have just broken it up for now for saving the work when we do the create table call in HMS api where we might have to do redundant works. Let me see if we can better refactor it.

Map<String, String> hiveSchema = HiveSchemaUtil.convertMapSchemaToHiveSchema(mapSchema);

List<FieldSchema> partitionSchema = syncConfig.partitionFields.stream().map(partitionKey -> {
String partitionKeyType = HiveSchemaUtil.getPartitionKeyType(hiveSchema, partitionKey);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change getPartitionKeyType to work with fieldSchema/mapSchema? can we remove 'hiveSchema' variable with that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsbali any thoughts on this?

@jsbali
Copy link
Contributor Author

jsbali commented Jul 1, 2021

Fixed a bug when complex schemas can cause HMS create api to fail. The fix is FieldSchemas which need to be provided explicitly for HMS don't work well with spaces and don't need backticks.

@n3nash
Copy link
Contributor

n3nash commented Jul 2, 2021

@n3nash To review this

Copy link
Member

@satishkotha satishkotha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@n3nash the change LGTM expect for few minor javaodc etc comments below. Please take a look.

Map<String, String> hiveSchema = HiveSchemaUtil.convertMapSchemaToHiveSchema(mapSchema);

List<FieldSchema> partitionSchema = syncConfig.partitionFields.stream().map(partitionKey -> {
String partitionKeyType = HiveSchemaUtil.getPartitionKeyType(hiveSchema, partitionKey);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsbali any thoughts on this?

@vinothchandar vinothchandar added priority:critical production down; pipelines stalled; Need help asap. and removed priority:critical production down; pipelines stalled; Need help asap. priority:minor everything else; usability gaps; questions; feature reqs labels Jul 15, 2021
@jsbali
Copy link
Contributor Author

jsbali commented Jul 21, 2021

Have added a doFormat bool to parquetSchemaToMapSchema function. The motive behind that is when running with complex or nested schemas spaces can trip the HMS create table call.
After going through the HiveQL driver code which in turn calls HMS this is exactly similar to how the create Table params get passed on there.
Also @n3nash and @satishkotha Have resolved all merge conflicts and added a complex schema tests as well.

@stym06
Copy link
Contributor

stym06 commented Jul 29, 2021

Thanks for this! Are there any docs on the usage?

@jsbali
Copy link
Contributor Author

jsbali commented Jul 29, 2021

Previously useJdbc flag was used to specify to useJdbc or HiveQL for ddl commands.
With this diff we have added the option for doing ddl commands via HMS as well.
Usage is as follows
New option is syncMode which can take three values

  • hms, hiveql (old way useJdbc as false) and jdbc (old way useJdbc as true)

@stym06
Copy link
Contributor

stym06 commented Jul 30, 2021

What parameters are required to be passed to sync with HMS ? Can we use the thrift url?

liujinhui1994 pushed a commit to liujinhui1994/hudi that referenced this pull request Aug 12, 2021
apache#2879)

* [HUDI-1848] Adding support for HMS for running DDL queries in hive-sync-tool

* [HUDI-1848] Fixing test cases

* [HUDI-1848] CR changes

* [HUDI-1848] Fix checkstyle violations

* [HUDI-1848] Fixed a bug when metastore api fails for complex schemas with multiple levels.

* [HUDI-1848] Adding the complex schema and resolving merge conflicts

* [HUDI-1848] Adding some more javadocs

* [HUDI-1848] Added javadocs for DDLExecutor impls

* [HUDI-1848] Fixed style issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority:critical production down; pipelines stalled; Need help asap.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants