-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-1848] Adding support for HMS for running DDL queries in hive-sy… #2879
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2879 +/- ##
============================================
- Coverage 47.84% 45.80% -2.04%
- Complexity 5564 5647 +83
============================================
Files 936 1003 +67
Lines 41652 43968 +2316
Branches 4195 4415 +220
============================================
+ Hits 19928 20140 +212
- Misses 19952 22037 +2085
- Partials 1772 1791 +19
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
@jsbali @satishkotha do we need #2532 to land for this? |
@vinothchandar. Both the PR's are different. This PR adds support for configurable options to be used for DDL queries within hive-sync-tool whereas the other PR removes all alternatives and keeps the HMS one only. |
@satishkotha I have fixed the test cases but there are few minor merge conflicts which I will fix along with the code review changes for this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took first pass.I havent reviewed individual executors yet. I'll do that over next week.
hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/ddl/HMSDDLExecutor.java
Show resolved
Hide resolved
@@ -73,6 +73,9 @@ | |||
@Parameter(names = {"--use-jdbc"}, description = "Hive jdbc connect url") | |||
public Boolean useJdbc = true; | |||
|
|||
@Parameter(names = {"--use-hms"}, description = "Use hms client for ddl commands") | |||
public Boolean useHMS = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any thoughts on introducing other option instead of using booleans? only one of useJdbc/useHms is allowed to be true. can we introduce something like sync_mode=jdbc/hms etc? We can still keep the flags for backward compatibility for few more versions. wdyt?
LOG.info("Creating hive connection " + cfg.jdbcUrl); | ||
createHiveConnection(); | ||
|
||
if (cfg.useHMS) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: probably better to keep cfg.useJdbc as first check to ensure jdbc takes precedence over hms.
@@ -68,7 +70,7 @@ public HiveSyncTool(HiveSyncConfig cfg, HiveConf configuration, FileSystem fs) { | |||
|
|||
try { | |||
this.hoodieHiveClient = new HoodieHiveClient(cfg, configuration, fs); | |||
} catch (RuntimeException e) { | |||
} catch (RuntimeException | HiveException | MetaException e) { //TODO-jsbali FIx this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: revert this change?
return schema; | ||
} catch (Exception e) { | ||
throw new HoodieHiveSyncException("Failed to get table schema for : " + tableName, e); | ||
if (!doesTableExist(tableName)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Earlier this check seems to be present only for jdbc executor. Do we need it for all cases? (probably not a big deal, but making sure there is no redundant checks)
hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/ddl/DDLExecutor.java
Show resolved
Hide resolved
List<FieldSchema> fieldSchema = HiveSchemaUtil.convertParquetSchemaToHiveFieldSchema(storageSchema, syncConfig); | ||
Map<String, String> hiveSchema = HiveSchemaUtil.convertParquetSchemaToHiveSchema(storageSchema, syncConfig.supportTimestamp); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For certain large schemas it is expensive to do these multiple conversions 1) convertParquetSchemaToHiveFieldSchema and 2) convertParquetSchemaToHiveSchema
Can we avoid one of them? Maybe change getPartitionKeyType to work List? or change convertParquetSchemaToHiveFieldSchema to return partitionSchema as well?
@satishkotha Made the changes. PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM at a high level. Please also test with internal version and verify all all executors work and share test results.
@vinothchandar or @n3nash This is somewhat high impact change. Do one of you guys want to take a pass as well?
hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/ddl/DDLExecutor.java
Show resolved
Hide resolved
hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/util/HiveSchemaUtil.java
Show resolved
Hide resolved
private static Map<String, String> convertParquetSchemaToHiveSchema(MessageType messageType, boolean supportTimestamp) throws IOException { | ||
Map<String, String> schema = new LinkedHashMap<>(); | ||
public static Map<String, String> convertParquetSchemaToHiveSchema(MessageType messageType, boolean supportTimestamp) throws IOException { | ||
return convertMapSchemaToHiveSchema(parquetSchemaToMapSchema(messageType, supportTimestamp)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like we are always doing two translations
parquet -> map and map -> hive schema? Is this extra step needed for all executors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We were doing it before anyways. I have just broken it up for now for saving the work when we do the create table call in HMS api where we might have to do redundant works. Let me see if we can better refactor it.
Map<String, String> hiveSchema = HiveSchemaUtil.convertMapSchemaToHiveSchema(mapSchema); | ||
|
||
List<FieldSchema> partitionSchema = syncConfig.partitionFields.stream().map(partitionKey -> { | ||
String partitionKeyType = HiveSchemaUtil.getPartitionKeyType(hiveSchema, partitionKey); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we change getPartitionKeyType to work with fieldSchema/mapSchema? can we remove 'hiveSchema' variable with that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jsbali any thoughts on this?
…with multiple levels.
Fixed a bug when complex schemas can cause HMS create api to fail. The fix is FieldSchemas which need to be provided explicitly for HMS don't work well with spaces and don't need backticks. |
@n3nash To review this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@n3nash the change LGTM expect for few minor javaodc etc comments below. Please take a look.
hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/ddl/DDLExecutor.java
Show resolved
Hide resolved
Map<String, String> hiveSchema = HiveSchemaUtil.convertMapSchemaToHiveSchema(mapSchema); | ||
|
||
List<FieldSchema> partitionSchema = syncConfig.partitionFields.stream().map(partitionKey -> { | ||
String partitionKeyType = HiveSchemaUtil.getPartitionKeyType(hiveSchema, partitionKey); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jsbali any thoughts on this?
hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/util/HiveSchemaUtil.java
Show resolved
Hide resolved
Have added a doFormat bool to parquetSchemaToMapSchema function. The motive behind that is when running with complex or nested schemas spaces can trip the HMS create table call. |
hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/util/HiveSchemaUtil.java
Show resolved
Hide resolved
Thanks for this! Are there any docs on the usage? |
Previously useJdbc flag was used to specify to useJdbc or HiveQL for ddl commands.
|
What parameters are required to be passed to sync with HMS ? Can we use the thrift url? |
apache#2879) * [HUDI-1848] Adding support for HMS for running DDL queries in hive-sync-tool * [HUDI-1848] Fixing test cases * [HUDI-1848] CR changes * [HUDI-1848] Fix checkstyle violations * [HUDI-1848] Fixed a bug when metastore api fails for complex schemas with multiple levels. * [HUDI-1848] Adding the complex schema and resolving merge conflicts * [HUDI-1848] Adding some more javadocs * [HUDI-1848] Added javadocs for DDLExecutor impls * [HUDI-1848] Fixed style issue
…nc-tool
Tips
What is the purpose of the pull request
This PR takes work done as part of this PR https://github.com/apache/hudi/pull/2532/files and builds on top of it. Fixed some small changes in that PR and added tests as well.
This adds support for a HMS for running DDL queries in hive-sync-tool configured by useHMS flag.
Brief change log
Added a DDLExecutor interface and three implementations for the same namely
JDBCExecutor
HiveQueryExecutor
HMSDDLExecutor
There are small things still remaining but PR can be reviewed alongside those changes.
Verify this pull request
(Please pick either of the following options)
This pull request is a trivial rework / code cleanup without any test coverage.
(or)
This pull request is already covered by existing tests, such as (please describe tests).
(or)
This change added tests and can be verified as follows:
(example:)
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.