Skip to content

[HUDI-1534]HiveSyncTool-It is not necessary to use JDBC and MetaStoreClient at the same time#2532

Closed
vinnielhj wants to merge 24 commits intoapache:masterfrom
vinnielhj:HUDI-1534
Closed

[HUDI-1534]HiveSyncTool-It is not necessary to use JDBC and MetaStoreClient at the same time#2532
vinnielhj wants to merge 24 commits intoapache:masterfrom
vinnielhj:HUDI-1534

Conversation

@vinnielhj
Copy link

@vinnielhj vinnielhj commented Feb 4, 2021

To synchronize the hudi meta information to the Hive metastore, it is not necessary to use both JDBC and metastoreClient. Now modified some methods, unified use metastoreClient to synchronize, mainly the createTable, updateTableDefinition, addPartitionsToTable and updatePartitionsToTable methods of the HoodieHiveClient class. At the same time, a new class DLASchemaUtil was created to isolate the synchronization of DLA tables.
please review tihs pull, thanks.

@codecov-io
Copy link

codecov-io commented Feb 4, 2021

Codecov Report

Merging #2532 (afd7303) into master (c30481f) will decrease coverage by 30.43%.
The diff coverage is n/a.

Impacted file tree graph

@@              Coverage Diff              @@
##             master    #2532       +/-   ##
=============================================
- Coverage     50.90%   20.47%   -30.44%     
+ Complexity     3167      101     -3066     
=============================================
  Files           433       53      -380     
  Lines         19806     1929    -17877     
  Branches       2032      230     -1802     
=============================================
- Hits          10083      395     -9688     
+ Misses         8904     1516     -7388     
+ Partials        819       18      -801     
Flag Coverage Δ Complexity Δ
hudicli ? ?
hudiclient ? ?
hudicommon ? ?
hudiflink ? ?
hudihadoopmr ? ?
hudisparkdatasource ? ?
hudisync ? ?
huditimelineservice ? ?
hudiutilities 20.47% <ø> (-48.99%) 0.00 <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ Complexity Δ
...va/org/apache/hudi/utilities/IdentitySplitter.java 0.00% <0.00%> (-100.00%) 0.00% <0.00%> (-2.00%)
...va/org/apache/hudi/utilities/schema/SchemaSet.java 0.00% <0.00%> (-100.00%) 0.00% <0.00%> (-3.00%)
...a/org/apache/hudi/utilities/sources/RowSource.java 0.00% <0.00%> (-100.00%) 0.00% <0.00%> (-4.00%)
.../org/apache/hudi/utilities/sources/AvroSource.java 0.00% <0.00%> (-100.00%) 0.00% <0.00%> (-1.00%)
.../org/apache/hudi/utilities/sources/JsonSource.java 0.00% <0.00%> (-100.00%) 0.00% <0.00%> (-1.00%)
...rg/apache/hudi/utilities/sources/CsvDFSSource.java 0.00% <0.00%> (-100.00%) 0.00% <0.00%> (-10.00%)
...g/apache/hudi/utilities/sources/JsonDFSSource.java 0.00% <0.00%> (-100.00%) 0.00% <0.00%> (-4.00%)
...apache/hudi/utilities/sources/JsonKafkaSource.java 0.00% <0.00%> (-100.00%) 0.00% <0.00%> (-6.00%)
...pache/hudi/utilities/sources/ParquetDFSSource.java 0.00% <0.00%> (-100.00%) 0.00% <0.00%> (-5.00%)
...lities/schema/SchemaProviderWithPostProcessor.java 0.00% <0.00%> (-100.00%) 0.00% <0.00%> (-4.00%)
... and 397 more

@n3nash
Copy link
Contributor

n3nash commented Feb 4, 2021

@satishkotha can you help review ?

@vinnielhj
Copy link
Author

hi @satishkotha ,please review,thank you very much

@nsivabalan nsivabalan added the priority:high Significant impact; potential bugs label Feb 11, 2021
Copy link
Member

@satishkotha satishkotha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for delay in reviewing. I was away from work for last week.

import java.util.Set;

/**
* Schema Utilities.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks identical to HiveSchemaUtil. Can you help me understand why we need this separate util class? Is 'tickSupport' only the difference? Do you think we can pass that as a flag/config to avoid code duplication?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the master branch, the realization of DLA table synchronization reuses many methods of the HiveSchemaUtil class, such as: createHiveStruct, convertField, etc. This change only modifies the synchronization of Hive and does not involve DLA. So the DLASchemaUtil class is added for DLA table synchronization. (It is actually the former HiveSchemaUtil.java).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@satishkotha Sorry, in the branch I submitted before, DLASyncTool.java still uses HiveSchemaUtil. This is my mistake. I modified this to use DLASchemaUtil and resubmitted the code.

this.configuration = configuration;
// Support both JDBC and metastore based implementations for backwards compatiblity. Future users should
// disable jdbc and depend on metastore client for all hive registrations
if (cfg.useJdbc) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this is good to have backward compatibility and can be disabled using config easily (we can also set it to false by default?)

Do you think this adds a lot of overhead? Is this change discussed in open source meetup/email lists?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Hive synchronization, any JDBC-related code and configuration have not been used after this change, so this content is removed from HoodieHiveClient.java. Welcome to continue the discussion @satishkotha

}
} else {
updateHiveSQLUsingHiveDriver(s);
public void createDataBase(String databaseName, String location, String description) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Database instead of DataBase everywhere?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a problem. I have changed the spelling of Database used by Hive synchronization to Database (the second b is lowercase) and resubmit the code.


String schemaString = HiveSchemaUtil.generateSchemaString(schema);
assertEquals("`int_list` ARRAY< int>", schemaString);
assertEquals("int_list ARRAY<int>", schemaString);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add test coverage for DLASchemaUtil#generateSchemaString (and other methods in DLASchemaUtil?)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I have added the TestDLASyncTool class to test the methods of the DLASchemaUtil class.please review.thanks.

@satishkotha
Copy link
Member

@lhjzmn Can you start a discuss thread in dev and users channel. I think removing JDBC support requires consensus in community

@rubenssoto
Copy link

Hello People,

Is this PR could help in my problem?

#2588

@satishkotha @lhjzmn

@satishkotha
Copy link
Member

Hello People,

Is this PR could help in my problem?

#2588

@satishkotha @lhjzmn

@rubenssoto I dont think so. This PR is removing jdbc support to connect to Hive. Hudi already supports a way to disable jdbc (See https://hudi.apache.org/docs/configurations.html#HIVE_USE_JDBC_OPT_KEY). I dont know how you are connecting, if your hive version, supports thrift metastore interface, you can try disabling jdbc and see if that helps.

@satishkotha
Copy link
Member

@lhjzmn we discussed this in one of the OSS meetings. we want to keep jdbc support for backward compatibility reasons. we can separate out MetastoreClient as a separate class though. please let me know what you think.

@jsbali
Copy link
Contributor

jsbali commented Apr 22, 2021

Hey @lhjzmn If you are not working on it. I can take it up and build on top your code.
One way we can do it is support all three mode ie via

  1. JDBC
  2. HiveDriver
  3. HMS

Or we can get rid of HiveDriver code and replace it with HMS calls and provide only two options like

  1. useJDBC
  2. useHMS

Or if we are sure that jdbc is not required we can completely get rid of it as well which is what this PR does.
@n3nash @vinothchandar @satishkotha What do you guys think?

@vinothchandar
Copy link
Member

@jsbali Love to see this home. Do you want to grab since wokr if you are still interested?

@jsbali
Copy link
Contributor

jsbali commented Sep 13, 2021

@vinothchandar This PR was about removing all support for jdbc. hiveql and supporting only HMS. We went ahead with a different approach in #2879 and added support for HMS along with jdbc and hiveql.
This PR can be closed.

@hudi-bot
Copy link
Collaborator

hudi-bot commented Nov 5, 2021

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@nsivabalan
Copy link
Contributor

closing this PR as we have added support for hive sync modes and for hms, users don't have to supply any additional configs like one has to provide for jdbc way.

@nsivabalan nsivabalan closed this Dec 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:high Significant impact; potential bugs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants