[HUDI-4384] fix hive style partition and record key prefix missing in bulk_insert #6085

TengHuo · 2022-07-12T09:20:40Z

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.

What is the purpose of the pull request

Fix [HUDI-4384] hive style partition and record key prefix missing in bulk_insert in Spark

Brief change log

Remove ComplexKeyGenerator bypass in HoodieDatasetBulkInsertHelper.prepareHoodieDatasetForBulkInsert
Add a new unit test method named testBulkInsertWithHiveStylePartition in TestHoodieDatasetBulkInsertHelper

Verify this pull request

This change added tests and can be verified as follows:

Added a new unit test method named testBulkInsertWithHiveStylePartition in TestHoodieDatasetBulkInsertHelper
Manually verified the change by running a job locally.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR (no need)
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. (no need)

… bulk_insert

TengHuo · 2022-07-12T09:24:43Z

Fix #6070

TengHuo · 2022-07-12T09:43:13Z

Manually verified the change by running a job locally. The reproduce code can be found in JIRA ticket: HUDI-4384

Result After fix:

Result before fix:

TengHuo · 2022-07-12T11:33:00Z

@hudi-bot run azure

TengHuo · 2022-07-12T11:40:15Z

@hudi-bot run azure

hudi-bot · 2022-07-12T12:27:26Z

CI report:

5b64362 Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

TengHuo · 2022-07-13T03:01:27Z

The test UT FT clients & cli & utilities & sync failed due to an exception java.net.ConnectException: Connection refused in the test of hudi-hive-sync, it should not be caused by this PR.

May I know how I can fix it? or should I re-run the test again?

TengHuo · 2022-07-14T08:12:21Z

Duplicate with #6049, will close this PR if #6049 is merged

nsivabalan · 2022-07-18T10:49:40Z

...atasource/hudi-spark-common/src/main/java/org/apache/hudi/HoodieDatasetBulkInsertHelper.java

@@ -87,18 +86,15 @@ public static Dataset<Row> prepareHoodieDatasetForBulkInsert(SQLContext sqlConte
    BuiltinKeyGenerator keyGenerator = (BuiltinKeyGenerator) ReflectionUtils.loadClass(keyGeneratorClass, properties);

    Dataset<Row> rowDatasetWithRecordKeysAndPartitionPath;
-    if (keyGeneratorClass.equals(NonpartitionedKeyGenerator.class.getName())) {
+    if (keyGeneratorClass.equals(NonpartitionedKeyGenerator.class.getName())
+        || (keyGeneratorClass.equals(SimpleKeyGenerator.class.getName()) && !config.isHiveStylePartitioningEnabled())) {


if hive style partitioning is enabled, we are falling back to using udf flow is it? guess the intent was to do use udf based key gen only for non simple use-cases. Can we honor the same even w/ hive style partitioning enabled please?

yeah, this PR will fall back to UDF if hive style partition enabled, same logic as 0.10.

think PR #6049 is a better fix which can improve performance by using withColumn

nsivabalan · 2022-07-18T10:51:19Z

since the other is marked as priority blocker, closing this one.

[HUDI-4384] fix hive style partition and record key prefix missing in…

5b64362

… bulk_insert

TengHuo mentioned this pull request Jul 12, 2022

[SUPPORT]'hoodie.datasource.write.hive_style_partitioning':'true' does not take effect in hudi-0.11.1 & spark 3.2.1 #6070

Closed

xushiyan added the priority:blocker label Jul 12, 2022

xushiyan added this to Under Discussion PRs in PR Tracker Board via automation Jul 12, 2022

xushiyan linked an issue Jul 12, 2022 that may be closed by this pull request

[SUPPORT]'hoodie.datasource.write.hive_style_partitioning':'true' does not take effect in hudi-0.11.1 & spark 3.2.1 #6070

Closed

nsivabalan self-requested a review July 18, 2022 10:40

nsivabalan self-assigned this Jul 18, 2022

nsivabalan requested changes Jul 18, 2022

View reviewed changes

PR Tracker Board automation moved this from Under Discussion PRs to Nearing Landing Jul 18, 2022

nsivabalan closed this Jul 18, 2022

PR Tracker Board automation moved this from Nearing Landing to Done Jul 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-4384] fix hive style partition and record key prefix missing in bulk_insert #6085

[HUDI-4384] fix hive style partition and record key prefix missing in bulk_insert #6085

TengHuo commented Jul 12, 2022

TengHuo commented Jul 12, 2022

TengHuo commented Jul 12, 2022

TengHuo commented Jul 12, 2022

TengHuo commented Jul 12, 2022

hudi-bot commented Jul 12, 2022

TengHuo commented Jul 13, 2022

TengHuo commented Jul 14, 2022

nsivabalan Jul 18, 2022

TengHuo Jul 18, 2022

nsivabalan commented Jul 18, 2022

[HUDI-4384] fix hive style partition and record key prefix missing in bulk_insert #6085

[HUDI-4384] fix hive style partition and record key prefix missing in bulk_insert #6085

Conversation

TengHuo commented Jul 12, 2022

Tips

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

TengHuo commented Jul 12, 2022

TengHuo commented Jul 12, 2022

TengHuo commented Jul 12, 2022

TengHuo commented Jul 12, 2022

hudi-bot commented Jul 12, 2022

CI report:

TengHuo commented Jul 13, 2022

TengHuo commented Jul 14, 2022

nsivabalan Jul 18, 2022

Choose a reason for hiding this comment

TengHuo Jul 18, 2022

Choose a reason for hiding this comment

nsivabalan commented Jul 18, 2022