Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-4384] fix hive style partition and record key prefix missing in bulk_insert #6085

Closed
wants to merge 1 commit into from

Conversation

TengHuo
Copy link
Contributor

@TengHuo TengHuo commented Jul 12, 2022

Tips

What is the purpose of the pull request

Fix [HUDI-4384] hive style partition and record key prefix missing in bulk_insert in Spark

Brief change log

  • Remove ComplexKeyGenerator bypass in HoodieDatasetBulkInsertHelper.prepareHoodieDatasetForBulkInsert
  • Add a new unit test method named testBulkInsertWithHiveStylePartition in TestHoodieDatasetBulkInsertHelper

Verify this pull request

This change added tests and can be verified as follows:

  • Added a new unit test method named testBulkInsertWithHiveStylePartition in TestHoodieDatasetBulkInsertHelper
  • Manually verified the change by running a job locally.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR (no need)

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. (no need)

@TengHuo
Copy link
Contributor Author

TengHuo commented Jul 12, 2022

Fix #6070

@TengHuo
Copy link
Contributor Author

TengHuo commented Jul 12, 2022

Manually verified the change by running a job locally. The reproduce code can be found in JIRA ticket: HUDI-4384

Result After fix:

Screenshot 2022-07-12 at 17 39 15

Result before fix:

Screenshot 2022-07-12 at 16 39 58

@TengHuo
Copy link
Contributor Author

TengHuo commented Jul 12, 2022

@hudi-bot run azure

1 similar comment
@TengHuo
Copy link
Contributor Author

TengHuo commented Jul 12, 2022

@hudi-bot run azure

@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@TengHuo
Copy link
Contributor Author

TengHuo commented Jul 13, 2022

The test UT FT clients & cli & utilities & sync failed due to an exception java.net.ConnectException: Connection refused in the test of hudi-hive-sync, it should not be caused by this PR.

May I know how I can fix it? or should I re-run the test again?

@TengHuo
Copy link
Contributor Author

TengHuo commented Jul 14, 2022

Duplicate with #6049, will close this PR if #6049 is merged

@nsivabalan nsivabalan self-requested a review July 18, 2022 10:40
@nsivabalan nsivabalan self-assigned this Jul 18, 2022
@@ -87,18 +86,15 @@ public static Dataset<Row> prepareHoodieDatasetForBulkInsert(SQLContext sqlConte
BuiltinKeyGenerator keyGenerator = (BuiltinKeyGenerator) ReflectionUtils.loadClass(keyGeneratorClass, properties);

Dataset<Row> rowDatasetWithRecordKeysAndPartitionPath;
if (keyGeneratorClass.equals(NonpartitionedKeyGenerator.class.getName())) {
if (keyGeneratorClass.equals(NonpartitionedKeyGenerator.class.getName())
|| (keyGeneratorClass.equals(SimpleKeyGenerator.class.getName()) && !config.isHiveStylePartitioningEnabled())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if hive style partitioning is enabled, we are falling back to using udf flow is it? guess the intent was to do use udf based key gen only for non simple use-cases. Can we honor the same even w/ hive style partitioning enabled please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, this PR will fall back to UDF if hive style partition enabled, same logic as 0.10.

think PR #6049 is a better fix which can improve performance by using withColumn

PR Tracker Board automation moved this from Under Discussion PRs to Nearing Landing Jul 18, 2022
@nsivabalan
Copy link
Contributor

since the other is marked as priority blocker, closing this one.

@nsivabalan nsivabalan closed this Jul 18, 2022
PR Tracker Board automation moved this from Nearing Landing to Done Jul 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
4 participants