Skip to content

[HUDI-9211] Fix bug with config in DataHubSyncTool#13018

Merged
yihua merged 1 commit intoapache:masterfrom
the-other-tim-brown:minor-datahub-properties-bug
Mar 25, 2025
Merged

[HUDI-9211] Fix bug with config in DataHubSyncTool#13018
yihua merged 1 commit intoapache:masterfrom
the-other-tim-brown:minor-datahub-properties-bug

Conversation

@the-other-tim-brown
Copy link
Contributor

Change Logs

  • Updates the getInt and getString calls to return the default if not set to avoid NullPointerExceptions

Impact

  • Fixes runtime issue if the config is not set for HIVE_SYNC_SCHEMA_STRING_LENGTH_THRESHOLD

Risk level (write none, low medium or high below)

None

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@github-actions github-actions bot added the size:S PR with lines of changes in (10, 100] label Mar 23, 2025
String path = "file:///tmp/path";
Map<String, String> expected = new HashMap<>();
expected.put(HUDI_TABLE_TYPE, "MERGE_ON_READ");
expected.put(HUDI_TABLE_VERSION, "SIX");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sgomezvillamor is the intention for this to be 6 or is SIX spelled out the expected value?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HUDI_TABLE_VERSION is set in

properties.put(HUDI_TABLE_VERSION, tableMetadata.getTableVersion());
, which comes from here , which comes from
public static final ConfigProperty<HoodieTableVersion> VERSION = ConfigProperty
.key("hoodie.table.version")
.defaultValue(HoodieTableVersion.current())
, which comes from
SIX(6, CollectionUtils.createImmutableList("0.14.0"), TimelineLayoutVersion.LAYOUT_VERSION_1),

So the content is managed by Hudi itself and matches the toString serialization of the SIX enum.

Hope this helps.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I am aware of all this already. Just wanted to make sure this is intentional since it is a number in the hoodie.properties.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean is in any case this is responsibility of the datahub-sync controller.

About this being a number in the hoodie.properties, that matches the code here

public static HoodieTableVersion getTableVersion(HoodieConfig config) {
return contains(VERSION, config)
? HoodieTableVersion.fromVersionCode(config.getInt(VERSION))
: VERSION.defaultValue();
}

So, there is some misalignment in Hudi and apparently this is loaded from version code and reported as string enum representation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sgomezvillamor regardless of the misalignment, could you enhance the docs to specify the expected properties synced to the datahub catalog, especially the table version so the user does not get confused (on top of #12504)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As requested: #13067

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Comment on lines +89 to +90
config.getStringOrDefault(META_SYNC_SPARK_VERSION),
config.getIntOrDefault(HIVE_SYNC_SCHEMA_STRING_LENGTH_THRESHOLD),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see ADB sync also uses config.getString(META_SYNC_SPARK_VERSION). Should that be fixed separately?

Map<String, String> sparkTableProperties = SparkDataSourceTableUtils.getSparkTableProperties(config.getSplitStrings(META_SYNC_PARTITION_FIELDS),
          config.getString(META_SYNC_SPARK_VERSION), config.getInt(ADB_SYNC_SCHEMA_STRING_LENGTH_THRESHOLD), schema);

String path = "file:///tmp/path";
Map<String, String> expected = new HashMap<>();
expected.put(HUDI_TABLE_TYPE, "MERGE_ON_READ");
expected.put(HUDI_TABLE_VERSION, "SIX");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sgomezvillamor regardless of the misalignment, could you enhance the docs to specify the expected properties synced to the datahub catalog, especially the table version so the user does not get confused (on top of #12504)?

@yihua yihua merged commit 72bc770 into apache:master Mar 25, 2025
43 checks passed
voonhous pushed a commit to voonhous/hudi that referenced this pull request Apr 8, 2025
voonhous pushed a commit to voonhous/hudi that referenced this pull request Apr 9, 2025
voonhous pushed a commit to voonhous/hudi that referenced this pull request Apr 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-1.0.2 size:S PR with lines of changes in (10, 100]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants