Skip to content

fix: filter EXTERNAL property in SparkCatalogMetaStoreClient.toCatalogTable#18672

Merged
danny0405 merged 1 commit intoapache:masterfrom
prashantwason:fix-external-property-filter
May 5, 2026
Merged

fix: filter EXTERNAL property in SparkCatalogMetaStoreClient.toCatalogTable#18672
danny0405 merged 1 commit intoapache:masterfrom
prashantwason:fix-external-property-filter

Conversation

@prashantwason
Copy link
Copy Markdown
Member

Describe the issue this Pull Request addresses

Hudi's HMSDDLExecutor.createTable (HMSDDLExecutor.java#L128-L131) sets BOTH of the following on the Hive Table object when the table is external:

if (!syncConfig.getBoolean(HIVE_CREATE_MANAGED_TABLE)) {
  newTb.putToParameters("EXTERNAL", "TRUE");          // ← duplicate encoding
  newTb.setTableType(TableType.EXTERNAL_TABLE.toString());
}

When that Table flows through SparkCatalogMetaStoreClient.createTabletoCatalogTableHiveExternalCatalog.createTable, HiveExternalCatalog.verifyTableProperties throws:

org.apache.spark.sql.AnalysisException: Cannot set or change the preserved property key: 'EXTERNAL'
  at org.apache.spark.sql.hive.HiveExternalCatalog.verifyTableProperties(HiveExternalCatalog.scala:147)
  ...
  at org.apache.spark.sql.hive.SparkCatalogMetaStoreClient.createTable(SparkCatalogMetaStoreClient.scala:59)
  at org.apache.hudi.hive.ddl.HMSDDLExecutor.createTable(HMSDDLExecutor.java:136)

Spark uses CatalogTableType.EXTERNAL on the CatalogTable itself to encode external-ness, and rejects EXTERNAL=... as a duplicate (and forbidden) encoding. So Hudi's normal HMSDDLExecutor-shaped Table cannot be created at all when SparkCatalogMetaStoreClient is in use (i.e. hoodie.datasource.hive_sync.use_spark_catalog=true), unless the user explicitly sets HIVE_CREATE_MANAGED_TABLE=true.

Summary and Changelog

  • SparkCatalogMetaStoreClient.toCatalogTable: extend the existing filter that strips spark.sql.* keys to also strip EXTERNAL. The tableType field below already maps EXTERNAL_TABLECatalogTableType.EXTERNAL, so dropping the redundant property is safe and avoids the verifyTableProperties rejection.
  • TestSparkCatalogMetaStoreClient: add a regression test mirroring the real HMSDDLExecutor shape (tableType=EXTERNAL_TABLE AND parameters[EXTERNAL]=TRUE).

Same family as #18654 (filter spark.sql.*).

Impact

Restores the ability to use SparkCatalogMetaStoreClient with the default Hive-sync codepath (external tables). No previously-successful path changes behavior — this only converts a previously-thrown exception into a successful create on the same exact input shape.

Risk Level

low — one-line filter extension mirroring existing logic, covered by a new unit test.

Documentation Update

none

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable
  • CI passed

🤖 Generated with Claude Code

…gTable

Hudi's `HMSDDLExecutor.createTable` sets both `tableType=EXTERNAL_TABLE`
and `parameters[EXTERNAL]=TRUE` on the Hive Table object when the table
is external. When that Table flows through `SparkCatalogMetaStoreClient`
into `HiveExternalCatalog`, `verifyTableProperties` rejects:

  AnalysisException: Cannot set or change the preserved property key:
  'EXTERNAL'

Spark uses `CatalogTableType.EXTERNAL` on the `CatalogTable` itself to
encode external-ness, and treats `EXTERNAL=...` as a duplicate (and
forbidden) encoding. We already map `tableType` correctly via
`if ("EXTERNAL_TABLE".equalsIgnoreCase(table.getTableType))`, so dropping
the property in the same filter that already strips `spark.sql.*` is safe.

Same family as apache#18654 (filter `spark.sql.*`).

Adds a regression test mirroring the real `HMSDDLExecutor` shape:
`tableType=EXTERNAL_TABLE` AND `parameters[EXTERNAL]=TRUE`.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR strips the redundant EXTERNAL parameter in SparkCatalogMetaStoreClient.toCatalogTable so tables created via HMSDDLExecutor flow through Spark's HiveExternalCatalog.verifyTableProperties without tripping the reserved-key check, with a regression test mirroring the real call shape. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. One suggestion below: the new test could be strengthened with an explicit assertion that the EXTERNAL property was actually stripped from the stored table parameters.

cc @yihua


client.createTable(createdTable)
assertTrue(client.tableExists(databaseName, tableName))
assertEquals("v1", client.getTable(databaseName, tableName).getParameters.get("comment"))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: the test verifies createTable doesn't throw and that comment survives, but it never explicitly asserts that EXTERNAL was removed. Could you add something like assertFalse(client.getTable(databaseName, tableName).getParameters.containsKey("EXTERNAL")) to directly validate the actual fix?

- AI-generated; verify before applying. React 👍/👎 to flag quality.

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 0% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 68.08%. Comparing base (38db5ed) to head (f8f8716).
⚠️ Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
...e/spark/sql/hive/SparkCatalogMetaStoreClient.scala 0.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18672      +/-   ##
============================================
+ Coverage     68.06%   68.08%   +0.01%     
+ Complexity    28922    28919       -3     
============================================
  Files          2518     2519       +1     
  Lines        140574   140611      +37     
  Branches      17419    17423       +4     
============================================
+ Hits          95682    95731      +49     
+ Misses        37036    37022      -14     
- Partials       7856     7858       +2     
Flag Coverage Δ
common-and-other-modules 44.35% <0.00%> (-0.02%) ⬇️
hadoop-mr-java-client 44.95% <ø> (+0.06%) ⬆️
spark-client-hadoop-common 48.44% <ø> (+0.01%) ⬆️
spark-java-tests 48.61% <0.00%> (-0.02%) ⬇️
spark-scala-tests 44.75% <0.00%> (+0.05%) ⬆️
utilities 37.67% <0.00%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...e/spark/sql/hive/SparkCatalogMetaStoreClient.scala 30.07% <0.00%> (ø)

... and 14 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions github-actions Bot added the size:S PR with lines of changes in (10, 100] label May 1, 2026
@hudi-bot
Copy link
Copy Markdown
Collaborator

hudi-bot commented May 1, 2026

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@danny0405 danny0405 merged commit 91f341f into apache:master May 5, 2026
63 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:S PR with lines of changes in (10, 100]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants