Skip to content

fix(spark): align CTAS partition fields by table partition order#18899

Merged
voonhous merged 2 commits into
apache:masterfrom
fhan688:fix-CTAS-partition-field-order
Jun 3, 2026
Merged

fix(spark): align CTAS partition fields by table partition order#18899
voonhous merged 2 commits into
apache:masterfrom
fhan688:fix-CTAS-partition-field-order

Conversation

@fhan688
Copy link
Copy Markdown
Contributor

@fhan688 fhan688 commented Jun 2, 2026

Describe the issue this Pull Request addresses

Spark SQL CTAS for Hudi tables can write incorrect values for multi-level partition fields when the partition columns in the SELECT output are not ordered the same as the table partition spec.

For example, a table created with:

partitioned by (year, month, day)

can receive a CTAS query whose output is:

select ..., month, day, year

The CTAS path currently forwards the resolved query output as-is, so the downstream write path may interpret partition field values by position instead of the declared table partition order.

This PR fixes the issue inline.

Summary and Changelog

This change aligns CTAS query output with the Hudi table partition field order before creating CreateHoodieTableAsSelectCommand.

Changes:

  • Reorder CTAS partition attributes according to table.partitionColumnNames in ResolveImplementationsEarly.
  • Preserve non-partition columns in their original query output order.
  • Use Spark's session resolver for partition field matching.
  • Avoid adding a projection when the CTAS output is already aligned.
  • Add Spark SQL DDL tests for multi-level partition CTAS with both ordered and out-of-order partition columns.

Impact

No public API, config, or storage format changes.

This fixes Spark SQL CTAS behavior for Hudi partitioned tables. CTAS now correctly handles multi-level partition columns even when the SELECT list orders partition fields differently from the PARTITIONED BY clause.

Risk Level

low

The change is scoped to Hudi Spark SQL CTAS analysis for resolved Hudi tables. Non-partitioned CTAS and already-aligned CTAS plans keep the existing behavior. Verification was added for both COW and MOR table types
through the existing TestCreateTable CTAS coverage.

Documentation Update

none

This is a bug fix with no new feature, config, or public API change.

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@github-actions github-actions Bot added the size:S PR with lines of changes in (10, 100] label Jun 2, 2026
Copy link
Copy Markdown
Member

@voonhous voonhous left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the fix!

LGTM

Will love another set of eyes to take a look at this before it gets merged in :)

Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR aligns the CTAS query output partition column order with the table's declared partition spec to fix a positional misinterpretation bug in multi-level partitioned tables. The logic is straightforward, scoped to the analyzer rule, and includes regression tests for both ordered and out-of-order partition columns. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. One minor readability suggestion below, otherwise the code is clean.

cc @yihua

@voonhous
Copy link
Copy Markdown
Member

voonhous commented Jun 3, 2026

@fhan688 Can you also try to address the nit comments if possible?

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 77.77778% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.81%. Comparing base (af38b88) to head (1abd797).
⚠️ Report is 3 commits behind head on master.

Files with missing lines Patch % Lines
...pache/spark/sql/hudi/analysis/HoodieAnalysis.scala 77.77% 2 Missing and 2 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18899      +/-   ##
============================================
+ Coverage     67.01%   68.81%   +1.80%     
- Complexity    28461    29172     +711     
============================================
  Files          2520     2520              
  Lines        140046   140073      +27     
  Branches      17197    17213      +16     
============================================
+ Hits          93850    96393    +2543     
+ Misses        38529    35902    -2627     
- Partials       7667     7778     +111     
Flag Coverage Δ
common-and-other-modules 44.31% <0.00%> (-0.03%) ⬇️
hadoop-mr-java-client 44.87% <ø> (-0.04%) ⬇️
spark-client-hadoop-common 48.16% <ø> (-0.07%) ⬇️
spark-java-tests 49.36% <0.00%> (+0.02%) ⬆️
spark-scala-tests 45.25% <77.77%> (-0.01%) ⬇️
utilities 37.40% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...pache/spark/sql/hudi/analysis/HoodieAnalysis.scala 73.89% <77.77%> (+0.17%) ⬆️

... and 160 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@hudi-bot
Copy link
Copy Markdown
Collaborator

hudi-bot commented Jun 3, 2026

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@voonhous voonhous merged commit ba8c4c7 into apache:master Jun 3, 2026
58 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:S PR with lines of changes in (10, 100]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants