fix(spark): align CTAS partition fields by table partition order#18899
Conversation
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for the contribution! This PR aligns the CTAS query output partition column order with the table's declared partition spec to fix a positional misinterpretation bug in multi-level partitioned tables. The logic is straightforward, scoped to the analyzer rule, and includes regression tests for both ordered and out-of-order partition columns. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. One minor readability suggestion below, otherwise the code is clean.
cc @yihua
|
@fhan688 Can you also try to address the nit comments if possible? |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #18899 +/- ##
============================================
+ Coverage 67.01% 68.81% +1.80%
- Complexity 28461 29172 +711
============================================
Files 2520 2520
Lines 140046 140073 +27
Branches 17197 17213 +16
============================================
+ Hits 93850 96393 +2543
+ Misses 38529 35902 -2627
- Partials 7667 7778 +111
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
Describe the issue this Pull Request addresses
Spark SQL CTAS for Hudi tables can write incorrect values for multi-level partition fields when the partition columns in the SELECT output are not ordered the same as the table partition spec.
For example, a table created with:
can receive a CTAS query whose output is:
select ..., month, day, yearThe CTAS path currently forwards the resolved query output as-is, so the downstream write path may interpret partition field values by position instead of the declared table partition order.
This PR fixes the issue inline.
Summary and Changelog
This change aligns CTAS query output with the Hudi table partition field order before creating
CreateHoodieTableAsSelectCommand.Changes:
table.partitionColumnNamesinResolveImplementationsEarly.Impact
No public API, config, or storage format changes.
This fixes Spark SQL CTAS behavior for Hudi partitioned tables. CTAS now correctly handles multi-level partition columns even when the SELECT list orders partition fields differently from the
PARTITIONED BYclause.Risk Level
low
The change is scoped to Hudi Spark SQL CTAS analysis for resolved Hudi tables. Non-partitioned CTAS and already-aligned CTAS plans keep the existing behavior. Verification was added for both COW and MOR table types
through the existing
TestCreateTableCTAS coverage.Documentation Update
none
This is a bug fix with no new feature, config, or public API change.
Contributor's checklist