fix: Honor SparkSession overrides for rebase mode and timezone in compaction tasks by yihua · Pull Request #18675 · apache/hudi

yihua · 2026-05-01T23:35:11Z

Describe the issue this Pull Request addresses

When MOR compaction runs outside a Spark SQL execution context (for example, a standalone compaction runner that dispatches per-file-group work via JavaSparkContext.parallelize(...).map(...)), SQLConf.get on the executor task thread returns a fresh fallback SQLConf with default values rather than the user's SparkSession overrides. As a result, Spark{3_3,3_4,3_5,4_0}Adapter.getDateTimeRebaseMode() resolves to EXCEPTION even when the user has set spark.sql.parquet.datetimeRebaseModeInWrite=LEGACY, producing SparkUpgradeException [INCONSISTENT_BEHAVIOR_CROSS_VERSION.WRITE_ANCIENT_DATETIME] during compaction of MOR tables containing pre-1900 timestamps (a common Oracle DATE sentinel pattern). The same gap affects HoodieRowParquetWriteSupport.init()'s sessionLocalTimeZone read when LEGACY rebase metadata is being recorded.

Summary and Changelog

This change resolves the value in three steps:

SQLConf override — catches values set via spark.conf.set(...) on the SparkSession (and any task thread that IS inside a Spark SQL execution context, where Spark already propagated SQLConf for us).
SparkConf via SparkEnv.get.conf — catches values set in SparkConf at startup. SparkConf is broadcast to every executor, so user overrides reach tasks dispatched outside any SQL execution context.
The ConfigEntry's own default (or SQLConf.sessionLocalTimeZone for the timezone helper).

The inline-compaction-via-df.write path was already correct because Spark's SQL execution wrapper had already populated the task thread's SQLConf. No call-site changes; the fix is localized to the adapter read and a small helper in HoodieRowParquetWriteSupport.

Spark{3_3,3_4,3_5,4_0}Adapter.getDateTimeRebaseMode(): SQLConf → SparkConf → ConfigEntry default.
HoodieRowParquetWriteSupport: extracted Parquet metadata key strings as named constants; new resolveSessionLocalTimeZone() helper applies the same resolution order; called from init() when LEGACY rebase metadata is being recorded.

New tests: TestSparkAdapterRebaseModePropagation (3 methods)

rebase mode reaches executor task threads under parallelize().map(),
SparkConf is reachable on executors via SparkEnv.get.conf (mechanism check),
sessionLocalTimeZone reaches executor task threads.

Each test fails without the fix.

Impact

User-facing: users who set spark.sql.parquet.datetimeRebaseModeInWrite (or spark.sql.session.timeZone) on the SparkSession — at startup via SparkConf or at runtime via spark.conf.set(...) — will have the override honored by Hudi compaction tasks dispatched outside a Spark SQL execution context. No behavior change for callers already inside a SQL execution context, and no change to default behavior when no override is set.

Risk Level

Low. The SparkConf fallback only fires when SQLConf.get does not have the key set; SparkConf is the canonical broadcast source for cluster-wide configuration. No call-site or API changes.

Documentation Update

None required — fix preserves the documented behavior of the underlying Spark configs.

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

…on tasks When MOR compaction runs outside a Spark SQL execution context (e.g. a standalone CompactTask runner), `SQLConf.get` on the executor task thread returns a fresh fallback `SQLConf` with default values, not the user's SparkSession overrides. As a result, `Spark{3_3,3_4,3_5,4_0}Adapter .getDateTimeRebaseMode()` resolved to `EXCEPTION` even when the user had set `spark.sql.parquet.datetimeRebaseModeInWrite=LEGACY`, producing `SparkUpgradeException [INCONSISTENT_BEHAVIOR_CROSS_VERSION .WRITE_ANCIENT_DATETIME]` during compaction of MOR tables containing pre-1900 timestamps. The same gap affected `HoodieRowParquetWriteSupport.init()`'s `sessionLocalTimeZone` read. Adapter and WriteSupport now resolve the value in this order: 1. SQLConf override (so `spark.conf.set(...)` on the SparkSession takes effect on the driver and inside SQL execution contexts). 2. SparkConf via SparkEnv.get.conf (broadcast to every executor at startup, so user-set keys are honored on executor tasks running outside a SQL execution context). 3. The ConfigEntry's own default (or SQLConf.sessionLocalTimeZone for the timezone helper). Adds TestSparkAdapterRebaseModePropagation (3 methods) covering rebase mode and timezone propagation into vanilla parallelize().map() task closures. Each test fails without the fix.

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for working on this! The PR adds a SparkConf fallback for rebase-mode and session timezone reads so they survive Spark's missing SQLConf propagation in non-SQL-execution executor tasks (e.g., compaction dispatched via parallelize().map()). The 3-step resolution looks correct, the new helper is exposed for testing in a clean way, and the tests are well-constructed (the JVM-default-vs-custom timezone trick is a nice touch). One thing worth double-checking is whether Spark4_1Adapter.getDateTimeRebaseMode() should be in this fix as well — see the inline comment. Please take a look at the inline comment, and this should be ready for a Hudi committer or PMC member to take it from here. A single minor Scala idiom nit — otherwise the code is clean and well-documented.

… Option

…m scaladocs

codecov-commenter · 2026-05-02T08:29:32Z

Codecov Report

❌ Patch coverage is 97.50000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 68.08%. Comparing base (f7508de) to head (c46cdfd).

Files with missing lines	Patch %	Lines
...i/io/storage/row/HoodieRowParquetWriteSupport.java	90.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff            @@
##             master   #18675   +/-   ##
=========================================
  Coverage     68.07%   68.08%           
- Complexity    28916    28937   +21     
=========================================
  Files          2519     2519           
  Lines        140611   140645   +34     
  Branches      17423    17426    +3     
=========================================
+ Hits          95726    95754   +28     
- Misses        37026    37031    +5     
- Partials       7859     7860    +1

Flag	Coverage Δ
common-and-other-modules	`44.35% <0.00%> (+<0.01%)`	⬆️
hadoop-mr-java-client	`44.95% <ø> (-0.01%)`	⬇️
spark-client-hadoop-common	`48.44% <80.00%> (+<0.01%)`	⬆️
spark-java-tests	`48.64% <95.00%> (-0.01%)`	⬇️
spark-scala-tests	`44.76% <75.00%> (+<0.01%)`	⬆️
utilities	`37.68% <56.25%> (-0.04%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...org/apache/spark/sql/adapter/Spark3_3Adapter.scala	`65.62% <100.00%> (+2.91%)`	⬆️
...org/apache/spark/sql/adapter/Spark3_4Adapter.scala	`64.06% <100.00%> (+3.04%)`	⬆️
...org/apache/spark/sql/adapter/Spark3_5Adapter.scala	`64.61% <100.00%> (+2.94%)`	⬆️
...org/apache/spark/sql/adapter/Spark4_0Adapter.scala	`62.50% <100.00%> (+3.17%)`	⬆️
...org/apache/spark/sql/adapter/Spark4_1Adapter.scala	`62.50% <100.00%> (+3.17%)`	⬆️
...i/io/storage/row/HoodieRowParquetWriteSupport.java	`73.21% <90.00%> (+0.33%)`	⬆️

... and 8 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

hudi-bot · 2026-05-02T09:12:19Z

CI report:

5a0dd75 UNKNOWN
7d36ea4 UNKNOWN
7fdb095 UNKNOWN
e0dab8c UNKNOWN
312548c UNKNOWN
c46cdfd Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

github-actions Bot added the size:M PR with lines of changes in (100, 300] label May 1, 2026

hudi-agent reviewed May 1, 2026

View reviewed changes

Comment thread ...datasource/hudi-spark4.0.x/src/main/scala/org/apache/spark/sql/adapter/Spark4_0Adapter.scala

Comment thread ...datasource/hudi-spark3.5.x/src/main/scala/org/apache/spark/sql/adapter/Spark3_5Adapter.scala Outdated

yihua changed the title ~~Honor SparkSession overrides for rebase mode and timezone in compaction tasks~~ fix: Honor SparkSession overrides for rebase mode and timezone in compaction tasks May 1, 2026

yihua added 3 commits May 1, 2026 16:47

Apply fix to Spark4_1Adapter; use flatMap+Option to avoid null inside…

5a0dd75

… Option

Use SQLConf.getConf(entry, null) instead of getConfString(key, null)

7d36ea4

Make Spark4_1 consistent with 3.x/4.0; add default-behavior test; tri…

f6b6f5b

…m scaladocs

github-actions Bot added size:L PR with lines of changes in (300, 1000] and removed size:M PR with lines of changes in (100, 300] labels May 2, 2026

yihua added 7 commits May 1, 2026 19:52

Add unit test for resolveSessionLocalTimeZone in hudi-spark-client

7fdb095

Add SQLConf-override test method to lift coverage on new code

55e4679

Drop redundant public modifiers from JUnit 5 test class and methods

e0dab8c

Read expected default from SQLConf so test works on Spark 3.x and 4.1

3f06201

Document why init() coverage lives in hudi-spark integration tests

312548c

Inline single-use Parquet metadata keys; keep only timeZone constant

3dd442c

Add SparkConf-branch test for resolveSessionLocalTimeZone

c46cdfd

voonhous approved these changes May 2, 2026

View reviewed changes

voonhous merged commit 695294c into apache:master May 2, 2026
63 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Honor SparkSession overrides for rebase mode and timezone in compaction tasks#18675

fix: Honor SparkSession overrides for rebase mode and timezone in compaction tasks#18675
voonhous merged 11 commits into
apache:masterfrom
yihua:yihua/sql-conf-executor-rebase-fix

yihua commented May 1, 2026 •

edited

Loading

Uh oh!

hudi-agent left a comment

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented May 2, 2026

Uh oh!

hudi-bot commented May 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

yihua commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented May 2, 2026

Codecov Report

Uh oh!

hudi-bot commented May 2, 2026

CI report:

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

yihua commented May 1, 2026 •

edited

Loading