Skip to content

fix: Honor SparkSession overrides for rebase mode and timezone in compaction tasks#18675

Merged
voonhous merged 11 commits into
apache:masterfrom
yihua:yihua/sql-conf-executor-rebase-fix
May 2, 2026
Merged

fix: Honor SparkSession overrides for rebase mode and timezone in compaction tasks#18675
voonhous merged 11 commits into
apache:masterfrom
yihua:yihua/sql-conf-executor-rebase-fix

Conversation

@yihua
Copy link
Copy Markdown
Contributor

@yihua yihua commented May 1, 2026

Describe the issue this Pull Request addresses

When MOR compaction runs outside a Spark SQL execution context (for example, a standalone compaction runner that dispatches per-file-group work via JavaSparkContext.parallelize(...).map(...)), SQLConf.get on the executor task thread returns a fresh fallback SQLConf with default values rather than the user's SparkSession overrides. As a result, Spark{3_3,3_4,3_5,4_0}Adapter.getDateTimeRebaseMode() resolves to EXCEPTION even when the user has set spark.sql.parquet.datetimeRebaseModeInWrite=LEGACY, producing SparkUpgradeException [INCONSISTENT_BEHAVIOR_CROSS_VERSION.WRITE_ANCIENT_DATETIME] during compaction of MOR tables containing pre-1900 timestamps (a common Oracle DATE sentinel pattern). The same gap affects HoodieRowParquetWriteSupport.init()'s sessionLocalTimeZone read when LEGACY rebase metadata is being recorded.

Summary and Changelog

This change resolves the value in three steps:

  1. SQLConf override — catches values set via spark.conf.set(...) on the SparkSession (and any task thread that IS inside a Spark SQL execution context, where Spark already propagated SQLConf for us).
  2. SparkConf via SparkEnv.get.conf — catches values set in SparkConf at startup. SparkConf is broadcast to every executor, so user overrides reach tasks dispatched outside any SQL execution context.
  3. The ConfigEntry's own default (or SQLConf.sessionLocalTimeZone for the timezone helper).

The inline-compaction-via-df.write path was already correct because Spark's SQL execution wrapper had already populated the task thread's SQLConf. No call-site changes; the fix is localized to the adapter read and a small helper in HoodieRowParquetWriteSupport.

  • Spark{3_3,3_4,3_5,4_0}Adapter.getDateTimeRebaseMode(): SQLConf → SparkConf → ConfigEntry default.
  • HoodieRowParquetWriteSupport: extracted Parquet metadata key strings as named constants; new resolveSessionLocalTimeZone() helper applies the same resolution order; called from init() when LEGACY rebase metadata is being recorded.

New tests: TestSparkAdapterRebaseModePropagation (3 methods)

  • rebase mode reaches executor task threads under parallelize().map(),
  • SparkConf is reachable on executors via SparkEnv.get.conf (mechanism check),
  • sessionLocalTimeZone reaches executor task threads.

Each test fails without the fix.

Impact

User-facing: users who set spark.sql.parquet.datetimeRebaseModeInWrite (or spark.sql.session.timeZone) on the SparkSession — at startup via SparkConf or at runtime via spark.conf.set(...) — will have the override honored by Hudi compaction tasks dispatched outside a Spark SQL execution context. No behavior change for callers already inside a SQL execution context, and no change to default behavior when no override is set.

Risk Level

Low. The SparkConf fallback only fires when SQLConf.get does not have the key set; SparkConf is the canonical broadcast source for cluster-wide configuration. No call-site or API changes.

Documentation Update

None required — fix preserves the documented behavior of the underlying Spark configs.

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

…on tasks

When MOR compaction runs outside a Spark SQL execution context (e.g. a
standalone CompactTask runner), `SQLConf.get` on the executor task thread
returns a fresh fallback `SQLConf` with default values, not the user's
SparkSession overrides. As a result, `Spark{3_3,3_4,3_5,4_0}Adapter
.getDateTimeRebaseMode()` resolved to `EXCEPTION` even when the user had
set `spark.sql.parquet.datetimeRebaseModeInWrite=LEGACY`, producing
`SparkUpgradeException [INCONSISTENT_BEHAVIOR_CROSS_VERSION
.WRITE_ANCIENT_DATETIME]` during compaction of MOR tables containing
pre-1900 timestamps. The same gap affected
`HoodieRowParquetWriteSupport.init()`'s `sessionLocalTimeZone` read.

Adapter and WriteSupport now resolve the value in this order:
  1. SQLConf override (so `spark.conf.set(...)` on the SparkSession takes
     effect on the driver and inside SQL execution contexts).
  2. SparkConf via SparkEnv.get.conf (broadcast to every executor at
     startup, so user-set keys are honored on executor tasks running
     outside a SQL execution context).
  3. The ConfigEntry's own default (or SQLConf.sessionLocalTimeZone for
     the timezone helper).

Adds TestSparkAdapterRebaseModePropagation (3 methods) covering rebase
mode and timezone propagation into vanilla parallelize().map() task
closures. Each test fails without the fix.
@github-actions github-actions Bot added the size:M PR with lines of changes in (100, 300] label May 1, 2026
Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for working on this! The PR adds a SparkConf fallback for rebase-mode and session timezone reads so they survive Spark's missing SQLConf propagation in non-SQL-execution executor tasks (e.g., compaction dispatched via parallelize().map()). The 3-step resolution looks correct, the new helper is exposed for testing in a clean way, and the tests are well-constructed (the JVM-default-vs-custom timezone trick is a nice touch). One thing worth double-checking is whether Spark4_1Adapter.getDateTimeRebaseMode() should be in this fix as well — see the inline comment. Please take a look at the inline comment, and this should be ready for a Hudi committer or PMC member to take it from here. A single minor Scala idiom nit — otherwise the code is clean and well-documented.

@yihua yihua changed the title Honor SparkSession overrides for rebase mode and timezone in compaction tasks fix: Honor SparkSession overrides for rebase mode and timezone in compaction tasks May 1, 2026
@github-actions github-actions Bot added size:L PR with lines of changes in (300, 1000] and removed size:M PR with lines of changes in (100, 300] labels May 2, 2026
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 97.50000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 68.08%. Comparing base (f7508de) to head (c46cdfd).

Files with missing lines Patch % Lines
...i/io/storage/row/HoodieRowParquetWriteSupport.java 90.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff            @@
##             master   #18675   +/-   ##
=========================================
  Coverage     68.07%   68.08%           
- Complexity    28916    28937   +21     
=========================================
  Files          2519     2519           
  Lines        140611   140645   +34     
  Branches      17423    17426    +3     
=========================================
+ Hits          95726    95754   +28     
- Misses        37026    37031    +5     
- Partials       7859     7860    +1     
Flag Coverage Δ
common-and-other-modules 44.35% <0.00%> (+<0.01%) ⬆️
hadoop-mr-java-client 44.95% <ø> (-0.01%) ⬇️
spark-client-hadoop-common 48.44% <80.00%> (+<0.01%) ⬆️
spark-java-tests 48.64% <95.00%> (-0.01%) ⬇️
spark-scala-tests 44.76% <75.00%> (+<0.01%) ⬆️
utilities 37.68% <56.25%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...org/apache/spark/sql/adapter/Spark3_3Adapter.scala 65.62% <100.00%> (+2.91%) ⬆️
...org/apache/spark/sql/adapter/Spark3_4Adapter.scala 64.06% <100.00%> (+3.04%) ⬆️
...org/apache/spark/sql/adapter/Spark3_5Adapter.scala 64.61% <100.00%> (+2.94%) ⬆️
...org/apache/spark/sql/adapter/Spark4_0Adapter.scala 62.50% <100.00%> (+3.17%) ⬆️
...org/apache/spark/sql/adapter/Spark4_1Adapter.scala 62.50% <100.00%> (+3.17%) ⬆️
...i/io/storage/row/HoodieRowParquetWriteSupport.java 73.21% <90.00%> (+0.33%) ⬆️

... and 8 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@hudi-bot
Copy link
Copy Markdown
Collaborator

hudi-bot commented May 2, 2026

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@voonhous voonhous merged commit 695294c into apache:master May 2, 2026
63 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L PR with lines of changes in (300, 1000]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants