test(spark): repro for bytes→string promotion crashing data-skipping read by linliu-code · Pull Request #18810 · apache/hudi

linliu-code · 2026-05-22T02:30:24Z

Describe the issue this Pull Request addresses

The Hudi schema-evolution docs at https://hudi.apache.org/docs/schema_evolution/ list bytes → string as a supported type promotion. Empirically on Hudi 1.x master:

✅ The WRITE succeeds (initial bytes batch + evolved string batch both commit).
✅ Reading WITHOUT data skipping succeeds and returns the matching row.

❌ Reading WITH data skipping (the default) throws:

java.lang.ClassCastException:
  class java.nio.HeapByteBuffer cannot be cast to class [B
  (java.nio.HeapByteBuffer and [B are in module java.base of loader 'bootstrap')
  at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getBinary(rows.scala:46)
  at org.apache.spark.sql.execution.LocalTableScanExec.$anonfun$unsafeRows$1(LocalTableScanExec.scala:44)
  ...

Cross-version evidence: this is a regression introduced in 1.1.0

I ran the same minimal repro (bytes write → string evolve → string-predicate read with data skipping enabled) against every Hudi bundle published to Maven Central between 0.15.0 and master HEAD:

Version	DS=enabled	DS=disabled
0.15.0	✅ PASS	✅ PASS
0.15.1-rc1	✅ PASS	✅ PASS
1.0.0	✅ PASS	✅ PASS
1.0.1	✅ PASS	✅ PASS
1.0.2	✅ PASS	✅ PASS
1.1.0	❌ CRASH	✅ PASS
1.1.1	❌ CRASH	✅ PASS
master HEAD (1.3-SNAPSHOT)	❌ CRASH	✅ PASS

The regression was introduced in the 1.1.0 release — between 1.0.2 (works) and 1.1.0 (crashes). The 1.0.2 → 1.1.0 commit window is the bisect target for finding the responsible change.

This re-frames the question: it's not just "is this a documented limitation or a bug?" — it's "this used to work for ~3 years, then 1.1.0 broke it." A bisect within the 1.0.2 → 1.1.0 window would identify the regressing commit.

This PR contains only a test that documents the observed behavior. It does NOT include a fix. The intent is to ask reviewers to confirm:

(a) the test correctly demonstrates a bug — bytes → string is documented as supported, so data-skipping queries should not crash after the promotion, OR
(b) the test setup is missing a config or usage detail that the empirical crash depends on (in which case the docs probably need a clarification on this corner of the promotion matrix).

Summary and Changelog

Adds one new test file: hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestBytesToStringPromotionDataSkipping.scala.

The test is testBytesToStringPromotionReadAfterEvolution, parameterized via @CsvSource across:

Dimension	Values
`tableType`	`COPY_ON_WRITE`, `MERGE_ON_READ`
`dataSkippingEnabled`	`true`, `false`

→ 4 cells total. Per the docs, all 4 should PASS.

Observed on this branch (off latest master facb517ef957):

Tests run: 4, Failures: 0, Errors: 2, Skipped: 0

Cell	Result
(COW, dataSkipping=true)	❌ ERROR — `ClassCastException`
(COW, dataSkipping=false)	✅ PASS
(MOR, dataSkipping=true)	❌ ERROR — `ClassCastException`
(MOR, dataSkipping=false)	✅ PASS

The crash reproduces consistently on both COW and MOR with data skipping enabled.

Test details

Write a 3-row initial batch with col_promote as BinaryType (arrays like Array[Byte](0x01, 0x02)).
Write a 2-row evolved batch with col_promote as StringType (values "zz_alpha", "zz_beta").
Read with a string predicate: SELECT _row_key FROM t WHERE col_promote = 'zz_alpha'.
Expected: 1 matching row returned.

MDT col_stats is explicitly enabled and col_promote is included in the indexed-columns list, so the read path consults col_stats. The col_stats records for the pre-evolution files carry stats in the bytes union-member; the post-evolution file carries stats in the string union-member. The crash appears to happen when the comparator/projection path retrieves a HeapByteBuffer (Avro's bytes representation) where it expects a Java byte[].

Impact

No source code change. New test only. CI will show 2 failing cells (the data-skipping=true cells) until either:

the production code is fixed to handle the bytes→string promotion in the data-skipping path, OR
the test is removed because the documented expectation was misread.

Risk Level

None — test-only.

Documentation Update

If the resolution is (b) (expected behavior), the schema-evolution docs should note that the promotion matrix's bytes → string row has a data-skipping limitation, or that queries on a column that has gone through this promotion must set hoodie.enable.data.skipping=false.

If the resolution is (a) (bug), no docs change needed — fix the comparator path to handle the bytes-to-string union-member transition.

Related PRs in this series

This is the third "repro for community confirmation" PR on schema-evolution behavior:

test(spark): repro for reconcile.schema=true blocking documented type promotion #18806 — reconcile.schema=true blocks documented type promotions (int→long, int→double)
test(spark): codify MDT col_stats behavior on ADD COLUMN schema evolution #18807 — codifies MDT col_stats auto-extend behavior on ADD COLUMN
this PR — bytes → string promotion crashes data-skipping read

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable (this PR IS the test; no production code change)
CI passed — EXPECTED to FAIL on the 2 dataSkipping=true cells; that's the repro this PR is opening for discussion

… read The Hudi schema-evolution docs at https://hudi.apache.org/docs/schema_evolution/ list bytes -> string as a supported type promotion. Empirically, on Hudi 1.x master: - The WRITE succeeds. - Reading WITHOUT data skipping succeeds. - Reading WITH data skipping enabled throws: java.lang.ClassCastException: class java.nio.HeapByteBuffer cannot be cast to class [B This PR adds ONLY a test that documents the empirical behavior. It does NOT include a fix. The intent is for reviewers to confirm one of: (a) the test correctly demonstrates a bug — bytes->string is documented as supported, so data-skipping queries should not crash after the promotion, OR (b) the test setup is missing a config / usage detail and the crash is expected (in which case the docs should clarify the limitation). The test parameterizes across (tableType x dataSkipping) for 4 cells: - tableType in {COPY_ON_WRITE, MERGE_ON_READ} - dataSkipping in {true, false} Expected per docs: all 4 PASS. Observed on current master: - 2 dataSkipping=true cells throw ClassCastException - 2 dataSkipping=false cells PASS The crash happens during data-skipping evaluation of a string predicate against a column whose MDT col_stats records carry mixed bytes + string union members (from the pre- and post-evolution batches). The comparator path appears to retrieve a HeapByteBuffer where it expects a Java byte[]. This is the third in a series of "repro PRs for community confirmation" on schema-evolution behavior — see also apache#18806 (reconcile.schema blocks documented type promotions) and apache#18807 (col_stats auto-extend behavior on ADD COLUMN). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codecov-commenter · 2026-05-22T03:38:14Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.91%. Comparing base (facb517) to head (9b3c86d).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff            @@
##             master   #18810   +/-   ##
=========================================
  Coverage     68.91%   68.91%           
- Complexity    29089    29091    +2     
=========================================
  Files          2509     2509           
  Lines        139470   139470           
  Branches      17114    17114           
=========================================
+ Hits          96115    96116    +1     
- Misses        35601    35602    +1     
+ Partials       7754     7752    -2

Flag	Coverage Δ
common-and-other-modules	`44.43% <ø> (+<0.01%)`	⬆️
hadoop-mr-java-client	`44.91% <ø> (+0.06%)`	⬆️
spark-client-hadoop-common	`48.23% <ø> (+<0.01%)`	⬆️
spark-java-tests	`49.29% <ø> (-0.07%)`	⬇️
spark-scala-tests	`45.26% <ø> (-0.02%)`	⬇️
utilities	`37.46% <ø> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.
see 16 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

hudi-bot · 2026-05-22T03:39:08Z

CI report:

9b3c86d Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR adds a parameterized repro test documenting a regression where bytes→string schema promotion causes a ClassCastException when data-skipping reads consult MDT col_stats. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. A couple of small readability suggestions below.

cc @yihua

hudi-agent · 2026-05-22T04:58:42Z

+    // Per the Hudi schema-evolution docs, bytes → string is supported. The
+    // read should succeed regardless of data-skipping value and return the
+    // matching row.
+    spark.conf.set("hoodie.enable.data.skipping", dataSkippingEnabled.toString)


🤖 nit: could you use the typed constant for this key instead of the raw string? "hoodie.enable.data.skipping" is easy to silently drift if the key is ever renamed — something like HoodieMetadataConfig.ENABLE_DATA_SKIPPING.key() (or whichever constant owns it) would be safer.

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-agent · 2026-05-22T04:58:42Z

+    // read should succeed regardless of data-skipping value and return the
+    // matching row.
+    spark.conf.set("hoodie.enable.data.skipping", dataSkippingEnabled.toString)
+    val df = spark.read.format("org.apache.hudi").load(basePath)


🤖 nit: df is only used on the very next line — have you considered collapsing these two into spark.read.format("org.apache.hudi").load(basePath).createOrReplaceTempView("t")?

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

github-actions Bot added the size:M PR with lines of changes in (100, 300] label May 22, 2026

hudi-agent reviewed May 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(spark): repro for bytes→string promotion crashing data-skipping read#18810

test(spark): repro for bytes→string promotion crashing data-skipping read#18810
linliu-code wants to merge 1 commit into
apache:masterfrom
linliu-code:repro-bytes-to-string-crashes-data-skipping

linliu-code commented May 22, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented May 22, 2026

Uh oh!

hudi-bot commented May 22, 2026

Uh oh!

hudi-agent left a comment

Uh oh!

hudi-agent May 22, 2026

Uh oh!

hudi-agent May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

linliu-code commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the issue this Pull Request addresses

Cross-version evidence: this is a regression introduced in 1.1.0

Summary and Changelog

Test details

Impact

Risk Level

Documentation Update

Related PRs in this series

Contributor's checklist

Uh oh!

codecov-commenter commented May 22, 2026

Codecov Report

Uh oh!

hudi-bot commented May 22, 2026

CI report:

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-agent May 22, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-agent May 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

linliu-code commented May 22, 2026 •

edited

Loading