Skip to content

test(spark): repro for bytes→string promotion crashing data-skipping read#18810

Open
linliu-code wants to merge 1 commit into
apache:masterfrom
linliu-code:repro-bytes-to-string-crashes-data-skipping
Open

test(spark): repro for bytes→string promotion crashing data-skipping read#18810
linliu-code wants to merge 1 commit into
apache:masterfrom
linliu-code:repro-bytes-to-string-crashes-data-skipping

Conversation

@linliu-code
Copy link
Copy Markdown
Collaborator

@linliu-code linliu-code commented May 22, 2026

Describe the issue this Pull Request addresses

The Hudi schema-evolution docs at https://hudi.apache.org/docs/schema_evolution/ list bytes → string as a supported type promotion. Empirically on Hudi 1.x master:

  • ✅ The WRITE succeeds (initial bytes batch + evolved string batch both commit).
  • ✅ Reading WITHOUT data skipping succeeds and returns the matching row.
  • ❌ Reading WITH data skipping (the default) throws:
    java.lang.ClassCastException:
      class java.nio.HeapByteBuffer cannot be cast to class [B
      (java.nio.HeapByteBuffer and [B are in module java.base of loader 'bootstrap')
      at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getBinary(rows.scala:46)
      at org.apache.spark.sql.execution.LocalTableScanExec.$anonfun$unsafeRows$1(LocalTableScanExec.scala:44)
      ...
    

Cross-version evidence: this is a regression introduced in 1.1.0

I ran the same minimal repro (bytes write → string evolve → string-predicate read with data skipping enabled) against every Hudi bundle published to Maven Central between 0.15.0 and master HEAD:

Version DS=enabled DS=disabled
0.15.0 ✅ PASS ✅ PASS
0.15.1-rc1 ✅ PASS ✅ PASS
1.0.0 ✅ PASS ✅ PASS
1.0.1 ✅ PASS ✅ PASS
1.0.2 PASS ✅ PASS
1.1.0 CRASH ✅ PASS
1.1.1 ❌ CRASH ✅ PASS
master HEAD (1.3-SNAPSHOT) ❌ CRASH ✅ PASS

The regression was introduced in the 1.1.0 release — between 1.0.2 (works) and 1.1.0 (crashes). The 1.0.2 → 1.1.0 commit window is the bisect target for finding the responsible change.

This re-frames the question: it's not just "is this a documented limitation or a bug?" — it's "this used to work for ~3 years, then 1.1.0 broke it." A bisect within the 1.0.2 → 1.1.0 window would identify the regressing commit.

This PR contains only a test that documents the observed behavior. It does NOT include a fix. The intent is to ask reviewers to confirm:

  • (a) the test correctly demonstrates a bug — bytes → string is documented as supported, so data-skipping queries should not crash after the promotion, OR
  • (b) the test setup is missing a config or usage detail that the empirical crash depends on (in which case the docs probably need a clarification on this corner of the promotion matrix).

Summary and Changelog

Adds one new test file: hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestBytesToStringPromotionDataSkipping.scala.

The test is testBytesToStringPromotionReadAfterEvolution, parameterized via @CsvSource across:

Dimension Values
tableType COPY_ON_WRITE, MERGE_ON_READ
dataSkippingEnabled true, false

→ 4 cells total. Per the docs, all 4 should PASS.

Observed on this branch (off latest master facb517ef957):

Tests run: 4, Failures: 0, Errors: 2, Skipped: 0
Cell Result
(COW, dataSkipping=true) ❌ ERROR — ClassCastException
(COW, dataSkipping=false) ✅ PASS
(MOR, dataSkipping=true) ❌ ERROR — ClassCastException
(MOR, dataSkipping=false) ✅ PASS

The crash reproduces consistently on both COW and MOR with data skipping enabled.

Test details

  1. Write a 3-row initial batch with col_promote as BinaryType (arrays like Array[Byte](0x01, 0x02)).
  2. Write a 2-row evolved batch with col_promote as StringType (values "zz_alpha", "zz_beta").
  3. Read with a string predicate: SELECT _row_key FROM t WHERE col_promote = 'zz_alpha'.
  4. Expected: 1 matching row returned.

MDT col_stats is explicitly enabled and col_promote is included in the indexed-columns list, so the read path consults col_stats. The col_stats records for the pre-evolution files carry stats in the bytes union-member; the post-evolution file carries stats in the string union-member. The crash appears to happen when the comparator/projection path retrieves a HeapByteBuffer (Avro's bytes representation) where it expects a Java byte[].

Impact

No source code change. New test only. CI will show 2 failing cells (the data-skipping=true cells) until either:

  • the production code is fixed to handle the bytes→string promotion in the data-skipping path, OR
  • the test is removed because the documented expectation was misread.

Risk Level

None — test-only.

Documentation Update

If the resolution is (b) (expected behavior), the schema-evolution docs should note that the promotion matrix's bytes → string row has a data-skipping limitation, or that queries on a column that has gone through this promotion must set hoodie.enable.data.skipping=false.

If the resolution is (a) (bug), no docs change needed — fix the comparator path to handle the bytes-to-string union-member transition.

Related PRs in this series

This is the third "repro for community confirmation" PR on schema-evolution behavior:

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable (this PR IS the test; no production code change)
  • CI passed — EXPECTED to FAIL on the 2 dataSkipping=true cells; that's the repro this PR is opening for discussion

… read

The Hudi schema-evolution docs at https://hudi.apache.org/docs/schema_evolution/
list bytes -> string as a supported type promotion. Empirically, on Hudi 1.x
master:

  - The WRITE succeeds.
  - Reading WITHOUT data skipping succeeds.
  - Reading WITH data skipping enabled throws:
      java.lang.ClassCastException:
        class java.nio.HeapByteBuffer cannot be cast to class [B

This PR adds ONLY a test that documents the empirical behavior. It does NOT
include a fix. The intent is for reviewers to confirm one of:

  (a) the test correctly demonstrates a bug — bytes->string is documented as
      supported, so data-skipping queries should not crash after the promotion,
      OR
  (b) the test setup is missing a config / usage detail and the crash is
      expected (in which case the docs should clarify the limitation).

The test parameterizes across (tableType x dataSkipping) for 4 cells:
  - tableType  in {COPY_ON_WRITE, MERGE_ON_READ}
  - dataSkipping in {true, false}

Expected per docs: all 4 PASS.
Observed on current master:
  - 2 dataSkipping=true  cells throw ClassCastException
  - 2 dataSkipping=false cells PASS

The crash happens during data-skipping evaluation of a string predicate
against a column whose MDT col_stats records carry mixed bytes + string union
members (from the pre- and post-evolution batches). The comparator path
appears to retrieve a HeapByteBuffer where it expects a Java byte[].

This is the third in a series of "repro PRs for community confirmation" on
schema-evolution behavior — see also apache#18806 (reconcile.schema
blocks documented type promotions) and apache#18807 (col_stats
auto-extend behavior on ADD COLUMN).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the size:M PR with lines of changes in (100, 300] label May 22, 2026
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.91%. Comparing base (facb517) to head (9b3c86d).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff            @@
##             master   #18810   +/-   ##
=========================================
  Coverage     68.91%   68.91%           
- Complexity    29089    29091    +2     
=========================================
  Files          2509     2509           
  Lines        139470   139470           
  Branches      17114    17114           
=========================================
+ Hits          96115    96116    +1     
- Misses        35601    35602    +1     
+ Partials       7754     7752    -2     
Flag Coverage Δ
common-and-other-modules 44.43% <ø> (+<0.01%) ⬆️
hadoop-mr-java-client 44.91% <ø> (+0.06%) ⬆️
spark-client-hadoop-common 48.23% <ø> (+<0.01%) ⬆️
spark-java-tests 49.29% <ø> (-0.07%) ⬇️
spark-scala-tests 45.26% <ø> (-0.02%) ⬇️
utilities 37.46% <ø> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.
see 16 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@hudi-bot
Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR adds a parameterized repro test documenting a regression where bytes→string schema promotion causes a ClassCastException when data-skipping reads consult MDT col_stats. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. A couple of small readability suggestions below.

cc @yihua

// Per the Hudi schema-evolution docs, bytes → string is supported. The
// read should succeed regardless of data-skipping value and return the
// matching row.
spark.conf.set("hoodie.enable.data.skipping", dataSkippingEnabled.toString)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: could you use the typed constant for this key instead of the raw string? "hoodie.enable.data.skipping" is easy to silently drift if the key is ever renamed — something like HoodieMetadataConfig.ENABLE_DATA_SKIPPING.key() (or whichever constant owns it) would be safer.

- AI-generated; verify before applying. React 👍/👎 to flag quality.

// read should succeed regardless of data-skipping value and return the
// matching row.
spark.conf.set("hoodie.enable.data.skipping", dataSkippingEnabled.toString)
val df = spark.read.format("org.apache.hudi").load(basePath)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: df is only used on the very next line — have you considered collapsing these two into spark.read.format("org.apache.hudi").load(basePath).createOrReplaceTempView("t")?

- AI-generated; verify before applying. React 👍/👎 to flag quality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:M PR with lines of changes in (100, 300]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants