test(spark): repro for bytes→string promotion crashing data-skipping read#18810
test(spark): repro for bytes→string promotion crashing data-skipping read#18810linliu-code wants to merge 1 commit into
Conversation
… read The Hudi schema-evolution docs at https://hudi.apache.org/docs/schema_evolution/ list bytes -> string as a supported type promotion. Empirically, on Hudi 1.x master: - The WRITE succeeds. - Reading WITHOUT data skipping succeeds. - Reading WITH data skipping enabled throws: java.lang.ClassCastException: class java.nio.HeapByteBuffer cannot be cast to class [B This PR adds ONLY a test that documents the empirical behavior. It does NOT include a fix. The intent is for reviewers to confirm one of: (a) the test correctly demonstrates a bug — bytes->string is documented as supported, so data-skipping queries should not crash after the promotion, OR (b) the test setup is missing a config / usage detail and the crash is expected (in which case the docs should clarify the limitation). The test parameterizes across (tableType x dataSkipping) for 4 cells: - tableType in {COPY_ON_WRITE, MERGE_ON_READ} - dataSkipping in {true, false} Expected per docs: all 4 PASS. Observed on current master: - 2 dataSkipping=true cells throw ClassCastException - 2 dataSkipping=false cells PASS The crash happens during data-skipping evaluation of a string predicate against a column whose MDT col_stats records carry mixed bytes + string union members (from the pre- and post-evolution batches). The comparator path appears to retrieve a HeapByteBuffer where it expects a Java byte[]. This is the third in a series of "repro PRs for community confirmation" on schema-evolution behavior — see also apache#18806 (reconcile.schema blocks documented type promotions) and apache#18807 (col_stats auto-extend behavior on ADD COLUMN). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #18810 +/- ##
=========================================
Coverage 68.91% 68.91%
- Complexity 29089 29091 +2
=========================================
Files 2509 2509
Lines 139470 139470
Branches 17114 17114
=========================================
+ Hits 96115 96116 +1
- Misses 35601 35602 +1
+ Partials 7754 7752 -2
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for the contribution! This PR adds a parameterized repro test documenting a regression where bytes→string schema promotion causes a ClassCastException when data-skipping reads consult MDT col_stats. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. A couple of small readability suggestions below.
cc @yihua
| // Per the Hudi schema-evolution docs, bytes → string is supported. The | ||
| // read should succeed regardless of data-skipping value and return the | ||
| // matching row. | ||
| spark.conf.set("hoodie.enable.data.skipping", dataSkippingEnabled.toString) |
There was a problem hiding this comment.
🤖 nit: could you use the typed constant for this key instead of the raw string? "hoodie.enable.data.skipping" is easy to silently drift if the key is ever renamed — something like HoodieMetadataConfig.ENABLE_DATA_SKIPPING.key() (or whichever constant owns it) would be safer.
- AI-generated; verify before applying. React 👍/👎 to flag quality.
| // read should succeed regardless of data-skipping value and return the | ||
| // matching row. | ||
| spark.conf.set("hoodie.enable.data.skipping", dataSkippingEnabled.toString) | ||
| val df = spark.read.format("org.apache.hudi").load(basePath) |
There was a problem hiding this comment.
🤖 nit: df is only used on the very next line — have you considered collapsing these two into spark.read.format("org.apache.hudi").load(basePath).createOrReplaceTempView("t")?
- AI-generated; verify before applying. React 👍/👎 to flag quality.
Describe the issue this Pull Request addresses
The Hudi schema-evolution docs at https://hudi.apache.org/docs/schema_evolution/ list
bytes → stringas a supported type promotion. Empirically on Hudi 1.x master:Cross-version evidence: this is a regression introduced in 1.1.0
I ran the same minimal repro (bytes write → string evolve → string-predicate read with data skipping enabled) against every Hudi bundle published to Maven Central between 0.15.0 and master HEAD:
The regression was introduced in the 1.1.0 release — between 1.0.2 (works) and 1.1.0 (crashes). The 1.0.2 → 1.1.0 commit window is the bisect target for finding the responsible change.
This re-frames the question: it's not just "is this a documented limitation or a bug?" — it's "this used to work for ~3 years, then 1.1.0 broke it." A bisect within the 1.0.2 → 1.1.0 window would identify the regressing commit.
This PR contains only a test that documents the observed behavior. It does NOT include a fix. The intent is to ask reviewers to confirm:
bytes → stringis documented as supported, so data-skipping queries should not crash after the promotion, ORSummary and Changelog
Adds one new test file:
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestBytesToStringPromotionDataSkipping.scala.The test is
testBytesToStringPromotionReadAfterEvolution, parameterized via@CsvSourceacross:tableTypeCOPY_ON_WRITE,MERGE_ON_READdataSkippingEnabledtrue,false→ 4 cells total. Per the docs, all 4 should PASS.
Observed on this branch (off latest master
facb517ef957):ClassCastExceptionClassCastExceptionThe crash reproduces consistently on both COW and MOR with data skipping enabled.
Test details
col_promoteasBinaryType(arrays likeArray[Byte](0x01, 0x02)).col_promoteasStringType(values"zz_alpha","zz_beta").SELECT _row_key FROM t WHERE col_promote = 'zz_alpha'.MDT col_stats is explicitly enabled and
col_promoteis included in the indexed-columns list, so the read path consults col_stats. The col_stats records for the pre-evolution files carry stats in the bytes union-member; the post-evolution file carries stats in the string union-member. The crash appears to happen when the comparator/projection path retrieves aHeapByteBuffer(Avro's bytes representation) where it expects a Javabyte[].Impact
No source code change. New test only. CI will show 2 failing cells (the data-skipping=true cells) until either:
Risk Level
None — test-only.
Documentation Update
If the resolution is (b) (expected behavior), the schema-evolution docs should note that the promotion matrix's
bytes → stringrow has a data-skipping limitation, or that queries on a column that has gone through this promotion must sethoodie.enable.data.skipping=false.If the resolution is (a) (bug), no docs change needed — fix the comparator path to handle the bytes-to-string union-member transition.
Related PRs in this series
This is the third "repro for community confirmation" PR on schema-evolution behavior:
reconcile.schema=trueblocks documented type promotions (int→long, int→double)ADD COLUMNbytes → stringpromotion crashes data-skipping readContributor's checklist