[SPARK-54916][ML] Fix Parquet footer error in DecisionTree test suites by caching RDDs by azmatsiddique · Pull Request #54941 · apache/spark

azmatsiddique · 2026-03-22T13:12:54Z

What changes were proposed in this pull request?
This PR fixes a CANNOT_READ_FILE_FOOTER error encountered in DecisionTreeClassifierSuite and DecisionTreeRegressorSuite when running with Scala 2.13 in Spark 4.2.0-preview3.

The fix involves explicitly calling .cache() on the RDDs initialized in beforeAll() (and in TreeTests.getTreeReadWriteData). This forces the materialization of the underlying ArraySeq (introduced via .toImmutableArraySeq) before the test cases attempt to fit models, which involves writing temporary Parquet files.
Why are the changes needed?
Following the migration to Scala 2.13 collection conversions (specifically toImmutableArraySeq), certain ML test suites began failing with corrupted Parquet files.

The root cause appears to be a race condition or serialization issue where the lazy evaluation of the ImmutableArraySeq coincided with Spark's background Parquet writing during DecisionTree.fit. By caching the RDDs immediately after creation in beforeAll, we ensure the data is fully materialized and stable before any Parquet write operations begin.

Does this PR introduce any user-facing change?
No. This is a fix for internal test suites only.

How was this patch tested?
Verified locally by running the affected test suite: build/sbt "mllib/testOnly org.apache.spark.ml.classification.DecisionTreeClassifierSuite" The tests now pass consistently without Parquet footer errors.

Was this patch authored or co-authored using generative AI tooling?
No

…s corrupted file

…n last CSV column ### What changes were proposed in this pull request? This PR fixes an issue where the CSV reader inconsistently parses empty quoted strings (`""`) when the `escape` option is set to an empty string (`""`). Previously, mid-line empty quoted strings correctly resolved to null/empty, but the last column resolved to a literal `"` character due to univocity parser behavior. ### Why are the changes needed? To ensure consistent parsing of CSV data regardless of column position. ### Does this PR introduce _any_ user-facing change? Yes, it fixes a bug where users were receiving incorrect data (a literal quote instead of an empty/null value) for the last column in a row under specific CSV configurations. ### How was this patch tested? Added a new regression test in `CSVSuite` that verifies consistent parsing of both mid-line and end-of-line empty quoted fields.

azmatsiddique mentioned this pull request Mar 22, 2026

DecisionTreeClassifierSuite fails in Spark 4.2.0-preview3 (Scala 2.13) with corrupted Parquet file error #54916

Open

azmatsiddique added 5 commits March 22, 2026 22:38

[SPARK-55968][SQL] Do not treat vectorized reader capacity overflow a…

762186c

…s corrupted file

Trigger Github Actions

14797e9

[SPARK-55559][SQL] Fix BIT_COUNT for negative tinyint/smallint/int

2096f5b

[SPARK-54916][ML] Fix Parquet footer error in DecisionTree test suites

2ca1250

azmatsiddique force-pushed the fix/spark-54916-parquet-footer branch from dcf3825 to 2ca1250 Compare March 22, 2026 17:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-54916][ML] Fix Parquet footer error in DecisionTree test suites by caching RDDs#54941

[SPARK-54916][ML] Fix Parquet footer error in DecisionTree test suites by caching RDDs#54941
azmatsiddique wants to merge 5 commits intoapache:masterfrom
azmatsiddique:fix/spark-54916-parquet-footer

azmatsiddique commented Mar 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

azmatsiddique commented Mar 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant