[GLUTEN] Fallback to Spark Parquet reader for Spark 4.1 struct compatibility (#11914)#12190
Open
senthh wants to merge 1 commit into
Open
[GLUTEN] Fallback to Spark Parquet reader for Spark 4.1 struct compatibility (#11914)#12190senthh wants to merge 1 commit into
senthh wants to merge 1 commit into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes are proposed in this pull request?
Fixes #11914.
Spark 4.1 (SPARK-53535) introduced the spark.sql.legacy.parquet.returnNullStructIfAllFieldsMissing config to change how a struct is handled when all of its requested fields are missing from the Parquet file. When this config is false (the new default), Spark returns a non-null struct with null inner fields ({NULL}) instead of a null struct. Gluten's Velox native Parquet scan has not adapted to this behavior and returns an incorrect null struct, causing GlutenParquetIOSuite to fail on Spark 4.1.
This PR makes the Velox backend fall back to the vanilla Spark Parquet reader for the affected cases so results match Spark, rather than producing incorrect results via the native scan. Native support for this behavior can be added later as a follow-up.
Specifically:
In VeloxBackendSettings.validateFileFormat (Parquet path), add a guard shouldFallbackBySpark41ParquetStructBehavior that rejects the native scan (triggering fallback) when all of the following hold:
running on Spark 4.1+ (SparkVersionUtil.gteSpark41),
the read schema contains a struct type (recursively, including structs nested inside array/map), and
spark.sql.legacy.parquet.returnNullStructIfAllFieldsMissing is false.
Re-enable the two previously excluded Spark 4.1 tests in gluten-ut/spark41 VeloxTestSettings (SPARK-53535 and vectorized reader: missing all struct fields), which now pass via fallback.
Files changed:
backends-velox/.../velox/VeloxBackend.scala — fallback guard.
backends-velox/.../execution/FallbackSuite.scala — new fallback test.
gluten-ut/spark41/.../velox/VeloxTestSettings.scala — remove the two SPARK-53535-related excludes.
How was this patch tested?
Added a unit test in FallbackSuite ("fallback Spark 4.1 parquet missing all struct fields compatibility") that writes a Parquet file containing struct field a, reads it back with a schema requesting only the missing field b under returnNullStructIfAllFieldsMissing=false, and asserts the plan contains no GlutenPlan (i.e. fully fell back to Spark). The test is cancel-ed on Spark < 4.1.
Re-enabled and verified the previously excluded GlutenParquetIOSuite tests on Spark 4.1: SPARK-53535 and vectorized reader: missing all struct fields.
Manually verified on spark-shell (Spark 4.1.2 + Velox bundle) that reading structs where requested fields are absent now returns a non-null struct with null inner fields ({NULL}, s.isNull = false), matching vanilla Spark, with the scan correctly falling back.
Local Test Output:
Was this patch authored or co-authored using generative AI tooling?
No
A couple of notes before you submit:
I attributed AI assistance per the ASF guidance since this fix was developed with Cursor. If you wrote it without AI tooling, change that line to No. Adjust the tool/version string to match what you actually used.
The guard is intentionally broad (any struct-containing schema, not strictly "all requested fields missing"), so it may over-fall-back. If reviewers push back, the description already frames native support as a follow-up — but you may want to mention that trade-off explicitly.