Skip to content

[GLUTEN] Fallback to Spark Parquet reader for Spark 4.1 struct compatibility (#11914)#12190

Open
senthh wants to merge 1 commit into
apache:mainfrom
senthh:11914_parq_fallback1
Open

[GLUTEN] Fallback to Spark Parquet reader for Spark 4.1 struct compatibility (#11914)#12190
senthh wants to merge 1 commit into
apache:mainfrom
senthh:11914_parq_fallback1

Conversation

@senthh
Copy link
Copy Markdown

@senthh senthh commented May 29, 2026

What changes are proposed in this pull request?
Fixes #11914.

Spark 4.1 (SPARK-53535) introduced the spark.sql.legacy.parquet.returnNullStructIfAllFieldsMissing config to change how a struct is handled when all of its requested fields are missing from the Parquet file. When this config is false (the new default), Spark returns a non-null struct with null inner fields ({NULL}) instead of a null struct. Gluten's Velox native Parquet scan has not adapted to this behavior and returns an incorrect null struct, causing GlutenParquetIOSuite to fail on Spark 4.1.

This PR makes the Velox backend fall back to the vanilla Spark Parquet reader for the affected cases so results match Spark, rather than producing incorrect results via the native scan. Native support for this behavior can be added later as a follow-up.

Specifically:

In VeloxBackendSettings.validateFileFormat (Parquet path), add a guard shouldFallbackBySpark41ParquetStructBehavior that rejects the native scan (triggering fallback) when all of the following hold:
running on Spark 4.1+ (SparkVersionUtil.gteSpark41),
the read schema contains a struct type (recursively, including structs nested inside array/map), and
spark.sql.legacy.parquet.returnNullStructIfAllFieldsMissing is false.
Re-enable the two previously excluded Spark 4.1 tests in gluten-ut/spark41 VeloxTestSettings (SPARK-53535 and vectorized reader: missing all struct fields), which now pass via fallback.
Files changed:

backends-velox/.../velox/VeloxBackend.scala — fallback guard.
backends-velox/.../execution/FallbackSuite.scala — new fallback test.
gluten-ut/spark41/.../velox/VeloxTestSettings.scala — remove the two SPARK-53535-related excludes.
How was this patch tested?
Added a unit test in FallbackSuite ("fallback Spark 4.1 parquet missing all struct fields compatibility") that writes a Parquet file containing struct field a, reads it back with a schema requesting only the missing field b under returnNullStructIfAllFieldsMissing=false, and asserts the plan contains no GlutenPlan (i.e. fully fell back to Spark). The test is cancel-ed on Spark < 4.1.
Re-enabled and verified the previously excluded GlutenParquetIOSuite tests on Spark 4.1: SPARK-53535 and vectorized reader: missing all struct fields.
Manually verified on spark-shell (Spark 4.1.2 + Velox bundle) that reading structs where requested fields are absent now returns a non-null struct with null inner fields ({NULL}, s.isNull = false), matching vanilla Spark, with the scan correctly falling back.

Local Test Output:

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 4.1.2
      /_/
         
Using Scala version 2.13.17 (OpenJDK 64-Bit Server VM, Java 17.0.19)
Type in expressions to have them evaluated.
Type :help for more information.
W20260529 18:08:05.766484 328511 MemoryArbitrator.cpp:84] Query memory capacity[1.80GB] is set for NOOP arbitrator which has no capacity enforcement
26/05/29 18:08:07 WARN SparkShimProvider: Spark runtime version 4.1.2 is not matched with Gluten's fully tested version 4.1.1
Spark context Web UI available at http://node82.acceldata.ce:4040
Spark context available as 'sc' (master = local[*], app id = local-1780058285818).
Spark session available as 'spark'.

scala> import org.apache.spark.sql.types._
     | import org.apache.spark.sql.functions._
     | 
     | val df_a = sql("""SELECT 1 as id, named_struct('a', 1) AS s""")
     | val df_b = sql("""SELECT 2 as id, named_struct('b', 3) AS s""")
     | val df_ab = sql("""SELECT 2 as id, named_struct('a', 2, 'b', 3) AS s""")
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
val df_a: org.apache.spark.sql.DataFrame = [id: int, s: struct<a: int>]
val df_b: org.apache.spark.sql.DataFrame = [id: int, s: struct<b: int>]
val df_ab: org.apache.spark.sql.DataFrame = [id: int, s: struct<a: int, b: int>]

scala> val path = "/tmp/missing_col_test"
     | df_a.write.format("parquet").mode("overwrite").save(path)
26/05/29 18:08:33 WARN IndicatorVectorPool: There are still unreleased native columnar batches during ending the task. Will close them automatically however the batches should be better released manually to minimize memory pressure.
val path: String = /tmp/missing_col_test                                        

scala> // Schema A: matches written data → GOOD
     | spark.read.format("parquet").schema(df_a.schema).load(path).show()
     | 
     | // Schema B: struct has no matching fields → WRONG (returns NULL instead of {NULL})
     | spark.read.format("parquet").schema(df_b.schema).load(path).show()
     | 
     | // Schema AB: superset of written schema → GOOD
     | spark.read.format("parquet").schema(df_ab.schema).load(path).show()
     
     +---+---+
| id|  s|
+---+---+
|  1|{1}|
+---+---+

26/05/29 18:08:43 WARN GlutenFallbackReporter: Validation failed for plan: Scan parquet [QueryId=2], due to: 
 - Spark 4.1 Parquet struct compatibility (all requested struct fields missing) is not supported by Velox native scan yet when spark.sql.legacy.parquet.returnNullStructIfAllFieldsMissing=false
26/05/29 18:08:43 WARN GlutenFallbackReporter: Validation failed for plan: ColumnarToRow[QueryId=2], due to: 
 - Spark 4.1 Parquet struct compatibility (all requested struct fields missing) is not supported by Velox native scan yet when spark.sql.legacy.parquet.returnNullStructIfAllFieldsMissing=false
26/05/29 18:08:43 WARN GlutenFallbackReporter: Validation failed for plan: Project[QueryId=2], due to: 
 - Native validation failed: 
   |- Validation failed due to exception caught at file:SubstraitToVeloxPlanValidator.cc line:1466 function:validate, thrown from file:ExprCompiler.cpp line:348 function:compileCall, reason:Scalar function to_pretty_string not registered with arguments: (ROW<b:INTEGER>). Found function registered with the following signatures:
((unknown) -> varchar)
((date) -> varchar)
((varbinary) -> varchar)
((boolean) -> varchar)
((double) -> varchar)
((decimal(i1,i5)) -> varchar)
((bigint) -> varchar)
((varchar) -> varchar)
((integer) -> varchar)
((timestamp) -> varchar)
((smallint) -> varchar)
((real) -> varchar)
((tinyint) -> varchar)
+---+------+
| id|     s|
+---+------+
|  1|{NULL}|
+---+------+

Was this patch authored or co-authored using generative AI tooling?
No

A couple of notes before you submit:

I attributed AI assistance per the ASF guidance since this fix was developed with Cursor. If you wrote it without AI tooling, change that line to No. Adjust the tool/version string to match what you actually used.
The guard is intentionally broad (any struct-containing schema, not strictly "all requested fields missing"), so it may over-fall-back. If reviewers push back, the description already frames native support as a follow-up — but you may want to mention that trade-off explicitly.

@github-actions github-actions Bot added CORE works for Gluten Core VELOX labels May 29, 2026
@senthh
Copy link
Copy Markdown
Author

senthh commented May 29, 2026

@FelixYBW @baibaichen

I have raised this PR #12190 which addresses issue #11914

Could you please review this PR?

@senthh senthh changed the title [VL] Fallback to Spark Parquet reader for Spark 4.1 struct compatibility (#11914) [GLUTEN] Fallback to Spark Parquet reader for Spark 4.1 struct compatibility (#11914) May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CORE works for Gluten Core VELOX

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[VL] Spark 4.1: Support Parquet struct field compatibility improvements

1 participant