[SPARK-53535][SQL] Fix missing structs always being assumed as nulls #52557

ZiyaZa · 2025-10-09T08:34:41Z

What changes were proposed in this pull request?

Currently, if all fields of a struct mentioned in the read schema are missing in a Parquet file, the reader populates the struct with nulls.

This PR modifies the scan behavior so that if the struct exists in the Parquet schema but none of the fields from the read schema are present, we instead pick an arbitrary field from the Parquet file to read and use that to populate NULLs (as well as outer NULLs and array sizes if the struct is nested in another nested type).

This is done by changing the schema requested by the readers. We add an additional field to the requested schema when clipping the Parquet file schema according to the Spark schema. This means that the readers actually read and return more data than requested, which can cause problems. This is only a problem for the VectorizedParquetRecordReader, since for the other read code path via parquet-mr, we already have an UnsafeProjection for outputting only requested schema fields in ParquetFileFormat.

To ensure VectorizedParquetRecordReader only returns Spark requested fields, we create the ColumnarBatch with vectors that match the requested schema (we get rid of the additional fields by recursively matching sparkSchema with sparkRequestedSchema and ensuring structs have the same length in both). Then ParquetColumnVectors are responsible for allocating dummy vectors to hold the data temporarily while reading, but these are not exposed to the outside.

The heuristic to pick the arbitrary field is as follows: we pick one at the lowest array nesting level (i.e., any scalar field is preferred to array, which is preferred to array<array>), and prefer narrower scalar fields over wider scalar fields, which are preferred over strings.

Why are the changes needed?

This is a bug fix, because we were incorrectly assuming non-null struct values to be missing from the file depending on requested fields and returning null values.

Does this PR introduce any user-facing change?

Yes. We previously assumed structs to be null if all the fields we are trying to read from a Parquet file were missing from that file, even if the file contained other fields that could be used to take definition levels from. See an example from the Jira ticket below:

df_a = sql('SELECT 1 as id, named_struct("a", 1) AS s')
path = "/tmp/missing_col_test"
df_a.write.format("parquet").save(path)

df_b = sql('SELECT 2 as id, named_struct("b", 3) AS s')
spark.read.format("parquet").schema(df_b.schema).load(path).show()

This used to return:

+---+----+
| id|   s|
+---+----+
|  1|NULL|
+---+----+

It now returns:

+---+------+
| id|     s|
+---+------+
|  1|{NULL}|
+---+------+

How was this patch tested?

Added new unit tests, also fixed an old test to expect this new behavior.

Was this patch authored or co-authored using generative AI tooling?

No.

Kimahriman · 2025-10-10T01:05:15Z

Wow this has been a problem for us for so long, especially when you read non nullable strict this actually throws an NPE instead of just giving you the wrong data. Thanks for the fix!

github-actions bot added the SQL label Oct 9, 2025

ZiyaZa force-pushed the missing_struct branch from 4ba65ee to 11e12aa Compare October 9, 2025 08:36

Fix missing structs always being assumed as nulls

99a288d

ZiyaZa force-pushed the missing_struct branch from 11e12aa to 99a288d Compare October 9, 2025 08:57

ZiyaZa added 2 commits October 9, 2025 09:49

Fix linter errors

223d359

Ensure sparkRequestedSchema is filled

0f81378

ZiyaZa changed the title ~~[WIP][SPARK-53535][SQL] Fix missing structs always being assumed as nulls~~ [SPARK-53535][SQL] Fix missing structs always being assumed as nulls Oct 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-53535][SQL] Fix missing structs always being assumed as nulls #52557

[SPARK-53535][SQL] Fix missing structs always being assumed as nulls #52557

ZiyaZa commented Oct 9, 2025

Uh oh!

Kimahriman commented Oct 10, 2025

Uh oh!

Uh oh!

[SPARK-53535][SQL] Fix missing structs always being assumed as nulls #52557

Are you sure you want to change the base?

[SPARK-53535][SQL] Fix missing structs always being assumed as nulls #52557

Conversation

ZiyaZa commented Oct 9, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Kimahriman commented Oct 10, 2025

Uh oh!

Uh oh!