Skip to content

Conversation

ZiyaZa
Copy link

@ZiyaZa ZiyaZa commented Oct 9, 2025

What changes were proposed in this pull request?

Currently, if all fields of a struct mentioned in the read schema are missing in a Parquet file, the reader populates the struct with nulls.

This PR modifies the scan behavior so that if the struct exists in the Parquet schema but none of the fields from the read schema are present, we instead pick an arbitrary field from the Parquet file to read and use that to populate NULLs (as well as outer NULLs and array sizes if the struct is nested in another nested type).

This is done by changing the schema requested by the readers. We add an additional field to the requested schema when clipping the Parquet file schema according to the Spark schema. This means that the readers actually read and return more data than requested, which can cause problems. This is only a problem for the VectorizedParquetRecordReader, since for the other read code path via parquet-mr, we already have an UnsafeProjection for outputting only requested schema fields in ParquetFileFormat.

To ensure VectorizedParquetRecordReader only returns Spark requested fields, we create the ColumnarBatch with vectors that match the requested schema (we get rid of the additional fields by recursively matching sparkSchema with sparkRequestedSchema and ensuring structs have the same length in both). Then ParquetColumnVectors are responsible for allocating dummy vectors to hold the data temporarily while reading, but these are not exposed to the outside.

The heuristic to pick the arbitrary field is as follows: we pick one at the lowest array nesting level (i.e., any scalar field is preferred to array, which is preferred to array<array>), and prefer narrower scalar fields over wider scalar fields, which are preferred over strings.

Why are the changes needed?

This is a bug fix, because we were incorrectly assuming non-null struct values to be missing from the file depending on requested fields and returning null values.

Does this PR introduce any user-facing change?

Yes. We previously assumed structs to be null if all the fields we are trying to read from a Parquet file were missing from that file, even if the file contained other fields that could be used to take definition levels from. See an example from the Jira ticket below:

df_a = sql('SELECT 1 as id, named_struct("a", 1) AS s')
path = "/tmp/missing_col_test"
df_a.write.format("parquet").save(path)

df_b = sql('SELECT 2 as id, named_struct("b", 3) AS s')
spark.read.format("parquet").schema(df_b.schema).load(path).show()

This used to return:

+---+----+
| id|   s|
+---+----+
|  1|NULL|
+---+----+

It now returns:

+---+------+
| id|     s|
+---+------+
|  1|{NULL}|
+---+------+

How was this patch tested?

Added new unit tests, also fixed an old test to expect this new behavior.

Was this patch authored or co-authored using generative AI tooling?

No.

@ZiyaZa ZiyaZa changed the title [WIP][SPARK-53535][SQL] Fix missing structs always being assumed as nulls [SPARK-53535][SQL] Fix missing structs always being assumed as nulls Oct 9, 2025
@Kimahriman
Copy link
Contributor

Wow this has been a problem for us for so long, especially when you read non nullable strict this actually throws an NPE instead of just giving you the wrong data. Thanks for the fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants