[SPARK-53535][SQL] Fix missing structs always being assumed as nulls #52557
+719
−70
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Currently, if all fields of a struct mentioned in the read schema are missing in a Parquet file, the reader populates the struct with nulls.
This PR modifies the scan behavior so that if the struct exists in the Parquet schema but none of the fields from the read schema are present, we instead pick an arbitrary field from the Parquet file to read and use that to populate NULLs (as well as outer NULLs and array sizes if the struct is nested in another nested type).
This is done by changing the schema requested by the readers. We add an additional field to the requested schema when clipping the Parquet file schema according to the Spark schema. This means that the readers actually read and return more data than requested, which can cause problems. This is only a problem for the
VectorizedParquetRecordReader
, since for the other read code path via parquet-mr, we already have anUnsafeProjection
for outputting only requested schema fields inParquetFileFormat
.To ensure
VectorizedParquetRecordReader
only returns Spark requested fields, we create theColumnarBatch
with vectors that match the requested schema (we get rid of the additional fields by recursively matchingsparkSchema
withsparkRequestedSchema
and ensuring structs have the same length in both). ThenParquetColumnVector
s are responsible for allocating dummy vectors to hold the data temporarily while reading, but these are not exposed to the outside.The heuristic to pick the arbitrary field is as follows: we pick one at the lowest array nesting level (i.e., any scalar field is preferred to
array
, which is preferred toarray<array>
), and prefer narrower scalar fields over wider scalar fields, which are preferred over strings.Why are the changes needed?
This is a bug fix, because we were incorrectly assuming non-null struct values to be missing from the file depending on requested fields and returning null values.
Does this PR introduce any user-facing change?
Yes. We previously assumed structs to be null if all the fields we are trying to read from a Parquet file were missing from that file, even if the file contained other fields that could be used to take definition levels from. See an example from the Jira ticket below:
This used to return:
It now returns:
How was this patch tested?
Added new unit tests, also fixed an old test to expect this new behavior.
Was this patch authored or co-authored using generative AI tooling?
No.