[SPARK-56385][SQL][FOLLOW-UP] Fix FIELD_NOT_FOUND when remapping pushed filters after nested schema pruning#55477
Closed
anton5798 wants to merge 2 commits into
Closed
Conversation
…d filters after nested schema pruning ### What changes were proposed in this pull request? Wrap `projectionFunc` in `scala.util.Try` when remapping `pushedFilterExpressions` against the pruned scan output in `V2ScanRelationPushDown.pruneColumns`, and drop filters whose remap fails. The accompanying `.subsetOf(AttributeSet(output))` filter is retained for the top-level-column pruning case. ### Why are the changes needed? After SPARK-56385, `pushedFilterExpressions` are remapped through `ProjectionOverSchema` to match the post-pruning scan output. When a pushed filter references a nested struct field that nested schema pruning has dropped, `ProjectionOverSchema` calls `StructType.fieldIndex` on the narrowed struct and throws `SparkIllegalArgumentException: [FIELD_NOT_FOUND]`. Repro (exercised by the new test): ``` Schema: s: struct<a: int, b: int>, i: int Query: SELECT s.b FROM t WHERE s.a > 3 (s.a fully pushed) ``` Column pruning narrows `s` to `struct<b>`. The parent `s` is still in the output, so the existing `.subsetOf` guard passes, but remapping `GetStructField(s, "a")` through `ProjectionOverSchema` throws because field `a` is gone. This does not crash for top-level pruning — when the pruned column is entirely absent from the output, `ProjectionOverSchema.getProjection` returns `None` and `transformDown` leaves the expression unchanged, which `.subsetOf` then drops cleanly. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added a unit test in `DataSourceV2Suite` that reproduces the crash via a new `NestedSchemaDataSourceV2` + `SELECT s.b WHERE s.a > 3` pattern.
yyanyy
approved these changes
Apr 22, 2026
Contributor
yyanyy
left a comment
There was a problem hiding this comment.
Thank you for helping fixing this!
cloud-fan
approved these changes
Apr 23, 2026
Contributor
cloud-fan
left a comment
There was a problem hiding this comment.
LGTM, with one optional nit. Clean, well-targeted follow-up. The fix is localized to the one call site that actually needs to tolerate remap failures (fully-pushed filters), and correctly leaves the post-scan remap at line 820 alone — those filter references are considered by SchemaPruning.identifyRootFields, so their nested fields are preserved.
Address review feedback from Wenchen: catch only the specific `SparkIllegalArgumentException` with condition `FIELD_NOT_FOUND` thrown by `StructType.fieldIndex` when a pushed filter references a pruned nested field, instead of swallowing every `Throwable` via `scala.util.Try`. Other failure modes (e.g., `SparkException.internalError` from `ProjectionOverSchema`'s "unmatched child schema" branches) now surface instead of being silently dropped.
49a1510 to
92b01b3
Compare
Contributor
|
thanks, merging to master! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Wrap
projectionFuncinscala.util.Trywhen remappingpushedFilterExpressionsagainst the pruned scan output inV2ScanRelationPushDown.pruneColumns, and drop filters whose remap fails. The accompanying.subsetOf(AttributeSet(output))filter is retained for the top-level-column pruning case.Why are the changes needed?
After SPARK-56385,
pushedFilterExpressionsare remapped throughProjectionOverSchemato match the post-pruning scan output. When a pushed filter references a nested struct field that nested schema pruning has dropped,ProjectionOverSchemacallsStructType.fieldIndexon the narrowed struct and throwsSparkIllegalArgumentException: [FIELD_NOT_FOUND].Repro (exercised by the new test):
Column pruning narrows
stostruct<b>. The parentsis still in the output, so the existing.subsetOfguard passes, but remappingGetStructField(s, "a")throughProjectionOverSchemathrows because fieldais gone.This does not crash for top-level pruning — when the pruned column is entirely absent from the output,
ProjectionOverSchema.getProjectionreturnsNoneandtransformDownleaves the expression unchanged, which.subsetOfthen drops cleanly.Does this PR introduce any user-facing change?
No.
How was this patch tested?
Added a unit test in
DataSourceV2Suitethat reproduces the crash via a newNestedSchemaDataSourceV2+SELECT s.b WHERE s.a > 3pattern.