Skip to content

[SPARK-56385][SQL][FOLLOW-UP] Fix FIELD_NOT_FOUND when remapping pushed filters after nested schema pruning#55477

Closed
anton5798 wants to merge 2 commits into
apache:masterfrom
anton5798:fix-pushed-filter-nested-pruning
Closed

[SPARK-56385][SQL][FOLLOW-UP] Fix FIELD_NOT_FOUND when remapping pushed filters after nested schema pruning#55477
anton5798 wants to merge 2 commits into
apache:masterfrom
anton5798:fix-pushed-filter-nested-pruning

Conversation

@anton5798
Copy link
Copy Markdown
Contributor

@anton5798 anton5798 commented Apr 22, 2026

What changes were proposed in this pull request?

Wrap projectionFunc in scala.util.Try when remapping pushedFilterExpressions against the pruned scan output in V2ScanRelationPushDown.pruneColumns, and drop filters whose remap fails. The accompanying .subsetOf(AttributeSet(output)) filter is retained for the top-level-column pruning case.

Why are the changes needed?

After SPARK-56385, pushedFilterExpressions are remapped through ProjectionOverSchema to match the post-pruning scan output. When a pushed filter references a nested struct field that nested schema pruning has dropped, ProjectionOverSchema calls
StructType.fieldIndex on the narrowed struct and throws SparkIllegalArgumentException: [FIELD_NOT_FOUND].

Repro (exercised by the new test):

Schema:  s: struct<a: int, b: int>, i: int
Query:   SELECT s.b FROM t WHERE s.a > 3   (s.a fully pushed)

Column pruning narrows s to struct<b>. The parent s is still in the output, so the existing .subsetOf guard passes, but remapping GetStructField(s, "a") through ProjectionOverSchema throws because field a is gone.

This does not crash for top-level pruning — when the pruned column is entirely absent from the output, ProjectionOverSchema.getProjection returns None and transformDown leaves the expression unchanged, which .subsetOf then drops cleanly.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added a unit test in DataSourceV2Suite that reproduces the crash via a new NestedSchemaDataSourceV2 + SELECT s.b WHERE s.a > 3 pattern.

…d filters after nested schema pruning

### What changes were proposed in this pull request?

Wrap `projectionFunc` in `scala.util.Try` when remapping
`pushedFilterExpressions` against the pruned scan output in
`V2ScanRelationPushDown.pruneColumns`, and drop filters whose remap
fails. The accompanying `.subsetOf(AttributeSet(output))` filter is
retained for the top-level-column pruning case.

### Why are the changes needed?

After SPARK-56385, `pushedFilterExpressions` are remapped through
`ProjectionOverSchema` to match the post-pruning scan output. When a
pushed filter references a nested struct field that nested schema
pruning has dropped, `ProjectionOverSchema` calls
`StructType.fieldIndex` on the narrowed struct and throws
`SparkIllegalArgumentException: [FIELD_NOT_FOUND]`.

Repro (exercised by the new test):

```
Schema:  s: struct<a: int, b: int>, i: int
Query:   SELECT s.b FROM t WHERE s.a > 3   (s.a fully pushed)
```

Column pruning narrows `s` to `struct<b>`. The parent `s` is still in
the output, so the existing `.subsetOf` guard passes, but remapping
`GetStructField(s, "a")` through `ProjectionOverSchema` throws because
field `a` is gone.

This does not crash for top-level pruning — when the pruned column is
entirely absent from the output, `ProjectionOverSchema.getProjection`
returns `None` and `transformDown` leaves the expression unchanged,
which `.subsetOf` then drops cleanly.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added a unit test in `DataSourceV2Suite` that reproduces the crash via
a new `NestedSchemaDataSourceV2` + `SELECT s.b WHERE s.a > 3` pattern.
Copy link
Copy Markdown
Contributor

@yyanyy yyanyy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for helping fixing this!

@HyukjinKwon HyukjinKwon changed the title [SPARK-56385][SQL][FOLLOWUP] Fix FIELD_NOT_FOUND when remapping pushed filters after nested schema pruning [SPARK-56385][SQL][FOLLOW-UP] Fix FIELD_NOT_FOUND when remapping pushed filters after nested schema pruning Apr 23, 2026
Copy link
Copy Markdown
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with one optional nit. Clean, well-targeted follow-up. The fix is localized to the one call site that actually needs to tolerate remap failures (fully-pushed filters), and correctly leaves the post-scan remap at line 820 alone — those filter references are considered by SchemaPruning.identifyRootFields, so their nested fields are preserved.

Address review feedback from Wenchen: catch only the specific
`SparkIllegalArgumentException` with condition `FIELD_NOT_FOUND`
thrown by `StructType.fieldIndex` when a pushed filter references a
pruned nested field, instead of swallowing every `Throwable` via
`scala.util.Try`. Other failure modes (e.g., `SparkException.internalError`
from `ProjectionOverSchema`'s "unmatched child schema" branches) now
surface instead of being silently dropped.
@anton5798 anton5798 force-pushed the fix-pushed-filter-nested-pruning branch from 49a1510 to 92b01b3 Compare April 23, 2026 08:38
@cloud-fan
Copy link
Copy Markdown
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 875a2f2 Apr 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants