New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-30761]SQL] Nested column pruning should not prune on required child outputs in Generate #27503
Conversation
val requiredAttrs = AttributeSet(g.requiredChildOutput) | ||
NestedColumnAliasing.getAliasSubMap(g.generator.children, requiredAttrs).map { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This case normally should be treated by above case pattern (Project + Generate). But if all nested fields are selected at top Project, the above case won't prune. Then when Optimizer transforms down to the underlying Generate, only the referred nested column are kept and others are pruned from the child. It causes the accessors at top Project unresolved.
@@ -333,6 +333,14 @@ abstract class SchemaPruningSuite | |||
} | |||
} | |||
|
|||
testSchemaPruning("select explode of nested field of array of struct and " + | |||
"all remaining nested fields") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of fixing case by case, can we try to find all the possible cases and ensure we can cover all the possible query plans? Includes negative and positive cases.
Also, we need to have the unit test cases for these optimizer rules.
Do we traverse all the ancestor nodes? |
@gatorsmile I'm ok to revert it. |
Test build #118085 has finished for PR 27503 at commit
|
What changes were proposed in this pull request?
We prune nested fields from Generate. If a child output is required in a top operator of Generate, we should not prune nested fields on it. Otherwise, the accessors on top operator could be unresolved.
Why are the changes needed?
A required child output means it is referred as a whole or by its nested fields on top of operator of Generate. If the rule prunes other nested fields from it, the accessors on top operator will be unresolved.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Unit tests.