New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-42163][SQL] Fix schema pruning for non-foldable array index or map key #39718
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
sigmod
approved these changes
Jan 30, 2023
@cloud-fan @gengliangwang @viirya: can any of you help to merge this PR? |
viirya
changed the title
[SPARK-42163] Fix schema pruning for non-foldable array index or map key
[SPARK-42163][SQL] Fix schema pruning for non-foldable array index or map key
Jan 31, 2023
viirya
approved these changes
Jan 31, 2023
cloud-fan
approved these changes
Jan 31, 2023
thanks, merging to master/3.4! |
cloud-fan
pushed a commit
that referenced
this pull request
Jan 31, 2023
… map key ### What changes were proposed in this pull request? In parquet schema pruning, we use SelectedField to try to extract the field that is used in a struct. It looks through GetArrayItem/GetMapItem, but when doing so, it ignores the index/key, which may itself be a struct field. If it is a struct field that is not selected in some other expression, and another field of the same attribute is selected, then pruning will drop the field, resulting in an optimizer error. This change modifies SelectedField to only look through GetArrayItem/GetMapItem if the index/key argument is foldable. The equivalent code for `ElementAt` was already doing the same thing, so this just makes them consistent. In principle, we could continue to traverse through these expressions, we'd just need to make sure that the index/key expression was also surfaced to column pruning as an expression that needs to be examined. But this seems like a fairly non-trivial change to the design of the SelectedField class. There is some risk that the current approach could result in a regression e.g. if there is an existing GetArrayItem that is being successfully pruned, where a non-foldable index argument happens to not trigger an error (because it is not a struct field, or it is preserved due to some other expression). ### Why are the changes needed? Allows queries that previously would fail in the optimizer to pass. ### Does this PR introduce _any_ user-facing change? Yes, as described above, there could be a performance regression if a query was previously pruning through a GetArrayItem/GetMapItem, and happened to not fail. ### How was this patch tested? Unit test included in patch, fails without the patch and passes with it. Closes #39718 from cashmand/fix_selected_field. Authored-by: cashmand <david.cashman@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 16cfa09) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
HyukjinKwon
reviewed
Feb 3, 2023
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM2!
snmvaughan
pushed a commit
to snmvaughan/spark
that referenced
this pull request
Jun 20, 2023
… map key ### What changes were proposed in this pull request? In parquet schema pruning, we use SelectedField to try to extract the field that is used in a struct. It looks through GetArrayItem/GetMapItem, but when doing so, it ignores the index/key, which may itself be a struct field. If it is a struct field that is not selected in some other expression, and another field of the same attribute is selected, then pruning will drop the field, resulting in an optimizer error. This change modifies SelectedField to only look through GetArrayItem/GetMapItem if the index/key argument is foldable. The equivalent code for `ElementAt` was already doing the same thing, so this just makes them consistent. In principle, we could continue to traverse through these expressions, we'd just need to make sure that the index/key expression was also surfaced to column pruning as an expression that needs to be examined. But this seems like a fairly non-trivial change to the design of the SelectedField class. There is some risk that the current approach could result in a regression e.g. if there is an existing GetArrayItem that is being successfully pruned, where a non-foldable index argument happens to not trigger an error (because it is not a struct field, or it is preserved due to some other expression). ### Why are the changes needed? Allows queries that previously would fail in the optimizer to pass. ### Does this PR introduce _any_ user-facing change? Yes, as described above, there could be a performance regression if a query was previously pruning through a GetArrayItem/GetMapItem, and happened to not fail. ### How was this patch tested? Unit test included in patch, fails without the patch and passes with it. Closes apache#39718 from cashmand/fix_selected_field. Authored-by: cashmand <david.cashman@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 16cfa09) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
In parquet schema pruning, we use SelectedField to try to extract the field that is used in a struct. It looks through GetArrayItem/GetMapItem, but when doing so, it ignores the index/key, which may itself be a struct field. If it is a struct field that is not selected in some other expression, and another field of the same attribute is selected, then pruning will drop the field, resulting in an optimizer error.
This change modifies SelectedField to only look through GetArrayItem/GetMapItem if the index/key argument is foldable. The equivalent code for
ElementAt
was already doing the same thing, so this just makes them consistent.In principle, we could continue to traverse through these expressions, we'd just need to make sure that the index/key expression was also surfaced to column pruning as an expression that needs to be examined. But this seems like a fairly non-trivial change to the design of the SelectedField class.
There is some risk that the current approach could result in a regression e.g. if there is an existing GetArrayItem that is being successfully pruned, where a non-foldable index argument happens to not trigger an error (because it is not a struct field, or it is preserved due to some other expression).
Why are the changes needed?
Allows queries that previously would fail in the optimizer to pass.
Does this PR introduce any user-facing change?
Yes, as described above, there could be a performance regression if a query was previously pruning through a GetArrayItem/GetMapItem, and happened to not fail.
How was this patch tested?
Unit test included in patch, fails without the patch and passes with it.