[SPARK-42163][SQL] Fix schema pruning for non-foldable array index or map key #39718

cashmand · 2023-01-24T15:03:02Z

What changes were proposed in this pull request?

In parquet schema pruning, we use SelectedField to try to extract the field that is used in a struct. It looks through GetArrayItem/GetMapItem, but when doing so, it ignores the index/key, which may itself be a struct field. If it is a struct field that is not selected in some other expression, and another field of the same attribute is selected, then pruning will drop the field, resulting in an optimizer error.

This change modifies SelectedField to only look through GetArrayItem/GetMapItem if the index/key argument is foldable. The equivalent code for ElementAt was already doing the same thing, so this just makes them consistent.

In principle, we could continue to traverse through these expressions, we'd just need to make sure that the index/key expression was also surfaced to column pruning as an expression that needs to be examined. But this seems like a fairly non-trivial change to the design of the SelectedField class.

There is some risk that the current approach could result in a regression e.g. if there is an existing GetArrayItem that is being successfully pruned, where a non-foldable index argument happens to not trigger an error (because it is not a struct field, or it is preserved due to some other expression).

Why are the changes needed?

Allows queries that previously would fail in the optimizer to pass.

Does this PR introduce any user-facing change?

Yes, as described above, there could be a performance regression if a query was previously pruning through a GetArrayItem/GetMapItem, and happened to not fail.

How was this patch tested?

Unit test included in patch, fails without the patch and passes with it.

cashmand · 2023-01-24T15:03:44Z

@sigmod @viirya do you want to take a look at this patch?

sigmod · 2023-01-30T23:38:42Z

cc @rkkorlapati-db @jchen5

sigmod · 2023-01-30T23:40:07Z

@cloud-fan @gengliangwang @viirya: can any of you help to merge this PR?

cloud-fan · 2023-01-31T02:16:13Z

thanks, merging to master/3.4!

… map key ### What changes were proposed in this pull request? In parquet schema pruning, we use SelectedField to try to extract the field that is used in a struct. It looks through GetArrayItem/GetMapItem, but when doing so, it ignores the index/key, which may itself be a struct field. If it is a struct field that is not selected in some other expression, and another field of the same attribute is selected, then pruning will drop the field, resulting in an optimizer error. This change modifies SelectedField to only look through GetArrayItem/GetMapItem if the index/key argument is foldable. The equivalent code for `ElementAt` was already doing the same thing, so this just makes them consistent. In principle, we could continue to traverse through these expressions, we'd just need to make sure that the index/key expression was also surfaced to column pruning as an expression that needs to be examined. But this seems like a fairly non-trivial change to the design of the SelectedField class. There is some risk that the current approach could result in a regression e.g. if there is an existing GetArrayItem that is being successfully pruned, where a non-foldable index argument happens to not trigger an error (because it is not a struct field, or it is preserved due to some other expression). ### Why are the changes needed? Allows queries that previously would fail in the optimizer to pass. ### Does this PR introduce _any_ user-facing change? Yes, as described above, there could be a performance regression if a query was previously pruning through a GetArrayItem/GetMapItem, and happened to not fail. ### How was this patch tested? Unit test included in patch, fails without the patch and passes with it. Closes #39718 from cashmand/fix_selected_field. Authored-by: cashmand <david.cashman@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 16cfa09) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

HyukjinKwon

LGTM2!

… map key ### What changes were proposed in this pull request? In parquet schema pruning, we use SelectedField to try to extract the field that is used in a struct. It looks through GetArrayItem/GetMapItem, but when doing so, it ignores the index/key, which may itself be a struct field. If it is a struct field that is not selected in some other expression, and another field of the same attribute is selected, then pruning will drop the field, resulting in an optimizer error. This change modifies SelectedField to only look through GetArrayItem/GetMapItem if the index/key argument is foldable. The equivalent code for `ElementAt` was already doing the same thing, so this just makes them consistent. In principle, we could continue to traverse through these expressions, we'd just need to make sure that the index/key expression was also surfaced to column pruning as an expression that needs to be examined. But this seems like a fairly non-trivial change to the design of the SelectedField class. There is some risk that the current approach could result in a regression e.g. if there is an existing GetArrayItem that is being successfully pruned, where a non-foldable index argument happens to not trigger an error (because it is not a struct field, or it is preserved due to some other expression). ### Why are the changes needed? Allows queries that previously would fail in the optimizer to pass. ### Does this PR introduce _any_ user-facing change? Yes, as described above, there could be a performance regression if a query was previously pruning through a GetArrayItem/GetMapItem, and happened to not fail. ### How was this patch tested? Unit test included in patch, fails without the patch and passes with it. Closes apache#39718 from cashmand/fix_selected_field. Authored-by: cashmand <david.cashman@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 16cfa09) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

cashmand added 3 commits January 23, 2023 18:02

Add test. It fails

4ca4a85

Fix test

de04b87

Add ticket name

778d912

github-actions bot added the SQL label Jan 24, 2023

sigmod approved these changes Jan 30, 2023

View reviewed changes

viirya changed the title ~~[SPARK-42163] Fix schema pruning for non-foldable array index or map key~~ [SPARK-42163][SQL] Fix schema pruning for non-foldable array index or map key Jan 31, 2023

viirya approved these changes Jan 31, 2023

View reviewed changes

cloud-fan approved these changes Jan 31, 2023

View reviewed changes

cloud-fan closed this in 16cfa09 Jan 31, 2023

HyukjinKwon reviewed Feb 3, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-42163][SQL] Fix schema pruning for non-foldable array index or map key #39718

[SPARK-42163][SQL] Fix schema pruning for non-foldable array index or map key #39718

cashmand commented Jan 24, 2023

cashmand commented Jan 24, 2023

sigmod commented Jan 30, 2023

sigmod commented Jan 30, 2023

cloud-fan commented Jan 31, 2023

HyukjinKwon left a comment

[SPARK-42163][SQL] Fix schema pruning for non-foldable array index or map key #39718

[SPARK-42163][SQL] Fix schema pruning for non-foldable array index or map key #39718

Conversation

cashmand commented Jan 24, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

cashmand commented Jan 24, 2023

sigmod commented Jan 30, 2023

sigmod commented Jan 30, 2023

cloud-fan commented Jan 31, 2023

HyukjinKwon left a comment

Choose a reason for hiding this comment