Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-42163][SQL] Fix schema pruning for non-foldable array index or map key #39718

Closed
wants to merge 3 commits into from

Conversation

cashmand
Copy link
Contributor

What changes were proposed in this pull request?

In parquet schema pruning, we use SelectedField to try to extract the field that is used in a struct. It looks through GetArrayItem/GetMapItem, but when doing so, it ignores the index/key, which may itself be a struct field. If it is a struct field that is not selected in some other expression, and another field of the same attribute is selected, then pruning will drop the field, resulting in an optimizer error.

This change modifies SelectedField to only look through GetArrayItem/GetMapItem if the index/key argument is foldable. The equivalent code for ElementAt was already doing the same thing, so this just makes them consistent.

In principle, we could continue to traverse through these expressions, we'd just need to make sure that the index/key expression was also surfaced to column pruning as an expression that needs to be examined. But this seems like a fairly non-trivial change to the design of the SelectedField class.

There is some risk that the current approach could result in a regression e.g. if there is an existing GetArrayItem that is being successfully pruned, where a non-foldable index argument happens to not trigger an error (because it is not a struct field, or it is preserved due to some other expression).

Why are the changes needed?

Allows queries that previously would fail in the optimizer to pass.

Does this PR introduce any user-facing change?

Yes, as described above, there could be a performance regression if a query was previously pruning through a GetArrayItem/GetMapItem, and happened to not fail.

How was this patch tested?

Unit test included in patch, fails without the patch and passes with it.

@github-actions github-actions bot added the SQL label Jan 24, 2023
@cashmand
Copy link
Contributor Author

@sigmod @viirya do you want to take a look at this patch?

@sigmod
Copy link
Contributor

sigmod commented Jan 30, 2023

cc @rkkorlapati-db @jchen5

@sigmod
Copy link
Contributor

sigmod commented Jan 30, 2023

@cloud-fan @gengliangwang @viirya: can any of you help to merge this PR?

@viirya viirya changed the title [SPARK-42163] Fix schema pruning for non-foldable array index or map key [SPARK-42163][SQL] Fix schema pruning for non-foldable array index or map key Jan 31, 2023
@cloud-fan
Copy link
Contributor

thanks, merging to master/3.4!

@cloud-fan cloud-fan closed this in 16cfa09 Jan 31, 2023
cloud-fan pushed a commit that referenced this pull request Jan 31, 2023
… map key

### What changes were proposed in this pull request?

In parquet schema pruning, we use SelectedField to try to extract the field that is used in a struct. It looks through GetArrayItem/GetMapItem, but when doing so, it ignores the index/key, which may itself be a struct field. If it is a struct field that is not selected in some other expression, and another field of the same attribute is selected, then pruning will drop the field, resulting in an optimizer error.

This change modifies SelectedField to only look through GetArrayItem/GetMapItem if the index/key argument is foldable. The equivalent code for `ElementAt` was already doing the same thing, so this just makes them consistent.

In principle, we could continue to traverse through these expressions, we'd just need to make sure that the index/key expression was also surfaced to column pruning as an expression that needs to be examined. But this seems like a fairly non-trivial change to the design of the SelectedField class.

There is some risk that the current approach could result in a regression e.g. if there is an existing GetArrayItem that is being successfully pruned, where a non-foldable index argument happens to not trigger an error (because it is not a struct field, or it is preserved due to some other expression).

### Why are the changes needed?

Allows queries that previously would fail in the optimizer to pass.

### Does this PR introduce _any_ user-facing change?

Yes, as described above, there could be a performance regression if a query was previously pruning through a GetArrayItem/GetMapItem, and happened to not fail.

### How was this patch tested?

Unit test included in patch, fails without the patch and passes with it.

Closes #39718 from cashmand/fix_selected_field.

Authored-by: cashmand <david.cashman@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 16cfa09)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM2!

snmvaughan pushed a commit to snmvaughan/spark that referenced this pull request Jun 20, 2023
… map key

### What changes were proposed in this pull request?

In parquet schema pruning, we use SelectedField to try to extract the field that is used in a struct. It looks through GetArrayItem/GetMapItem, but when doing so, it ignores the index/key, which may itself be a struct field. If it is a struct field that is not selected in some other expression, and another field of the same attribute is selected, then pruning will drop the field, resulting in an optimizer error.

This change modifies SelectedField to only look through GetArrayItem/GetMapItem if the index/key argument is foldable. The equivalent code for `ElementAt` was already doing the same thing, so this just makes them consistent.

In principle, we could continue to traverse through these expressions, we'd just need to make sure that the index/key expression was also surfaced to column pruning as an expression that needs to be examined. But this seems like a fairly non-trivial change to the design of the SelectedField class.

There is some risk that the current approach could result in a regression e.g. if there is an existing GetArrayItem that is being successfully pruned, where a non-foldable index argument happens to not trigger an error (because it is not a struct field, or it is preserved due to some other expression).

### Why are the changes needed?

Allows queries that previously would fail in the optimizer to pass.

### Does this PR introduce _any_ user-facing change?

Yes, as described above, there could be a performance regression if a query was previously pruning through a GetArrayItem/GetMapItem, and happened to not fail.

### How was this patch tested?

Unit test included in patch, fails without the patch and passes with it.

Closes apache#39718 from cashmand/fix_selected_field.

Authored-by: cashmand <david.cashman@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 16cfa09)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants