Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-42084][SQL] Avoid leaking the qualified-access-only restriction #39596

Closed
wants to merge 2 commits into from

Conversation

cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Jan 16, 2023

What changes were proposed in this pull request?

This is a better fix than #39077 and #38862

The special attribute metadata __qualified_access_only is very risky, as it breaks normal column resolution. The aforementioned 2 PRs remove the restriction in SubqueryAlias and Alias, but it's not good enough as we may forget to do the same thing for new logical plans/expressions in the future. It's also problematic if advanced users manipulate logical plans and expressions directly, when there is no SubqueryAlias and Alias to remove the restriction.

To be safe, we should only apply this restriction when resolving join hidden columns, which means the plan node right above Project(Join(using or natural join)). This PR simply removes the restriction when a column is resolved from a sequence of Attributes, or from star expansion, and also when adding the Project hidden columns to its output. This makes sure that the qualified-access-only restriction will not be leaked to normal column resolution, but only metadata column resolution.

Why are the changes needed?

To make the join hidden column feature more robust

Does this PR introduce any user-facing change?

No

How was this patch tested?

existing tests

@cloud-fan
Copy link
Contributor Author

cc @viirya @gengliangwang

@gengliangwang
Copy link
Member

image

image

@cloud-fan I feel the qualifiedAccessOnly attributes are still complicated. We have to check them here and there. Since it is for certain Joins only, shall we decouple it from the metadataOutput and have a new TreeNode method to infer these hidden lists? E.g. hiddenOutputFromJoin. In this way, we don't need to mark whether an attribute is qualifiedAccessOnly.

@cloud-fan
Copy link
Contributor Author

cloud-fan commented Jan 18, 2023

You will need to re-implement a propagation framework, as not only the direct parent of the join can access the hidden columns. e.g. Sort -> Project -> Join, we can sort by the join hidden column

import org.apache.spark.sql.catalyst.util._
!Utils.isTesting || (LogicalPlanIntegrity.checkIfExprIdsAreGloballyUnique(currentPlan) &&
(!LogicalPlanIntegrity.canGetOutputAttrs(currentPlan) ||
!currentPlan.output.exists(_.qualifiedAccessOnly)))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sure that only metadata column resolution will apply the qualified-access-only restriction.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may not be straightforward to read/understand what it means. Maybe put your comment in the code.

@gengliangwang
Copy link
Member

You will need to re-implement a propagation framework, as not only the direct parent of the join can access the hidden columns. e.g. Sort -> Project -> Join, we can sort by the join hidden column

OK, let me think more on this recently. I feel that mixing the V2 metadata columns and the hidden join columns are hard to maintain. A bit of duplicate code for the propagation can make future development easier.

Copy link
Member

@gengliangwang gengliangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the best fix for the hidden column issues so far.

@cloud-fan
Copy link
Contributor Author

thanks for the review, merging to master/3.3!

@cloud-fan cloud-fan closed this in 8fde8bd Jan 18, 2023
cloud-fan added a commit that referenced this pull request Jan 18, 2023
This is a better fix than #39077 and #38862

The special attribute metadata `__qualified_access_only` is very risky, as it breaks normal column resolution. The aforementioned 2 PRs remove the restriction in `SubqueryAlias` and `Alias`, but it's not good enough as we may forget to do the same thing for new logical plans/expressions in the future. It's also problematic if advanced users manipulate logical plans and expressions directly, when there is no `SubqueryAlias` and `Alias` to remove the restriction.

To be safe, we should only apply this restriction when resolving join hidden columns, which means the plan node right above `Project(Join(using or natural join))`. This PR simply removes the restriction when a column is resolved from a sequence of `Attributes`, or from star expansion, and also when adding the `Project` hidden columns to its output. This makes sure that the qualified-access-only restriction will not be leaked to normal column resolution, but only metadata column resolution.

To make the join hidden column feature more robust

No

existing tests

Closes #39596 from cloud-fan/join.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan added a commit that referenced this pull request Apr 27, 2023
…mal columns

### What changes were proposed in this pull request?

This is a followup of #39596 to fix more corner cases. It ignores the special column flag that requires qualified access for normal output attributes, as the flag should be effective only to metadata columns.

### Why are the changes needed?

It's very hard to make sure that we don't leak the special column flag. Since the bug has been in the Spark release for a while, there may be tables created with CTAS and the table schema contains the special flag.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

new analysis test

Closes #40961 from cloud-fan/col.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan added a commit that referenced this pull request Apr 27, 2023
…mal columns

This is a followup of #39596 to fix more corner cases. It ignores the special column flag that requires qualified access for normal output attributes, as the flag should be effective only to metadata columns.

It's very hard to make sure that we don't leak the special column flag. Since the bug has been in the Spark release for a while, there may be tables created with CTAS and the table schema contains the special flag.

No

new analysis test

Closes #40961 from cloud-fan/col.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 021f02e)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan added a commit that referenced this pull request Apr 27, 2023
…mal columns

This is a followup of #39596 to fix more corner cases. It ignores the special column flag that requires qualified access for normal output attributes, as the flag should be effective only to metadata columns.

It's very hard to make sure that we don't leak the special column flag. Since the bug has been in the Spark release for a while, there may be tables created with CTAS and the table schema contains the special flag.

No

new analysis test

Closes #40961 from cloud-fan/col.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 021f02e)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
LuciferYang pushed a commit to LuciferYang/spark that referenced this pull request May 10, 2023
…mal columns

### What changes were proposed in this pull request?

This is a followup of apache#39596 to fix more corner cases. It ignores the special column flag that requires qualified access for normal output attributes, as the flag should be effective only to metadata columns.

### Why are the changes needed?

It's very hard to make sure that we don't leak the special column flag. Since the bug has been in the Spark release for a while, there may be tables created with CTAS and the table schema contains the special flag.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

new analysis test

Closes apache#40961 from cloud-fan/col.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
snmvaughan pushed a commit to snmvaughan/spark that referenced this pull request Jun 20, 2023
…mal columns

This is a followup of apache#39596 to fix more corner cases. It ignores the special column flag that requires qualified access for normal output attributes, as the flag should be effective only to metadata columns.

It's very hard to make sure that we don't leak the special column flag. Since the bug has been in the Spark release for a while, there may be tables created with CTAS and the table schema contains the special flag.

No

new analysis test

Closes apache#40961 from cloud-fan/col.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 021f02e)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
GladwinLee pushed a commit to lyft/spark that referenced this pull request Oct 10, 2023
…mal columns

This is a followup of apache#39596 to fix more corner cases. It ignores the special column flag that requires qualified access for normal output attributes, as the flag should be effective only to metadata columns.

It's very hard to make sure that we don't leak the special column flag. Since the bug has been in the Spark release for a while, there may be tables created with CTAS and the table schema contains the special flag.

No

new analysis test

Closes apache#40961 from cloud-fan/col.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 021f02e)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
catalinii pushed a commit to lyft/spark that referenced this pull request Oct 10, 2023
…mal columns

This is a followup of apache#39596 to fix more corner cases. It ignores the special column flag that requires qualified access for normal output attributes, as the flag should be effective only to metadata columns.

It's very hard to make sure that we don't leak the special column flag. Since the bug has been in the Spark release for a while, there may be tables created with CTAS and the table schema contains the special flag.

No

new analysis test

Closes apache#40961 from cloud-fan/col.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 021f02e)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants