-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-42084][SQL] Avoid leaking the qualified-access-only restriction #39596
Conversation
@cloud-fan I feel the |
You will need to re-implement a propagation framework, as not only the direct parent of the join can access the hidden columns. e.g. |
import org.apache.spark.sql.catalyst.util._ | ||
!Utils.isTesting || (LogicalPlanIntegrity.checkIfExprIdsAreGloballyUnique(currentPlan) && | ||
(!LogicalPlanIntegrity.canGetOutputAttrs(currentPlan) || | ||
!currentPlan.output.exists(_.qualifiedAccessOnly))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sure that only metadata column resolution will apply the qualified-access-only restriction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may not be straightforward to read/understand what it means. Maybe put your comment in the code.
OK, let me think more on this recently. I feel that mixing the V2 metadata columns and the hidden join columns are hard to maintain. A bit of duplicate code for the propagation can make future development easier. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the best fix for the hidden column issues so far.
thanks for the review, merging to master/3.3! |
This is a better fix than #39077 and #38862 The special attribute metadata `__qualified_access_only` is very risky, as it breaks normal column resolution. The aforementioned 2 PRs remove the restriction in `SubqueryAlias` and `Alias`, but it's not good enough as we may forget to do the same thing for new logical plans/expressions in the future. It's also problematic if advanced users manipulate logical plans and expressions directly, when there is no `SubqueryAlias` and `Alias` to remove the restriction. To be safe, we should only apply this restriction when resolving join hidden columns, which means the plan node right above `Project(Join(using or natural join))`. This PR simply removes the restriction when a column is resolved from a sequence of `Attributes`, or from star expansion, and also when adding the `Project` hidden columns to its output. This makes sure that the qualified-access-only restriction will not be leaked to normal column resolution, but only metadata column resolution. To make the join hidden column feature more robust No existing tests Closes #39596 from cloud-fan/join. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…mal columns ### What changes were proposed in this pull request? This is a followup of #39596 to fix more corner cases. It ignores the special column flag that requires qualified access for normal output attributes, as the flag should be effective only to metadata columns. ### Why are the changes needed? It's very hard to make sure that we don't leak the special column flag. Since the bug has been in the Spark release for a while, there may be tables created with CTAS and the table schema contains the special flag. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? new analysis test Closes #40961 from cloud-fan/col. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…mal columns This is a followup of #39596 to fix more corner cases. It ignores the special column flag that requires qualified access for normal output attributes, as the flag should be effective only to metadata columns. It's very hard to make sure that we don't leak the special column flag. Since the bug has been in the Spark release for a while, there may be tables created with CTAS and the table schema contains the special flag. No new analysis test Closes #40961 from cloud-fan/col. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 021f02e) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…mal columns This is a followup of #39596 to fix more corner cases. It ignores the special column flag that requires qualified access for normal output attributes, as the flag should be effective only to metadata columns. It's very hard to make sure that we don't leak the special column flag. Since the bug has been in the Spark release for a while, there may be tables created with CTAS and the table schema contains the special flag. No new analysis test Closes #40961 from cloud-fan/col. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 021f02e) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…mal columns ### What changes were proposed in this pull request? This is a followup of apache#39596 to fix more corner cases. It ignores the special column flag that requires qualified access for normal output attributes, as the flag should be effective only to metadata columns. ### Why are the changes needed? It's very hard to make sure that we don't leak the special column flag. Since the bug has been in the Spark release for a while, there may be tables created with CTAS and the table schema contains the special flag. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? new analysis test Closes apache#40961 from cloud-fan/col. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…mal columns This is a followup of apache#39596 to fix more corner cases. It ignores the special column flag that requires qualified access for normal output attributes, as the flag should be effective only to metadata columns. It's very hard to make sure that we don't leak the special column flag. Since the bug has been in the Spark release for a while, there may be tables created with CTAS and the table schema contains the special flag. No new analysis test Closes apache#40961 from cloud-fan/col. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 021f02e) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…mal columns This is a followup of apache#39596 to fix more corner cases. It ignores the special column flag that requires qualified access for normal output attributes, as the flag should be effective only to metadata columns. It's very hard to make sure that we don't leak the special column flag. Since the bug has been in the Spark release for a while, there may be tables created with CTAS and the table schema contains the special flag. No new analysis test Closes apache#40961 from cloud-fan/col. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 021f02e) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…mal columns This is a followup of apache#39596 to fix more corner cases. It ignores the special column flag that requires qualified access for normal output attributes, as the flag should be effective only to metadata columns. It's very hard to make sure that we don't leak the special column flag. Since the bug has been in the Spark release for a while, there may be tables created with CTAS and the table schema contains the special flag. No new analysis test Closes apache#40961 from cloud-fan/col. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 021f02e) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
This is a better fix than #39077 and #38862
The special attribute metadata
__qualified_access_only
is very risky, as it breaks normal column resolution. The aforementioned 2 PRs remove the restriction inSubqueryAlias
andAlias
, but it's not good enough as we may forget to do the same thing for new logical plans/expressions in the future. It's also problematic if advanced users manipulate logical plans and expressions directly, when there is noSubqueryAlias
andAlias
to remove the restriction.To be safe, we should only apply this restriction when resolving join hidden columns, which means the plan node right above
Project(Join(using or natural join))
. This PR simply removes the restriction when a column is resolved from a sequence ofAttributes
, or from star expansion, and also when adding theProject
hidden columns to its output. This makes sure that the qualified-access-only restriction will not be leaked to normal column resolution, but only metadata column resolution.Why are the changes needed?
To make the join hidden column feature more robust
Does this PR introduce any user-facing change?
No
How was this patch tested?
existing tests