New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-21555][SQL] RuntimeReplaceable should be compared semantically by its canonicalized child #18761
Conversation
Test build #80025 has finished for PR 18761 at commit
|
cc @cloud-fan @gatorsmile Please take a look when you have some time. Thanks. |
@@ -241,6 +241,7 @@ trait RuntimeReplaceable extends UnaryExpression with Unevaluable { | |||
override def nullable: Boolean = child.nullable | |||
override def foldable: Boolean = child.foldable | |||
override def dataType: DataType = child.dataType | |||
override lazy val canonicalized: Expression = child.canonicalized |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although this fixes the issue, the root cause is the unneeded aliases (these aliases were added for nested fields) are not properly cleaned up in these RuntimeReplaceable
expressions by the rule CleanupAliases
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it. However, the aliases actually are replaced by Optimizer because RuntimeReplaceable
. So they are no harm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, those aliases are not children of RuntimeReplaceable
which is an UnaryExpression
. So we can't trim the aliases out by simple transforming the expressions in CleanupAliases
.
If we want to replace the non-children aliases in RuntimeReplaceable
, we need to add more codes to RuntimeReplaceable
and modify all expressions of RuntimeReplaceable
. It makes the interface ugly IMO.
Consider those aliases will be replaced soon and no harm of them, that's why I choose to simply override canonicalized
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, my only concern is CleanupAliases
might be called in the other Analyzer rules. These unneeded aliases could break the assumptions made by the caller of CleanupAliases
.
Second, you should post what I said above in the PR description. So far, the description is not clear to the reviewers.
Third, I am ok about the fix, but we should write the comments to explain why we did it here. We need to add two comments. One is in CleanupAliases
; another is above this line to explain why we add canonicalized
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated PR description and comments accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you meant there is an assumption that CleanupAliases
would clean up ALL aliases in a plan, that is not correct. It can only clean up the aliases reachable by expression transformation. I added a comment to clarify this.
CREATE TABLE test(a INT, foo STRUCT<foo1:STRING,foo2:STRING>) USING parquet; | ||
INSERT INTO test VALUES(1, ("value1", "value2")); | ||
SELECT nvl(foo.foo1, "value"), count(*) FROM test GROUP BY nvl(foo.foo1, "value"); | ||
DROP TABLE test; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can simplify the test case to something like
CREATE TEMPORARY VIEW tempView1 AS VALUES (1, NAMED_STRUCT('col1', 'gamma', 'col2', 'delta')) AS T(id, st)")
SELECT nvl(st.col1, "value"), count(*) FROM from tempView1 GROUP BY nvl(st.col1, "value")
BTW, please also update the PR title. |
I hope this PR description can be good for you. |
Test build #80039 has finished for PR 18761 at commit
|
It looks much better now. Since |
LGTM |
… by its canonicalized child ## What changes were proposed in this pull request? When there are aliases (these aliases were added for nested fields) as parameters in `RuntimeReplaceable`, as they are not in the children expression, those aliases can't be cleaned up in analyzer rule `CleanupAliases`. An expression `nvl(foo.foo1, "value")` can be resolved to two semantically different expressions in a group by query because they contain different aliases. Because those aliases are not children of `RuntimeReplaceable` which is an `UnaryExpression`. So we can't trim the aliases out by simple transforming the expressions in `CleanupAliases`. If we want to replace the non-children aliases in `RuntimeReplaceable`, we need to add more codes to `RuntimeReplaceable` and modify all expressions of `RuntimeReplaceable`. It makes the interface ugly IMO. Consider those aliases will be replaced later at optimization and so they're no harm, this patch chooses to simply override `canonicalized` of `RuntimeReplaceable`. One concern is about `CleanupAliases`. Because it actually cannot clean up ALL aliases inside a plan. To make caller of this rule notice that, this patch adds a comment to `CleanupAliases`. ## How was this patch tested? Added test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #18761 from viirya/SPARK-21555. (cherry picked from commit 9c8109e) Signed-off-by: gatorsmile <gatorsmile@gmail.com>
Thanks! Merging to master/2.2/2.1 |
… by its canonicalized child ## What changes were proposed in this pull request? When there are aliases (these aliases were added for nested fields) as parameters in `RuntimeReplaceable`, as they are not in the children expression, those aliases can't be cleaned up in analyzer rule `CleanupAliases`. An expression `nvl(foo.foo1, "value")` can be resolved to two semantically different expressions in a group by query because they contain different aliases. Because those aliases are not children of `RuntimeReplaceable` which is an `UnaryExpression`. So we can't trim the aliases out by simple transforming the expressions in `CleanupAliases`. If we want to replace the non-children aliases in `RuntimeReplaceable`, we need to add more codes to `RuntimeReplaceable` and modify all expressions of `RuntimeReplaceable`. It makes the interface ugly IMO. Consider those aliases will be replaced later at optimization and so they're no harm, this patch chooses to simply override `canonicalized` of `RuntimeReplaceable`. One concern is about `CleanupAliases`. Because it actually cannot clean up ALL aliases inside a plan. To make caller of this rule notice that, this patch adds a comment to `CleanupAliases`. ## How was this patch tested? Added test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #18761 from viirya/SPARK-21555. (cherry picked from commit 9c8109e) Signed-off-by: gatorsmile <gatorsmile@gmail.com>
@gatorsmile OK. Usually the constructor expression parameters should also be in children. |
… by its canonicalized child ## What changes were proposed in this pull request? When there are aliases (these aliases were added for nested fields) as parameters in `RuntimeReplaceable`, as they are not in the children expression, those aliases can't be cleaned up in analyzer rule `CleanupAliases`. An expression `nvl(foo.foo1, "value")` can be resolved to two semantically different expressions in a group by query because they contain different aliases. Because those aliases are not children of `RuntimeReplaceable` which is an `UnaryExpression`. So we can't trim the aliases out by simple transforming the expressions in `CleanupAliases`. If we want to replace the non-children aliases in `RuntimeReplaceable`, we need to add more codes to `RuntimeReplaceable` and modify all expressions of `RuntimeReplaceable`. It makes the interface ugly IMO. Consider those aliases will be replaced later at optimization and so they're no harm, this patch chooses to simply override `canonicalized` of `RuntimeReplaceable`. One concern is about `CleanupAliases`. Because it actually cannot clean up ALL aliases inside a plan. To make caller of this rule notice that, this patch adds a comment to `CleanupAliases`. ## How was this patch tested? Added test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes apache#18761 from viirya/SPARK-21555. (cherry picked from commit 9c8109e) Signed-off-by: gatorsmile <gatorsmile@gmail.com>
… by its canonicalized child When there are aliases (these aliases were added for nested fields) as parameters in `RuntimeReplaceable`, as they are not in the children expression, those aliases can't be cleaned up in analyzer rule `CleanupAliases`. An expression `nvl(foo.foo1, "value")` can be resolved to two semantically different expressions in a group by query because they contain different aliases. Because those aliases are not children of `RuntimeReplaceable` which is an `UnaryExpression`. So we can't trim the aliases out by simple transforming the expressions in `CleanupAliases`. If we want to replace the non-children aliases in `RuntimeReplaceable`, we need to add more codes to `RuntimeReplaceable` and modify all expressions of `RuntimeReplaceable`. It makes the interface ugly IMO. Consider those aliases will be replaced later at optimization and so they're no harm, this patch chooses to simply override `canonicalized` of `RuntimeReplaceable`. One concern is about `CleanupAliases`. Because it actually cannot clean up ALL aliases inside a plan. To make caller of this rule notice that, this patch adds a comment to `CleanupAliases`. Added test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes apache#18761 from viirya/SPARK-21555. (cherry picked from commit 9c8109e) Signed-off-by: gatorsmile <gatorsmile@gmail.com>
What changes were proposed in this pull request?
When there are aliases (these aliases were added for nested fields) as parameters in
RuntimeReplaceable
, as they are not in the children expression, those aliases can't be cleaned up in analyzer ruleCleanupAliases
.An expression
nvl(foo.foo1, "value")
can be resolved to two semantically different expressions in a group by query because they contain different aliases.Because those aliases are not children of
RuntimeReplaceable
which is anUnaryExpression
. So we can't trim the aliases out by simple transforming the expressions inCleanupAliases
.If we want to replace the non-children aliases in
RuntimeReplaceable
, we need to add more codes toRuntimeReplaceable
and modify all expressions ofRuntimeReplaceable
. It makes the interface ugly IMO.Consider those aliases will be replaced later at optimization and so they're no harm, this patch chooses to simply override
canonicalized
ofRuntimeReplaceable
.One concern is about
CleanupAliases
. Because it actually cannot clean up ALL aliases inside a plan. To make caller of this rule notice that, this patch adds a comment toCleanupAliases
.How was this patch tested?
Added test.