Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-21555][SQL] RuntimeReplaceable should be compared semantically by its canonicalized child #18761

Closed
wants to merge 2 commits into from

Conversation

viirya
Copy link
Member

@viirya viirya commented Jul 28, 2017

What changes were proposed in this pull request?

When there are aliases (these aliases were added for nested fields) as parameters in RuntimeReplaceable, as they are not in the children expression, those aliases can't be cleaned up in analyzer rule CleanupAliases.

An expression nvl(foo.foo1, "value") can be resolved to two semantically different expressions in a group by query because they contain different aliases.

Because those aliases are not children of RuntimeReplaceable which is an UnaryExpression. So we can't trim the aliases out by simple transforming the expressions in CleanupAliases.

If we want to replace the non-children aliases in RuntimeReplaceable, we need to add more codes to RuntimeReplaceable and modify all expressions of RuntimeReplaceable. It makes the interface ugly IMO.

Consider those aliases will be replaced later at optimization and so they're no harm, this patch chooses to simply override canonicalized of RuntimeReplaceable.

One concern is about CleanupAliases. Because it actually cannot clean up ALL aliases inside a plan. To make caller of this rule notice that, this patch adds a comment to CleanupAliases.

How was this patch tested?

Added test.

@SparkQA
Copy link

SparkQA commented Jul 28, 2017

Test build #80025 has finished for PR 18761 at commit d0d5804.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Jul 29, 2017

cc @cloud-fan @gatorsmile Please take a look when you have some time. Thanks.

@@ -241,6 +241,7 @@ trait RuntimeReplaceable extends UnaryExpression with Unevaluable {
override def nullable: Boolean = child.nullable
override def foldable: Boolean = child.foldable
override def dataType: DataType = child.dataType
override lazy val canonicalized: Expression = child.canonicalized
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although this fixes the issue, the root cause is the unneeded aliases (these aliases were added for nested fields) are not properly cleaned up in these RuntimeReplaceable expressions by the rule CleanupAliases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it. However, the aliases actually are replaced by Optimizer because RuntimeReplaceable. So they are no harm.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, those aliases are not children of RuntimeReplaceable which is an UnaryExpression. So we can't trim the aliases out by simple transforming the expressions in CleanupAliases.

If we want to replace the non-children aliases in RuntimeReplaceable, we need to add more codes to RuntimeReplaceable and modify all expressions of RuntimeReplaceable. It makes the interface ugly IMO.

Consider those aliases will be replaced soon and no harm of them, that's why I choose to simply override canonicalized.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, my only concern is CleanupAliases might be called in the other Analyzer rules. These unneeded aliases could break the assumptions made by the caller of CleanupAliases.

Second, you should post what I said above in the PR description. So far, the description is not clear to the reviewers.

Third, I am ok about the fix, but we should write the comments to explain why we did it here. We need to add two comments. One is in CleanupAliases ; another is above this line to explain why we add canonicalized.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated PR description and comments accordingly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you meant there is an assumption that CleanupAliases would clean up ALL aliases in a plan, that is not correct. It can only clean up the aliases reachable by expression transformation. I added a comment to clarify this.

CREATE TABLE test(a INT, foo STRUCT<foo1:STRING,foo2:STRING>) USING parquet;
INSERT INTO test VALUES(1, ("value1", "value2"));
SELECT nvl(foo.foo1, "value"), count(*) FROM test GROUP BY nvl(foo.foo1, "value");
DROP TABLE test;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can simplify the test case to something like

CREATE TEMPORARY VIEW tempView1 AS VALUES (1, NAMED_STRUCT('col1', 'gamma', 'col2', 'delta')) AS T(id, st)")
SELECT nvl(st.col1, "value"), count(*) FROM from tempView1 GROUP BY nvl(st.col1, "value")

@gatorsmile
Copy link
Member

BTW, please also update the PR title.

@viirya
Copy link
Member Author

viirya commented Jul 29, 2017

I hope this PR description can be good for you.

@viirya viirya changed the title [SPARK-21555][SQL] GROUP BY should work with expressions with Nvl [SPARK-21555][SQL] RuntimeReplaceable should be compared semantically by its canonicalized child Jul 29, 2017
@SparkQA
Copy link

SparkQA commented Jul 29, 2017

Test build #80039 has finished for PR 18761 at commit 82d54ed.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

It looks much better now. Since CleanupAliases has such an issue, could you check whether the other expressions are facing the same issue?

@gatorsmile
Copy link
Member

LGTM

asfgit pushed a commit that referenced this pull request Jul 29, 2017
… by its canonicalized child

## What changes were proposed in this pull request?

When there are aliases (these aliases were added for nested fields) as parameters in `RuntimeReplaceable`, as they are not in the children expression, those aliases can't be cleaned up in analyzer rule `CleanupAliases`.

An expression `nvl(foo.foo1, "value")` can be resolved to two semantically different expressions in a group by query because they contain different aliases.

Because those aliases are not children of `RuntimeReplaceable` which is an `UnaryExpression`. So we can't trim the aliases out by simple transforming the expressions in `CleanupAliases`.

If we want to replace the non-children aliases in `RuntimeReplaceable`, we need to add more codes to `RuntimeReplaceable` and modify all expressions of `RuntimeReplaceable`. It makes the interface ugly IMO.

Consider those aliases will be replaced later at optimization and so they're no harm, this patch chooses to simply override `canonicalized` of `RuntimeReplaceable`.

One concern is about `CleanupAliases`. Because it actually cannot clean up ALL aliases inside a plan. To make caller of this rule notice that, this patch adds a comment to `CleanupAliases`.

## How was this patch tested?

Added test.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #18761 from viirya/SPARK-21555.

(cherry picked from commit 9c8109e)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
@gatorsmile
Copy link
Member

Thanks! Merging to master/2.2/2.1

@asfgit asfgit closed this in 9c8109e Jul 29, 2017
asfgit pushed a commit that referenced this pull request Jul 29, 2017
… by its canonicalized child

## What changes were proposed in this pull request?

When there are aliases (these aliases were added for nested fields) as parameters in `RuntimeReplaceable`, as they are not in the children expression, those aliases can't be cleaned up in analyzer rule `CleanupAliases`.

An expression `nvl(foo.foo1, "value")` can be resolved to two semantically different expressions in a group by query because they contain different aliases.

Because those aliases are not children of `RuntimeReplaceable` which is an `UnaryExpression`. So we can't trim the aliases out by simple transforming the expressions in `CleanupAliases`.

If we want to replace the non-children aliases in `RuntimeReplaceable`, we need to add more codes to `RuntimeReplaceable` and modify all expressions of `RuntimeReplaceable`. It makes the interface ugly IMO.

Consider those aliases will be replaced later at optimization and so they're no harm, this patch chooses to simply override `canonicalized` of `RuntimeReplaceable`.

One concern is about `CleanupAliases`. Because it actually cannot clean up ALL aliases inside a plan. To make caller of this rule notice that, this patch adds a comment to `CleanupAliases`.

## How was this patch tested?

Added test.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #18761 from viirya/SPARK-21555.

(cherry picked from commit 9c8109e)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
@viirya
Copy link
Member Author

viirya commented Jul 30, 2017

@gatorsmile OK. Usually the constructor expression parameters should also be in children. RuntimeReplaceable is a special case. I'd check if there are other expressions like it.

MatthewRBruce pushed a commit to Shopify/spark that referenced this pull request Jul 31, 2018
… by its canonicalized child

## What changes were proposed in this pull request?

When there are aliases (these aliases were added for nested fields) as parameters in `RuntimeReplaceable`, as they are not in the children expression, those aliases can't be cleaned up in analyzer rule `CleanupAliases`.

An expression `nvl(foo.foo1, "value")` can be resolved to two semantically different expressions in a group by query because they contain different aliases.

Because those aliases are not children of `RuntimeReplaceable` which is an `UnaryExpression`. So we can't trim the aliases out by simple transforming the expressions in `CleanupAliases`.

If we want to replace the non-children aliases in `RuntimeReplaceable`, we need to add more codes to `RuntimeReplaceable` and modify all expressions of `RuntimeReplaceable`. It makes the interface ugly IMO.

Consider those aliases will be replaced later at optimization and so they're no harm, this patch chooses to simply override `canonicalized` of `RuntimeReplaceable`.

One concern is about `CleanupAliases`. Because it actually cannot clean up ALL aliases inside a plan. To make caller of this rule notice that, this patch adds a comment to `CleanupAliases`.

## How was this patch tested?

Added test.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes apache#18761 from viirya/SPARK-21555.

(cherry picked from commit 9c8109e)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
jzhuge pushed a commit to jzhuge/spark that referenced this pull request Aug 20, 2018
… by its canonicalized child

When there are aliases (these aliases were added for nested fields) as parameters in `RuntimeReplaceable`, as they are not in the children expression, those aliases can't be cleaned up in analyzer rule `CleanupAliases`.

An expression `nvl(foo.foo1, "value")` can be resolved to two semantically different expressions in a group by query because they contain different aliases.

Because those aliases are not children of `RuntimeReplaceable` which is an `UnaryExpression`. So we can't trim the aliases out by simple transforming the expressions in `CleanupAliases`.

If we want to replace the non-children aliases in `RuntimeReplaceable`, we need to add more codes to `RuntimeReplaceable` and modify all expressions of `RuntimeReplaceable`. It makes the interface ugly IMO.

Consider those aliases will be replaced later at optimization and so they're no harm, this patch chooses to simply override `canonicalized` of `RuntimeReplaceable`.

One concern is about `CleanupAliases`. Because it actually cannot clean up ALL aliases inside a plan. To make caller of this rule notice that, this patch adds a comment to `CleanupAliases`.

Added test.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes apache#18761 from viirya/SPARK-21555.

(cherry picked from commit 9c8109e)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
@viirya viirya deleted the SPARK-21555 branch December 27, 2023 18:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants