[SPARK-30870][SQL] Column pruning shouldn't alias a nested column if it means the whole structure by peter-toth · Pull Request #27675 · apache/spark

peter-toth · 2020-02-22T12:07:49Z

What changes were proposed in this pull request?

This PR fixes a bug in nested column aliasing by taking the data type of the referenced nested fields into account when calculating the number of extracted columns. After this PR this query runs without issues:

SELECT explodedvalue.*
FROM VALUES array(named_struct('nested', named_struct('a', 1, 'b', 2))) AS (value)
LATERAL VIEW explode(value) AS explodedvalue

This is a regression from Spark 2.4.

Why are the changes needed?

To fix a bug.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added new UT.

SparkQA · 2020-02-22T15:42:35Z

Test build #118817 has finished for PR 27675 at commit b09e19b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

peter-toth · 2020-02-22T19:12:26Z

...alyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasingSuite.scala

-      .analyze
-
-    comparePlans(optimized, expected)
+    comparePlans(optimized, query)


Since we need the whole structure, why we expected the local relation to be column pruned?

Yea, it seems that's just a mistake. cc: @dongjoon-hyun

btw, can you add tests in this suite, too?

I could add something similar to the SQL test to here as well:

test("SPARK-30870: Don't alias a nested column if it means the whole attribute") { val valueStructType = StructType.fromDDL("field struct<a:int, b:int>") val r = LocalRelation('value.struct(valueStructType)) val field = GetStructField('value, 0, Some("field")) val query = r .limit(5) .select(field) .analyze val optimized = Optimize.execute(query) comparePlans(optimized, query) }

but it wouldn't be much different to this particular test (Some nested column means the whole structure).

Hi, @peter-toth and @maropu .
The original code is correct, this is not about column pruning. This is about limit push down.

@dongjoon-hyun hmm, does this test have anything to do with limit push down? There is no LimitPushDown in the optimizer of this suite:

spark/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasingSuite.scala

Lines 34 to 39 in 5a51b94

object Optimize extends RuleExecutor[LogicalPlan] {

val batches = Batch("Nested column pruning", FixedPoint(100),

ColumnPruning,

CollapseProject,

RemoveNoopOperators) :: Nil

}

and actually limit is closer to the relation in the original query than in expected, but I might be wrong.

Sorry, I meant a pushdown over limit.

[SPARK-26975][SQL] Support nested-column pruning over limit/sample/repartition is about that.

Hmm. I got it. So, this is the result of bug fix, isn't it?

Yes, it is. In this test case there is no point in pushing down the project over the limit.

SparkQA · 2020-02-22T23:26:56Z

Test build #118819 has finished for PR 27675 at commit 5a51b94.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-02-23T06:40:10Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

Can you make the test title clearer?

+1 for @maropu 's comment. Please revise the PR title together.

I've changed the name of the test and the PR.

Sorry, I've changed it again. Let me now if a different name would fit better.

maropu · 2020-02-23T06:41:13Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala

-            nestedFieldToAlias.length < totalFieldNum(attr.dataType)) {
+            nestedFieldToAlias
+              .map { case (nestedField, _) => totalFieldNum(nestedField.dataType) }
+              .sum < totalFieldNum(attr.dataType)) {


Ur, I see. nice catch.

dongjoon-hyun · 2020-02-23T10:19:49Z

Thank you, @peter-toth .

HyukjinKwon · 2020-02-24T05:20:48Z

cc @viirya fyi

viirya

Seems good. And please update the PR title as suggested.

HyukjinKwon

Looks fine to me too

peter-toth · 2020-02-24T08:56:14Z

Thanks for your review, I will try to address your comments today.

SparkQA · 2020-02-24T17:20:20Z

Test build #118868 has finished for PR 27675 at commit 6a6ea0d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-02-24T18:01:37Z

Test build #118869 has finished for PR 27675 at commit e4c9009.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-02-24T18:51:39Z

cc @dbtsai

dongjoon-hyun · 2020-02-24T18:56:47Z

...alyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasingSuite.scala

-      .analyze
-
-    comparePlans(optimized, expected)
+    comparePlans(optimized, query)


Hi, @peter-toth and @maropu .
The original code is correct, this is not about column pruning. This is about limit push down.

dongjoon-hyun

+1, LGTM. Merged to master/3.0.
Thank you, @peter-toth , @maropu , @HyukjinKwon , @viirya .

…it means the whole structure ### What changes were proposed in this pull request? This PR fixes a bug in nested column aliasing by taking the data type of the referenced nested fields into account when calculating the number of extracted columns. After this PR this query runs without issues: ``` SELECT explodedvalue.* FROM VALUES array(named_struct('nested', named_struct('a', 1, 'b', 2))) AS (value) LATERAL VIEW explode(value) AS explodedvalue ``` This is a regression from Spark 2.4. ### Why are the changes needed? To fix a bug. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added new UT. Closes #27675 from peter-toth/SPARK-30870. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 1a4e242) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

peter-toth · 2020-02-25T07:18:39Z

Thanks for the review @dongjoon-hyun, @HyukjinKwon, @maropu, @viirya.

gatorsmile · 2020-02-26T03:24:45Z

@peter-toth @dongjoon-hyun Can we backport this to 2.4?

peter-toth · 2020-02-26T07:28:27Z

@peter-toth @dongjoon-hyun Can we backport this to 2.4?

@gatorsmile only Spark 3 is affected.

dongjoon-hyun · 2020-02-26T17:09:55Z

Yes. It's only for 3.0 in Apache Spark, @gatorsmile .

…it means the whole structure ### What changes were proposed in this pull request? This PR fixes a bug in nested column aliasing by taking the data type of the referenced nested fields into account when calculating the number of extracted columns. After this PR this query runs without issues: ``` SELECT explodedvalue.* FROM VALUES array(named_struct('nested', named_struct('a', 1, 'b', 2))) AS (value) LATERAL VIEW explode(value) AS explodedvalue ``` This is a regression from Spark 2.4. ### Why are the changes needed? To fix a bug. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added new UT. Closes apache#27675 from peter-toth/SPARK-30870. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

[SPARK-30870][SQL] Fix nested column aliasing

b09e19b

fix existing UT

5a51b94

peter-toth commented Feb 22, 2020

View reviewed changes

maropu reviewed Feb 23, 2020

View reviewed changes

viirya reviewed Feb 24, 2020

View reviewed changes

HyukjinKwon approved these changes Feb 24, 2020

View reviewed changes

fix test name

6a6ea0d

peter-toth changed the title ~~[SPARK-30870][SQL] Fix nested column aliasing~~ [SPARK-30870][SQL] Don't alias a nested column if it means the whole attribute Feb 24, 2020

peter-toth changed the title ~~[SPARK-30870][SQL] Don't alias a nested column if it means the whole attribute~~ [SPARK-30870][SQL] Column pruning shouldn't alias a nested column if it means the whole attribute Feb 24, 2020

fix test name 2

e4c9009

peter-toth changed the title ~~[SPARK-30870][SQL] Column pruning shouldn't alias a nested column if it means the whole attribute~~ [SPARK-30870][SQL] Column pruning shouldn't alias a nested column if it means the whole structure Feb 24, 2020

dongjoon-hyun added the SQL label Feb 24, 2020

dongjoon-hyun requested changes Feb 24, 2020

View reviewed changes

dongjoon-hyun approved these changes Feb 24, 2020

View reviewed changes

dongjoon-hyun closed this in 1a4e242 Feb 24, 2020

	object Optimize extends RuleExecutor[LogicalPlan] {
	val batches = Batch("Nested column pruning", FixedPoint(100),
	ColumnPruning,
	CollapseProject,
	RemoveNoopOperators) :: Nil
	}

Conversation

peter-toth commented Feb 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Feb 22, 2020

Uh oh!

peter-toth Feb 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peter-toth Feb 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peter-toth Feb 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 22, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Feb 23, 2020

Uh oh!

HyukjinKwon commented Feb 24, 2020

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

peter-toth commented Feb 24, 2020

Uh oh!

SparkQA commented Feb 24, 2020

Uh oh!

SparkQA commented Feb 24, 2020

Uh oh!

dongjoon-hyun commented Feb 24, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

peter-toth commented Feb 25, 2020

Uh oh!

gatorsmile commented Feb 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

peter-toth commented Feb 26, 2020

Uh oh!

dongjoon-hyun commented Feb 26, 2020

Uh oh!

peter-toth commented Feb 22, 2020 •

edited

Loading

peter-toth Feb 22, 2020 •

edited

Loading

peter-toth Feb 24, 2020 •

edited

Loading

peter-toth Feb 24, 2020 •

edited

Loading

gatorsmile commented Feb 26, 2020 •

edited

Loading