Skip to content

[SPARK-30870][SQL] Column pruning shouldn't alias a nested column if it means the whole structure#27675

Closed
peter-toth wants to merge 4 commits intoapache:masterfrom
peter-toth:SPARK-30870
Closed

[SPARK-30870][SQL] Column pruning shouldn't alias a nested column if it means the whole structure#27675
peter-toth wants to merge 4 commits intoapache:masterfrom
peter-toth:SPARK-30870

Conversation

@peter-toth
Copy link
Contributor

@peter-toth peter-toth commented Feb 22, 2020

What changes were proposed in this pull request?

This PR fixes a bug in nested column aliasing by taking the data type of the referenced nested fields into account when calculating the number of extracted columns. After this PR this query runs without issues:

SELECT explodedvalue.*
FROM VALUES array(named_struct('nested', named_struct('a', 1, 'b', 2))) AS (value)
LATERAL VIEW explode(value) AS explodedvalue

This is a regression from Spark 2.4.

Why are the changes needed?

To fix a bug.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added new UT.

@SparkQA
Copy link

SparkQA commented Feb 22, 2020

Test build #118817 has finished for PR 27675 at commit b09e19b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

.analyze

comparePlans(optimized, expected)
comparePlans(optimized, query)
Copy link
Contributor Author

@peter-toth peter-toth Feb 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we need the whole structure, why we expected the local relation to be column pruned?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, it seems that's just a mistake. cc: @dongjoon-hyun

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, can you add tests in this suite, too?

Copy link
Contributor Author

@peter-toth peter-toth Feb 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could add something similar to the SQL test to here as well:

  test("SPARK-30870: Don't alias a nested column if it means the whole attribute") {
    val valueStructType = StructType.fromDDL("field struct<a:int, b:int>")
    val r = LocalRelation('value.struct(valueStructType))

    val field = GetStructField('value, 0, Some("field"))

    val query = r
      .limit(5)
      .select(field)
      .analyze

    val optimized = Optimize.execute(query)

    comparePlans(optimized, query)
  }

but it wouldn't be much different to this particular test (Some nested column means the whole structure).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @peter-toth and @maropu .
The original code is correct, this is not about column pruning. This is about limit push down.

Copy link
Contributor Author

@peter-toth peter-toth Feb 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun hmm, does this test have anything to do with limit push down? There is no LimitPushDown in the optimizer of this suite:

object Optimize extends RuleExecutor[LogicalPlan] {
val batches = Batch("Nested column pruning", FixedPoint(100),
ColumnPruning,
CollapseProject,
RemoveNoopOperators) :: Nil
}
and actually limit is closer to the relation in the original query than in expected, but I might be wrong.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I meant a pushdown over limit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SPARK-26975][SQL] Support nested-column pruning over limit/sample/repartition is about that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I got it. So, this is the result of bug fix, isn't it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is. In this test case there is no point in pushing down the project over the limit.

@SparkQA
Copy link

SparkQA commented Feb 22, 2020

Test build #118819 has finished for PR 27675 at commit 5a51b94.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make the test title clearer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for @maropu 's comment. Please revise the PR title together.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed the name of the test and the PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I've changed it again. Let me now if a different name would fit better.

nestedFieldToAlias.length < totalFieldNum(attr.dataType)) {
nestedFieldToAlias
.map { case (nestedField, _) => totalFieldNum(nestedField.dataType) }
.sum < totalFieldNum(attr.dataType)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ur, I see. nice catch.

@dongjoon-hyun
Copy link
Member

Thank you, @peter-toth .

@HyukjinKwon
Copy link
Member

cc @viirya fyi

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems good. And please update the PR title as suggested.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine to me too

@peter-toth
Copy link
Contributor Author

Thanks for your review, I will try to address your comments today.

@peter-toth peter-toth changed the title [SPARK-30870][SQL] Fix nested column aliasing [SPARK-30870][SQL] Don't alias a nested column if it means the whole attribute Feb 24, 2020
@peter-toth peter-toth changed the title [SPARK-30870][SQL] Don't alias a nested column if it means the whole attribute [SPARK-30870][SQL] Column pruning shouldn't alias a nested column if it means the whole attribute Feb 24, 2020
@peter-toth peter-toth changed the title [SPARK-30870][SQL] Column pruning shouldn't alias a nested column if it means the whole attribute [SPARK-30870][SQL] Column pruning shouldn't alias a nested column if it means the whole structure Feb 24, 2020
@SparkQA
Copy link

SparkQA commented Feb 24, 2020

Test build #118868 has finished for PR 27675 at commit 6a6ea0d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 24, 2020

Test build #118869 has finished for PR 27675 at commit e4c9009.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

cc @dbtsai

.analyze

comparePlans(optimized, expected)
comparePlans(optimized, query)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @peter-toth and @maropu .
The original code is correct, this is not about column pruning. This is about limit push down.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Merged to master/3.0.
Thank you, @peter-toth , @maropu , @HyukjinKwon , @viirya .

dongjoon-hyun pushed a commit that referenced this pull request Feb 24, 2020
…it means the whole structure

### What changes were proposed in this pull request?
This PR fixes a bug in nested column aliasing by taking the data type of the referenced nested fields into account when calculating the number of extracted columns. After this PR this query runs without issues:
```
SELECT explodedvalue.*
FROM VALUES array(named_struct('nested', named_struct('a', 1, 'b', 2))) AS (value)
LATERAL VIEW explode(value) AS explodedvalue
```
This is a regression from Spark 2.4.

### Why are the changes needed?
To fix a bug.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Added new UT.

Closes #27675 from peter-toth/SPARK-30870.

Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit 1a4e242)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
@peter-toth
Copy link
Contributor Author

Thanks for the review @dongjoon-hyun, @HyukjinKwon, @maropu, @viirya.

@gatorsmile
Copy link
Member

gatorsmile commented Feb 26, 2020

@peter-toth @dongjoon-hyun Can we backport this to 2.4?

@peter-toth
Copy link
Contributor Author

@peter-toth @dongjoon-hyun Can we backport this to 2.4?

@gatorsmile only Spark 3 is affected.

@dongjoon-hyun
Copy link
Member

Yes. It's only for 3.0 in Apache Spark, @gatorsmile .

sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
…it means the whole structure

### What changes were proposed in this pull request?
This PR fixes a bug in nested column aliasing by taking the data type of the referenced nested fields into account when calculating the number of extracted columns. After this PR this query runs without issues:
```
SELECT explodedvalue.*
FROM VALUES array(named_struct('nested', named_struct('a', 1, 'b', 2))) AS (value)
LATERAL VIEW explode(value) AS explodedvalue
```
This is a regression from Spark 2.4.

### Why are the changes needed?
To fix a bug.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Added new UT.

Closes apache#27675 from peter-toth/SPARK-30870.

Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants

Comments