[SPARK-30870][SQL] Column pruning shouldn't alias a nested column if it means the whole structure#27675
[SPARK-30870][SQL] Column pruning shouldn't alias a nested column if it means the whole structure#27675peter-toth wants to merge 4 commits intoapache:masterfrom
Conversation
|
Test build #118817 has finished for PR 27675 at commit
|
| .analyze | ||
|
|
||
| comparePlans(optimized, expected) | ||
| comparePlans(optimized, query) |
There was a problem hiding this comment.
Since we need the whole structure, why we expected the local relation to be column pruned?
There was a problem hiding this comment.
Yea, it seems that's just a mistake. cc: @dongjoon-hyun
There was a problem hiding this comment.
btw, can you add tests in this suite, too?
There was a problem hiding this comment.
I could add something similar to the SQL test to here as well:
test("SPARK-30870: Don't alias a nested column if it means the whole attribute") {
val valueStructType = StructType.fromDDL("field struct<a:int, b:int>")
val r = LocalRelation('value.struct(valueStructType))
val field = GetStructField('value, 0, Some("field"))
val query = r
.limit(5)
.select(field)
.analyze
val optimized = Optimize.execute(query)
comparePlans(optimized, query)
}
but it wouldn't be much different to this particular test (Some nested column means the whole structure).
There was a problem hiding this comment.
Hi, @peter-toth and @maropu .
The original code is correct, this is not about column pruning. This is about limit push down.
There was a problem hiding this comment.
@dongjoon-hyun hmm, does this test have anything to do with limit push down? There is no LimitPushDown in the optimizer of this suite:
limit is closer to the relation in the original query than in expected, but I might be wrong.
There was a problem hiding this comment.
Sorry, I meant a pushdown over limit.
There was a problem hiding this comment.
[SPARK-26975][SQL] Support nested-column pruning over limit/sample/repartition is about that.
There was a problem hiding this comment.
Hmm. I got it. So, this is the result of bug fix, isn't it?
There was a problem hiding this comment.
Yes, it is. In this test case there is no point in pushing down the project over the limit.
|
Test build #118819 has finished for PR 27675 at commit
|
There was a problem hiding this comment.
Can you make the test title clearer?
There was a problem hiding this comment.
+1 for @maropu 's comment. Please revise the PR title together.
There was a problem hiding this comment.
I've changed the name of the test and the PR.
There was a problem hiding this comment.
Sorry, I've changed it again. Let me now if a different name would fit better.
| nestedFieldToAlias.length < totalFieldNum(attr.dataType)) { | ||
| nestedFieldToAlias | ||
| .map { case (nestedField, _) => totalFieldNum(nestedField.dataType) } | ||
| .sum < totalFieldNum(attr.dataType)) { |
|
Thank you, @peter-toth . |
|
cc @viirya fyi |
viirya
left a comment
There was a problem hiding this comment.
Seems good. And please update the PR title as suggested.
|
Thanks for your review, I will try to address your comments today. |
|
Test build #118868 has finished for PR 27675 at commit
|
|
Test build #118869 has finished for PR 27675 at commit
|
|
cc @dbtsai |
| .analyze | ||
|
|
||
| comparePlans(optimized, expected) | ||
| comparePlans(optimized, query) |
There was a problem hiding this comment.
Hi, @peter-toth and @maropu .
The original code is correct, this is not about column pruning. This is about limit push down.
dongjoon-hyun
left a comment
There was a problem hiding this comment.
+1, LGTM. Merged to master/3.0.
Thank you, @peter-toth , @maropu , @HyukjinKwon , @viirya .
…it means the whole structure
### What changes were proposed in this pull request?
This PR fixes a bug in nested column aliasing by taking the data type of the referenced nested fields into account when calculating the number of extracted columns. After this PR this query runs without issues:
```
SELECT explodedvalue.*
FROM VALUES array(named_struct('nested', named_struct('a', 1, 'b', 2))) AS (value)
LATERAL VIEW explode(value) AS explodedvalue
```
This is a regression from Spark 2.4.
### Why are the changes needed?
To fix a bug.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Added new UT.
Closes #27675 from peter-toth/SPARK-30870.
Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit 1a4e242)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
|
Thanks for the review @dongjoon-hyun, @HyukjinKwon, @maropu, @viirya. |
|
@peter-toth @dongjoon-hyun Can we backport this to 2.4? |
@gatorsmile only Spark 3 is affected. |
|
Yes. It's only for 3.0 in Apache Spark, @gatorsmile . |
…it means the whole structure
### What changes were proposed in this pull request?
This PR fixes a bug in nested column aliasing by taking the data type of the referenced nested fields into account when calculating the number of extracted columns. After this PR this query runs without issues:
```
SELECT explodedvalue.*
FROM VALUES array(named_struct('nested', named_struct('a', 1, 'b', 2))) AS (value)
LATERAL VIEW explode(value) AS explodedvalue
```
This is a regression from Spark 2.4.
### Why are the changes needed?
To fix a bug.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Added new UT.
Closes apache#27675 from peter-toth/SPARK-30870.
Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
What changes were proposed in this pull request?
This PR fixes a bug in nested column aliasing by taking the data type of the referenced nested fields into account when calculating the number of extracted columns. After this PR this query runs without issues:
This is a regression from Spark 2.4.
Why are the changes needed?
To fix a bug.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Added new UT.