[SPARK-34897][SQL][3.1] Support reconcile schemas based on index after nested column pruning by wangyum · Pull Request #32279 · apache/spark

wangyum · 2021-04-21T23:55:15Z

This PR backports #31993 to branch-3.1. The origin PR description:

What changes were proposed in this pull request?

It will remove StructField when pruning nested columns. For example:

spark.sql(
  """
    |CREATE TABLE t1 (
    |  _col0 INT,
    |  _col1 STRING,
    |  _col2 STRUCT<c1: STRING, c2: STRING, c3: STRING, c4: BIGINT>)
    |USING ORC
    |""".stripMargin)

spark.sql("INSERT INTO t1 values(1, '2', struct('a', 'b', 'c', 10L))")

spark.sql("SELECT _col0, _col2.c1 FROM t1").show

Before this pr. The returned schema is: `_col0` INT,`_col2` STRUCT<`c1`: STRING> add it will throw exception:

java.lang.AssertionError: assertion failed: The given data schema struct<_col0:int,_col2:struct<c1:string>> has less fields than the actual ORC physical schema, no idea which columns were dropped, fail to read.
	at scala.Predef$.assert(Predef.scala:223)
	at org.apache.spark.sql.execution.datasources.orc.OrcUtils$.requestedColumnIds(OrcUtils.scala:160)

After this pr. The returned schema is: `_col0` INT,`_col1` STRING,`_col2` STRUCT<`c1`: STRING>.

The finally schema is `_col0` INT,`_col2` STRUCT<`c1`: STRING> after the complete column pruning:

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala

Lines 208 to 213 in 7a5647a

    
           val readDataColumns = 
        
             dataColumns 
        
               .filter(requiredAttributes.contains) 
        
               .filterNot(partitionColumns.contains) 
        
           val outputSchema = readDataColumns.toStructType 
        
           logInfo(s"Output Data Schema: ${outputSchema.simpleString(5)}")

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownUtils.scala

Lines 96 to 97 in e64eb75

    
           val neededFieldNames = neededOutput.map(_.name).toSet 
        
           r.pruneColumns(StructType(prunedSchema.filter(f => neededFieldNames.contains(f.name))))

Why are the changes needed?

Fix bug.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test.

…ted column pruning

SparkQA · 2021-04-22T00:48:08Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42281/

SparkQA · 2021-04-22T00:52:21Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42281/

SparkQA · 2021-04-22T04:50:34Z

Test build #137754 has finished for PR 32279 at commit 040de22.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2021-04-23T02:15:49Z

retest this please.

SparkQA · 2021-04-23T02:58:17Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42370/

SparkQA · 2021-04-23T03:03:06Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42370/

SparkQA · 2021-04-23T06:33:33Z

Test build #137840 has finished for PR 32279 at commit 040de22.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…r nested column pruning This PR backports #31993 to branch-3.1. The origin PR description: ### What changes were proposed in this pull request? It will remove `StructField` when [pruning nested columns](https://github.com/apache/spark/blob/0f2c0b53e8fb18c86c67b5dd679c006db93f94a5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala#L28-L42). For example: ```scala spark.sql( """ |CREATE TABLE t1 ( | _col0 INT, | _col1 STRING, | _col2 STRUCT<c1: STRING, c2: STRING, c3: STRING, c4: BIGINT>) |USING ORC |""".stripMargin) spark.sql("INSERT INTO t1 values(1, '2', struct('a', 'b', 'c', 10L))") spark.sql("SELECT _col0, _col2.c1 FROM t1").show ``` Before this pr. The returned schema is: ``` `_col0` INT,`_col2` STRUCT<`c1`: STRING> ``` add it will throw exception: ``` java.lang.AssertionError: assertion failed: The given data schema struct<_col0:int,_col2:struct<c1:string>> has less fields than the actual ORC physical schema, no idea which columns were dropped, fail to read. at scala.Predef$.assert(Predef.scala:223) at org.apache.spark.sql.execution.datasources.orc.OrcUtils$.requestedColumnIds(OrcUtils.scala:160) ``` After this pr. The returned schema is: ``` `_col0` INT,`_col1` STRING,`_col2` STRUCT<`c1`: STRING> ```. The finally schema is ``` `_col0` INT,`_col2` STRUCT<`c1`: STRING> ``` after the complete column pruning: https://github.com/apache/spark/blob/7a5647a93aaea9d1d78d9262e24fc8c010db04d0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala#L208-L213 https://github.com/apache/spark/blob/e64eb75aede71a5403a4d4436e63b1fcfdeca14d/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownUtils.scala#L96-L97 ### Why are the changes needed? Fix bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #32279 from wangyum/SPARK-34897-3.1. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <yumwang@ebay.com>

wangyum · 2021-04-23T07:23:30Z

Merged to branch-3.1.

…r nested column pruning This PR backports apache#31993 to branch-3.1. The origin PR description: It will remove `StructField` when [pruning nested columns](https://github.com/apache/spark/blob/0f2c0b53e8fb18c86c67b5dd679c006db93f94a5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala#L28-L42). For example: ```scala spark.sql( """ |CREATE TABLE t1 ( | _col0 INT, | _col1 STRING, | _col2 STRUCT<c1: STRING, c2: STRING, c3: STRING, c4: BIGINT>) |USING ORC |""".stripMargin) spark.sql("INSERT INTO t1 values(1, '2', struct('a', 'b', 'c', 10L))") spark.sql("SELECT _col0, _col2.c1 FROM t1").show ``` Before this pr. The returned schema is: ``` `_col0` INT,`_col2` STRUCT<`c1`: STRING> ``` add it will throw exception: ``` java.lang.AssertionError: assertion failed: The given data schema struct<_col0:int,_col2:struct<c1:string>> has less fields than the actual ORC physical schema, no idea which columns were dropped, fail to read. at scala.Predef$.assert(Predef.scala:223) at org.apache.spark.sql.execution.datasources.orc.OrcUtils$.requestedColumnIds(OrcUtils.scala:160) ``` After this pr. The returned schema is: ``` `_col0` INT,`_col1` STRING,`_col2` STRUCT<`c1`: STRING> ```. The finally schema is ``` `_col0` INT,`_col2` STRUCT<`c1`: STRING> ``` after the complete column pruning: https://github.com/apache/spark/blob/7a5647a93aaea9d1d78d9262e24fc8c010db04d0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala#L208-L213 https://github.com/apache/spark/blob/e64eb75aede71a5403a4d4436e63b1fcfdeca14d/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownUtils.scala#L96-L97 Fix bug. No. Unit test. Closes apache#32279 from wangyum/SPARK-34897-3.1. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <yumwang@ebay.com>

…r nested column pruning This PR backports apache#31993 to branch-3.1. The origin PR description: ### What changes were proposed in this pull request? It will remove `StructField` when [pruning nested columns](https://github.com/apache/spark/blob/0f2c0b53e8fb18c86c67b5dd679c006db93f94a5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala#L28-L42). For example: ```scala spark.sql( """ |CREATE TABLE t1 ( | _col0 INT, | _col1 STRING, | _col2 STRUCT<c1: STRING, c2: STRING, c3: STRING, c4: BIGINT>) |USING ORC |""".stripMargin) spark.sql("INSERT INTO t1 values(1, '2', struct('a', 'b', 'c', 10L))") spark.sql("SELECT _col0, _col2.c1 FROM t1").show ``` Before this pr. The returned schema is: ``` `_col0` INT,`_col2` STRUCT<`c1`: STRING> ``` add it will throw exception: ``` java.lang.AssertionError: assertion failed: The given data schema struct<_col0:int,_col2:struct<c1:string>> has less fields than the actual ORC physical schema, no idea which columns were dropped, fail to read. at scala.Predef$.assert(Predef.scala:223) at org.apache.spark.sql.execution.datasources.orc.OrcUtils$.requestedColumnIds(OrcUtils.scala:160) ``` After this pr. The returned schema is: ``` `_col0` INT,`_col1` STRING,`_col2` STRUCT<`c1`: STRING> ```. The finally schema is ``` `_col0` INT,`_col2` STRUCT<`c1`: STRING> ``` after the complete column pruning: https://github.com/apache/spark/blob/7a5647a93aaea9d1d78d9262e24fc8c010db04d0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala#L208-L213 https://github.com/apache/spark/blob/e64eb75aede71a5403a4d4436e63b1fcfdeca14d/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownUtils.scala#L96-L97 ### Why are the changes needed? Fix bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes apache#32279 from wangyum/SPARK-34897-3.1. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <yumwang@ebay.com>

[SPARK-34897][SQL] Support reconcile schemas based on index after nes…

040de22

…ted column pruning

github-actions bot added the SQL label Apr 21, 2021

wangyum requested review from cloud-fan and viirya and removed request for viirya April 23, 2021 06:34

viirya approved these changes Apr 23, 2021

View reviewed changes

wangyum closed this Apr 23, 2021

wangyum deleted the SPARK-34897-3.1 branch April 23, 2021 07:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-34897][SQL][3.1] Support reconcile schemas based on index after nested column pruning#32279

[SPARK-34897][SQL][3.1] Support reconcile schemas based on index after nested column pruning#32279
wangyum wants to merge 1 commit intoapache:branch-3.1from
wangyum:SPARK-34897-3.1

wangyum commented Apr 21, 2021

Uh oh!

SparkQA commented Apr 22, 2021

Uh oh!

SparkQA commented Apr 22, 2021

Uh oh!

SparkQA commented Apr 22, 2021

Uh oh!

wangyum commented Apr 23, 2021

Uh oh!

SparkQA commented Apr 23, 2021

Uh oh!

SparkQA commented Apr 23, 2021

Uh oh!

SparkQA commented Apr 23, 2021

Uh oh!

wangyum commented Apr 23, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	val readDataColumns =
	dataColumns
	.filter(requiredAttributes.contains)
	.filterNot(partitionColumns.contains)
	val outputSchema = readDataColumns.toStructType
	logInfo(s"Output Data Schema: ${outputSchema.simpleString(5)}")

	val neededFieldNames = neededOutput.map(_.name).toSet
	r.pruneColumns(StructType(prunedSchema.filter(f => neededFieldNames.contains(f.name))))

Conversation

wangyum commented Apr 21, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Apr 22, 2021

Uh oh!

SparkQA commented Apr 22, 2021

Uh oh!

SparkQA commented Apr 22, 2021

Uh oh!

wangyum commented Apr 23, 2021

Uh oh!

SparkQA commented Apr 23, 2021

Uh oh!

SparkQA commented Apr 23, 2021

Uh oh!

SparkQA commented Apr 23, 2021

Uh oh!

wangyum commented Apr 23, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants