[SPARK-35096][SQL] SchemaPruning should adhere spark.sql.caseSensitive config #32194

sandeep-katta · 2021-04-15T18:01:38Z

What changes were proposed in this pull request?

As a part of the SPARK-26837 pruning of nested fields from object serializers are supported. But it is missed to handle case insensitivity nature of spark

In this PR I have resolved the column names to be pruned based on spark.sql.caseSensitive config
Exception Before Fix

Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
  at org.apache.spark.sql.types.StructType.apply(StructType.scala:414)
  at org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.$anonfun$applyOrElse$3(objects.scala:216)
  at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
  at scala.collection.immutable.List.foreach(List.scala:392)
  at scala.collection.TraversableLike.map(TraversableLike.scala:238)
  at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
  at scala.collection.immutable.List.map(List.scala:298)
  at org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.applyOrElse(objects.scala:215)
  at org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.applyOrElse(objects.scala:203)
  at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309)
  at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:309)
  at

Why are the changes needed?

After Upgrade to Spark 3 foreachBatch API throws java.lang.ArrayIndexOutOfBoundsException. This issue will be fixed using this PR

Does this PR introduce any user-facing change?

No, Infact fixes the regression

How was this patch tested?

Added tests and also tested verified manually

github-actions · 2021-04-15T18:01:53Z

Test build #752951989 for PR 32194 at commit db4a74a.

github-actions · 2021-04-15T18:05:39Z

Test build #752985634 for PR 32194 at commit f6e4b6b.

sandeep-katta · 2021-04-15T18:06:51Z

CC @viirya @cloud-fan

SparkQA · 2021-04-15T19:36:47Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42011/

SparkQA · 2021-04-15T19:36:48Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42011/

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala

SparkQA · 2021-04-15T22:24:51Z

Test build #137434 has finished for PR 32194 at commit db4a74a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-04-15T23:07:02Z

Test build #137436 has finished for PR 32194 at commit f6e4b6b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala

viirya

Good catch! Thanks for the fix.

dongjoon-hyun

Thank you, @sandeep-katta . BTW, this is [SQL] PR instead of [CORE].

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruningSuite.scala

github-actions · 2021-04-16T05:01:32Z

Test build #754432050 for PR 32194 at commit e477e91.

github-actions · 2021-04-16T05:31:23Z

Test build #754495960 for PR 32194 at commit baf8125.

SparkQA · 2021-04-16T06:31:27Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42039/

SparkQA · 2021-04-16T06:36:13Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42039/

wangyum · 2021-04-16T07:26:38Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala

        val sortedLeftFields = filteredRightFieldNames.map { fieldName =>
-          val leftFieldType = leftStruct(fieldName).dataType
+          val resolvedLeftStruct = leftStruct.filter(p => resolver(p.name, fieldName)).head


filter -> find?

Done updated, just curious is there any performance difference between these two

github-actions · 2021-04-16T08:02:45Z

Test build #754870665 for PR 32194 at commit 004d56c.

github-actions · 2021-04-16T08:04:39Z

Test build #754876451 for PR 32194 at commit 04b24c9.

SparkQA · 2021-04-16T08:43:10Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42049/

SparkQA · 2021-04-16T09:33:43Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42051/

SparkQA · 2021-04-16T09:38:39Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42051/

SparkQA · 2021-04-16T10:10:47Z

Test build #137464 has finished for PR 32194 at commit baf8125.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-04-16T13:18:22Z

Test build #137475 has finished for PR 32194 at commit 04b24c9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sandeep-katta · 2021-04-20T13:11:11Z

@dongjoon-hyun @viirya , can this PR be merged ?. If not, I am happy to address review comments if there are any

cloud-fan · 2021-04-21T07:16:04Z

thanks, merging to master/3.1/3.0!

…e config ### What changes were proposed in this pull request? As a part of the SPARK-26837 pruning of nested fields from object serializers are supported. But it is missed to handle case insensitivity nature of spark In this PR I have resolved the column names to be pruned based on `spark.sql.caseSensitive ` config **Exception Before Fix** ``` Caused by: java.lang.ArrayIndexOutOfBoundsException: 0 at org.apache.spark.sql.types.StructType.apply(StructType.scala:414) at org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.$anonfun$applyOrElse$3(objects.scala:216) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at scala.collection.immutable.List.foreach(List.scala:392) at scala.collection.TraversableLike.map(TraversableLike.scala:238) at scala.collection.TraversableLike.map$(TraversableLike.scala:231) at scala.collection.immutable.List.map(List.scala:298) at org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.applyOrElse(objects.scala:215) at org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.applyOrElse(objects.scala:203) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:309) at ``` ### Why are the changes needed? After Upgrade to Spark 3 `foreachBatch` API throws` java.lang.ArrayIndexOutOfBoundsException`. This issue will be fixed using this PR ### Does this PR introduce _any_ user-facing change? No, Infact fixes the regression ### How was this patch tested? Added tests and also tested verified manually Closes #32194 from sandeep-katta/SPARK-35096. Authored-by: sandeep.katta <sandeep.katta2007@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 4f309ce) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…e config ### What changes were proposed in this pull request? As a part of the SPARK-26837 pruning of nested fields from object serializers are supported. But it is missed to handle case insensitivity nature of spark In this PR I have resolved the column names to be pruned based on `spark.sql.caseSensitive ` config **Exception Before Fix** ``` Caused by: java.lang.ArrayIndexOutOfBoundsException: 0 at org.apache.spark.sql.types.StructType.apply(StructType.scala:414) at org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.$anonfun$applyOrElse$3(objects.scala:216) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at scala.collection.immutable.List.foreach(List.scala:392) at scala.collection.TraversableLike.map(TraversableLike.scala:238) at scala.collection.TraversableLike.map$(TraversableLike.scala:231) at scala.collection.immutable.List.map(List.scala:298) at org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.applyOrElse(objects.scala:215) at org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.applyOrElse(objects.scala:203) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:309) at ``` ### Why are the changes needed? After Upgrade to Spark 3 `foreachBatch` API throws` java.lang.ArrayIndexOutOfBoundsException`. This issue will be fixed using this PR ### Does this PR introduce _any_ user-facing change? No, Infact fixes the regression ### How was this patch tested? Added tests and also tested verified manually Closes #32194 from sandeep-katta/SPARK-35096. Authored-by: sandeep.katta <sandeep.katta2007@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

dongjoon-hyun · 2021-04-22T03:11:08Z

Hi, All.
This broke branch-3.0 because there is no SQLConfHelper.

Error: ] /home/runner/work/spark/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala:20: object SQLConfHelper is not a member of package org.apache.spark.sql.catalyst
Error: ] /home/runner/work/spark/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala:23: not found: type SQLConfHelper
Error: ] /home/runner/work/spark/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala:32: not found: value conf

spark-3.0:branch-3.0 $ git grep SQLConfHelper
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala:import org.apache.spark.sql.catalyst.SQLConfHelper
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala:object SchemaPruning extends SQLConfHelper {

I'll revert this from branch-3.0. Please make a backporting PR to branch-3.0.

cloud-fan · 2021-04-22T03:24:21Z

@sandeep-katta can you help to resubmit the PR for 3.0? thanks!

sandeep-katta · 2021-04-22T03:29:01Z

Sure will raise it soon

dongjoon-hyun · 2021-04-22T03:42:54Z

Thank you!

…e config ### What changes were proposed in this pull request? As a part of the SPARK-26837 pruning of nested fields from object serializers are supported. But it is missed to handle case insensitivity nature of spark In this PR I have resolved the column names to be pruned based on `spark.sql.caseSensitive ` config **Exception Before Fix** ``` Caused by: java.lang.ArrayIndexOutOfBoundsException: 0 at org.apache.spark.sql.types.StructType.apply(StructType.scala:414) at org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.$anonfun$applyOrElse$3(objects.scala:216) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at scala.collection.immutable.List.foreach(List.scala:392) at scala.collection.TraversableLike.map(TraversableLike.scala:238) at scala.collection.TraversableLike.map$(TraversableLike.scala:231) at scala.collection.immutable.List.map(List.scala:298) at org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.applyOrElse(objects.scala:215) at org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.applyOrElse(objects.scala:203) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:309) at ``` ### Why are the changes needed? After Upgrade to Spark 3 `foreachBatch` API throws` java.lang.ArrayIndexOutOfBoundsException`. This issue will be fixed using this PR ### Does this PR introduce _any_ user-facing change? No, Infact fixes the regression ### How was this patch tested? Added tests and also tested verified manually Closes apache#32194 from sandeep-katta/SPARK-35096. Authored-by: sandeep.katta <sandeep.katta2007@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

sandeep-katta · 2021-04-22T04:21:27Z

backport PR #32284

…e config ### What changes were proposed in this pull request? As a part of the SPARK-26837 pruning of nested fields from object serializers are supported. But it is missed to handle case insensitivity nature of spark In this PR I have resolved the column names to be pruned based on `spark.sql.caseSensitive ` config **Exception Before Fix** ``` Caused by: java.lang.ArrayIndexOutOfBoundsException: 0 at org.apache.spark.sql.types.StructType.apply(StructType.scala:414) at org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.$anonfun$applyOrElse$3(objects.scala:216) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at scala.collection.immutable.List.foreach(List.scala:392) at scala.collection.TraversableLike.map(TraversableLike.scala:238) at scala.collection.TraversableLike.map$(TraversableLike.scala:231) at scala.collection.immutable.List.map(List.scala:298) at org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.applyOrElse(objects.scala:215) at org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.applyOrElse(objects.scala:203) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:309) at ``` ### Why are the changes needed? After Upgrade to Spark 3 `foreachBatch` API throws` java.lang.ArrayIndexOutOfBoundsException`. This issue will be fixed using this PR ### Does this PR introduce _any_ user-facing change? No, Infact fixes the regression ### How was this patch tested? Added tests and also tested verified manually Closes apache#32194 from sandeep-katta/SPARK-35096. Authored-by: sandeep.katta <sandeep.katta2007@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 4f309ce) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

github-actions bot added the SQL label Apr 15, 2021

SchemaPruning should adhere spark.sql.caseSensitive config

f6e4b6b

sandeep-katta force-pushed the SPARK-35096 branch from db4a74a to f6e4b6b Compare April 15, 2021 18:05

viirya reviewed Apr 15, 2021

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala Outdated Show resolved Hide resolved

viirya reviewed Apr 16, 2021

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala Outdated Show resolved Hide resolved

viirya approved these changes Apr 16, 2021

View reviewed changes