[SPARK-27814][SQL] The cast operation for partition key may push down uncorrect filter, which is fatal. #24685

turboFei · 2019-05-23T09:10:44Z

What changes were proposed in this pull request?

For a partitioned table, such as:

table test (c1 int, c2 string) partitioned by (c3 Int)

If we use a cast operation in query, which casts the partition key, such as :

select * from test where (cast c3 as string)  = '0'

One predication of this query is cast(c3 as string) = ’0‘.
It would invoke this method to convert to a filter.

     case op @ SpecialBinaryComparison(
          ExtractAttribute(NonVarcharAttribute(name)), ExtractableLiteral(value)) =>
        Some(s"$name ${op.symbol} $value")

First, it invokes the ExtractAttribute.unapply to judge whether c3 can be casted to string, the result is yes.
Then it would invoke the origin NonVarcharAttribute, because the hivevar type of c3 is not varchar,
this prediction will be converted to filter c3 = "0", and pushed down.

But, Filtering is supported only on partition keys of type string, so it would trigger an exception.
In this PR, I judge whether the attribute's catalyst type is StringType additionally.

How was this patch tested?

unit test.

turboFei · 2019-05-23T09:29:19Z

@cloud-fan

cloud-fan · 2019-05-23T10:35:04Z

ok to test

cloud-fan · 2019-05-23T10:35:22Z

can you add a test?

SparkQA · 2019-05-23T11:36:20Z

Test build #105718 has finished for PR 24685 at commit e14a651.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

turboFei · 2019-05-23T12:03:26Z

I will check it.

sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala

SparkQA · 2019-05-23T12:40:48Z

Test build #105721 has finished for PR 24685 at commit 5a587bb.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-05-23T15:58:09Z

Test build #105722 has finished for PR 24685 at commit 6ae2479.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-05-23T17:28:29Z

Test build #105731 has finished for PR 24685 at commit 74654d6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

turboFei · 2019-05-23T17:29:42Z

Test passed. Thanks. @cloud-fan

SparkQA · 2019-05-24T04:33:57Z

Test build #105746 has finished for PR 24685 at commit 570bbb7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-05-24T04:41:32Z

Test build #105747 has finished for PR 24685 at commit 369c48b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

turboFei · 2019-05-27T01:38:26Z

gentle ping @cloud-fan

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala

cloud-fan · 2019-05-27T08:09:17Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala

      def unapply(expr: Expression): Option[Attribute] = {
        expr match {
-          case attr: Attribute => Some(attr)
+          case attr: Attribute


can't we simply do case attr: Attribute if attr.dataType == StringType?

I used to test that, but it can't pass some tests in PartitionedTablePerfStatsSuite.
It seems that such as p1 = '0' (p1 is a Int partition key and '0' is an Integer) can filter some partitions, but p1 = "0" (p1 is a Int partition key and "0" is a String) can't be pushed down.
So, I add the prediction here to judge whether it has a castToString operation.

I will remove the !partitionKeys.contains(attr.name), it is false always.
I did't have a good understanding about this.

can you point to the problematic test cases?

Such as(PartitionedTablePerfStatsSuite) :

genericTest("lazy partition pruning reads only necessary partition data")

Relative query is( partCol1 is an Int type partition key):

spark.sql("select * from test where partCol1 = 999").count() assert(HiveCatalogMetrics.METRIC_PARTITIONS_FETCHED.getCount() == 0)

The relative log is

5 did not equal 0 ScalaTestFailureLocation: org.apache.spark.sql.hive.PartitionedTablePerfStatsSuite at (PartitionedTablePerfStatsSuite.scala:139) Expected :0 Actual :5 <Click to see difference> org.scalatest.exceptions.TestFailedException: 5 did not equal 0

Now I'm a little confused. Seems hive does support to filter non-string-type partition columns.

seems we need to revisit #19602

But if you execute a sql likes

sql("SELECT c1 FROM t1 WHERE CAST(p1 as STRING) = '5'").show

It will throw an exception:

Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:759) ... 57 more Caused by: MetaException(message:Filtering is supported only on partition keys of type string)

…h is fatal.

SparkQA · 2019-05-27T10:04:27Z

Test build #105827 has finished for PR 24685 at commit 957bf13.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-05-27T10:42:54Z

Test build #105828 has finished for PR 24685 at commit 19b9f6b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2019-11-15T14:20:53Z

Can one of the admins verify this patch?

turboFei changed the title ~~[SPARK-27814] The cast operation may push down uncorrect filter, which is fatal.~~ [SPARK-27814] The cast operation for partitioned column may push down uncorrect filter, which is fatal. May 23, 2019

cloud-fan reviewed May 23, 2019

View reviewed changes

sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala Outdated Show resolved Hide resolved

turboFei force-pushed the SPARK-27814 branch from 5a587bb to 6ae2479 Compare May 23, 2019 14:11

turboFei changed the title ~~[SPARK-27814] The cast operation for partitioned column may push down uncorrect filter, which is fatal.~~ [SPARK-27814] The cast operation for partition key may push down uncorrect filter, which is fatal. May 24, 2019

turboFei force-pushed the SPARK-27814 branch from 570bbb7 to 369c48b Compare May 24, 2019 02:53

turboFei changed the title ~~[SPARK-27814] The cast operation for partition key may push down uncorrect filter, which is fatal.~~ [SPARK-27814][SQL][HIVE] The cast operation for partition key may push down uncorrect filter, which is fatal. May 26, 2019

turboFei changed the title ~~[SPARK-27814][SQL][HIVE] The cast operation for partition key may push down uncorrect filter, which is fatal.~~ [SPARK-27814][SQL] The cast operation for partition key may push down uncorrect filter, which is fatal. May 26, 2019

cloud-fan reviewed May 27, 2019

View reviewed changes

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala Outdated Show resolved Hide resolved

cloud-fan reviewed May 27, 2019

View reviewed changes

[SPARK-27814] The cast operation may push down uncorrect filter, whic…

957bf13

…h is fatal.

turboFei force-pushed the SPARK-27814 branch from 369c48b to 957bf13 Compare May 27, 2019 08:16

fix code

19b9f6b

dongjoon-hyun added the SQL label Jun 14, 2019

turboFei closed this Nov 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-27814][SQL] The cast operation for partition key may push down uncorrect filter, which is fatal. #24685

[SPARK-27814][SQL] The cast operation for partition key may push down uncorrect filter, which is fatal. #24685

turboFei commented May 23, 2019 •

edited

turboFei commented May 23, 2019

cloud-fan commented May 23, 2019

cloud-fan commented May 23, 2019

SparkQA commented May 23, 2019

turboFei commented May 23, 2019

SparkQA commented May 23, 2019

SparkQA commented May 23, 2019

SparkQA commented May 23, 2019

turboFei commented May 23, 2019 •

edited

SparkQA commented May 24, 2019

SparkQA commented May 24, 2019

turboFei commented May 27, 2019

cloud-fan May 27, 2019

turboFei May 27, 2019 •

edited

turboFei May 27, 2019

cloud-fan May 27, 2019

turboFei May 27, 2019 •

edited

cloud-fan May 27, 2019

cloud-fan May 27, 2019

turboFei May 27, 2019 •

edited

turboFei May 27, 2019

SparkQA commented May 27, 2019

SparkQA commented May 27, 2019

AmplabJenkins commented Nov 15, 2019

[SPARK-27814][SQL] The cast operation for partition key may push down uncorrect filter, which is fatal. #24685

[SPARK-27814][SQL] The cast operation for partition key may push down uncorrect filter, which is fatal. #24685

Conversation

turboFei commented May 23, 2019 • edited

What changes were proposed in this pull request?

How was this patch tested?

turboFei commented May 23, 2019

cloud-fan commented May 23, 2019

cloud-fan commented May 23, 2019

SparkQA commented May 23, 2019

turboFei commented May 23, 2019

SparkQA commented May 23, 2019

SparkQA commented May 23, 2019

SparkQA commented May 23, 2019

turboFei commented May 23, 2019 • edited

SparkQA commented May 24, 2019

SparkQA commented May 24, 2019

turboFei commented May 27, 2019

cloud-fan May 27, 2019

Choose a reason for hiding this comment

turboFei May 27, 2019 • edited

Choose a reason for hiding this comment

turboFei May 27, 2019

Choose a reason for hiding this comment

cloud-fan May 27, 2019

Choose a reason for hiding this comment

turboFei May 27, 2019 • edited

Choose a reason for hiding this comment

cloud-fan May 27, 2019

Choose a reason for hiding this comment

cloud-fan May 27, 2019

Choose a reason for hiding this comment

turboFei May 27, 2019 • edited

Choose a reason for hiding this comment

turboFei May 27, 2019

Choose a reason for hiding this comment

SparkQA commented May 27, 2019

SparkQA commented May 27, 2019

AmplabJenkins commented Nov 15, 2019

turboFei commented May 23, 2019 •

edited

turboFei commented May 23, 2019 •

edited

turboFei May 27, 2019 •

edited

turboFei May 27, 2019 •

edited

turboFei May 27, 2019 •

edited