Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-27814][SQL] The cast operation for partition key may push down uncorrect filter, which is fatal. #24685

Closed
wants to merge 2 commits into from

Conversation

turboFei
Copy link
Member

@turboFei turboFei commented May 23, 2019

What changes were proposed in this pull request?

For a partitioned table, such as:

table test (c1 int, c2 string) partitioned by (c3 Int)

If we use a cast operation in query, which casts the partition key, such as :

select * from test where (cast c3 as string)  = '0'

One predication of this query is cast(c3 as string) = ’0‘.
It would invoke this method to convert to a filter.

     case op @ SpecialBinaryComparison(
          ExtractAttribute(NonVarcharAttribute(name)), ExtractableLiteral(value)) =>
        Some(s"$name ${op.symbol} $value")

First, it invokes the ExtractAttribute.unapply to judge whether c3 can be casted to string, the result is yes.
Then it would invoke the origin NonVarcharAttribute, because the hivevar type of c3 is not varchar,
this prediction will be converted to filter c3 = "0", and pushed down.

But, Filtering is supported only on partition keys of type string, so it would trigger an exception.
In this PR, I judge whether the attribute's catalyst type is StringType additionally.

How was this patch tested?

unit test.

@turboFei turboFei changed the title [SPARK-27814] The cast operation may push down uncorrect filter, which is fatal. [SPARK-27814] The cast operation for partitioned column may push down uncorrect filter, which is fatal. May 23, 2019
@turboFei
Copy link
Member Author

@cloud-fan

@cloud-fan
Copy link
Contributor

ok to test

@cloud-fan
Copy link
Contributor

can you add a test?

@SparkQA
Copy link

SparkQA commented May 23, 2019

Test build #105718 has finished for PR 24685 at commit e14a651.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@turboFei
Copy link
Member Author

I will check it.

@SparkQA
Copy link

SparkQA commented May 23, 2019

Test build #105721 has finished for PR 24685 at commit 5a587bb.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 23, 2019

Test build #105722 has finished for PR 24685 at commit 6ae2479.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 23, 2019

Test build #105731 has finished for PR 24685 at commit 74654d6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@turboFei
Copy link
Member Author

turboFei commented May 23, 2019

Test passed. Thanks. @cloud-fan

@turboFei turboFei changed the title [SPARK-27814] The cast operation for partitioned column may push down uncorrect filter, which is fatal. [SPARK-27814] The cast operation for partition key may push down uncorrect filter, which is fatal. May 24, 2019
@SparkQA
Copy link

SparkQA commented May 24, 2019

Test build #105746 has finished for PR 24685 at commit 570bbb7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 24, 2019

Test build #105747 has finished for PR 24685 at commit 369c48b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@turboFei turboFei changed the title [SPARK-27814] The cast operation for partition key may push down uncorrect filter, which is fatal. [SPARK-27814][SQL][HIVE] The cast operation for partition key may push down uncorrect filter, which is fatal. May 26, 2019
@turboFei turboFei changed the title [SPARK-27814][SQL][HIVE] The cast operation for partition key may push down uncorrect filter, which is fatal. [SPARK-27814][SQL] The cast operation for partition key may push down uncorrect filter, which is fatal. May 26, 2019
@turboFei
Copy link
Member Author

gentle ping @cloud-fan

def unapply(expr: Expression): Option[Attribute] = {
expr match {
case attr: Attribute => Some(attr)
case attr: Attribute
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we simply do case attr: Attribute if attr.dataType == StringType?

Copy link
Member Author

@turboFei turboFei May 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used to test that, but it can't pass some tests in PartitionedTablePerfStatsSuite.
It seems that such as p1 = '0' (p1 is a Int partition key and '0' is an Integer) can filter some partitions, but p1 = "0" (p1 is a Int partition key and "0" is a String) can't be pushed down.
So, I add the prediction here to judge whether it has a castToString operation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove the !partitionKeys.contains(attr.name), it is false always.
I did't have a good understanding about this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you point to the problematic test cases?

Copy link
Member Author

@turboFei turboFei May 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Such as(PartitionedTablePerfStatsSuite) :

genericTest("lazy partition pruning reads only necessary partition data")

Relative query is( partCol1 is an Int type partition key):

          spark.sql("select * from test where partCol1 = 999").count()
          assert(HiveCatalogMetrics.METRIC_PARTITIONS_FETCHED.getCount() == 0)

The relative log is

5 did not equal 0
ScalaTestFailureLocation: org.apache.spark.sql.hive.PartitionedTablePerfStatsSuite at (PartitionedTablePerfStatsSuite.scala:139)
Expected :0
Actual   :5
<Click to see difference>

org.scalatest.exceptions.TestFailedException: 5 did not equal 0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I'm a little confused. Seems hive does support to filter non-string-type partition columns.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems we need to revisit #19602

Copy link
Member Author

@turboFei turboFei May 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if you execute a sql likes

sql("SELECT c1 FROM t1 WHERE CAST(p1 as STRING) = '5'").show

It will throw an exception:

Caused by: java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:759)
	... 57 more
Caused by: MetaException(message:Filtering is supported only on partition keys of type string)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok.

@SparkQA
Copy link

SparkQA commented May 27, 2019

Test build #105827 has finished for PR 24685 at commit 957bf13.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 27, 2019

Test build #105828 has finished for PR 24685 at commit 19b9f6b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@turboFei turboFei closed this Nov 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants