New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-27814][SQL] The cast operation for partition key may push down uncorrect filter, which is fatal. #24685
Conversation
ok to test |
can you add a test? |
Test build #105718 has finished for PR 24685 at commit
|
I will check it. |
sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala
Outdated
Show resolved
Hide resolved
Test build #105721 has finished for PR 24685 at commit
|
Test build #105722 has finished for PR 24685 at commit
|
Test build #105731 has finished for PR 24685 at commit
|
Test passed. Thanks. @cloud-fan |
Test build #105746 has finished for PR 24685 at commit
|
Test build #105747 has finished for PR 24685 at commit
|
gentle ping @cloud-fan |
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala
Outdated
Show resolved
Hide resolved
def unapply(expr: Expression): Option[Attribute] = { | ||
expr match { | ||
case attr: Attribute => Some(attr) | ||
case attr: Attribute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't we simply do case attr: Attribute if attr.dataType == StringType
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used to test that, but it can't pass some tests in PartitionedTablePerfStatsSuite
.
It seems that such as p1 = '0'
(p1 is a Int partition key and '0' is an Integer) can filter some partitions, but p1 = "0"
(p1 is a Int partition key and "0" is a String) can't be pushed down.
So, I add the prediction here to judge whether it has a castToString operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will remove the !partitionKeys.contains(attr.name)
, it is false always.
I did't have a good understanding about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you point to the problematic test cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Such as(PartitionedTablePerfStatsSuite) :
genericTest("lazy partition pruning reads only necessary partition data")
Relative query is( partCol1 is an Int type partition key):
spark.sql("select * from test where partCol1 = 999").count()
assert(HiveCatalogMetrics.METRIC_PARTITIONS_FETCHED.getCount() == 0)
The relative log is
5 did not equal 0
ScalaTestFailureLocation: org.apache.spark.sql.hive.PartitionedTablePerfStatsSuite at (PartitionedTablePerfStatsSuite.scala:139)
Expected :0
Actual :5
<Click to see difference>
org.scalatest.exceptions.TestFailedException: 5 did not equal 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I'm a little confused. Seems hive does support to filter non-string-type partition columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems we need to revisit #19602
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if you execute a sql likes
sql("SELECT c1 FROM t1 WHERE CAST(p1 as STRING) = '5'").show
It will throw an exception:
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:759)
... 57 more
Caused by: MetaException(message:Filtering is supported only on partition keys of type string)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok.
Test build #105827 has finished for PR 24685 at commit
|
Test build #105828 has finished for PR 24685 at commit
|
Can one of the admins verify this patch? |
What changes were proposed in this pull request?
For a partitioned table, such as:
If we use a cast operation in query, which casts the partition key, such as :
One predication of this query is
cast(c3 as string) = ’0‘
.It would invoke this method to convert to a filter.
First, it invokes the ExtractAttribute.unapply to judge whether c3 can be casted to string, the result is yes.
Then it would invoke the origin NonVarcharAttribute, because the hivevar type of c3 is not varchar,
this prediction will be converted to filter
c3 = "0"
, and pushed down.But, Filtering is supported only on partition keys of type string, so it would trigger an exception.
In this PR, I judge whether the attribute's catalyst type is StringType additionally.
How was this patch tested?
unit test.