-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-34538][SQL] Hive Metastore support filter by not-in #31646
Conversation
@@ -748,6 +748,10 @@ private[client] class Shim_v0_13 extends Shim_v0_12 { | |||
values.map(value => s"$name = $value").mkString("(", " or ", ")") | |||
} | |||
|
|||
def convertNotInToAnd(name: String, values: Seq[String]): String = { | |||
values.map(value => s"$name != $value").mkString("(", " and ", ")") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More than 10,000 values will cause the Hive Metastore stack overflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about stoping push it If it's values size exceeds the threshold ? In this case it can not be covert like >= and <=
.
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #135465 has finished for PR 31646 at commit
|
@@ -108,6 +108,28 @@ class FiltersSuite extends SparkFunSuite with Logging with PlanTest { | |||
(a("datecol", DateType) =!= Literal(Date.valueOf("2019-01-01"))) :: Nil, | |||
"datecol != 2019-01-01") | |||
|
|||
filterTest("not-in int filter", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we add a test case for NULL
as while, e.g. NOT IN (1, 2, ..., NULL)? Given we have bug before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it should be, add the null value for test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems exists some issue about null value. Created #31659.
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #135487 has finished for PR 31646 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #135514 has finished for PR 31646 at commit
|
@maropu @cloud-fan Do you have time to take a look this, thanks! |
retest this please |
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #135794 has finished for PR 31646 at commit
|
@@ -863,7 +863,9 @@ object SQLConf { | |||
.doc("The threshold of set size for InSet predicate when pruning partitions through Hive " + | |||
"Metastore. When the set size exceeds the threshold, we rewrite the InSet predicate " + | |||
"to be greater than or equal to the minimum value in set and less than or equal to the " + | |||
"maximum value in set. Larger values may cause Hive Metastore stack overflow.") | |||
"maximum value in set. Larger values may cause Hive Metastore stack overflow. But for " + | |||
"the predicate of Not InSet which values exceeds the threshold, we won't to push it " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
won't to push
-> won't push
@@ -767,6 +771,10 @@ private[client] class Shim_v0_13 extends Shim_v0_12 { | |||
if useAdvanced => | |||
Some(convertInToOr(name, values)) | |||
|
|||
case Not(In(ExtractAttribute(SupportedAttribute(name)), ExtractableLiterals(values))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not(a IN (null, 2))
: if a = 1, the final result is null.
a != 2
: if a = 1, the final result is true
I think this rewrite is incorrect. We need to make sure the values of IN are all not null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad, missed this. For Not(InSet)
case, null value can not be skip safely, it should convert to and(xx != null)
.
@@ -775,14 +791,25 @@ private[client] class Shim_v0_13 extends Shim_v0_12 { | |||
convert(And(GreaterThanOrEqual(child, Literal(sortedValues.head, dataType)), | |||
LessThanOrEqual(child, Literal(sortedValues.last, dataType)))) | |||
|
|||
case Not(InSet(_, values)) if values.size > inSetThreshold => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we move this to the beginning so that unsupported cases are grouped together?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved.
sql/hive/src/test/scala/org/apache/spark/sql/hive/client/FiltersSuite.scala
Show resolved
Hide resolved
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #135915 has finished for PR 31646 at commit
|
Test build #135922 has finished for PR 31646 at commit
|
@@ -863,7 +863,9 @@ object SQLConf { | |||
.doc("The threshold of set size for InSet predicate when pruning partitions through Hive " + | |||
"Metastore. When the set size exceeds the threshold, we rewrite the InSet predicate " + | |||
"to be greater than or equal to the minimum value in set and less than or equal to the " + | |||
"maximum value in set. Larger values may cause Hive Metastore stack overflow.") | |||
"maximum value in set. Larger values may cause Hive Metastore stack overflow. But for " + | |||
"the predicate of Not InSet which values exceeds the threshold, we won't push it to " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But for InSet inside Not with values exceeding ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
sql/hive/src/test/scala/org/apache/spark/sql/hive/client/FiltersSuite.scala
Show resolved
Hide resolved
@@ -418,6 +418,52 @@ class HivePartitionFilteringSuite(version: String) | |||
dateStrValue) | |||
} | |||
|
|||
test("getPartitionsByFilter: not in chunk") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the effective test that can catch correctness bugs. Does it test null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The coverage of previous test is not good. I rewrite the test and include you mentioned.
) | ||
check( | ||
Not(In(attr("chunk"), Seq(Literal("aa"), Literal("ab"), Literal(null)))), | ||
chunkValue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, this test can detect the correctness bug about null handling.
) | ||
} | ||
|
||
test("getPartitionsByFilter: not in/inset int type") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think int and string have much difference about this new feature. It should be sufficient to test string type and date type. Same to the FiltersSuite
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed these.
Test build #135968 has finished for PR 31646 at commit
|
thanks, merging to master! |
thanks for review and merging ! |
Add `Not(In)` and `Not(InSet)` pattern when convert filter to metastore. `NOT IN` is a useful condition to prune partition, it would be better to support it. Technically, we can convert `c not in(x,y)` to `c != x and c != y`, then push it to metastore. Avoid metastore overflow and respect the config `spark.sql.hive.metastorePartitionPruningInSetThreshold`, `Not(InSet)` won't push to metastore if it's value exceeds the threshold. No. Add test. Closes apache#31646 from ulysses-you/SPARK-34538. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 744a73d)
What changes were proposed in this pull request?
Add
Not(In)
andNot(InSet)
pattern when convert filter to metastore.Why are the changes needed?
NOT IN
is a useful condition to prune partition, it would be better to support it.Technically, we can convert
c not in(x,y)
toc != x and c != y
, then push it to metastore.Avoid metastore overflow and respect the config
spark.sql.hive.metastorePartitionPruningInSetThreshold
,Not(InSet)
won't push to metastore if it's value exceeds the threshold.Does this PR introduce any user-facing change?
No.
How was this patch tested?
Add test.