Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-34538][SQL] Hive Metastore support filter by not-in #31646

Closed
wants to merge 15 commits into from

Conversation

ulysses-you
Copy link
Contributor

@ulysses-you ulysses-you commented Feb 25, 2021

What changes were proposed in this pull request?

Add Not(In) and Not(InSet) pattern when convert filter to metastore.

Why are the changes needed?

NOT IN is a useful condition to prune partition, it would be better to support it.

Technically, we can convert c not in(x,y) to c != x and c != y, then push it to metastore.

Avoid metastore overflow and respect the config spark.sql.hive.metastorePartitionPruningInSetThreshold, Not(InSet) won't push to metastore if it's value exceeds the threshold.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Add test.

@ulysses-you ulysses-you changed the title [SPARK-33537][SQL] Hive Metastore support filter by not-in [SPARK-34538][SQL] Hive Metastore support filter by not-in Feb 25, 2021
@@ -748,6 +748,10 @@ private[client] class Shim_v0_13 extends Shim_v0_12 {
values.map(value => s"$name = $value").mkString("(", " or ", ")")
}

def convertNotInToAnd(name: String, values: Seq[String]): String = {
values.map(value => s"$name != $value").mkString("(", " and ", ")")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More than 10,000 values will cause the Hive Metastore stack overflow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about stoping push it If it's values size exceeds the threshold ? In this case it can not be covert like >= and <=.

@SparkQA
Copy link

SparkQA commented Feb 25, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40045/

@github-actions github-actions bot added the SQL label Feb 25, 2021
@SparkQA
Copy link

SparkQA commented Feb 25, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40045/

@SparkQA
Copy link

SparkQA commented Feb 25, 2021

Test build #135465 has finished for PR 31646 at commit 3f95df6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -108,6 +108,28 @@ class FiltersSuite extends SparkFunSuite with Logging with PlanTest {
(a("datecol", DateType) =!= Literal(Date.valueOf("2019-01-01"))) :: Nil,
"datecol != 2019-01-01")

filterTest("not-in int filter",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add a test case for NULL as while, e.g. NOT IN (1, 2, ..., NULL)? Given we have bug before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it should be, add the null value for test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems exists some issue about null value. Created #31659.

@SparkQA
Copy link

SparkQA commented Feb 26, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40068/

@SparkQA
Copy link

SparkQA commented Feb 26, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40068/

@SparkQA
Copy link

SparkQA commented Feb 26, 2021

Test build #135487 has finished for PR 31646 at commit cf3ba56.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 26, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40095/

@SparkQA
Copy link

SparkQA commented Feb 26, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40095/

@SparkQA
Copy link

SparkQA commented Feb 26, 2021

Test build #135514 has finished for PR 31646 at commit 0415448.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • sealed trait PartitionSpec extends LeafExpression with Unevaluable
  • trait V2PartitionCommand extends Command
  • case class TruncateTable(table: LogicalPlan) extends Command
  • case class TruncatePartition(
  • case class TruncatePartitionExec(

@ulysses-you
Copy link
Contributor Author

@maropu @cloud-fan Do you have time to take a look this, thanks!

@ulysses-you
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Mar 5, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40376/

@SparkQA
Copy link

SparkQA commented Mar 5, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40376/

@SparkQA
Copy link

SparkQA commented Mar 5, 2021

Test build #135794 has finished for PR 31646 at commit 0415448.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • sealed trait PartitionSpec extends LeafExpression with Unevaluable
  • trait V2PartitionCommand extends Command
  • case class TruncateTable(table: LogicalPlan) extends Command
  • case class TruncatePartition(
  • case class TruncatePartitionExec(

@@ -863,7 +863,9 @@ object SQLConf {
.doc("The threshold of set size for InSet predicate when pruning partitions through Hive " +
"Metastore. When the set size exceeds the threshold, we rewrite the InSet predicate " +
"to be greater than or equal to the minimum value in set and less than or equal to the " +
"maximum value in set. Larger values may cause Hive Metastore stack overflow.")
"maximum value in set. Larger values may cause Hive Metastore stack overflow. But for " +
"the predicate of Not InSet which values exceeds the threshold, we won't to push it " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

won't to push -> won't push

@@ -767,6 +771,10 @@ private[client] class Shim_v0_13 extends Shim_v0_12 {
if useAdvanced =>
Some(convertInToOr(name, values))

case Not(In(ExtractAttribute(SupportedAttribute(name)), ExtractableLiterals(values)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not(a IN (null, 2)): if a = 1, the final result is null.
a != 2: if a = 1, the final result is true

I think this rewrite is incorrect. We need to make sure the values of IN are all not null.

Copy link
Contributor Author

@ulysses-you ulysses-you Mar 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, missed this. For Not(InSet) case, null value can not be skip safely, it should convert to and(xx != null).

@@ -775,14 +791,25 @@ private[client] class Shim_v0_13 extends Shim_v0_12 {
convert(And(GreaterThanOrEqual(child, Literal(sortedValues.head, dataType)),
LessThanOrEqual(child, Literal(sortedValues.last, dataType))))

case Not(InSet(_, values)) if values.size > inSetThreshold =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move this to the beginning so that unsupported cases are grouped together?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved.

@SparkQA
Copy link

SparkQA commented Mar 10, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40505/

@SparkQA
Copy link

SparkQA commented Mar 10, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40505/

@SparkQA
Copy link

SparkQA commented Mar 10, 2021

Test build #135915 has finished for PR 31646 at commit 93eedd3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 10, 2021

Test build #135922 has finished for PR 31646 at commit e8c7b6c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -863,7 +863,9 @@ object SQLConf {
.doc("The threshold of set size for InSet predicate when pruning partitions through Hive " +
"Metastore. When the set size exceeds the threshold, we rewrite the InSet predicate " +
"to be greater than or equal to the minimum value in set and less than or equal to the " +
"maximum value in set. Larger values may cause Hive Metastore stack overflow.")
"maximum value in set. Larger values may cause Hive Metastore stack overflow. But for " +
"the predicate of Not InSet which values exceeds the threshold, we won't push it to " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But for InSet inside Not with values exceeding ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@@ -418,6 +418,52 @@ class HivePartitionFilteringSuite(version: String)
dateStrValue)
}

test("getPartitionsByFilter: not in chunk") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the effective test that can catch correctness bugs. Does it test null?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The coverage of previous test is not good. I rewrite the test and include you mentioned.

)
check(
Not(In(attr("chunk"), Seq(Literal("aa"), Literal("ab"), Literal(null)))),
chunkValue
Copy link
Contributor

@cloud-fan cloud-fan Mar 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, this test can detect the correctness bug about null handling.

)
}

test("getPartitionsByFilter: not in/inset int type") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think int and string have much difference about this new feature. It should be sufficient to test string type and date type. Same to the FiltersSuite

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed these.

@SparkQA
Copy link

SparkQA commented Mar 11, 2021

Test build #135968 has finished for PR 31646 at commit 17aea47.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 744a73d Mar 11, 2021
@ulysses-you
Copy link
Contributor Author

thanks for review and merging !

@ulysses-you ulysses-you deleted the SPARK-34538 branch March 12, 2021 01:16
domybest11 pushed a commit to domybest11/spark that referenced this pull request Jun 15, 2022
Add `Not(In)` and `Not(InSet)` pattern when convert filter to metastore.

`NOT IN` is a useful condition to prune partition, it would be better to support it.

Technically, we can convert `c not in(x,y)` to `c != x and c != y`, then push it to metastore.

Avoid metastore overflow and respect the config `spark.sql.hive.metastorePartitionPruningInSetThreshold`, `Not(InSet)` won't push to metastore if it's value exceeds the threshold.

No.

Add test.

Closes apache#31646 from ulysses-you/SPARK-34538.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 744a73d)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants