Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23564][SQL] infer additional filters from constraints for join's children #21083

Closed
wants to merge 4 commits into from

Conversation

cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

The existing query constraints framework has 2 steps:

  1. propagate constraints bottom up.
  2. use constraints to infer additional filters for better data pruning.

For step 2, it mostly helps with Join, because we can connect the constraints from children to the join condition and infer powerful filters to prune the data of the join sides. e.g., the left side has constraints a = 1, the join condition is left.a = right.a, then we can infer right.a = 1 to the right side and prune the right side a lot.

However, the current logic of inferring filters from constraints for Join is pretty weak. It infers the filters from Join's constraints. Some joins like left semi/anti exclude output from right side and the right side constraints will be lost here.

This PR propose to check the left and right constraints individually, expand the constraints with join condition and add filters to children of join directly, instead of adding to the join condition.

This reverts #20670 , covers #20717 and #20816

This is inspired by the original PRs and the tests are all from these PRs. Thanks to the authors @mgaido91 @maryannxue @KaiXinXiaoLei !

How was this patch tested?

new tests

@cloud-fan
Copy link
Contributor Author

@maryannxue
Copy link
Contributor

cc @mgaido91 @maryannxue @KaiXinXiaoLei @gatorsmile @jiangxb1987 @gengliangwang: I do not think this is the right way to do things, @cloud-fan. Looks like you have been aware of my and others' work like #20816, you could have, or I'd say, should have, given your input/suggestions on related PRs. People who have worked on this should deserve more credit than being mentioned as "inspired" here.

@SparkQA
Copy link

SparkQA commented Apr 17, 2018

Test build #89431 has finished for PR 21083 at commit b967955.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 17, 2018

Test build #89432 has finished for PR 21083 at commit 371c1df.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

It's a totally different approach to do additional filters inference from constraints for join, and I'm not sure if it works until I finished this PR. That's the main reason I didn't post my suggestion to the related PRs.

I actually don't care about the credites and was planning to give credites to @mgaido91 since it's mostly inspired by the discussion with him. However someone mentioned your PR to me at the last minute, I pulled in your tests and found it's already covered. Then I was a little hesitate about who should get the credites and didn't mention it in the PR desription.

If you don't mind, are you OK with giving credites to @mgaido91 ? Thanks for your understanding!

@mgaido91
Copy link
Contributor

@cloud-fan the approach itself seems OK to me. Indeed, I prefer this one over the previous status where we had constraints enforced on the output and allConstraints containing also the others: I think it was pretty confusing and I think that moving those additional constraints to the optimizer is the right thing to do. Anyway, I will review it more carefully in the next days, asap.

As far as the credit is regarded, there is no issue for me. You can give credit to @maryannxue and @KaiXinXiaoLei.

I think I can close #20717 then, since it is going to be covered here, am I right?

@SparkQA
Copy link

SparkQA commented Apr 17, 2018

Test build #89435 has finished for PR 21083 at commit 561db44.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -54,13 +51,42 @@ trait QueryPlanConstraints { self: LogicalPlan =>
* See [[Canonicalize]] for more details.
*/
protected def validConstraints: Set[Expression] = Set.empty
}

object ConstraintsUtils {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: in order to follow the pattern I see also for PredicateHelper, what about having a ConstraintHelper trait instead of this object?

baseConstraints.union(ConstraintsUtils.inferAdditionalConstraints(baseConstraints))
}

private def inferNewFilter(plan: LogicalPlan, constraints: Set[Expression]): LogicalPlan = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may we return here the additional constraints instead of the new plan, so that in L669 and similar we can copy the plan only if it is needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copying the plan is very cheap.

@maryannxue
Copy link
Contributor

Thank you for you reply, @cloud-fan! I was not clear when you had become aware of the effort on SPARK-21479 so it might be a misunderstanding on my side and I apologize. Anyway, if you had had a closer look at the PR, you would have probably got the idea that it's basically the same approach as what you have here, only that you have covered more join types.
Here's another note. There's two types of constraint-to-filter inference for joins going on here:

  1. Infer from the Join node constraints, which is covered by the PushPredicateThroughJoin rule;
  2. Infer from the sibling child node combined with the join condition, which is what you've added here.
    That said, the InnerLike joins should already be covered by 1 and might not be worth being considered again in this optimization rule. Not sure about LeftSemi joins, so it would be nice if we could have a test case that proves this optimization does something that has not yet been covered by PushPredicateThroughJoin rule.

@cloud-fan
Copy link
Contributor Author

That said, the InnerLike joins should already be covered by 1 and might not be worth being considered again in this optimization rule.

Previously the InferFiltersFromConstraints adds the additional filters to join condition, so inner joins are covered by 1. Here I changed this rule to directly add additional filters to join children, so inner joins also need to be considered here.

@SparkQA
Copy link

SparkQA commented Apr 18, 2018

Test build #89480 has finished for PR 21083 at commit 787cddf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait QueryPlanConstraints extends ConstraintHelper
  • trait ConstraintHelper

@cloud-fan cloud-fan changed the title [SPARK-21479][SPARK-23564][SQL] infer additional filters from constraints for join's children [SPARK-23564][SQL] infer additional filters from constraints for join's children Apr 23, 2018
@cloud-fan
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Apr 23, 2018

Test build #89701 has finished for PR 21083 at commit 787cddf.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait QueryPlanConstraints extends ConstraintHelper
  • trait ConstraintHelper

@cloud-fan
Copy link
Contributor Author

retest this please


private def inferNewFilter(plan: LogicalPlan, constraints: Set[Expression]): LogicalPlan = {
val newPredicates = constraints
.union(constructIsNotNullConstraints(constraints, plan.output))
Copy link
Member

@gengliangwang gengliangwang Apr 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put the code

constraints
    .union(inferAdditionalConstraints(constraints))
    .union(constructIsNotNullConstraints(constraints, output))
    .filter { c =>
      c.references.nonEmpty && c.references.subsetOf(outputSet) && c.deterministic
    }

into the a function in ConstraintHelper

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

case _: InnerLike | LeftSemi =>
          val allConstraints = getAllConstraints(left, right, conditionOpt)
          val newLeft = inferNewFilter(left, allConstraints)
          val newRight = inferNewFilter(right, allConstraints)
          join.copy(left = newLeft, right = newRight)

For this pattern, if we reuse the code you mentioned, we need to do constraints expanding twice, for left and right.

Copy link
Member

@gengliangwang gengliangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except one comment.

@mgaido91
Copy link
Contributor

LGTM

@jiangxb1987
Copy link
Contributor

lgtm

@SparkQA
Copy link

SparkQA commented Apr 23, 2018

Test build #89705 has finished for PR 21083 at commit 787cddf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait QueryPlanConstraints extends ConstraintHelper
  • trait ConstraintHelper

@cloud-fan
Copy link
Contributor Author

thanks, merging to master!

@asfgit asfgit closed this in d87d30e Apr 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants