Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-12957][SQL] Initial support for constraint propagation in SparkSQL #10844

Closed
wants to merge 15 commits into from

Conversation

sameeragarwal
Copy link
Member

Based on the semantics of a query, we can derive a number of data constraints on output of each (logical or physical) operator. For instance, if a filter defines ‘a > 10, we know that the output data of this filter satisfies 2 constraints:

  1. ‘a > 10
  2. isNotNull(‘a)

This PR proposes a possible way of keeping track of these constraints and propagating them in the logical plan, which can then help us build more advanced optimizations (such as pruning redundant filters, optimizing joins, among others). We define constraints as a set of (implicitly conjunctive) expressions. For e.g., if a filter operator has constraints = Set(‘a > 10, ‘b < 100), it’s implied that the outputs satisfy both individual constraints (i.e., ‘a > 10 AND ‘b < 100).

Design Document: https://docs.google.com/a/databricks.com/document/d/1WQRgDurUBV9Y6CWOBS75PQIqJwT-6WftVa18xzm7nCo/edit?usp=sharing

@SparkQA
Copy link

SparkQA commented Jan 20, 2016

Test build #49770 has finished for PR 10844 at commit f7251dd.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 20, 2016

Test build #49775 has finished for PR 10844 at commit 04ff99a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hvanhovell
Copy link
Contributor

Could you add the id of the JIRA ticket to the titlle?

Could you also add a description, explaining why we want this? Seems cool though!

@sameeragarwal sameeragarwal changed the title [WIP][SQL] Initial support for constraint propagation in SparkSQL [WIP][SPARK-12957][SQL] Initial support for constraint propagation in SparkSQL Jan 22, 2016
@sameeragarwal
Copy link
Member Author

@hvanhovell added, thanks!

@gatorsmile
Copy link
Member

@sameeragarwal I just saw it. Thank you! I will do the code changes after this is merged.

@SparkQA
Copy link

SparkQA commented Jan 26, 2016

Test build #50083 has finished for PR 10844 at commit 7fb2f9c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 27, 2016

Test build #50161 has finished for PR 10844 at commit f15ef96.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

self: PlanType =>

def output: Seq[Attribute]

/**
* Extracts the output property from a given child.
*/
def extractConstraintsFromChild(child: QueryPlan[PlanType]): Set[Expression] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

protected?

Also I'm not sure I get the scala doc. Maybe getReleventContraints is a better name? It is taking the constraints and removing those that don't apply anymore because we removed columns right?

@sameeragarwal sameeragarwal changed the title [WIP][SPARK-12957][SQL] Initial support for constraint propagation in SparkSQL [SPARK-12957][SQL] Initial support for constraint propagation in SparkSQL Jan 29, 2016
@SparkQA
Copy link

SparkQA commented Jan 29, 2016

Test build #50371 has finished for PR 10844 at commit 53be837.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sameeragarwal
Copy link
Member Author

Thanks @marmbrus, all comments addressed!

@SparkQA
Copy link

SparkQA commented Jan 29, 2016

Test build #50404 has finished for PR 10844 at commit 8c6bb70.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sameeragarwal
Copy link
Member Author

test this please

* operator. For example, if the output of this operator is column `a`, an example `constraints`
* can be `Set(a > 10, a < 20)`.
*/
lazy val constraints: Set[Expression] = validConstraints
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would consider making this getRelevantConstraints(validConstraints) so that each implementor of validContraints does't have to remember to do the filter / canonicalization. They can just focus on augmenting or passing through constraints from children based on the operators semantics.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea, added.

@SparkQA
Copy link

SparkQA commented Jan 30, 2016

Test build #50417 has finished for PR 10844 at commit 8c6bb70.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sameeragarwal
Copy link
Member Author

Thanks @marmbrus, all comments addressed!

@SparkQA
Copy link

SparkQA commented Jan 30, 2016

Test build #50429 has finished for PR 10844 at commit 302444f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}
}

def selfJoinResolved: Boolean = left.outputSet.intersect(right.outputSet).isEmpty
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was a merging mistake as its duplicated with the method below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, fixed!

@sameeragarwal
Copy link
Member Author

@marmbrus comments addressed!

@sameeragarwal
Copy link
Member Author

test this please

@SparkQA
Copy link

SparkQA commented Feb 2, 2016

Test build #50518 has finished for PR 10844 at commit b52742a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -301,10 +301,12 @@ abstract class LeafNode extends LogicalPlan {
/**
* A logical plan node with single child.
*/
abstract class UnaryNode extends LogicalPlan {
abstract class UnaryNode extends LogicalPlan with PredicateHelper {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we do not need PredicateHelper at here? Maybe it is better to just with PredicateHelper for Filter and Join?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, added this only for filter and join that needed splitConjunctivePredicates

@sameeragarwal
Copy link
Member Author

comments addressed!

@SparkQA
Copy link

SparkQA commented Feb 3, 2016

Test build #50603 has finished for PR 10844 at commit 2bd2735.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class QueryPlan[PlanType <: TreeNode[PlanType]] extends TreeNode[PlanType]
    • abstract class UnaryNode extends LogicalPlan
    • case class Filter(condition: Expression, child: LogicalPlan)

@marmbrus
Copy link
Contributor

marmbrus commented Feb 3, 2016

Thanks! Merging to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants