Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-13495][SQL] Add Null Filters in the query plan for Filters/Joins based on their data constraints #11372

Closed
wants to merge 7 commits into from

Conversation

sameeragarwal
Copy link
Member

What changes were proposed in this pull request?

This PR adds an optimizer rule to eliminate reading (unnecessary) NULL values if they are not required for correctness by inserting isNotNull filters is the query plan. These filters are currently inserted beneath existing Filter and Join operators and are inferred based on their data constraints.

Note: While this optimization is applicable to all types of join, it primarily benefits Inner and LeftSemi joins.

How was this patch tested?

  1. Added a new NullFilteringSuite that tests for IsNotNull filters in the query plan for joins and filters. Also, tests interaction with the CombineFilters optimizer rules.
  2. Test generated ExpressionTrees via OrcFilterSuite
  3. Test filter source pushdown logic via SimpleTextHadoopFsRelationSuite

cc @yhuai @nongli

@SparkQA
Copy link

SparkQA commented Feb 25, 2016

Test build #51980 has finished for PR 11372 at commit 06d74da.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 26, 2016

Test build #51994 has finished for PR 11372 at commit 2345075.

  • This patch fails from timeout after a configured wait of 250m.
  • This patch merges cleanly.
  • This patch adds no public classes.

.reduce(And)
Filter(reorderedCondition, child)
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if the optimizer is the right place for this rule. My main concern is that if we can preserve this ordering through the rest of query compilation. Will it be better to do it inside the physical Filter operator (just before we start to generate the code)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that sounds like a good idea! Could there be any other downside of not doing it in the optimizer? /cc @nongli

@sameeragarwal sameeragarwal changed the title [WIP][SPARK-13495][SQL] Add Null Filters in the query plan for Filters/Joins based on their data constraints [SPARK-13495][SQL] Add Null Filters in the query plan for Filters/Joins based on their data constraints Mar 2, 2016
@SparkQA
Copy link

SparkQA commented Mar 2, 2016

Test build #52338 has finished for PR 11372 at commit 28050b3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class PlanTest extends SparkFunSuite with PredicateHelper

@SparkQA
Copy link

SparkQA commented Mar 3, 2016

Test build #52383 has finished for PR 11372 at commit 2a469e8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 3, 2016

Test build #52406 has finished for PR 11372 at commit 80dab7e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 4, 2016

Test build #52416 has finished for PR 11372 at commit 013f97a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sameeragarwal
Copy link
Member Author

test this please

@SparkQA
Copy link

SparkQA commented Mar 4, 2016

Test build #52431 has finished for PR 11372 at commit 013f97a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -586,6 +587,52 @@ object NullPropagation extends Rule[LogicalPlan] {
}

/**
* Attempts to eliminate reading (unnecessary) NULL values if they are not required for correctness
* by inserting isNotNull filters is the query plan. These filters are currently inserted beneath
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"in the query plan"

@sameeragarwal
Copy link
Member Author

Thanks @nongli, all comments addressed.

@SparkQA
Copy link

SparkQA commented Mar 5, 2016

Test build #52494 has finished for PR 11372 at commit 31b1700.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@nongli
Copy link
Contributor

nongli commented Mar 7, 2016

LGTM

@asfgit asfgit closed this in ef77003 Mar 7, 2016
roygao94 pushed a commit to roygao94/spark that referenced this pull request Mar 22, 2016
…ns based on their data constraints

## What changes were proposed in this pull request?

This PR adds an optimizer rule to eliminate reading (unnecessary) NULL values if they are not required for correctness by inserting `isNotNull` filters is the query plan. These filters are currently inserted beneath existing `Filter` and `Join` operators and are inferred based on their data constraints.

Note: While this optimization is applicable to all types of join, it primarily benefits `Inner` and `LeftSemi` joins.

## How was this patch tested?

1. Added a new `NullFilteringSuite` that tests for `IsNotNull` filters in the query plan for joins and filters. Also, tests interaction with the `CombineFilters` optimizer rules.
2. Test generated ExpressionTrees via `OrcFilterSuite`
3. Test filter source pushdown logic via `SimpleTextHadoopFsRelationSuite`

cc yhuai nongli

Author: Sameer Agarwal <sameer@databricks.com>

Closes apache#11372 from sameeragarwal/gen-isnotnull.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants