-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-13495][SQL] Add Null Filters in the query plan for Filters/Joins based on their data constraints #11372
Conversation
Test build #51980 has finished for PR 11372 at commit
|
06d74da
to
2345075
Compare
Test build #51994 has finished for PR 11372 at commit
|
.reduce(And) | ||
Filter(reorderedCondition, child) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if the optimizer is the right place for this rule. My main concern is that if we can preserve this ordering through the rest of query compilation. Will it be better to do it inside the physical Filter operator (just before we start to generate the code)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that sounds like a good idea! Could there be any other downside of not doing it in the optimizer? /cc @nongli
2345075
to
c08b7fb
Compare
c08b7fb
to
28050b3
Compare
Test build #52338 has finished for PR 11372 at commit
|
Test build #52383 has finished for PR 11372 at commit
|
Test build #52406 has finished for PR 11372 at commit
|
Test build #52416 has finished for PR 11372 at commit
|
test this please |
Test build #52431 has finished for PR 11372 at commit
|
@@ -586,6 +587,52 @@ object NullPropagation extends Rule[LogicalPlan] { | |||
} | |||
|
|||
/** | |||
* Attempts to eliminate reading (unnecessary) NULL values if they are not required for correctness | |||
* by inserting isNotNull filters is the query plan. These filters are currently inserted beneath |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"in the query plan"
Thanks @nongli, all comments addressed. |
Test build #52494 has finished for PR 11372 at commit
|
LGTM |
…ns based on their data constraints ## What changes were proposed in this pull request? This PR adds an optimizer rule to eliminate reading (unnecessary) NULL values if they are not required for correctness by inserting `isNotNull` filters is the query plan. These filters are currently inserted beneath existing `Filter` and `Join` operators and are inferred based on their data constraints. Note: While this optimization is applicable to all types of join, it primarily benefits `Inner` and `LeftSemi` joins. ## How was this patch tested? 1. Added a new `NullFilteringSuite` that tests for `IsNotNull` filters in the query plan for joins and filters. Also, tests interaction with the `CombineFilters` optimizer rules. 2. Test generated ExpressionTrees via `OrcFilterSuite` 3. Test filter source pushdown logic via `SimpleTextHadoopFsRelationSuite` cc yhuai nongli Author: Sameer Agarwal <sameer@databricks.com> Closes apache#11372 from sameeragarwal/gen-isnotnull.
What changes were proposed in this pull request?
This PR adds an optimizer rule to eliminate reading (unnecessary) NULL values if they are not required for correctness by inserting
isNotNull
filters is the query plan. These filters are currently inserted beneath existingFilter
andJoin
operators and are inferred based on their data constraints.Note: While this optimization is applicable to all types of join, it primarily benefits
Inner
andLeftSemi
joins.How was this patch tested?
NullFilteringSuite
that tests forIsNotNull
filters in the query plan for joins and filters. Also, tests interaction with theCombineFilters
optimizer rules.OrcFilterSuite
SimpleTextHadoopFsRelationSuite
cc @yhuai @nongli