[SPARK-13249][SQL] Add Filter checking nullability of keys for inner join #11235

viirya · 2016-02-17T06:32:48Z

JIRA: https://issues.apache.org/jira/browse/SPARK-13249

For inner join, the join key with null in it will not match each other, so we could insert a Filter before inner join (could be pushed down), then we don't need to check nullability of keys while joining.

SparkQA · 2016-02-17T07:45:35Z

Test build #51414 has finished for PR 11235 at commit 132890b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2016-02-17T10:08:06Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

@@ -23,7 +23,7 @@ import org.apache.spark.sql.catalyst.analysis.{CleanupAliases, EliminateSubQueri
 import org.apache.spark.sql.catalyst.expressions._
 import org.apache.spark.sql.catalyst.expressions.Literal.{FalseLiteral, TrueLiteral}
 import org.apache.spark.sql.catalyst.expressions.aggregate._
-import org.apache.spark.sql.catalyst.planning.{ExtractFiltersAndInnerJoins, Unions}
+import org.apache.spark.sql.catalyst.planning._


Nit: Is it better, import org.apache.spark.sql.catalyst.planning.{ExtractEquiJoinKeys, ExtractFiltersAndInnerJoins, Unions}?

maropu · 2016-02-17T10:46:57Z

ISTM we can also add NULL filters in the one side of left/right-outer joins. Is it wrong?

SparkQA · 2016-02-17T10:57:25Z

Test build #51425 has finished for PR 11235 at commit f8505de.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-17T11:30:12Z

Test build #51426 has finished for PR 11235 at commit 7766d17.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class AddFilterOfNullForInnerJoinSuite extends PlanTest

…roadcastHashJoin.

viirya · 2016-02-18T04:15:38Z

retest this please.

SparkQA · 2016-02-18T06:05:50Z

Test build #51468 has finished for PR 11235 at commit 0c14be5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-02-19T07:01:10Z

@maropu yeah, I think so.

…nerjoin Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoin.scala

SparkQA · 2016-02-19T08:29:11Z

Test build #51531 has finished for PR 11235 at commit 4f15778.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-02-19T09:50:40Z

ping @davies

…nerjoin Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

SparkQA · 2016-02-21T14:56:52Z

Test build #51631 has finished for PR 11235 at commit aada320.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- abstract class SubqueryExpression extends LeafExpression
- case class ScalarSubquery(
- case class Subquery(name: String, child: SparkPlan) extends UnaryNode
- case class ScalarSubquery(

viirya · 2016-02-22T02:43:39Z

cc @davies @marmbrus @liancheng @rxin

marmbrus · 2016-02-22T18:37:38Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

@@ -57,6 +57,8 @@ abstract class Optimizer extends RuleExecutor[LogicalPlan] {
      ReplaceDistinctWithAggregate) ::
    Batch("Aggregate", FixedPoint(100),
      RemoveLiteralFromGroupExpressions) ::
+    Batch("Join", Once,
+      AddFilterOfNullForInnerJoin) ::


We should make this rule idempotent instead of hacking it and making it run only once. You are loosing the benefits of emergent optimizations with this implementation.

I would directly construct the filter in the left/right child, but only when its not already present in the constraints of the child. This is the whole reason we added the ability to reason about what constraints are already present on a subtree.

…nerjoin

SparkQA · 2016-02-25T11:46:32Z

Test build #51959 has finished for PR 11235 at commit 687e948.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-25T11:52:08Z

Test build #51960 has finished for PR 11235 at commit 88f5020.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

…nerjoin

SparkQA · 2016-02-26T05:42:25Z

Test build #52021 has finished for PR 11235 at commit 8291831.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-26T08:55:02Z

Test build #52028 has finished for PR 11235 at commit bf4777c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-02-26T14:48:10Z

@marmbrus I've addressed the comments. Please see if this change is appropriate. Thanks!

viirya · 2016-02-29T13:00:59Z

ping @marmbrus @davies @rxin

viirya · 2016-03-02T03:13:49Z

@marmbrus @davies @rxin any comments for this? Thanks!

viirya · 2016-03-05T12:54:18Z

@liancheng Can you review this too? I think I've addressed previous comments. Thanks!

viirya · 2016-03-07T02:35:51Z

ping @marmbrus @rxin @davies @liancheng Is this ready to go? Or you have other comments? Thanks!

davies · 2016-03-07T18:10:34Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

@@ -57,6 +57,8 @@ abstract class Optimizer extends RuleExecutor[LogicalPlan] {
      ReplaceDistinctWithAggregate) ::
    Batch("Aggregate", FixedPoint(100),
      RemoveLiteralFromGroupExpressions) ::
+    Batch("Join", FixedPoint(100),


We may have more InnerJoin from OuterJoinElimination, should we move this rule after that?

If we can make this rule idempotent, we don't need to put this as separate group.

…nerjoin

SparkQA · 2016-03-08T03:04:00Z

Test build #52617 has finished for PR 11235 at commit 312cb32.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-03-08T10:46:47Z

Comments addressed, please check if the change is good to merge now. Thanks!

davies · 2016-03-08T17:56:36Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+object AddNullFilterForEquiJoin extends Rule[LogicalPlan] with PredicateHelper {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+    case ExtractEquiJoinKeys(Inner, leftKeys, rightKeys, condition, left, right) =>
+      val leftConditions = leftKeys.distinct.map { l =>


We should only add predicate if the key is nullable and there is no IsNotNull constraints on the key.

davies · 2016-03-09T00:49:40Z

I did not realized that we had merged https://github.com/apache/spark/pull/11372/files, do we still need this?

viirya · 2016-03-09T01:00:50Z

@davies yea, looks like it is doing the same. Let me close this now. Thanks for reviewing this anyway!

viirya added 2 commits February 17, 2016 06:31

Add Filter checking nullability of keys for inner join.

216305f

Add comment.

132890b

viirya added 2 commits February 17, 2016 09:41

Use correct expresion.

f8505de

Add test.

7766d17

maropu reviewed Feb 17, 2016
View reviewed changes

Take care of broadcasthint. Also remove nullability check in codgen B…

0c14be5

…roadcastHashJoin.

viirya force-pushed the add-filter-for-innerjoin branch from 045121e to 0c14be5 Compare February 18, 2016 03:41

Merge remote-tracking branch 'upstream/master' into add-filter-for-in…

4f15778

…nerjoin Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoin.scala

Merge remote-tracking branch 'upstream/master' into add-filter-for-in…

aada320

…nerjoin Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

marmbrus reviewed Feb 22, 2016
View reviewed changes

viirya added 3 commits February 25, 2016 08:39

Merge remote-tracking branch 'upstream/master' into add-filter-for-in…

7308a08

…nerjoin

Check constraints to see if it is needed to add filters.

687e948

Revert previous changes.

88f5020

viirya added 2 commits February 26, 2016 03:21

Merge remote-tracking branch 'upstream/master' into add-filter-for-in…

96c7093

…nerjoin

Fix test.

8291831

Fix test.

bf4777c

davies reviewed Mar 7, 2016
View reviewed changes

viirya added 2 commits March 8, 2016 08:47

Merge remote-tracking branch 'upstream/master' into add-filter-for-in…

2c92f90

…nerjoin

Address comments.

312cb32

davies reviewed Mar 8, 2016
View reviewed changes

viirya closed this Mar 9, 2016

viirya deleted the add-filter-for-innerjoin branch December 27, 2023 18:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-13249][SQL] Add Filter checking nullability of keys for inner join #11235

[SPARK-13249][SQL] Add Filter checking nullability of keys for inner join #11235

viirya commented Feb 17, 2016

SparkQA commented Feb 17, 2016

maropu Feb 17, 2016

maropu commented Feb 17, 2016

SparkQA commented Feb 17, 2016

SparkQA commented Feb 17, 2016

viirya commented Feb 18, 2016

SparkQA commented Feb 18, 2016

viirya commented Feb 19, 2016

SparkQA commented Feb 19, 2016

viirya commented Feb 19, 2016

SparkQA commented Feb 21, 2016

viirya commented Feb 22, 2016

marmbrus Feb 22, 2016

SparkQA commented Feb 25, 2016

SparkQA commented Feb 25, 2016

SparkQA commented Feb 26, 2016

SparkQA commented Feb 26, 2016

viirya commented Feb 26, 2016

viirya commented Feb 29, 2016

viirya commented Mar 2, 2016

viirya commented Mar 5, 2016

viirya commented Mar 7, 2016

davies Mar 7, 2016

davies Mar 7, 2016

SparkQA commented Mar 8, 2016

viirya commented Mar 8, 2016

davies Mar 8, 2016

davies commented Mar 9, 2016

viirya commented Mar 9, 2016

[SPARK-13249][SQL] Add Filter checking nullability of keys for inner join #11235

[SPARK-13249][SQL] Add Filter checking nullability of keys for inner join #11235

Conversation

viirya commented Feb 17, 2016

SparkQA commented Feb 17, 2016

maropu Feb 17, 2016

Choose a reason for hiding this comment

maropu commented Feb 17, 2016

SparkQA commented Feb 17, 2016

SparkQA commented Feb 17, 2016

viirya commented Feb 18, 2016

SparkQA commented Feb 18, 2016

viirya commented Feb 19, 2016

SparkQA commented Feb 19, 2016

viirya commented Feb 19, 2016

SparkQA commented Feb 21, 2016

viirya commented Feb 22, 2016

marmbrus Feb 22, 2016

Choose a reason for hiding this comment

SparkQA commented Feb 25, 2016

SparkQA commented Feb 25, 2016

SparkQA commented Feb 26, 2016

SparkQA commented Feb 26, 2016

viirya commented Feb 26, 2016

viirya commented Feb 29, 2016

viirya commented Mar 2, 2016

viirya commented Mar 5, 2016

viirya commented Mar 7, 2016

davies Mar 7, 2016

Choose a reason for hiding this comment

davies Mar 7, 2016

Choose a reason for hiding this comment

SparkQA commented Mar 8, 2016

viirya commented Mar 8, 2016

davies Mar 8, 2016

Choose a reason for hiding this comment

davies commented Mar 9, 2016

viirya commented Mar 9, 2016