Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-13249][SQL] Add Filter checking nullability of keys for inner join #11235

Closed
wants to merge 15 commits into from

Conversation

viirya
Copy link
Member

@viirya viirya commented Feb 17, 2016

JIRA: https://issues.apache.org/jira/browse/SPARK-13249

For inner join, the join key with null in it will not match each other, so we could insert a Filter before inner join (could be pushed down), then we don't need to check nullability of keys while joining.

@SparkQA
Copy link

SparkQA commented Feb 17, 2016

Test build #51414 has finished for PR 11235 at commit 132890b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -23,7 +23,7 @@ import org.apache.spark.sql.catalyst.analysis.{CleanupAliases, EliminateSubQueri
import org.apache.spark.sql.catalyst.expressions._
import org.apache.spark.sql.catalyst.expressions.Literal.{FalseLiteral, TrueLiteral}
import org.apache.spark.sql.catalyst.expressions.aggregate._
import org.apache.spark.sql.catalyst.planning.{ExtractFiltersAndInnerJoins, Unions}
import org.apache.spark.sql.catalyst.planning._
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Is it better, import org.apache.spark.sql.catalyst.planning.{ExtractEquiJoinKeys, ExtractFiltersAndInnerJoins, Unions}?

@maropu
Copy link
Member

maropu commented Feb 17, 2016

ISTM we can also add NULL filters in the one side of left/right-outer joins. Is it wrong?

@SparkQA
Copy link

SparkQA commented Feb 17, 2016

Test build #51425 has finished for PR 11235 at commit f8505de.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 17, 2016

Test build #51426 has finished for PR 11235 at commit 7766d17.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class AddFilterOfNullForInnerJoinSuite extends PlanTest

@viirya
Copy link
Member Author

viirya commented Feb 18, 2016

retest this please.

@SparkQA
Copy link

SparkQA commented Feb 18, 2016

Test build #51468 has finished for PR 11235 at commit 0c14be5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Feb 19, 2016

@maropu yeah, I think so.

…nerjoin

Conflicts:
	sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoin.scala
@SparkQA
Copy link

SparkQA commented Feb 19, 2016

Test build #51531 has finished for PR 11235 at commit 4f15778.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Feb 19, 2016

ping @davies

…nerjoin

Conflicts:
	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
@SparkQA
Copy link

SparkQA commented Feb 21, 2016

Test build #51631 has finished for PR 11235 at commit aada320.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class SubqueryExpression extends LeafExpression
    • case class ScalarSubquery(
    • case class Subquery(name: String, child: SparkPlan) extends UnaryNode
    • case class ScalarSubquery(

@viirya
Copy link
Member Author

viirya commented Feb 22, 2016

cc @davies @marmbrus @liancheng @rxin

@@ -57,6 +57,8 @@ abstract class Optimizer extends RuleExecutor[LogicalPlan] {
ReplaceDistinctWithAggregate) ::
Batch("Aggregate", FixedPoint(100),
RemoveLiteralFromGroupExpressions) ::
Batch("Join", Once,
AddFilterOfNullForInnerJoin) ::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make this rule idempotent instead of hacking it and making it run only once. You are loosing the benefits of emergent optimizations with this implementation.

I would directly construct the filter in the left/right child, but only when its not already present in the constraints of the child. This is the whole reason we added the ability to reason about what constraints are already present on a subtree.

@SparkQA
Copy link

SparkQA commented Feb 25, 2016

Test build #51959 has finished for PR 11235 at commit 687e948.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 25, 2016

Test build #51960 has finished for PR 11235 at commit 88f5020.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 26, 2016

Test build #52021 has finished for PR 11235 at commit 8291831.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 26, 2016

Test build #52028 has finished for PR 11235 at commit bf4777c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Feb 26, 2016

@marmbrus I've addressed the comments. Please see if this change is appropriate. Thanks!

@viirya
Copy link
Member Author

viirya commented Feb 29, 2016

ping @marmbrus @davies @rxin

@viirya
Copy link
Member Author

viirya commented Mar 2, 2016

@marmbrus @davies @rxin any comments for this? Thanks!

@viirya
Copy link
Member Author

viirya commented Mar 5, 2016

@liancheng Can you review this too? I think I've addressed previous comments. Thanks!

@viirya
Copy link
Member Author

viirya commented Mar 7, 2016

ping @marmbrus @rxin @davies @liancheng Is this ready to go? Or you have other comments? Thanks!

@@ -57,6 +57,8 @@ abstract class Optimizer extends RuleExecutor[LogicalPlan] {
ReplaceDistinctWithAggregate) ::
Batch("Aggregate", FixedPoint(100),
RemoveLiteralFromGroupExpressions) ::
Batch("Join", FixedPoint(100),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may have more InnerJoin from OuterJoinElimination, should we move this rule after that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can make this rule idempotent, we don't need to put this as separate group.

@SparkQA
Copy link

SparkQA commented Mar 8, 2016

Test build #52617 has finished for PR 11235 at commit 312cb32.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Mar 8, 2016

Comments addressed, please check if the change is good to merge now. Thanks!

object AddNullFilterForEquiJoin extends Rule[LogicalPlan] with PredicateHelper {
def apply(plan: LogicalPlan): LogicalPlan = plan transform {
case ExtractEquiJoinKeys(Inner, leftKeys, rightKeys, condition, left, right) =>
val leftConditions = leftKeys.distinct.map { l =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should only add predicate if the key is nullable and there is no IsNotNull constraints on the key.

@davies
Copy link
Contributor

davies commented Mar 9, 2016

I did not realized that we had merged https://github.com/apache/spark/pull/11372/files, do we still need this?

@viirya
Copy link
Member Author

viirya commented Mar 9, 2016

@davies yea, looks like it is doing the same. Let me close this now. Thanks for reviewing this anyway!

@viirya viirya closed this Mar 9, 2016
@viirya viirya deleted the add-filter-for-innerjoin branch December 27, 2023 18:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants