Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23087][SQL] CheckCartesianProduct too restrictive when condition is false/null #20333

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
Expand Up @@ -1108,15 +1108,19 @@ object CheckCartesianProducts extends Rule[LogicalPlan] with PredicateHelper {
*/
def isCartesianProduct(join: Join): Boolean = {
val conditions = join.condition.map(splitConjunctivePredicates).getOrElse(Nil)
!conditions.map(_.references).exists(refs => refs.exists(join.left.outputSet.contains)
&& refs.exists(join.right.outputSet.contains))

conditions match {
case Seq(Literal.FalseLiteral) | Seq(Literal(null, BooleanType)) => false
case _ => !conditions.map(_.references).exists(refs =>
refs.exists(join.left.outputSet.contains) && refs.exists(join.right.outputSet.contains))
}
}

def apply(plan: LogicalPlan): LogicalPlan =
if (SQLConf.get.crossJoinEnabled) {
plan
} else plan transform {
case j @ Join(left, right, Inner | LeftOuter | RightOuter | FullOuter, condition)
case j @ Join(left, right, Inner | LeftOuter | RightOuter | FullOuter, _)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For inner joins, we will not hit this, because it is already optimized to an empty relation. For the other outer join types, we face the exactly same issue as the condition is true. That is, the size of the join result sets is still the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you saying that the size of the result set is the same?
If you have a relation A (of size n, let's say 1M rows) in outer join with a relation B (of size m, let's say 1M rows). If the condition is true, the output relation is 1M * 1M (ie. (n * m)); if the condition is false, the result is 1M (n) for a left join, 1M (m) for a right join, 1M + 1M (m +n) for a full outer join. Therefore the size is not the same at all. But maybe you meant something different, am I missing something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. For outer join, it makes sense to remove this check

if isCartesianProduct(j) =>
throw new AnalysisException(
s"""Detected cartesian product for ${j.joinType.sql} join between logical plans
Expand Down
Expand Up @@ -274,4 +274,18 @@ class DataFrameJoinSuite extends QueryTest with SharedSQLContext {
checkAnswer(innerJoin, Row(1) :: Nil)
}

test("SPARK-23087: don't throw Analysis Exception in CheckCartesianProduct when join condition " +
"is false or null") {
val df = spark.range(10)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

withSQLConf(CROSS_JOINS_ENABLED.key -> "true") {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't it be false?

val dfNull = spark.range(10).select(lit(null).as("b"))
val planNull = df.join(dfNull, $"id" === $"b", "left").queryExecution.analyzed

spark.sessionState.executePlan(planNull).optimizedPlan

val dfOne = df.select(lit(1).as("a"))
val dfTwo = spark.range(10).select(lit(2).as("b"))
val planFalse = dfOne.join(dfTwo, $"a" === $"b", "left").queryExecution.analyzed

spark.sessionState.executePlan(planFalse).optimizedPlan
}
}