[SPARK-17698] [SQL] Join predicates should not contain filter clauses #15272

tejasapatil · 2016-09-28T01:38:25Z

What changes were proposed in this pull request?

Jira : https://issues.apache.org/jira/browse/SPARK-17698

ExtractEquiJoinKeys is incorrectly using filter predicates as the join condition for joins. canEvaluate [0] tries to see if the an Expression can be evaluated using output of a given Plan. In case of filter predicates (eg. a.id='1'), the Expression passed for the right hand side (ie. '1' ) is a Literal which does not have any attribute references. Thus expr.references is an empty set which theoretically is a subset of any set. This leads to canEvaluate returning true and a.id='1' is treated as a join predicate. While this does not lead to incorrect results but in case of bucketed + sorted tables, we might miss out on avoiding un-necessary shuffle + sort. See example below:

[0] : https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala#L91

eg.

val df = (1 until 10).toDF("id").coalesce(1)
hc.sql("DROP TABLE IF EXISTS table1").collect
df.write.bucketBy(8, "id").sortBy("id").saveAsTable("table1")
hc.sql("DROP TABLE IF EXISTS table2").collect
df.write.bucketBy(8, "id").sortBy("id").saveAsTable("table2")

sqlContext.sql("""
  SELECT a.id, b.id
  FROM table1 a
  FULL OUTER JOIN table2 b
  ON a.id = b.id AND a.id='1' AND b.id='1'
""").explain(true)

BEFORE: This is doing shuffle + sort over table scan outputs which is not needed as both tables are bucketed and sorted on the same columns and have same number of buckets. This should be a single stage job.

SortMergeJoin [id#38, cast(id#38 as double), 1.0], [id#39, 1.0, cast(id#39 as double)], FullOuter
:- *Sort [id#38 ASC NULLS FIRST, cast(id#38 as double) ASC NULLS FIRST, 1.0 ASC NULLS FIRST], false, 0
:  +- Exchange hashpartitioning(id#38, cast(id#38 as double), 1.0, 200)
:     +- *FileScan parquet default.table1[id#38] Batched: true, Format: ParquetFormat, InputPaths: file:spark-warehouse/table1, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int>
+- *Sort [id#39 ASC NULLS FIRST, 1.0 ASC NULLS FIRST, cast(id#39 as double) ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(id#39, 1.0, cast(id#39 as double), 200)
      +- *FileScan parquet default.table2[id#39] Batched: true, Format: ParquetFormat, InputPaths: file:spark-warehouse/table2, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int>

AFTER :

SortMergeJoin [id#32], [id#33], FullOuter, ((cast(id#32 as double) = 1.0) && (cast(id#33 as double) = 1.0))
:- *FileScan parquet default.table1[id#32] Batched: true, Format: ParquetFormat, InputPaths: file:spark-warehouse/table1, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int>
+- *FileScan parquet default.table2[id#33] Batched: true, Format: ParquetFormat, InputPaths: file:spark-warehouse/table2, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int>

How was this patch tested?

Added a new test case for this scenario : SPARK-17698 Join predicates should not contain filter clauses
Ran all the tests in BucketedReadSuite

SparkQA · 2016-09-28T03:58:43Z

Test build #66010 has finished for PR 15272 at commit b65c926.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-09-29T06:52:45Z

cc @cloud-fan

cloud-fan · 2016-09-29T13:02:10Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala

@@ -88,7 +88,7 @@ trait PredicateHelper {
   * `false`.
   */
  protected def canEvaluate(expr: Expression, plan: LogicalPlan): Boolean =
-    expr.references.subsetOf(plan.outputSet)
+    !expr.references.isEmpty && expr.references.subsetOf(plan.outputSet)


This does fix the problem with minimal code changes, but it doesn't match the semantic, as literals CAN be evaluated on any plan.

how about we fix ExtractEquiJoinKeys instead?

@cloud-fan : Ok. Made that change

tejasapatil · 2016-10-11T02:48:04Z

Jenkins test this please

SparkQA · 2016-10-11T04:19:23Z

Test build #66705 has finished for PR 15272 at commit e9f9378.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-10-13T04:56:41Z

hm looks like another legitimate failing test too

tejasapatil · 2016-10-13T15:06:58Z

@rxin : Yes. I looked at it but could not find the root cause. I have been busy with other stuff so could not invest more time. I plan to get this fixed over the weekend.

tejasapatil · 2016-10-16T00:25:28Z

Jenkins test this please

SparkQA · 2016-10-16T02:49:13Z

Test build #67023 has finished for PR 15272 at commit d84a15f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tejasapatil · 2016-10-16T03:45:46Z

@cloud-fan + @rxin : Fixed the test case. Ready for review.

cloud-fan · 2016-10-20T12:18:51Z

sql/hive/src/test/scala/org/apache/spark/sql/sources/BucketedReadSuite.scala

+      bucketSpecRight = bucketSpec,
+      joinType = "fullouter",
+      joinCondition = (left: DataFrame, right: DataFrame) => {
+        val joinPredicates = Seq("i").map(col => left(col) === right(col)).reduce(_ && _)


isn't it just val joinPredicates = left(col) === right(col)?

yea @tejasapatil mind fixing this? We can merge it then.

Actually @tejasapatil given there is a need for backport, I'd let you fix this in your other prs since this is fairly cosmetic.

cloud-fan · 2016-10-20T12:20:08Z

LGTM

rxin · 2016-10-20T16:49:40Z

Actually let me just merge it. I will fix this in my other pr.

rxin · 2016-10-20T16:51:50Z

Merged in master.

@tejasapatil can you create a backport for branch-2.0?

tejasapatil · 2016-10-22T20:30:38Z

@rxin : Here is the backport for 2.0 branch: #15600

## What changes were proposed in this pull request? This is a backport of #15272 to 2.0 branch. Jira : https://issues.apache.org/jira/browse/SPARK-17698 `ExtractEquiJoinKeys` is incorrectly using filter predicates as the join condition for joins. `canEvaluate` [0] tries to see if the an `Expression` can be evaluated using output of a given `Plan`. In case of filter predicates (eg. `a.id='1'`), the `Expression` passed for the right hand side (ie. '1' ) is a `Literal` which does not have any attribute references. Thus `expr.references` is an empty set which theoretically is a subset of any set. This leads to `canEvaluate` returning `true` and `a.id='1'` is treated as a join predicate. While this does not lead to incorrect results but in case of bucketed + sorted tables, we might miss out on avoiding un-necessary shuffle + sort. See example below: [0] : https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala#L91 eg. ``` val df = (1 until 10).toDF("id").coalesce(1) hc.sql("DROP TABLE IF EXISTS table1").collect df.write.bucketBy(8, "id").sortBy("id").saveAsTable("table1") hc.sql("DROP TABLE IF EXISTS table2").collect df.write.bucketBy(8, "id").sortBy("id").saveAsTable("table2") sqlContext.sql(""" SELECT a.id, b.id FROM table1 a FULL OUTER JOIN table2 b ON a.id = b.id AND a.id='1' AND b.id='1' """).explain(true) ``` BEFORE: This is doing shuffle + sort over table scan outputs which is not needed as both tables are bucketed and sorted on the same columns and have same number of buckets. This should be a single stage job. ``` SortMergeJoin [id#38, cast(id#38 as double), 1.0], [id#39, 1.0, cast(id#39 as double)], FullOuter :- *Sort [id#38 ASC NULLS FIRST, cast(id#38 as double) ASC NULLS FIRST, 1.0 ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(id#38, cast(id#38 as double), 1.0, 200) : +- *FileScan parquet default.table1[id#38] Batched: true, Format: ParquetFormat, InputPaths: file:spark-warehouse/table1, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int> +- *Sort [id#39 ASC NULLS FIRST, 1.0 ASC NULLS FIRST, cast(id#39 as double) ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(id#39, 1.0, cast(id#39 as double), 200) +- *FileScan parquet default.table2[id#39] Batched: true, Format: ParquetFormat, InputPaths: file:spark-warehouse/table2, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int> ``` AFTER : ``` SortMergeJoin [id#32], [id#33], FullOuter, ((cast(id#32 as double) = 1.0) && (cast(id#33 as double) = 1.0)) :- *FileScan parquet default.table1[id#32] Batched: true, Format: ParquetFormat, InputPaths: file:spark-warehouse/table1, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int> +- *FileScan parquet default.table2[id#33] Batched: true, Format: ParquetFormat, InputPaths: file:spark-warehouse/table2, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int> ``` ## How was this patch tested? - Added a new test case for this scenario : `SPARK-17698 Join predicates should not contain filter clauses` - Ran all the tests in `BucketedReadSuite` Author: Tejas Patil <tejasp@fb.com> Closes #15600 from tejasapatil/SPARK-17698_2.0_backport.

## What changes were proposed in this pull request? Jira : https://issues.apache.org/jira/browse/SPARK-17698 `ExtractEquiJoinKeys` is incorrectly using filter predicates as the join condition for joins. `canEvaluate` [0] tries to see if the an `Expression` can be evaluated using output of a given `Plan`. In case of filter predicates (eg. `a.id='1'`), the `Expression` passed for the right hand side (ie. '1' ) is a `Literal` which does not have any attribute references. Thus `expr.references` is an empty set which theoretically is a subset of any set. This leads to `canEvaluate` returning `true` and `a.id='1'` is treated as a join predicate. While this does not lead to incorrect results but in case of bucketed + sorted tables, we might miss out on avoiding un-necessary shuffle + sort. See example below: [0] : https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala#L91 eg. ``` val df = (1 until 10).toDF("id").coalesce(1) hc.sql("DROP TABLE IF EXISTS table1").collect df.write.bucketBy(8, "id").sortBy("id").saveAsTable("table1") hc.sql("DROP TABLE IF EXISTS table2").collect df.write.bucketBy(8, "id").sortBy("id").saveAsTable("table2") sqlContext.sql(""" SELECT a.id, b.id FROM table1 a FULL OUTER JOIN table2 b ON a.id = b.id AND a.id='1' AND b.id='1' """).explain(true) ``` BEFORE: This is doing shuffle + sort over table scan outputs which is not needed as both tables are bucketed and sorted on the same columns and have same number of buckets. This should be a single stage job. ``` SortMergeJoin [id#38, cast(id#38 as double), 1.0], [id#39, 1.0, cast(id#39 as double)], FullOuter :- *Sort [id#38 ASC NULLS FIRST, cast(id#38 as double) ASC NULLS FIRST, 1.0 ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(id#38, cast(id#38 as double), 1.0, 200) : +- *FileScan parquet default.table1[id#38] Batched: true, Format: ParquetFormat, InputPaths: file:spark-warehouse/table1, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int> +- *Sort [id#39 ASC NULLS FIRST, 1.0 ASC NULLS FIRST, cast(id#39 as double) ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(id#39, 1.0, cast(id#39 as double), 200) +- *FileScan parquet default.table2[id#39] Batched: true, Format: ParquetFormat, InputPaths: file:spark-warehouse/table2, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int> ``` AFTER : ``` SortMergeJoin [id#32], [id#33], FullOuter, ((cast(id#32 as double) = 1.0) && (cast(id#33 as double) = 1.0)) :- *FileScan parquet default.table1[id#32] Batched: true, Format: ParquetFormat, InputPaths: file:spark-warehouse/table1, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int> +- *FileScan parquet default.table2[id#33] Batched: true, Format: ParquetFormat, InputPaths: file:spark-warehouse/table2, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int> ``` ## How was this patch tested? - Added a new test case for this scenario : `SPARK-17698 Join predicates should not contain filter clauses` - Ran all the tests in `BucketedReadSuite` Author: Tejas Patil <tejasp@fb.com> Closes apache#15272 from tejasapatil/SPARK-17698_join_predicate_filter_clause.

[SPARK-17698] [SQL] Join predicates should not contain filter clauses

b65c926

cloud-fan reviewed Sep 29, 2016

View reviewed changes

review comment

e9f9378

ignore literals for otherPredicates

d84a15f

cloud-fan reviewed Oct 20, 2016

View reviewed changes

asfgit closed this in fb0894b Oct 20, 2016

tejasapatil deleted the SPARK-17698_join_predicate_filter_clause branch October 22, 2016 19:35

tejasapatil mentioned this pull request Oct 22, 2016

[SPARK-17698] [SQL] Join predicates should not contain filter clauses #15600

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-17698] [SQL] Join predicates should not contain filter clauses #15272

[SPARK-17698] [SQL] Join predicates should not contain filter clauses #15272

tejasapatil commented Sep 28, 2016

SparkQA commented Sep 28, 2016

rxin commented Sep 29, 2016

cloud-fan Sep 29, 2016

tejasapatil Oct 11, 2016

tejasapatil commented Oct 11, 2016

SparkQA commented Oct 11, 2016

rxin commented Oct 13, 2016

tejasapatil commented Oct 13, 2016

tejasapatil commented Oct 16, 2016

SparkQA commented Oct 16, 2016

tejasapatil commented Oct 16, 2016

cloud-fan Oct 20, 2016

rxin Oct 20, 2016

rxin Oct 20, 2016

cloud-fan commented Oct 20, 2016

rxin commented Oct 20, 2016

rxin commented Oct 20, 2016

tejasapatil commented Oct 22, 2016

[SPARK-17698] [SQL] Join predicates should not contain filter clauses #15272

[SPARK-17698] [SQL] Join predicates should not contain filter clauses #15272

Conversation

tejasapatil commented Sep 28, 2016

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Sep 28, 2016

rxin commented Sep 29, 2016

cloud-fan Sep 29, 2016

Choose a reason for hiding this comment

tejasapatil Oct 11, 2016

Choose a reason for hiding this comment

tejasapatil commented Oct 11, 2016

SparkQA commented Oct 11, 2016

rxin commented Oct 13, 2016

tejasapatil commented Oct 13, 2016

tejasapatil commented Oct 16, 2016

SparkQA commented Oct 16, 2016

tejasapatil commented Oct 16, 2016

cloud-fan Oct 20, 2016

Choose a reason for hiding this comment

rxin Oct 20, 2016

Choose a reason for hiding this comment

rxin Oct 20, 2016

Choose a reason for hiding this comment

cloud-fan commented Oct 20, 2016

rxin commented Oct 20, 2016

rxin commented Oct 20, 2016

tejasapatil commented Oct 22, 2016