[SPARK-13235] [SQL] Removed an Extra Distinct from the Plan when Using Union in SQL #11120

gatorsmile · 2016-02-08T21:04:41Z

Currently, the parser added two Distinct operators in the plan if we are using Union or Union Distinct in the SQL. This PR is to remove the extra Distinct from the plan.

For example, before the fix, the following query has a plan with two Distinct

sql("select * from t0 union select * from t0").explain(true)

== Parsed Logical Plan ==
'Project [unresolvedalias(*,None)]
+- 'Subquery u_2
   +- 'Distinct
      +- 'Project [unresolvedalias(*,None)]
         +- 'Subquery u_1
            +- 'Distinct
               +- 'Union
                  :- 'Project [unresolvedalias(*,None)]
                  :  +- 'UnresolvedRelation `t0`, None
                  +- 'Project [unresolvedalias(*,None)]
                     +- 'UnresolvedRelation `t0`, None

== Analyzed Logical Plan ==
id: bigint
Project [id#16L]
+- Subquery u_2
   +- Distinct
      +- Project [id#16L]
         +- Subquery u_1
            +- Distinct
               +- Union
                  :- Project [id#16L]
                  :  +- Subquery t0
                  :     +- Relation[id#16L] ParquetRelation
                  +- Project [id#16L]
                     +- Subquery t0
                        +- Relation[id#16L] ParquetRelation

== Optimized Logical Plan ==
Aggregate [id#16L], [id#16L]
+- Aggregate [id#16L], [id#16L]
   +- Union
      :- Project [id#16L]
      :  +- Relation[id#16L] ParquetRelation
      +- Project [id#16L]
         +- Relation[id#16L] ParquetRelation

After the fix, the plan is changed without the extra Distinct as follows:

== Parsed Logical Plan ==
'Project [unresolvedalias(*,None)]
+- 'Subquery u_1
   +- 'Distinct
      +- 'Union
         :- 'Project [unresolvedalias(*,None)]
         :  +- 'UnresolvedRelation `t0`, None
         +- 'Project [unresolvedalias(*,None)]
           +- 'UnresolvedRelation `t0`, None

== Analyzed Logical Plan ==
id: bigint
Project [id#17L]
+- Subquery u_1
   +- Distinct
      +- Union
        :- Project [id#16L]
        :  +- Subquery t0
        :     +- Relation[id#16L] ParquetRelation
        +- Project [id#16L]
          +- Subquery t0
          +- Relation[id#16L] ParquetRelation

== Optimized Logical Plan ==
Aggregate [id#17L], [id#17L]
+- Union
  :- Project [id#16L]
  :  +- Relation[id#16L] ParquetRelation
  +- Project [id#16L]
    +- Relation[id#16L] ParquetRelation

SparkQA · 2016-02-08T22:50:19Z

Test build #50938 has finished for PR 11120 at commit 2e78d2c.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- s\"Unable to generate an encoder for inner class$`
- case class AssertNotNull(child: Expression, walkedTypePath: Seq[String])
- case class ReturnAnswer(child: LogicalPlan) extends UnaryNode
- case class CollectLimit(limit: Int, child: SparkPlan) extends UnaryNode
- trait BaseLimit extends UnaryNode
- case class LocalLimit(limit: Int, child: SparkPlan) extends BaseLimit
- case class GlobalLimit(limit: Int, child: SparkPlan) extends BaseLimit
- case class TakeOrderedAndProject(

gatorsmile · 2016-02-08T23:19:11Z

This change is on the parser. @hvanhovell Could you please take a look? Thanks!

hvanhovell · 2016-02-09T06:38:44Z

@gatorsmile This looks pretty solid. Could you add a test for this to CatalystQlSuite? I'll have a better look at the grammar change tonight.

viirya · 2016-02-09T12:58:14Z

sql/catalyst/src/main/antlr3/org/apache/spark/sql/catalyst/parser/SparkSqlParser.g

@@ -2358,34 +2358,8 @@ setOpSelectStatement[CommonTree t, boolean topLevel]
    u=setOperator LPAREN b=simpleSelectStatement RPAREN
    |
    u=setOperator b=simpleSelectStatement)
-   -> {$setOpSelectStatement.tree != null && $u.tree.getType()==SparkSqlParser.TOK_UNIONDISTINCT}?


Yea, this is redundant because originally TOK_UNIONALL is used here. So an additional distinct is necessary. As we use TOK_UNIONDISTINCT now, we can skip it.

gatorsmile · 2016-02-09T14:42:16Z

@hvanhovell @viirya Thank you for your reviews!

Just added two test cases for this, as suggested by @hvanhovell

Yeah. I did check the original Hive JIRA: https://issues.apache.org/jira/browse/HIVE-9039 . The reason their parser added this is that they do not add another Distinct above Union All.

SparkQA · 2016-02-09T16:10:08Z

Test build #50975 has finished for PR 11120 at commit a5e81f4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2016-02-09T21:08:21Z

LGTM

viirya · 2016-02-10T00:09:08Z

LGTM

hvanhovell · 2016-02-11T07:41:50Z

Merging to master. thanks!

gatorsmile added 2 commits February 8, 2016 12:58

removed an extra Distinct.

644e0da

Merge remote-tracking branch 'upstream/master' into unionDistinctToSQL

2e78d2c

gatorsmile changed the title ~~[SPARK-13235] [SQL] Removed an Extra Distinct from the Plan with Union Distinct~~ [SPARK-13235] [SQL] Removed an Extra Distinct from the Plan with Union when Using SQL Feb 8, 2016

gatorsmile changed the title ~~[SPARK-13235] [SQL] Removed an Extra Distinct from the Plan with Union when Using SQL~~ [SPARK-13235] [SQL] Removed an Extra Distinct from the Plan when Using Union in SQL Feb 8, 2016

viirya reviewed Feb 9, 2016
View reviewed changes

added test cases

a5e81f4

asfgit closed this in e88bff1 Feb 11, 2016

gatorsmile deleted the unionDistinct branch February 11, 2016 14:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-13235] [SQL] Removed an Extra Distinct from the Plan when Using Union in SQL #11120

[SPARK-13235] [SQL] Removed an Extra Distinct from the Plan when Using Union in SQL #11120

gatorsmile commented Feb 8, 2016

SparkQA commented Feb 8, 2016

gatorsmile commented Feb 8, 2016

hvanhovell commented Feb 9, 2016

viirya Feb 9, 2016

gatorsmile commented Feb 9, 2016

SparkQA commented Feb 9, 2016

hvanhovell commented Feb 9, 2016

viirya commented Feb 10, 2016

hvanhovell commented Feb 11, 2016

[SPARK-13235] [SQL] Removed an Extra Distinct from the Plan when Using Union in SQL #11120

[SPARK-13235] [SQL] Removed an Extra Distinct from the Plan when Using Union in SQL #11120

Conversation

gatorsmile commented Feb 8, 2016

SparkQA commented Feb 8, 2016

gatorsmile commented Feb 8, 2016

hvanhovell commented Feb 9, 2016

viirya Feb 9, 2016

Choose a reason for hiding this comment

gatorsmile commented Feb 9, 2016

SparkQA commented Feb 9, 2016

hvanhovell commented Feb 9, 2016

viirya commented Feb 10, 2016

hvanhovell commented Feb 11, 2016