Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-13235] [SQL] Removed an Extra Distinct from the Plan when Using Union in SQL #11120

Closed
wants to merge 3 commits into from

Conversation

gatorsmile
Copy link
Member

Currently, the parser added two Distinct operators in the plan if we are using Union or Union Distinct in the SQL. This PR is to remove the extra Distinct from the plan.

For example, before the fix, the following query has a plan with two Distinct

sql("select * from t0 union select * from t0").explain(true)
== Parsed Logical Plan ==
'Project [unresolvedalias(*,None)]
+- 'Subquery u_2
   +- 'Distinct
      +- 'Project [unresolvedalias(*,None)]
         +- 'Subquery u_1
            +- 'Distinct
               +- 'Union
                  :- 'Project [unresolvedalias(*,None)]
                  :  +- 'UnresolvedRelation `t0`, None
                  +- 'Project [unresolvedalias(*,None)]
                     +- 'UnresolvedRelation `t0`, None

== Analyzed Logical Plan ==
id: bigint
Project [id#16L]
+- Subquery u_2
   +- Distinct
      +- Project [id#16L]
         +- Subquery u_1
            +- Distinct
               +- Union
                  :- Project [id#16L]
                  :  +- Subquery t0
                  :     +- Relation[id#16L] ParquetRelation
                  +- Project [id#16L]
                     +- Subquery t0
                        +- Relation[id#16L] ParquetRelation

== Optimized Logical Plan ==
Aggregate [id#16L], [id#16L]
+- Aggregate [id#16L], [id#16L]
   +- Union
      :- Project [id#16L]
      :  +- Relation[id#16L] ParquetRelation
      +- Project [id#16L]
         +- Relation[id#16L] ParquetRelation

After the fix, the plan is changed without the extra Distinct as follows:

== Parsed Logical Plan ==
'Project [unresolvedalias(*,None)]
+- 'Subquery u_1
   +- 'Distinct
      +- 'Union
         :- 'Project [unresolvedalias(*,None)]
         :  +- 'UnresolvedRelation `t0`, None
         +- 'Project [unresolvedalias(*,None)]
           +- 'UnresolvedRelation `t0`, None

== Analyzed Logical Plan ==
id: bigint
Project [id#17L]
+- Subquery u_1
   +- Distinct
      +- Union
        :- Project [id#16L]
        :  +- Subquery t0
        :     +- Relation[id#16L] ParquetRelation
        +- Project [id#16L]
          +- Subquery t0
          +- Relation[id#16L] ParquetRelation

== Optimized Logical Plan ==
Aggregate [id#17L], [id#17L]
+- Union
  :- Project [id#16L]
  :  +- Relation[id#16L] ParquetRelation
  +- Project [id#16L]
    +- Relation[id#16L] ParquetRelation

@gatorsmile gatorsmile changed the title [SPARK-13235] [SQL] Removed an Extra Distinct from the Plan with Union Distinct [SPARK-13235] [SQL] Removed an Extra Distinct from the Plan with Union when Using SQL Feb 8, 2016
@gatorsmile gatorsmile changed the title [SPARK-13235] [SQL] Removed an Extra Distinct from the Plan with Union when Using SQL [SPARK-13235] [SQL] Removed an Extra Distinct from the Plan when Using Union in SQL Feb 8, 2016
@SparkQA
Copy link

SparkQA commented Feb 8, 2016

Test build #50938 has finished for PR 11120 at commit 2e78d2c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • s\"Unable to generate an encoder for inner class$`
    • case class AssertNotNull(child: Expression, walkedTypePath: Seq[String])
    • case class ReturnAnswer(child: LogicalPlan) extends UnaryNode
    • case class CollectLimit(limit: Int, child: SparkPlan) extends UnaryNode
    • trait BaseLimit extends UnaryNode
    • case class LocalLimit(limit: Int, child: SparkPlan) extends BaseLimit
    • case class GlobalLimit(limit: Int, child: SparkPlan) extends BaseLimit
    • case class TakeOrderedAndProject(

@gatorsmile
Copy link
Member Author

This change is on the parser. @hvanhovell Could you please take a look? Thanks!

@hvanhovell
Copy link
Contributor

@gatorsmile This looks pretty solid. Could you add a test for this to CatalystQlSuite? I'll have a better look at the grammar change tonight.

@@ -2358,34 +2358,8 @@ setOpSelectStatement[CommonTree t, boolean topLevel]
u=setOperator LPAREN b=simpleSelectStatement RPAREN
|
u=setOperator b=simpleSelectStatement)
-> {$setOpSelectStatement.tree != null && $u.tree.getType()==SparkSqlParser.TOK_UNIONDISTINCT}?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, this is redundant because originally TOK_UNIONALL is used here. So an additional distinct is necessary. As we use TOK_UNIONDISTINCT now, we can skip it.

@gatorsmile
Copy link
Member Author

@hvanhovell @viirya Thank you for your reviews!

Just added two test cases for this, as suggested by @hvanhovell

Yeah. I did check the original Hive JIRA: https://issues.apache.org/jira/browse/HIVE-9039 . The reason their parser added this is that they do not add another Distinct above Union All.

@SparkQA
Copy link

SparkQA commented Feb 9, 2016

Test build #50975 has finished for PR 11120 at commit a5e81f4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hvanhovell
Copy link
Contributor

LGTM

1 similar comment
@viirya
Copy link
Member

viirya commented Feb 10, 2016

LGTM

@hvanhovell
Copy link
Contributor

Merging to master. thanks!

@asfgit asfgit closed this in e88bff1 Feb 11, 2016
@gatorsmile gatorsmile deleted the unionDistinct branch February 11, 2016 14:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants