[SPARK-12718][SPARK-13720][SQL] SQL generation support for window functions #11555

cloud-fan · 2016-03-07T07:08:43Z

What changes were proposed in this pull request?

Add SQL generation support for window functions. The idea is simple, just treat Window operator like Project, i.e. add subquery to its child when necessary, generate a SELECT ... FROM ... SQL string, implement sql method for window related expressions, e.g. WindowSpecDefinition, WindowFrame, etc.

This PR also fixed SPARK-13720 by improving the process of adding extra SubqueryAlias(the RecoverScopingInfo rule). Before this PR, we update the qualifiers in project list while adding the subquery. However, this is incomplete as we need to update qualifiers in all ancestors that refer attributes here. In this PR, we split RecoverScopingInfo into 2 rules: AddSubQuery and UpdateQualifier. AddSubQuery only add subquery if necessary, and UpdateQualifier will re-propagate and update qualifiers bottom up.

Ideally we should put the bug fix part in an individual PR, but this bug also blocks the window stuff, so I put them together here.

Many thanks to @gatorsmile for the initial discussion and test cases!

How was this patch tested?

new tests in LogicalPlanToSQLSuite

cloud-fan · 2016-03-07T07:09:01Z

cc @liancheng @gatorsmile @yhuai

SparkQA · 2016-03-07T07:14:01Z

Test build #52544 has finished for PR 11555 at commit 559bbc5.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-03-07T07:26:29Z

@cloud-fan The issue is much more complex in my implementation. As you saw in the JIRA, I originally want to add extra subqueryAlias between each Window. However, I hit a couple of problems caused by subqueryAlias. Thus, I finally decided to recover the original SQL statement at best. Below is my code draft without code cleaning and refactoring.

  private def getAllWindowExprs(
      plan: Window,
      windowExprs: ArrayBuffer[NamedExpression]): (LogicalPlan, ArrayBuffer[NamedExpression]) = {
    plan.child match {
      case w: Window =>
        getAllWindowExprs(plan.child.asInstanceOf[Window], windowExprs ++ plan.windowExpressions)
      case _ => (plan.child, windowExprs ++ plan.windowExpressions)
    }
  }

  // Replace the attributes of aliased expressions in windows expressions
  // by the original expressions in Project or Aggregate
  private def replaceAliasedByExpr(
      projectList: Seq[NamedExpression],
      windowExprs: Seq[NamedExpression]): Seq[Expression] = {
    val aliasMap = AttributeMap(projectList.collect {
      case a: Alias => (a.toAttribute, a.child)
    })

    windowExprs.map { case expr =>
      expr.transformDown {
        case ar: AttributeReference if aliasMap.contains(ar) => aliasMap(ar)
      }
    }
  }

  private def buildProjectListForWindow(plan: Window): (String, String, String, LogicalPlan) = {
    // get all the windowExpressions from all the adjacent Window
    val (child, windowExpressions) = getAllWindowExprs(plan, ArrayBuffer.empty[NamedExpression])

    child match {
      case p: Project =>
        val newWindowExpr = replaceAliasedByExpr(p.projectList, windowExpressions)
        ((p.projectList ++ newWindowExpr).map(_.sql).mkString(", "), "", "", p.child)

      case _: Aggregate | _ @ Filter(_, _: Aggregate) =>
        val agg: Aggregate = child match {
          case a: Aggregate => a
          case Filter(_, a: Aggregate) => a
        }

        val newWindowExpr = replaceAliasedByExpr(agg.aggregateExpressions, windowExpressions)

        val groupingSQL = agg.groupingExpressions.map(_.sql).mkString(", ")

        val havingSQL = child match {
          case a: Aggregate => ""
          case Filter(condition, a: Aggregate) => "HAVING " + condition.sql
        }

        ((agg.aggregateExpressions ++ newWindowExpr).map(_.sql).mkString(", "),
          groupingSQL,
          havingSQL,
          agg.child)
    }
  }

  private def windowToSQL(plan: Window): String = {

    val (selectList, groupingSQL, havingSQL, nextPlan) = buildProjectListForWindow(plan)

    build(
      "SELECT",
      selectList,
      if (nextPlan == OneRowRelation) "" else "FROM",
      toSQL(nextPlan),
      if (groupingSQL.isEmpty) "" else "GROUP BY",
      groupingSQL,
      havingSQL
    )
  }

gatorsmile · 2016-03-07T07:35:09Z

So far, the test cases I wrote are listed below. I think we still need to add more to cover all the cases. I am not sure if your current implementation can pass the first one.

  test("window basic") {
    checkHiveQl(
      s"""
         |select key, value,
         |round(avg(value) over (), 2)
         |from parquet_t1 order by key
      """.stripMargin)
  }

  test("window with different window specification") {
    checkHiveQl(
      s"""
         |select key, value,
         |dense_rank() over (order by key, value) as dr,
         |sum(value) over (partition by key order by key) as sum
         |from parquet_t1
      """.stripMargin)
  }

  test("window with the same window specification with aggregate + having") {
    checkHiveQl(
      s"""
        |select key, value,
        |sum(value) over (partition by key % 5 order by key) as dr
        |from parquet_t1 group by key, value having key > 5
      """.stripMargin)
  }

  test("window with the same window specification with aggregate functions") {
    checkHiveQl(
      s"""
        |select key, value,
        |sum(value) over (partition by key % 5 order by key) as dr
        |from parquet_t1 group by key, value
      """.stripMargin)
  }

  test("window with the same window specification with aggregate") {
    checkHiveQl(
      s"""
        |select key, value,
        |dense_rank() over (distribute by key sort by key, value) as dr,
        |count(key)
        |from parquet_t1 group by key, value
      """.stripMargin)
  }

  test("window with the same window specification without aggregate and filter") {
    checkHiveQl(
      s"""
        |select key, value,
        |dense_rank() over (distribute by key sort by key, value) as dr,
        |count(key) over(distribute by key sort by key, value) as ca
        |from parquet_t1
      """.stripMargin)
  }

gatorsmile · 2016-03-07T07:42:04Z

Sorry, I have an early morning conference call with the patent attorneys. Will reply your response tomorrow. Thanks!

cloud-fan · 2016-03-07T07:44:07Z

have a good rest, we can discuss more tomorrow :)

gatorsmile · 2016-03-07T07:53:30Z

Thank you! Talk to you tomorrow.

BTW, we also need to fix a couple of windows expressions, for example, row_number, cume_dist, rank, dense_rank and percent_rank. We need to override the default sql functions.

SparkQA · 2016-03-07T08:52:26Z

Test build #52545 has finished for PR 11555 at commit 3ce072b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2016-03-07T12:47:21Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/SQLBuilder.scala

+    }
+
+    results.toSeq
+  }


So this method is basically a DFS search for all the outermost SubqueryAlias operators. Maybe the following version is clearer:

def findOutermostQualifiers(input: LogicalPlan): Seq[(String, AttributeSet)] = { input.collectFirst { case SubqueryAlias(alias, child) => Seq(alias -> child.outputSet) case plan => plan.children.flatMap(findOutermostQualifiers) }.toSeq.flatten }

SparkQA · 2016-03-07T13:14:41Z

Test build #52557 has finished for PR 11555 at commit e037814.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-03-07T13:58:09Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/SQLBuilder.scala

+   * Finds the outer most [[SubqueryAlias]] nodes in the input logical plan and return their alias
+   * names and outputSet.
+   */
+  private def findOutermostQualifiers(input: LogicalPlan): Seq[(String, AttributeSet)] = {


I have another alternative. We are facing the same issue everywhere when we add an extra Qualifier or remove an extra Qualifier. How about adding another rule/batch below the existing Batch("Canonicalizer") For example,

Batch("Replace Qualifier", Once, ReplaceQualifier)

The rule is simple. We always can get the qualifier from the inputSet if we are doing in bottom up traversal. I did not do a full test last night. Below is the code draft:

object ReplaceQualifier extends Rule[LogicalPlan] { override def apply(tree: LogicalPlan): LogicalPlan = tree transformUp { case plan => plan transformExpressions { case e: AttributeReference => e.withQualifiers(getQualifier(plan.inputSet, e)) } } private def getQualifier(inputSet: AttributeSet, e: AttributeReference): Seq[String] = { inputSet.collectFirst { case a if a.semanticEquals(e) => a.qualifiers }.getOrElse(Seq.empty[String]) } }

Thanks, I like this one :)

This is really a good idea! thanks, updated.

SparkQA · 2016-03-07T13:58:56Z

Test build #52558 has finished for PR 11555 at commit 9a66fbb.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-07T14:43:02Z

Test build #52560 has finished for PR 11555 at commit 40bd17a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-07T16:44:33Z

Test build #52566 has finished for PR 11555 at commit 276a870.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-07T17:55:49Z

Test build #52571 has finished for PR 11555 at commit 656a13a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-07T18:06:33Z

Test build #52569 has finished for PR 11555 at commit f968a33.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-03-08T01:00:02Z

retest this please

SparkQA · 2016-03-08T02:30:14Z

Test build #52616 has finished for PR 11555 at commit 656a13a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-08T05:31:46Z

Test build #52626 has finished for PR 11555 at commit c82229a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-08T10:52:59Z

Test build #52649 has finished for PR 11555 at commit 1ebb3c5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2016-03-08T22:41:20Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala

@@ -499,12 +520,25 @@ case class CumeDist() extends RowNumberLike with SizeBasedWindowFunction {
 case class NTile(buckets: Expression) extends RowNumberLike with SizeBasedWindowFunction {


Maybe I missed, where is the method of sql for NTile?

It is defined here:
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala#L200-L203

Ah I see. Thanks!

yhuai · 2016-03-08T22:57:11Z

Thanks @cloud-fan for working on it! Overall, it looks good. It will be great to have more test cases. Like the following

Multiple window functions are used in a single expression, e.g. sum(...) OVER (...) / count(...) OVER (...).
An expression having regular expression and window functions, e.g. 1 + 2 + Count(...) OVER (...).
A regular agg function used with a window function, e.g. sum(...) - sum(...) OVER (...).
ORDER BY clauses with ASC or DESC specified.

Also, maybe we are missing some window functions (like LEAD and LAG)? Supported window functions can be found in https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html.

SparkQA · 2016-03-09T05:01:25Z

Test build #52718 has finished for PR 11555 at commit dab7a2f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-09T05:34:39Z

Test build #52720 has finished for PR 11555 at commit aa0a32b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-09T08:11:54Z

Test build #52728 has finished for PR 11555 at commit dab7a2f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-03-09T08:26:18Z

@dilipbiswal and I just had an offline discussion about this. Sorry, to mention this at the last minute.

Adding extra subqueries could be a big issue if the column names are the same but the original qualifier are different. For example, we can join two tables which have the same column names. Normally, we use different qualifier names to differentiate them. Now, if we just replace it by the identical subquery name, they will lose the original qualifiers. Then, the generated SQL statement will be rejected by the Analyzer due to name ambiguity.

We are facing this issue in multiple SQL generation cases. Please correct us if our understanding is wrong. Thanks! @cloud-fan @liancheng

gatorsmile · 2016-03-09T08:36:01Z

BTW, we are having another related discussion in the JIRA: https://issues.apache.org/jira/browse/SPARK-13393.

Not sure if you are interested in this. Please feel free to jump in, if you have better ideas. Thanks!

cloud-fan · 2016-03-09T08:41:29Z

Now, if we just replace it by the identical subquery name, they will lose the original qualifiers.

I think it's not true. Every added subquery will have a unique name, so we won't have same qualifiers from left and right child of a Join.

gatorsmile · 2016-03-09T08:53:02Z

For example, given the following sub-plan:

Project a.key, b.key 
   Join

Assuming we still have multiple operators above this sub-plan and these operators are using both a.key and b.key, we will hit an issue if we add extra subquery. In SQL generation, both of them will be t.key and t.key.

cloud-fan · 2016-03-09T09:29:11Z

@gatorsmile , can you give a more detailed example? where does the t come from? We won't insert subquery between join and project.

gatorsmile · 2016-03-09T14:37:53Z

t is the subquery name SQLBuilder generated.

For example, the following query

sqlContext.range(10).select('id as 'key, 'id as 'value).write.saveAsTable("test1")
sqlContext.range(10).select('id as 'key, 'id as 'value).write.saveAsTable("test2")

sql("SELECT sum(a.value) over (ORDER BY a.key), sum(b.value) over (ORDER BY b.key) FROM test1 a JOIN test2 b ON a.key = b.key").explain(true)

The plan will be like

+- Project [value#29L,key#28L,value#31L,key#30L,windowexpression(sum(value), windowspecdefinition(sortorder(key)))#35L,windowexpression(sum(value), windowspecdefinition(sortorder(key)))#36L,windowexpression(sum(value), windowspecdefinition(sortorder(key)))#35L,windowexpression(sum(value), windowspecdefinition(sortorder(key)))#36L]
   +- Window [value#29L,key#28L,value#31L,key#30L,windowexpression(sum(value), windowspecdefinition(sortorder(key)))#35L], [(sum(value#31L),mode=Complete,isDistinct=false) windowspecdefinition(key#30L ASC, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS windowexpression(sum(value), windowspecdefinition(sortorder(key)))#36L], [key#30L ASC]
      +- Window [value#29L,key#28L,value#31L,key#30L], [(sum(value#29L),mode=Complete,isDistinct=false) windowspecdefinition(key#28L ASC, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS windowexpression(sum(value), windowspecdefinition(sortorder(key)))#35L], [key#28L ASC]
         +- Project [value#29L,key#28L,value#31L,key#30L]
            +- Join Inner, Some((key#28L = key#30L))
               :- SubqueryAlias a
               :  +- SubqueryAlias test1
               :     +- Relation[key#28L,value#29L] ParquetRelation
               +- SubqueryAlias b
                  +- SubqueryAlias test2
                     +- Relation[key#30L,value#31L] ParquetRelation

cloud-fan · 2016-03-10T00:58:02Z

Ah, makes sense, thanks for the explanation!
I think we need a better fix for SPARK-13720, let me send a separate PR.

cloud-fan · 2016-03-10T01:01:22Z

hmmm, it's not quite related to SPARK-13720, but a fundamental bug of the SQL builder infrastructure. How about we merge this PR first and fix it later?

gatorsmile · 2016-03-10T01:35:30Z

Yeah, this is a fundamental issue. I am afraid we are unable to add any extra subqueries for SQL generation. I will check whether SQL generation in traditional RDBMS is also using subqueries. Will post the answer I got in this PR.

BTW, I am fine to merge this at first. Thank you!

cloud-fan · 2016-03-10T04:01:04Z

retest this please

SparkQA · 2016-03-10T05:42:50Z

Test build #52813 has finished for PR 11555 at commit dab7a2f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-03-10T22:52:47Z

When adding an extra Subquery, we always detect if duplicate names exist. If found one, how about adding another Project with unique Alias names for the columns with duplicate names?

BTW, I am still waiting for the inputs from RDBMS experts. Will keep you posted. cc @ioana-delaney Thanks!

gatorsmile · 2016-03-10T22:59:29Z

The generated alias names will have different column names. To keep the original column names, we need another top Project to convert their names back.

gatorsmile · 2016-03-11T00:57:52Z

Got the offline inputs from @ioana-delaney.

Using subqueries is not common, and it is only used if runtime doesn't support a certain sequence of operations.
Internally, when projecting columns with the same name coming from different tables, we can use aliases to distinguish among them. That should be the default behavior irrespective of any further optimizations that can be applied to the generated SQL.

Basically, I think we can safely merge this PR. Fix the naming ambiguity issues in a separate PR. Thanks!

cloud-fan · 2016-03-11T05:23:35Z

I'm going to merge this PR as it blocks my next work. cc @liancheng I'll address your comments in follow-up PRs if you have any.
And thanks @gatorsmile for your review! I'll send a PR to fix the fundamental issue and we can keep discussing there.

Fix the compilation failure introduced by #11555 because of a merge conflict. Author: Wenchen Fan <wenchen@databricks.com> Closes #11648 from cloud-fan/hotbug.

…ctions ## What changes were proposed in this pull request? Add SQL generation support for window functions. The idea is simple, just treat `Window` operator like `Project`, i.e. add subquery to its child when necessary, generate a `SELECT ... FROM ...` SQL string, implement `sql` method for window related expressions, e.g. `WindowSpecDefinition`, `WindowFrame`, etc. This PR also fixed SPARK-13720 by improving the process of adding extra `SubqueryAlias`(the `RecoverScopingInfo` rule). Before this PR, we update the qualifiers in project list while adding the subquery. However, this is incomplete as we need to update qualifiers in all ancestors that refer attributes here. In this PR, we split `RecoverScopingInfo` into 2 rules: `AddSubQuery` and `UpdateQualifier`. `AddSubQuery` only add subquery if necessary, and `UpdateQualifier` will re-propagate and update qualifiers bottom up. Ideally we should put the bug fix part in an individual PR, but this bug also blocks the window stuff, so I put them together here. Many thanks to gatorsmile for the initial discussion and test cases! ## How was this patch tested? new tests in `LogicalPlanToSQLSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes apache#11555 from cloud-fan/window.

Fix the compilation failure introduced by apache#11555 because of a merge conflict. Author: Wenchen Fan <wenchen@databricks.com> Closes apache#11648 from cloud-fan/hotbug.

SQL generation support for window functions

3ce072b

cloud-fan force-pushed the window branch from 559bbc5 to 3ce072b Compare March 7, 2016 07:14

fix qualifiers

e037814

cloud-fan changed the title ~~[SPARK-12718][SQL] SQL generation support for window functions~~ [SPARK-12718][SPARK-13720][SQL] SQL generation support for window functions Mar 7, 2016

add one more test

9a66fbb

liancheng reviewed Mar 7, 2016
View reviewed changes

more comments

40bd17a

gatorsmile reviewed Mar 7, 2016
View reviewed changes

cloud-fan added 2 commits March 7, 2016 23:03

fix a bug?

276a870

simplification

656a13a

cloud-fan force-pushed the window branch from f968a33 to 656a13a Compare March 7, 2016 16:29

add log

c82229a

yhuai reviewed Mar 8, 2016
View reviewed changes

address comments

dab7a2f

cloud-fan force-pushed the window branch from aa0a32b to dab7a2f Compare March 9, 2016 06:28

asfgit closed this in 6871cc8 Mar 11, 2016

cloud-fan mentioned this pull request Mar 11, 2016

[HOT-FIX] fix compile #11648

Closed

asfgit pushed a commit that referenced this pull request Mar 11, 2016

[HOT-FIX] fix compile

74c4e26

Fix the compilation failure introduced by #11555 because of a merge conflict. Author: Wenchen Fan <wenchen@databricks.com> Closes #11648 from cloud-fan/hotbug.

roygao94 pushed a commit to roygao94/spark that referenced this pull request Mar 22, 2016

[HOT-FIX] fix compile

a219e75

Fix the compilation failure introduced by apache#11555 because of a merge conflict. Author: Wenchen Fan <wenchen@databricks.com> Closes apache#11648 from cloud-fan/hotbug.

		@@ -499,12 +520,25 @@ case class CumeDist() extends RowNumberLike with SizeBasedWindowFunction {
		case class NTile(buckets: Expression) extends RowNumberLike with SizeBasedWindowFunction {

[SPARK-12718][SPARK-13720][SQL] SQL generation support for window functions #11555

[SPARK-12718][SPARK-13720][SQL] SQL generation support for window functions #11555

Conversation

cloud-fan commented Mar 7, 2016

What changes were proposed in this pull request?

How was this patch tested?

cloud-fan commented Mar 7, 2016

SparkQA commented Mar 7, 2016

gatorsmile commented Mar 7, 2016

gatorsmile commented Mar 7, 2016

gatorsmile commented Mar 7, 2016

cloud-fan commented Mar 7, 2016

gatorsmile commented Mar 7, 2016

SparkQA commented Mar 7, 2016

liancheng Mar 7, 2016

Choose a reason for hiding this comment

SparkQA commented Mar 7, 2016

gatorsmile Mar 7, 2016

Choose a reason for hiding this comment

liancheng Mar 7, 2016

Choose a reason for hiding this comment

cloud-fan Mar 7, 2016

Choose a reason for hiding this comment

SparkQA commented Mar 7, 2016

SparkQA commented Mar 7, 2016

SparkQA commented Mar 7, 2016

SparkQA commented Mar 7, 2016

SparkQA commented Mar 7, 2016

cloud-fan commented Mar 8, 2016

SparkQA commented Mar 8, 2016

SparkQA commented Mar 8, 2016

SparkQA commented Mar 8, 2016

yhuai Mar 8, 2016

Choose a reason for hiding this comment

gatorsmile Mar 8, 2016

Choose a reason for hiding this comment

yhuai Mar 8, 2016

Choose a reason for hiding this comment

yhuai commented Mar 8, 2016

SparkQA commented Mar 9, 2016

SparkQA commented Mar 9, 2016

SparkQA commented Mar 9, 2016

gatorsmile commented Mar 9, 2016

gatorsmile commented Mar 9, 2016

cloud-fan commented Mar 9, 2016

gatorsmile commented Mar 9, 2016

cloud-fan commented Mar 9, 2016

gatorsmile commented Mar 9, 2016

cloud-fan commented Mar 10, 2016

cloud-fan commented Mar 10, 2016

gatorsmile commented Mar 10, 2016

cloud-fan commented Mar 10, 2016

SparkQA commented Mar 10, 2016

gatorsmile commented Mar 10, 2016

gatorsmile commented Mar 10, 2016

gatorsmile commented Mar 11, 2016

cloud-fan commented Mar 11, 2016