Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP][SPARK-24497][SQL] Support recursive SQL query with adaptive replanning #23531

Closed
wants to merge 10 commits into from

Conversation

peter-toth
Copy link
Contributor

@peter-toth peter-toth commented Jan 13, 2019

What changes were proposed in this pull request?

This PR adds recursive query feature to Spark SQL.

A recursive query is defined using the WITH RECURSIVE keywords and referring the name of the common table expression within the query.
The implementation complies with SQL standard and follows similar rules to other relational databases:

  • A query is made of an anchor followed by a recursive term.
  • The anchor terms doesn't contain self reference and it is used to initialize the query.
  • The recursive term contains a self reference and it is used to expand the current set of rows with new ones.
  • The anchor and recursive terms must be joined with each other by UNION or UNION ALL operators.
  • New rows can only be derived from the newly added rows of the previous iteration (or from the initial set of rows of anchor terms). This limitation implies that recursive references can't be used with some of the joins, aggregations or subqueries.

Please see cte-recursive.sql and with.sql for some examples.

Please note that this PR focuses on the minimal working implementation which means:

  • SQL recursion is actually loop where the current iteration is computed based on the previous one's result and when an iteration returns no rows the loop is over. The final result is the union of all iteration results. This means that caching intermediate results could speed up the process, but caching was removed from this PR to reduce complexity and can be added back in a follow-up PR.
  • A common way to stop SQL recursion is using the LIMIT operator to stop computing more than the required number of rows. LIMIT support was removed from this PR to reduce complexity and can be added back in a follow-up PR.
  • Some relational databases are more relaxed in terms how many anchor and recursive terms can be in a recursion. This PR allows the most simple case and allows only 1-1 of them. A follow-up PR can target to relax this limitation.

Why are the changes needed?

Recursive query is an ANSI SQL feature that is useful to process hierarchical data.

Does this PR introduce any user-facing change?

Yes, adds recursive query feature.

How was this patch tested?

Added new UTs and tests in cte-recursion.sql and with.sql.

@gatorsmile
Copy link
Member

ok to test

@gatorsmile
Copy link
Member

cc @maryannxue

@SparkQA
Copy link

SparkQA commented Jan 14, 2019

Test build #101149 has finished for PR 23531 at commit 1c81909.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class PlanTraverseStatus(
  • case class UnresolvedRecursiveReference(name: String) extends LeafNode
  • case class RecursiveTable(
  • case class RecursiveReference(name: String, output: Seq[Attribute]) extends LeafNode
  • case class With(
  • case class RecursiveTableExec(
  • case class RecursiveReferenceExec(name: String, output: Seq[Attribute]) extends LeafExecNode

@SparkQA
Copy link

SparkQA commented Jan 14, 2019

Test build #101190 has started for PR 23531 at commit a69473f.

@SparkQA
Copy link

SparkQA commented Jan 14, 2019

Test build #101191 has started for PR 23531 at commit f6d8c17.

@shaneknapp
Copy link
Contributor

test this please

1 similar comment
@shaneknapp
Copy link
Contributor

test this please

@SparkQA
Copy link

SparkQA commented Jan 14, 2019

Test build #101197 has finished for PR 23531 at commit f6d8c17.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 16, 2019

Test build #101306 has finished for PR 23531 at commit 5f69d60.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 16, 2019

Test build #101308 has finished for PR 23531 at commit d9cc864.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 16, 2019

Test build #101304 has finished for PR 23531 at commit 4ea3a4a.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 16, 2019

Test build #101310 has finished for PR 23531 at commit 8f9a673.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 16, 2019

Test build #101327 has finished for PR 23531 at commit f5feb63.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 22, 2019

Test build #101546 has finished for PR 23531 at commit 24af7b2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 23, 2019

Test build #101555 has finished for PR 23531 at commit 0c6e957.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 23, 2019

Test build #101559 has finished for PR 23531 at commit 31574d7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 23, 2019

Test build #101585 has finished for PR 23531 at commit aab0886.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@peter-toth
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Mar 26, 2020

Test build #120411 has finished for PR 23531 at commit 656995b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Apr 3, 2020

Test build #120774 has finished for PR 23531 at commit 656995b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dmateusp
Copy link

dmateusp commented Apr 8, 2020

seems like the PR addressed the feedback, can someone review?

@SparkQA
Copy link

SparkQA commented Apr 14, 2020

Test build #121283 has finished for PR 23531 at commit 8205e61.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


override def children: Seq[SparkPlan] = anchorTerm :: Nil

override def innerChildren: Seq[QueryPlan[_]] = logicalRecursiveTerm +: super.innerChildren
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any specific reason we need to include logicalResursiveTerm as innerChildren other than for display?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No specific reason, just for display. I can find a better way to do it.

* Notify the listeners of the physical plan change.
*/
private def onUpdatePlan(executionId: Long): Unit = {
val queryExecution = SQLExecution.getQueryExecution(executionId)
Copy link
Contributor

@maryannxue maryannxue Apr 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we should introduce the idea of iterations in metrics here. Or maybe we need a better way of expressing recursion in UI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I liked the idea that we can see what happened in each iteration, but I agree that it can be too much information to present on UI, especially when recursion level is high.
Do you have any idea how to present recursion on the UI?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For performance and UI readability reasons, we don't wanna really update all those physical nodes to the UI. AQE has metric-only UI updates, and I assume we can do sth. similar here.
Would you mind doing a little research on how other SQL engines or tools present recursive queries in GUIs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.
Sure, I'm happy to look into it how other engines show recursion on their UI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, I didn't have time for this yet, but it is still on my list.

@peter-toth
Copy link
Contributor Author

Thanks @maryannxue for your comments, really appreciated. I will try to address them soon.

@SparkQA
Copy link

SparkQA commented Jun 18, 2020

Test build #124213 has finished for PR 23531 at commit fc7b04a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 18, 2020

Test build #124214 has finished for PR 23531 at commit 607aa4d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class TimeFormatters(date: DateFormatter, timestamp: TimestampFormatter)

@SparkQA
Copy link

SparkQA commented Jun 19, 2020

Test build #124233 has finished for PR 23531 at commit adf120c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 20, 2020

Test build #124299 has finished for PR 23531 at commit 69202d5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@peter-toth peter-toth changed the title [SPARK-24497][SQL] Support recursive SQL query [WIP][SPARK-24497][SQL] Support recursive SQL query with adaptive replanning Jul 23, 2020
@peter-toth
Copy link
Contributor Author

peter-toth commented Jul 23, 2020

I've opened a simplified PR (without adaptive replanning in each iteration) here: #29210. That implementation is bit simpler and hopefully will get merged sooner than this PR.

@github-actions
Copy link

github-actions bot commented Nov 1, 2020

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Nov 1, 2020
@github-actions github-actions bot closed this Nov 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet