Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP][SPARK-24497][SQL] Support recursive SQL query #29210

Closed

Conversation

peter-toth
Copy link
Contributor

What changes were proposed in this pull request?

This PR adds recursive query feature to Spark SQL.

A recursive query is defined using the WITH RECURSIVE keywords and referring the name of the common table expression within the query.
The implementation complies with SQL standard and follows similar rules to other relational databases:

  • A query is made of an anchor followed by a recursive term.
  • The anchor terms doesn't contain self reference and it is used to initialize the query.
  • The recursive term contains a self reference and it is used to expand the current set of rows with new ones.
  • The anchor and recursive terms must be joined with each other by UNION or UNION ALL operators.
  • New rows can only be derived from the newly added rows of the previous iteration (or from the initial set of rows of anchor terms). This limitation implies that recursive references can't be used with some of the joins, aggregations or subqueries.

Please see cte-recursive.sql and with.sql for some examples.

Please note that this PR focuses on the minimal working implementation which means:

  • SQL recursion is actually loop where the current iteration is computed based on the previous one's result and when an iteration returns no rows the loop is over. The final result is the union of all iteration results. This means that caching intermediate results could speed up the process, but caching was removed from this PR to reduce complexity and can be added back in a follow-up PR.
  • A common way to stop SQL recursion is using the LIMIT operator to stop computing more than the required number of rows. LIMIT support was removed from this PR to reduce complexity and can be added back in a follow-up PR.
  • Some relational databases are more relaxed in terms how many anchor and recursive terms can be in a recursion. This PR allows the most simple case and allows only 1-1 of them. A follow-up PR can target to relax this limitation.

Why are the changes needed?

Recursive query is an ANSI SQL feature that is useful to process hierarchical data.

Does this PR introduce any user-facing change?

Yes, adds recursive query feature.

How was this patch tested?

Added new UTs and tests in cte-recursion.sql and with.sql.

@peter-toth
Copy link
Contributor Author

peter-toth commented Jul 23, 2020

This is the static version of #23531 which means this PR doesn't do adaptive replanning in each recursive iteration, but the advantage is that the implementation comes with simplified code. Common relational DB implementations don't do replanning either.

@maropu this is very close to what you suggested before, if you have some time please review.

@maryannxue I think we can add adaptive support in a follow-up PR if needed.

cc @cloud-fan @viirya @gatorsmile

@SparkQA
Copy link

SparkQA commented Jul 23, 2020

Test build #126431 has finished for PR 29210 at commit 016b952.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class UnresolvedRecursiveReference(cteName: String, accumulated: Boolean) extends LeafNode
  • case class RecursiveRelation(
  • case class RecursiveReference(
  • abstract class UnionBase extends LogicalPlan
  • case class Union(children: Seq[LogicalPlan]) extends UnionBase
  • case class With(
  • case class RecursiveRelationExec(
  • case class RecursiveReferenceExec(

@maropu
Copy link
Member

maropu commented Jul 24, 2020

The simpler design as a first step looks fine to me. Anyone preferring the adaptive one for this? I think we need to choose which approach to take first. @maryannxue @viirya @gatorsmile @viirya

@SparkQA
Copy link

SparkQA commented Jul 24, 2020

Test build #126490 has finished for PR 29210 at commit 3646400.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class UnschedulableTaskSetAdded(stageId: Int, stageAttemptId: Int)
  • case class UnschedulableTaskSetRemoved(stageId: Int, stageAttemptId: Int)
  • case class SparkListenerUnschedulableTaskSetAdded(
  • case class SparkListenerUnschedulableTaskSetRemoved(
  • case class Union(

@peter-toth peter-toth changed the title [WIP][SPARK-24497][SQL] Support recursive SQL query [SPARK-24497][SQL] Support recursive SQL query Jul 26, 2020

CREATE TEMPORARY VIEW t AS SELECT * FROM VALUES 0, 1, 2 AS t(id);

-- fails due to recursion isn't allowed with RECURSIVE keyword
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without RECURSIVE ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @manuzhang. Fixed.

@SparkQA
Copy link

SparkQA commented Sep 1, 2020

Test build #128152 has finished for PR 29210 at commit 1a2826e.

  • This patch fails RAT tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@peter-toth
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Sep 2, 2020

Test build #128162 has finished for PR 29210 at commit 1a2826e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@SparkQA
Copy link

SparkQA commented Feb 4, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39470/

@SparkQA
Copy link

SparkQA commented Feb 4, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39470/

@SparkQA
Copy link

SparkQA commented Feb 4, 2021

Test build #134884 has finished for PR 29210 at commit 794300c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 4, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39473/

@SparkQA
Copy link

SparkQA commented Feb 4, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39473/

@SparkQA
Copy link

SparkQA commented Feb 5, 2021

Test build #134890 has finished for PR 29210 at commit 1e364d3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 5, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39516/

@SparkQA
Copy link

SparkQA commented Feb 5, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39516/

@SparkQA
Copy link

SparkQA commented Feb 5, 2021

Test build #134933 has finished for PR 29210 at commit 8f79020.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label May 17, 2021
@peter-toth
Copy link
Contributor Author

peter-toth commented May 17, 2021

I'm happy to update this PR if anyone is willing to review it. Just let me know...

@github-actions github-actions bot closed this May 18, 2021
@phancey
Copy link

phancey commented Jul 19, 2021

I would like to see recursive queries - why did this not go through? Is there an alternative?

@timschwab
Copy link

@peter-toth

Pinging this to ensure it doesn't fall through the cracks. I'm currently hitting the common use case of wanting to create a materialized path from an adjacency list, and not having recursive queries built into Spark makes this more painful and less performant.

@sixdimensionalarray
Copy link

sixdimensionalarray commented Jul 19, 2022

Another +1, ran into a use case where recursive queries were needed and alternate technology was utilized to solve because it was not available in Spark SQL.

@peter-toth
Copy link
Contributor Author

Thanks for the feedback. I will try to rebase this PR on the latest master (Spark 3.4) in a few weeks.

@peter-toth
Copy link
Contributor Author

Sorry guys, this is unlikely to land in Spark 3.4, maybe in 3.5...

@wangyum wangyum removed the Stale label Jan 18, 2023
@wangyum wangyum reopened this Jan 18, 2023
@wangyum
Copy link
Member

wangyum commented Jan 18, 2023

@peter-toth Could you rebase this PR on the master branch. I have removed the Stale tag.

@peter-toth peter-toth changed the title [SPARK-24497][SQL] Support recursive SQL query [WIP][SPARK-24497][SQL] Support recursive SQL query Jan 18, 2023
@peter-toth
Copy link
Contributor Author

@peter-toth Could you rebase this PR on the master branch. I have removed the Stale tag.

@wangyum, unfortunately this is a very old PR and a lots of changes are needed to make it work on the latest Spark.
I've started rebasing it and fixing it last year, but it requires quite some time and I haven't finished yet.

Let me close the PR for now and reopen once I have a working solution.

@peter-toth peter-toth closed this Jan 18, 2023
@peter-toth
Copy link
Contributor Author

This PR is rebased on latest master here: #40744

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
9 participants