[SPARK-37259][SQL] Support CTE and TempTable queries with MSSQL JDBC #34693

peter-toth · 2021-11-23T12:09:33Z

What changes were proposed in this pull request?

Currently CTE queries from Spark are not supported with MSSQL server via JDBC. This is because MSSQL server doesn't support the nested CTE syntax (SELECT * FROM (WITH t AS (...) SELECT ... FROM t) WHERE 1=0) that Spark builds from the original query (options.tableOrQuery) in JDBCRDD.resolveTable() and in JDBCRDD.compute().
Unfortunately, it is non-trivial to split an arbitrary query it into "with" and "regular" clauses in MsSqlServerDialect. So instead, I'm proposing a new general JDBC option "withClause" that users can use if they have complex queries with CTE:

val withClause = "WITH t AS (SELECT x, y FROM tbl)"
val query = "SELECT * FROM t WHERE x > 10"
val df = spark.read.format("jdbc")
  .option("url", jdbcUrl)
  .option("withClause", withClause)
  .option("query", query)
  .load()

This change also works with MSSQL's temp table syntax:

val withClause = "(SELECT * INTO #TempTable FROM (SELECT * FROM tbl WHERE x > 10) t)"
val query = "SELECT * FROM #TempTable"
val df = spark.read.format("jdbc")
  .option("url", jdbcUrl)
  .option("withClause", withClause)
  .option("query", query)
  .load()

Why are the changes needed?

To support CTE queries with MSSQL.

Does this PR introduce any user-facing change?

Yes, CTE queries are supported form now.

How was this patch tested?

Added new integration UTs.

SparkQA · 2021-11-23T12:54:38Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50020/

SparkQA · 2021-11-23T13:53:35Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50020/

peter-toth · 2021-11-23T15:34:41Z

This change also seem to work with MSSQL's temp table syntax:

val withClause = "(SELECT * INTO #TempTable FROM (SELECT * FROM tbl WHERE x > 10) t)"
val query = "SELECT * FROM #TempTable"
val df = spark.read.format("jdbc")
  .option("url", jdbcUrl)
  .option("withClause", withClause)
  .option("query", query)
  .load()

SparkQA · 2021-11-23T16:39:57Z

Test build #145548 has finished for PR 34693 at commit e2c9577.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

peter-toth · 2021-11-23T16:40:41Z

Hmm, failures in ExpressionsSchemaSuite look unrelated...

sumeetgajjar · 2021-11-23T18:22:26Z

This change also seem to work with MSSQL's temp table syntax:

val withClause = "(SELECT * INTO #TempTable FROM (SELECT * FROM tbl WHERE x > 10) t)"
val query = "SELECT * FROM #TempTable"
val df = spark.read.format("jdbc")
  .option("url", jdbcUrl)
  .option("withClause", withClause)
  .option("query", query)
  .load()

Since it also works with temp table syntax, do you think it would be a good idea to include it the title of this PR and modify the title to "Support CTE and TempTable queries with MSSQL JDBC"?

sumeetgajjar · 2021-11-23T18:19:53Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala

@@ -325,6 +325,6 @@ private[sql] case class JDBCRelation(
  override def toString: String = {
    val partitioningInfo = if (parts.nonEmpty) s" [numPartitions=${parts.length}]" else ""
    // credentials should not be included in the plan output, table information is sufficient.
-    s"JDBCRelation(${jdbcOptions.tableOrQuery})" + partitioningInfo
+    s"JDBCRelation(${jdbcOptions.withClause}${jdbcOptions.tableOrQuery})" + partitioningInfo


Since partitioningInfo is also a string, should we create the final string as
s"JDBCRelation(${jdbcOptions.withClause}${jdbcOptions.tableOrQuery})$partitioningInfo"?

Fixed in b53ef47

sumeetgajjar · 2021-11-23T18:21:29Z

...integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MsSqlServerIntegrationSuite.scala

+      .option("dbtable", dbtable)
+      .load()
+    assert(df.collect.toSet === expectedResult)
+  }


Since it already works with temp table syntax, should we also add the corresponding UTs?

Ok. Added in dea730a

peter-toth · 2021-11-24T09:02:47Z

This change also seem to work with MSSQL's temp table syntax:
val withClause = "(SELECT * INTO #TempTable FROM (SELECT * FROM tbl WHERE x > 10) t)"
val query = "SELECT * FROM #TempTable"
val df = spark.read.format("jdbc")
  .option("url", jdbcUrl)
  .option("withClause", withClause)
  .option("query", query)
  .load()
Since it also works with temp table syntax, do you think it would be a good idea to include it the title of this PR and modify the title to "Support CTE and TempTable queries with MSSQL JDBC"?

Thanks, makes sense. I've modified PR title and description and added a new test in: dea730a

Actually, I wonder if it makes sense to rename withClause to a more general queryPrefix option.

SparkQA · 2021-11-24T09:53:58Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50044/

SparkQA · 2021-11-24T09:55:51Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50045/

SparkQA · 2021-11-24T10:39:27Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50045/

SparkQA · 2021-11-24T10:54:06Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50044/

SparkQA · 2021-11-24T13:42:10Z

Test build #145572 has finished for PR 34693 at commit b53ef47.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-11-24T13:52:15Z

Test build #145571 has finished for PR 34693 at commit dea730a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

KevinAppelBofa · 2021-11-29T14:53:27Z

@peter-toth thank you for working on this, I was able to get a spark 3.3.0-snapshot compiled and test the changes you made. I ran both the sample queries first and those were able to work, then I ran the temp table query and this is also working; that one was easy to split into the withClause and query. I am running into an issue though getting the CTE query to run, i have tried to split this up a few ways but I keep getting the same error which is below. I'm going to try to add a logwarning to dump out the query it is trying to run to get the schema and see if I can get that to run directly in the sql server. This was the issue I ran into originally I was able to get the test CTE to work and doing a $CTEQUERY where 1=0; was working but in this more complex CTE I can't find a spot where to add the 1=0 to get a schema back only.

py4j.protocol.Py4JJavaError: An error occurred while calling o85.load. : com.microsoft.sqlserver.jdbc.SQLServerException: Incorrect syntax near the keyword 'WITH'. at com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:262) at com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1632) at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.doExecutePreparedStatement(SQLServerPreparedStatement.java:602) at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement$PrepStmtExecCmd.doExecute(SQLServerPreparedStatement.java:524) at com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:7418) at com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:3272) at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeCommand(SQLServerStatement.java:247) at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeStatement(SQLServerStatement.java:222) at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.executeQuery(SQLServerPreparedStatement.java:446) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.getQueryOutputSchema(JDBCRDD.scala:69) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:59) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:240) at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:36) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:227) at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:209) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:209) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:170) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) at py4j.ClientServerConnection.run(ClientServerConnection.java:106) at java.lang.Thread.run(Thread.java:748)

KevinAppelBofa · 2021-11-29T16:15:55Z

@peter-toth I was able to get the CTE query to work using the split method you have, this was a little trial and error to find the right place to split and is producing the same results using the other method that uses the useRawQuery option and appears to be a similar run time.

attilapiros · 2021-12-02T10:07:01Z

Unfortunately, it is non-trivial to split an arbitrary query it into "with" and "regular" clauses in MsSqlServerDialect.

Could you please show us some example queries where the split is non-trivial?

peter-toth · 2021-12-02T10:35:55Z

Unfortunately, it is non-trivial to split an arbitrary query it into "with" and "regular" clauses in MsSqlServerDialect.

Could you please show us some example queries where the split is non-trivial?

For example:

  WITH t AS (SELECT x FROM tbl), t2 AS (SELECT y FROM tbl2) SELECT * FROM t WHERE x > SELECT max(y) FROM t2

You need to split this to WITH t AS (SELECT x FROM tbl), t2 AS (SELECT y FROM tbl2) and SELECT * FROM t WHERE x > SELECT max(y) FROM t2. Obviously there can be more brackets in the query and some might not be paired/closed if they appear in strings like: t2 AS (SELECT y, ')' AS dummy FROM tbl2). We could use ANTLR to parse the query just like we use it to parse Spark SQL statements but I'm not sure it is worth it...

attilapiros · 2021-12-02T11:00:57Z

We could use ANTLR to parse the query just like we use it to parse Spark SQL statements but I'm not sure it is worth it...

I agree that would be too much here and without a complex parsing logic we might fail at the splitting

KevinAppelBofa · 2021-12-02T14:27:03Z

@attilapiros Finding in the query where the WITH piece stops, then where the SELECT begins is where I found the place to split. In the test query

query2 = """
WITH DummyCTE AS
(
SELECT 1 as DummyCOL
)
SELECT *
FROM DummyCTE
"""

This splits into

withClause = """
WITH DummyCTE AS
(
SELECT 1 as DummyCOL
)

"""
query = """
SELECT *
FROM DummyCTE
"""

In the actual query we are running is more complex and is a bunch of chained WITH together, in that one I did the same approach and where the actual WITH part ends to stick that into the withClause and then the rest into the query

This same technique works for the temp table query, to split it up where the part generating the temp table goes into the withClause and the rest goes into the query

query3 = """
(SELECT *
INTO #Temp1a
FROM
(SELECT @@VERSION as version) data
)

(SELECT *
FROM
#Temp1a)
"""

Turns into

withClause = """
(SELECT *
INTO #Temp1a
FROM
(SELECT @@VERSION as version) data
)
"""

query = """
(SELECT *
FROM
#Temp1a)
"""

attilapiros · 2021-12-02T17:13:36Z

@attilapiros Finding in the query where the WITH piece stops, then where the SELECT begins is where I found the place to split.

Thanks, I see that. But still it is really hard to automate it. So I think what Peter come up with is best we have right now.

I was thinking about how to avoid SELECT * FROM $table WHERE 1=0. One of my idea was just replacing all the SELECT (ignoring case) with SELECT top(0) as that could be done even in string literals as it does not change the schema. But if top was already used somewhere then this ugly hack fails. And this is just one part of the problem to get the schema without running the query. The other one in (in JDBCRDD.compute()) is even harder to crack where the partitioning and pushed down group by is handled.

So based on this LGTM.

cc @viirya, @HyukjinKwon

sumeetgajjar

Thanks for incorporating the suggestions.

github-actions · 2022-03-13T00:15:47Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

WaterKnight1998 · 2022-04-21T15:19:38Z

Any news here?

peter-toth · 2022-04-22T17:22:04Z

I rebased the PR on top of master and already pushed it to my branch, but for some reason I can't reopen the PR.

@HyukjinKwon, @allisonwang-db, @cloud-fan could you please take a look and judge if it make sense to reopen this PR.

cloud-fan · 2022-04-25T05:42:27Z

I'm not sure what the actual requirement is. To use the JDBC option that directly passes a SQL string to a JDBC database, we must make sure the SQL is fully supported by the database. Or people should run the SQL with Spark and only read/write JDBC tables.

If we only want to run a custom SQL statement before running the query, can we use the sessionInitStatement option?

KevinAppelBofa · 2022-04-27T13:26:21Z

@cloud-fan the requirement is to be able to use the Spark JDBC to access Microsoft SQL Server and use items that are unique to SQL Server, such as temp tables or common table expression (with statement). The Oracle and Mysql both also support CTE however their language allows them to begin with a select statement.
The SQL server for CTE is only having it one way where it starts with, WITH, because of how Spark is wrapping the query this is what causes the issue. I had opened up a case with Microsoft and got to their Spark team but they were not able to provide any feedback or commits on how to fix Spark to handle this.

This fix that Peter created is allowing both of these items to work, the CTE query and also the temp table query; in both of these it is very difficult to try to split out the sql into parts to place into the sessionInitStatement vs being able to run the query as is

The sample queries look like
query2 = """
WITH DummyCTE AS
(
SELECT 1 as DummyCOL
)
SELECT *
FROM DummyCTE
"""

query3 = """
(SELECT *
INTO #Temp1a
FROM
(SELECT @@Version as version) data
)
(SELECT *
FROM
#Temp1a)
"""

peter-toth · 2022-04-27T13:31:45Z

I think the requirement is to be able to use CTE queries with JDBC sources on MSSQL.
Currently it doesn't work as Spark wraps the original query into a SELECT statement (e.g. when it queries the schema Spark wraps the query into SELECT * FROM (<query>) WHERE 1=0), which construct is not supported by MSSQL.

sessionInitStatement is a separate statement so you can't put the WITH clause there and keep the SELECT clause in query.
We could put the whole CTE query into sessionInitStatement as CREATE TEMPORARY VIEW v AS <query> and use SELECT v FROM in query but temporary views are also not supported by MSSQL.
We could improve Spark to identify CTE queries and assemble the schema query in a way that it is compatible with MSSQL, but splitting an arbitrary query into WITH and SELECT clauses programatically is not that simple.
This PR offers a new withClause option where the user can split the query manually. (I should probably call it queryPrefix as it also works with MSSQL's temp table syntax.)

cloud-fan · 2022-04-27T15:23:49Z

Why does Spark JDBC source issue SELECT * FROM (<query>) WHERE 1=0 instead of simply <query>? Sorry I'm not very familiar with the details. cc @huaxingao @beliefer

huaxingao · 2022-04-28T01:09:37Z

@cloud-fan
When JDBC resolves a relation, it needs to get the schema of the relation first. JDBC uses this query

s"SELECT * FROM $table WHERE 1=0"

to discover the schema of the relation. In the user's case, the table is WITH t AS (SELECT x, y FROM tbl) SELECT * FROM t WHERE x > 10 so the query to get the schema is

SELECT * FROM (WITH t AS (SELECT x, y FROM tbl) SELECT * FROM t WHERE x > 10) WHERE 1=0

but MSSQL server doesn't support this syntax SELECT * FROM (WITH t AS (...) SELECT ... FROM t) WHERE 1=0
This PR offers a withClause option so the user can split the query manually. The query will be changed to the following after the split

WITH t AS (SELECT x, y FROM tbl) SELECT * FROM (SELECT * FROM t WHERE x > 10) WHERE 1=0

cloud-fan · 2022-04-28T02:11:20Z

Can we use WITH t AS (SELECT x, y FROM tbl) SELECT * FROM t WHERE x > 10 directly to send to the database and get the schema?

beliefer · 2022-04-28T03:05:09Z

I investigated the behavior of PostgreSQL.
WITH t AS (select dept, name, salary from employee) SELECT * FROM t; works well.
SELECT * FROM (WITH t AS (select dept, name, salary from employee) SELECT * FROM t) WHERE 1=0; works bad.
The option "withClause" looks strange. If so, why not uses WITH t AS (SELECT x, y FROM tbl) SELECT * FROM t WHERE x > 10 directly ?

beliefer · 2022-04-28T03:05:47Z

Can we use WITH t AS (SELECT x, y FROM tbl) SELECT * FROM t WHERE x > 10 directly to send to the database and get the schema?

+1

beliefer · 2022-04-28T03:20:03Z

The case case show below works well.

  test("investigate ctes") {
    checkAnswer(sql("WITH t AS (SELECT id, name FROM h2.test.empty_table) SELECT * FROM t"), Seq())
    checkAnswer(sql("WITH t AS (SELECT id, name FROM h2.test.people) SELECT * FROM t"),
      Seq(Row(1, "fred"), Row(2, "mary")))
  }

peter-toth · 2022-04-28T05:59:28Z

Why does Spark JDBC source issue SELECT * FROM (<query>) WHERE 1=0 instead of simply <query>?

Because that way we can let MSSQL (or other) optimizer to kick in and return an empty resultset with the schema very fast.

Can we use WITH t AS (SELECT x, y FROM tbl) SELECT * FROM t WHERE x > 10 directly to send to the database and get the schema?

Well, we could do that with loosing the above optimization, but besides the "schema query" Spark also wraps the original query at other places. For example when the query is actually executed: https://github.com/apache/spark/pull/34693/files#diff-ecf5b374060c1222d3a0a1295b4ec2cb5d07603db460273484b1753e1cab9f90L370-L371 so that JDBC sources can support different pushdowns and partitioning.

cloud-fan · 2022-04-28T06:15:45Z

Make sense, I'm convinced to add something like prepareQuery JDBC option, with a clear document.

peter-toth · 2022-04-28T06:20:05Z

Sounds good, I can do it early next week.
How can I reopen this PR? Or shall I open a new one?

cloud-fan · 2022-04-28T06:23:30Z

I can't reopen the PR, I think we need to create a new one

peter-toth · 2022-05-03T12:58:40Z

I opened a new PR here: #36440

[SPARK-37259][SQL] Support CTE queries with MSSQL JDBC

e2c9577

github-actions bot added the SQL label Nov 23, 2021

sumeetgajjar reviewed Nov 23, 2021

View reviewed changes

add temp table test, fix test name

dea730a

peter-toth changed the title ~~[SPARK-37259][SQL] Support CTE queries with MSSQL JDBC~~ [SPARK-37259][SQL] Support CTE and TempTable queries with MSSQL JDBC Nov 24, 2021

fix JDBCRelation.toString().

b53ef47

akhalymon-cv mentioned this pull request Nov 25, 2021

[WIP][SPARK-37259] Add option to unwrap query to support CTE for MSSQL JDBC driver #34709

Closed

sumeetgajjar approved these changes Dec 2, 2021

View reviewed changes

github-actions bot added the Stale label Mar 13, 2022

github-actions bot closed this Mar 14, 2022

cloud-fan removed the Stale label Apr 28, 2022

[SPARK-37259][SQL] Support CTE and TempTable queries with MSSQL JDBC #34693

[SPARK-37259][SQL] Support CTE and TempTable queries with MSSQL JDBC #34693

Conversation

peter-toth commented Nov 23, 2021 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Nov 23, 2021

SparkQA commented Nov 23, 2021

peter-toth commented Nov 23, 2021 • edited

SparkQA commented Nov 23, 2021

peter-toth commented Nov 23, 2021

sumeetgajjar commented Nov 23, 2021

sumeetgajjar Nov 23, 2021

Choose a reason for hiding this comment

peter-toth Nov 24, 2021

Choose a reason for hiding this comment

sumeetgajjar Nov 23, 2021

Choose a reason for hiding this comment

peter-toth Nov 24, 2021

Choose a reason for hiding this comment

peter-toth commented Nov 24, 2021 • edited

SparkQA commented Nov 24, 2021

SparkQA commented Nov 24, 2021

SparkQA commented Nov 24, 2021

SparkQA commented Nov 24, 2021

SparkQA commented Nov 24, 2021

SparkQA commented Nov 24, 2021

KevinAppelBofa commented Nov 29, 2021

KevinAppelBofa commented Nov 29, 2021

attilapiros commented Dec 2, 2021 • edited

peter-toth commented Dec 2, 2021

attilapiros commented Dec 2, 2021

KevinAppelBofa commented Dec 2, 2021 • edited

attilapiros commented Dec 2, 2021

sumeetgajjar left a comment

Choose a reason for hiding this comment

github-actions bot commented Mar 13, 2022

WaterKnight1998 commented Apr 21, 2022

peter-toth commented Apr 22, 2022

cloud-fan commented Apr 25, 2022

KevinAppelBofa commented Apr 27, 2022

peter-toth commented Apr 27, 2022 • edited

cloud-fan commented Apr 27, 2022

huaxingao commented Apr 28, 2022

cloud-fan commented Apr 28, 2022

beliefer commented Apr 28, 2022 • edited

beliefer commented Apr 28, 2022

beliefer commented Apr 28, 2022

peter-toth commented Apr 28, 2022

cloud-fan commented Apr 28, 2022

peter-toth commented Apr 28, 2022

cloud-fan commented Apr 28, 2022

peter-toth commented May 3, 2022

peter-toth commented Nov 23, 2021 •

edited

peter-toth commented Nov 23, 2021 •

edited

peter-toth commented Nov 24, 2021 •

edited

attilapiros commented Dec 2, 2021 •

edited

KevinAppelBofa commented Dec 2, 2021 •

edited

peter-toth commented Apr 27, 2022 •

edited

beliefer commented Apr 28, 2022 •

edited