-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-37259][SQL] Support CTE and TempTable queries with MSSQL JDBC #34693
Conversation
Kubernetes integration test starting |
Kubernetes integration test status failure |
This change also seem to work with MSSQL's temp table syntax:
|
Test build #145548 has finished for PR 34693 at commit
|
Hmm, failures in |
Since it also works with temp table syntax, do you think it would be a good idea to include it the title of this PR and modify the title to "Support CTE and TempTable queries with MSSQL JDBC"? |
@@ -325,6 +325,6 @@ private[sql] case class JDBCRelation( | |||
override def toString: String = { | |||
val partitioningInfo = if (parts.nonEmpty) s" [numPartitions=${parts.length}]" else "" | |||
// credentials should not be included in the plan output, table information is sufficient. | |||
s"JDBCRelation(${jdbcOptions.tableOrQuery})" + partitioningInfo | |||
s"JDBCRelation(${jdbcOptions.withClause}${jdbcOptions.tableOrQuery})" + partitioningInfo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since partitioningInfo
is also a string, should we create the final string as
s"JDBCRelation(${jdbcOptions.withClause}${jdbcOptions.tableOrQuery})$partitioningInfo"
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in b53ef47
.option("dbtable", dbtable) | ||
.load() | ||
assert(df.collect.toSet === expectedResult) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it already works with temp table syntax, should we also add the corresponding UTs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Added in dea730a
Thanks, makes sense. I've modified PR title and description and added a new test in: dea730a Actually, I wonder if it makes sense to rename |
Kubernetes integration test starting |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Kubernetes integration test status failure |
Test build #145572 has finished for PR 34693 at commit
|
Test build #145571 has finished for PR 34693 at commit
|
@peter-toth thank you for working on this, I was able to get a spark 3.3.0-snapshot compiled and test the changes you made. I ran both the sample queries first and those were able to work, then I ran the temp table query and this is also working; that one was easy to split into the withClause and query. I am running into an issue though getting the CTE query to run, i have tried to split this up a few ways but I keep getting the same error which is below. I'm going to try to add a logwarning to dump out the query it is trying to run to get the schema and see if I can get that to run directly in the sql server. This was the issue I ran into originally I was able to get the test CTE to work and doing a $CTEQUERY where 1=0; was working but in this more complex CTE I can't find a spot where to add the 1=0 to get a schema back only.
|
@peter-toth I was able to get the CTE query to work using the split method you have, this was a little trial and error to find the right place to split and is producing the same results using the other method that uses the useRawQuery option and appears to be a similar run time. |
Could you please show us some example queries where the split is non-trivial? |
For example:
You need to split this to |
I agree that would be too much here and without a complex parsing logic we might fail at the splitting |
@attilapiros Finding in the query where the WITH piece stops, then where the SELECT begins is where I found the place to split. In the test query query2 = """
WITH DummyCTE AS
(
SELECT 1 as DummyCOL
)
SELECT *
FROM DummyCTE
""" This splits into withClause = """
WITH DummyCTE AS
(
SELECT 1 as DummyCOL
)
"""
query = """
SELECT *
FROM DummyCTE
""" In the actual query we are running is more complex and is a bunch of chained WITH together, in that one I did the same approach and where the actual WITH part ends to stick that into the withClause and then the rest into the query This same technique works for the temp table query, to split it up where the part generating the temp table goes into the withClause and the rest goes into the query query3 = """
(SELECT *
INTO #Temp1a
FROM
(SELECT @@VERSION as version) data
)
(SELECT *
FROM
#Temp1a)
""" Turns into withClause = """
(SELECT *
INTO #Temp1a
FROM
(SELECT @@VERSION as version) data
)
"""
query = """
(SELECT *
FROM
#Temp1a)
""" |
Thanks, I see that. But still it is really hard to automate it. So I think what Peter come up with is best we have right now. I was thinking about how to avoid So based on this LGTM. cc @viirya, @HyukjinKwon |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for incorporating the suggestions.
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
Any news here? |
I rebased the PR on top of @HyukjinKwon, @allisonwang-db, @cloud-fan could you please take a look and judge if it make sense to reopen this PR. |
I'm not sure what the actual requirement is. To use the JDBC option that directly passes a SQL string to a JDBC database, we must make sure the SQL is fully supported by the database. Or people should run the SQL with Spark and only read/write JDBC tables. If we only want to run a custom SQL statement before running the query, can we use the |
@cloud-fan the requirement is to be able to use the Spark JDBC to access Microsoft SQL Server and use items that are unique to SQL Server, such as temp tables or common table expression (with statement). The Oracle and Mysql both also support CTE however their language allows them to begin with a select statement. This fix that Peter created is allowing both of these items to work, the CTE query and also the temp table query; in both of these it is very difficult to try to split out the sql into parts to place into the sessionInitStatement vs being able to run the query as is The sample queries look like query3 = """ |
I think the requirement is to be able to use CTE queries with JDBC sources on MSSQL.
|
Why does Spark JDBC source issue |
@cloud-fan
to discover the schema of the relation. In the user's case, the table is
but MSSQL server doesn't support this syntax
|
Can we use |
I investigated the behavior of PostgreSQL. |
+1 |
The case case show below works well.
|
Because that way we can let MSSQL (or other) optimizer to kick in and return an empty resultset with the schema very fast.
Well, we could do that with loosing the above optimization, but besides the "schema query" Spark also wraps the original query at other places. For example when the query is actually executed: https://github.com/apache/spark/pull/34693/files#diff-ecf5b374060c1222d3a0a1295b4ec2cb5d07603db460273484b1753e1cab9f90L370-L371 so that JDBC sources can support different pushdowns and partitioning. |
Make sense, I'm convinced to add something like |
Sounds good, I can do it early next week. |
I can't reopen the PR, I think we need to create a new one |
I opened a new PR here: #36440 |
What changes were proposed in this pull request?
Currently CTE queries from Spark are not supported with MSSQL server via JDBC. This is because MSSQL server doesn't support the nested CTE syntax (
SELECT * FROM (WITH t AS (...) SELECT ... FROM t) WHERE 1=0
) that Spark builds from the original query (options.tableOrQuery
) inJDBCRDD.resolveTable()
and inJDBCRDD.compute()
.Unfortunately, it is non-trivial to split an arbitrary query it into "with" and "regular" clauses in
MsSqlServerDialect
. So instead, I'm proposing a new general JDBC option "withClause" that users can use if they have complex queries with CTE:This change also works with MSSQL's temp table syntax:
Why are the changes needed?
To support CTE queries with MSSQL.
Does this PR introduce any user-facing change?
Yes, CTE queries are supported form now.
How was this patch tested?
Added new integration UTs.