-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-27209][SQL] Split parsing of SELECT and INSERT into two top-level rules in the grammar file. #24150
Conversation
Test build #103708 has finished for PR 24150 at commit
|
retest this please |
: insertInto? queryTerm queryOrganization #singleInsertQuery | ||
| fromClause multiInsertQueryBody+ #multiInsertQuery | ||
: queryTerm queryOrganization #noWithQuery | ||
| fromClause querySpecification queryOrganization #queryWithFrom |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where does this come from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan This is required to parse statement of the form code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why isn't it covered by insertStatement
? We have (ctes)? fromClause multiInsertQueryBody+
there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan Actually this is sort of tricky.. I initially didn't have this rule thinking insertStatement should take care of this. But it didn't .
FROM (FROM TEST SELECT TRANSFORM(KEY, VALUE) USING 'CAT' AS (`THING1` INT, THING2 STRING)) T SELECT THING1 + 1
In this case, we have nested from clause. and the inner from is handled by rule aliasedQuery
. If we see the definition, it calls queryNoWith
and we resolve the nested FROM along with querySpecification(the transform query) by using queryNoWith:queryWithFrom rule.
val sql1 = | ||
""" | ||
|CREATE VIEW testView AS FROM jt | ||
|INSERT INTO tbl1 SELECT * WHERE jt.id < 5 " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
INSERT INTO tbl1 SELECT * WHERE jt.id < 5 " + INSERT INTO tbl2 SELECT * WHERE jt.id > 4
what are we testing for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan oops.. will fix.
Test build #103713 has finished for PR 24150 at commit
|
Test build #103715 has finished for PR 24150 at commit
|
@@ -354,9 +355,14 @@ resource | |||
: identifier STRING | |||
; | |||
|
|||
insertStatement | |||
: (ctes)? insertInto queryTerm queryOrganization #singleInsertQuery | |||
| (ctes)? fromClause multiInsertQueryBody+ #multiInsertQuery |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we change the multiInsertQueryBody
and make the insertInto
non-optional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan did you mean make insertInto non-optional in multiInsertQueryBody
? If so, i did try it and we need it to be optional to parse statement referenced in test
Also , we we can have a mixture of insert and selects in the multiInsertQueryBody. Thats why we have it under a "multi insert" rule where insert is optional. Not sure, but i think multiStatementBody
may be a better name for the rule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we still need https://github.com/apache/spark/pull/24150/files#diff-8c1cb2af4aa1109e08481dae79052cc3R365 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan If you don't mind, could you please paste the block or line here ? When i click on that link, it shows me pretty much all the changes so i am not sure which one i should focus on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dilipbiswal . He means line 365. And, that's the same question with the below comment, https://github.com/apache/spark/pull/24150/files#r267283726.
| fromClause querySpecification queryOrganization #queryWithFrom
Line 365 is required for FROM ... SELECT ...
statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun Thanks for helping out. When i clicked.. it took me to AstBuilder ... so i was confused.
@cloud-fan So we need the rule to parse statement i mentioned earlier.
from (from test select transform(....) using .. as ...) T select ...
It is parsed sort of recursively in the queryNoWith
rule .. calling queryNoWith
from relationPrimary:aliasedQuery
rule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fromClause multiInsertQueryBody+
-> fromClause (insertInto? querySpecification queryOrganization)+
-> fromClause (querySpecification queryOrganization)+
and in queryNoWith
, we have fromClause querySpecification queryOrganization
. Seems we should change it to fromClause (querySpecification queryOrganization)+
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan Currently after this change, fromClause and multiple querySpecification queryOrganization will be handled by 2nd alternative of the insertStatement. I know that this may look confusing. In this pass, i wanted to handle single queries in one rule which is query
. This would help us catch quite a few errors i.e when a user could plugin an insert in any part of the query. The rest is currently handled by insertStatement
which includes multi insert, multi select etc. Please let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan instead of name insertStatement
, should we change to insertOrMultiQueries
to clear the confusion ?
queryNoWith | ||
: insertInto? queryTerm queryOrganization #singleInsertQuery | ||
| fromClause multiInsertQueryBody+ #multiInsertQuery | ||
: queryTerm queryOrganization #noWithQuery |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, queryNoWith
has these two patterns. And both don't have With
. Can we revise this name #noWithQuery
more narrowly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun would you have any suggestion ? I was thinking selectQuery
but this matches select, table query, inline table etc. So a little confused :-)
@@ -754,4 +754,29 @@ class PlanParserSuite extends AnalysisTest { | |||
assertEqual(query2, Distinct(a.union(b)).except(c.intersect(d, isAll = true), isAll = true)) | |||
} | |||
} | |||
|
|||
test("create/alter view as insert into table") { | |||
intercept[ParseException](parsePlan("CREATE VIEW testView AS INSERT INTO jt VALUES(1, 1)")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ur, is this a behavior change? Previously, this seems to be My bad. It was the same AnalysisException
with Operation not allowed: CREATE VIEW ... AS INSERT INTO
.ParseException
.
BTW, let's check the exception message always.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun For parser exceptions, we would have list of expected tokens which may change in future to make this a little fragile (we may need to update the error message frequently) ? Thats why i didn't check the exact message text here.
if you want it checked, would it be better to put it in SQlQueryTestSuite (perhaps easy to regenerate the output file) ? Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, that means a subtle behavior change. There is no more kind message like Operation not allowed:
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun Let me paste a sample error from the describe query test.
DESCRIBE
FROM desc_temp1 a
insert into desc_temp1 select *
insert into desc_temp2 select *
-- !query 13 schema
struct<>
-- !query 13 output
org.apache.spark.sql.catalyst.parser.ParseException
mismatched input 'insert' expecting {<EOF>, '(', ',', 'ANTI', 'CLUSTER', 'CROSS', 'DISTRIBUTE', 'EXCEPT', 'FULL', 'GROUP', 'HAVING', 'INNER', 'INTERSECT', 'JOIN', 'LATERAL', 'LEFT', 'LIMIT', 'NATURAL', 'ORDER', 'PIVOT', 'RIGHT', 'SELECT', 'SEMI', 'MINUS', 'SORT', 'UNION', 'WHERE', 'WINDOW'}(line 3, pos 5)
I am thinking, this is coming from antlr itself as it knows the expected grammar ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun I will do a contains and look for "mismatched input" like we do in test "select hint syntax". hopefully its ok.
|INSERT INTO tbl2 SELECT * WHERE jt.id > 4 | ||
""".stripMargin | ||
intercept[ParseException](parsePlan(sql1)) | ||
intercept[ParseException](parsePlan("ALTER VIEW testView AS INSERT INTO jt VALUES(1, 1)")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/PlanParserSuite.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/PlanParserSuite.scala
Outdated
Show resolved
Hide resolved
Test build #103726 has finished for PR 24150 at commit
|
Test build #103742 has finished for PR 24150 at commit
|
: insertInto? queryTerm queryOrganization #singleInsertQuery | ||
| fromClause multiInsertQueryBody+ #multiInsertQuery | ||
: queryTerm queryOrganization #noWithQuery | ||
| fromClause querySpecification queryOrganization #queryWithFrom |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously there can be multiple multiInsertQueryBody
. Looks like there will be an Union for the inserts. Because the insertInto
is optional, I think it can a fromClause
and multiple querySpecification queryOrganization
. But looks like the new rule doesn't allow it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya Currently after this change, fromClause
and multiple querySpecification queryOrganization
will be handled by 2nd alternative of the insertStatement
rule ?
override def visitQueryWithFrom(ctx: QueryWithFromContext): LogicalPlan = withOrigin(ctx) { | ||
val from = visitFromClause(ctx.fromClause) | ||
validate(ctx.querySpecification.fromClause == null, | ||
"Script transformation queries cannot have a FROM clause in their" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I missing anything, is this related to script transformation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya First of all you are absolutely right. I could use your suggestion here. So what this is guarding against is of query of following form :
FROM foo select c1 from foo
So this has nothing to do with script transformation :-). I don't know what to call this kind of query as ? Should we say "Individual select statement can not have FROM cause as its already specified in the outer query block" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Individual select statement can not have FROM cause as its already specified in the outer query block
SGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan Thank you. will change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @viirya
@@ -354,9 +355,14 @@ resource | |||
: identifier STRING | |||
; | |||
|
|||
insertStatement | |||
: (ctes)? insertInto queryTerm queryOrganization #singleInsertQuery | |||
| (ctes)? fromClause multiInsertQueryBody+ #multiInsertQuery |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think why we keep insertInto
as optional in multiInsertQueryBody
after this change, if we want to split select and insert to individual rule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot write it like this?
insertStatement
: ...
| (ctes)? fromClause insertInto someName (insertInto? someName) #multiInsertQuery
;
someName
: querySpecification queryOrganization
;
@@ -354,9 +355,14 @@ resource | |||
: identifier STRING | |||
; | |||
|
|||
insertStatement | |||
: (ctes)? insertInto queryTerm queryOrganization #singleInsertQuery |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move (ctes)?
to the line 80 like | (ctes)? insertStatement
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu then i may have to implement/override another visit method to apply the cte to an insert statement, okay ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the duplicate patten?
queryToDesc
: queryTerm queryOrganization
;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu Thats right. After this PR is in, i will have a follow-up to allow CTEs for DESCRIBE QUERY (currently it is a limitation) and that would remove this pattern all together (i hope :-) ).
@cloud-fan @viirya The only test that fails after making
AND
Given only the planParseSuite is affected, my thinking is , this multi select support is NOT really something we were designing to support. What we wanted to support is a |
I don't think the multi-select is valid syntax, what is the expected result? And what we get if we run a multi-select in real? Do we get an exception? |
@cloud-fan It works :-) .. here is the output snippet :
It basically turns into :
What i don't know is if it is designed to work or its simply a side effect of supporting multi insert. |
@cloud-fan and the error if we specify the from clause on a multi select query to add to the confusion :-)
Please note the reference to "insert" .. even though there is no insert in the query. |
damn, then this is something we can't break, even if it's not designed this way. Can we create a special parser rule for it? Also add a comment to say that we support it for history reasons. |
@cloud-fan Sure. will do. |
e25f2d0
to
f7aaebd
Compare
I knew for compatibility we usually don't break such thing. But for a major version like 3.0, can't we break it? |
Test build #103846 has finished for PR 24150 at commit
|
@cloud-fan Please let me know if this looks ok now. Thank you. |
thanks, merging to master! |
Thank you very much @cloud-fan @viirya @maropu @dongjoon-hyun |
…vel rules in the grammar file. Currently in the grammar file the rule `query` is responsible to parse both select and insert statements. As a result, we need to have more semantic checks in the code to guard against in-valid insert constructs in a query. Couple of examples are in the `visitCreateView` and `visitAlterView` functions. One other issue is that, we don't catch the `invalid insert constructs` in all the places until checkAnalysis (the errors we raise can be confusing as well). Here are couple of examples : ```SQL select * from (insert into bar values (2)); ``` ``` Error in query: unresolved operator 'Project [*]; 'Project [*] +- SubqueryAlias `__auto_generated_subquery_name` +- InsertIntoHiveTable `default`.`bar`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, false, false, [c1] +- Project [cast(col1#18 as int) AS c1#20] +- LocalRelation [col1#18] ``` ```SQL select * from foo where c1 in (insert into bar values (2)) ``` ``` Error in query: cannot resolve '(default.foo.`c1` IN (listquery()))' due to data type mismatch: The number of columns in the left hand side of an IN subquery does not match the number of columns in the output of subquery. Left side columns: [default.foo.`c1`]. Right side columns: [].;; 'Project [*] +- 'Filter c1#6 IN (list#5 []) : +- InsertIntoHiveTable `default`.`bar`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, false, false, [c1] : +- Project [cast(col1#7 as int) AS c1#9] : +- LocalRelation [col1#7] +- SubqueryAlias `default`.`foo` +- HiveTableRelation `default`.`foo`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#6] ``` For both the cases above, we should reject the syntax at parser level. In this PR, we create two top-level parser rules to parse `SELECT` and `INSERT` respectively. I will create a small PR to allow CTEs in DESCRIBE QUERY after this PR is in. Added tests to PlanParserSuite and removed the semantic check tests from SparkSqlParserSuites. Closes apache#24150 from dilipbiswal/split-query-insert. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…vel rules in the grammar file. Currently in the grammar file the rule `query` is responsible to parse both select and insert statements. As a result, we need to have more semantic checks in the code to guard against in-valid insert constructs in a query. Couple of examples are in the `visitCreateView` and `visitAlterView` functions. One other issue is that, we don't catch the `invalid insert constructs` in all the places until checkAnalysis (the errors we raise can be confusing as well). Here are couple of examples : ```SQL select * from (insert into bar values (2)); ``` ``` Error in query: unresolved operator 'Project [*]; 'Project [*] +- SubqueryAlias `__auto_generated_subquery_name` +- InsertIntoHiveTable `default`.`bar`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, false, false, [c1] +- Project [cast(col1#18 as int) AS c1#20] +- LocalRelation [col1#18] ``` ```SQL select * from foo where c1 in (insert into bar values (2)) ``` ``` Error in query: cannot resolve '(default.foo.`c1` IN (listquery()))' due to data type mismatch: The number of columns in the left hand side of an IN subquery does not match the number of columns in the output of subquery. Left side columns: [default.foo.`c1`]. Right side columns: [].;; 'Project [*] +- 'Filter c1#6 IN (list#5 []) : +- InsertIntoHiveTable `default`.`bar`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, false, false, [c1] : +- Project [cast(col1#7 as int) AS c1#9] : +- LocalRelation [col1#7] +- SubqueryAlias `default`.`foo` +- HiveTableRelation `default`.`foo`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#6] ``` For both the cases above, we should reject the syntax at parser level. In this PR, we create two top-level parser rules to parse `SELECT` and `INSERT` respectively. I will create a small PR to allow CTEs in DESCRIBE QUERY after this PR is in. Added tests to PlanParserSuite and removed the semantic check tests from SparkSqlParserSuites. Closes apache#24150 from dilipbiswal/split-query-insert. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Apologies on crashing the party on this closed PR, but my question relates directly to the parsing of INSERT statements: Related to the ANSI SQL specification for insert syntax, could the parsing and underlying engine support the syntax of: I think I read somewhere that there's some underlying technical detail where the columns inserted into SPARK tables must have the selected columns match the order of the table definition. But, if this is the case, isn't' there a place in the parser-layer and execution-layer where the parser can translate something like:
Where someTable has 3 columns (col3,col2,col1) (note the order here), the query is rewritten and sent to the engine as:
Note, the reordering and adding of the null column was done based on some table metadata on someTable so it knew which columns from the INSERT() map over to the columns from the select. Is this possible? The lack of specifying column values is preventing our project from SPARK being a supported platform. |
This is unrelated to this PR but is a reasonable feature request. One missing part is the default column value, we should support it as well instead of always using null as the default value. It is still valuable to support this SQL syntax so that we can reorder the columns during insertion. Feel free to open a JIRA ticket/PR! |
I remember we already have the tickets for the related topics, e.g., https://issues.apache.org/jira/browse/SPARK-21548 (#18756) and https://issues.apache.org/jira/browse/SPARK-20845 (#22532). Could you check them first? If there is a difference between your suggestion and these tickets, please file a jira! |
You are correct, apologies I didn't know about the jira issue tracking that was separate from github. You are correct that those issues do cover what I'm talking about here. The closest PR was this one: #22532 but has been closed to inactivity. So, sadface on that. |
What changes were proposed in this pull request?
Currently in the grammar file the rule
query
is responsible to parse both select and insert statements. As a result, we need to have more semantic checks in the code to guard against in-valid insert constructs in a query. Couple of examples are in thevisitCreateView
andvisitAlterView
functions. One other issue is that, we don't catch theinvalid insert constructs
in all the places until checkAnalysis (the errors we raise can be confusing as well). Here are couple of examples :For both the cases above, we should reject the syntax at parser level.
In this PR, we create two top-level parser rules to parse
SELECT
andINSERT
respectively.I will create a small PR to allow CTEs in DESCRIBE QUERY after this PR is in.
How was this patch tested?
Added tests to PlanParserSuite and removed the semantic check tests from SparkSqlParserSuites.