Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-27209][SQL] Split parsing of SELECT and INSERT into two top-level rules in the grammar file. #24150

Closed
wants to merge 5 commits into from

Conversation

dilipbiswal
Copy link
Contributor

@dilipbiswal dilipbiswal commented Mar 20, 2019

What changes were proposed in this pull request?

Currently in the grammar file the rule query is responsible to parse both select and insert statements. As a result, we need to have more semantic checks in the code to guard against in-valid insert constructs in a query. Couple of examples are in the visitCreateView and visitAlterView functions. One other issue is that, we don't catch the invalid insert constructs in all the places until checkAnalysis (the errors we raise can be confusing as well). Here are couple of examples :

select * from (insert into bar values (2));
Error in query: unresolved operator 'Project [*];
'Project [*]
+- SubqueryAlias `__auto_generated_subquery_name`
   +- InsertIntoHiveTable `default`.`bar`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, false, false, [c1]
      +- Project [cast(col1#18 as int) AS c1#20]
         +- LocalRelation [col1#18]
select * from foo where c1 in (insert into bar values (2))
Error in query: cannot resolve '(default.foo.`c1` IN (listquery()))' due to data type mismatch: 
The number of columns in the left hand side of an IN subquery does not match the
number of columns in the output of subquery.
#columns in left hand side: 1.
#columns in right hand side: 0.

Left side columns:
[default.foo.`c1`].
Right side columns:
[].;;
'Project [*]
+- 'Filter c1#6 IN (list#5 [])
   :  +- InsertIntoHiveTable `default`.`bar`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, false, false, [c1]
   :     +- Project [cast(col1#7 as int) AS c1#9]
   :        +- LocalRelation [col1#7]
   +- SubqueryAlias `default`.`foo`
      +- HiveTableRelation `default`.`foo`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#6]

For both the cases above, we should reject the syntax at parser level.

In this PR, we create two top-level parser rules to parse SELECT and INSERT respectively.
I will create a small PR to allow CTEs in DESCRIBE QUERY after this PR is in.

How was this patch tested?

Added tests to PlanParserSuite and removed the semantic check tests from SparkSqlParserSuites.

@dilipbiswal
Copy link
Contributor Author

dilipbiswal commented Mar 20, 2019

cc @cloud-fan @maropu @dongjoon-hyun

@SparkQA
Copy link

SparkQA commented Mar 20, 2019

Test build #103708 has finished for PR 24150 at commit edb7440.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor Author

retest this please

: insertInto? queryTerm queryOrganization #singleInsertQuery
| fromClause multiInsertQueryBody+ #multiInsertQuery
: queryTerm queryOrganization #noWithQuery
| fromClause querySpecification queryOrganization #queryWithFrom
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where does this come from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan This is required to parse statement of the form code

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't it covered by insertStatement? We have (ctes)? fromClause multiInsertQueryBody+ there

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan Actually this is sort of tricky.. I initially didn't have this rule thinking insertStatement should take care of this. But it didn't .

FROM (FROM TEST SELECT TRANSFORM(KEY, VALUE) USING 'CAT' AS (`THING1` INT, THING2 STRING)) T SELECT THING1 + 1

In this case, we have nested from clause. and the inner from is handled by rule aliasedQuery. If we see the definition, it calls queryNoWith and we resolve the nested FROM along with querySpecification(the transform query) by using queryNoWith:queryWithFrom rule.

val sql1 =
"""
|CREATE VIEW testView AS FROM jt
|INSERT INTO tbl1 SELECT * WHERE jt.id < 5 " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

INSERT INTO tbl1 SELECT * WHERE jt.id < 5 " + INSERT INTO tbl2 SELECT * WHERE jt.id > 4

what are we testing for it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan oops.. will fix.

@SparkQA
Copy link

SparkQA commented Mar 20, 2019

Test build #103713 has finished for PR 24150 at commit edb7440.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 20, 2019

Test build #103715 has finished for PR 24150 at commit edb7440.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -354,9 +355,14 @@ resource
: identifier STRING
;

insertStatement
: (ctes)? insertInto queryTerm queryOrganization #singleInsertQuery
| (ctes)? fromClause multiInsertQueryBody+ #multiInsertQuery
Copy link
Contributor

@cloud-fan cloud-fan Mar 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we change the multiInsertQueryBody and make the insertInto non-optional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan did you mean make insertInto non-optional in multiInsertQueryBody ? If so, i did try it and we need it to be optional to parse statement referenced in test
Also , we we can have a mixture of insert and selects in the multiInsertQueryBody. Thats why we have it under a "multi insert" rule where insert is optional. Not sure, but i think multiStatementBody may be a better name for the rule.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan If you don't mind, could you please paste the block or line here ? When i click on that link, it shows me pretty much all the changes so i am not sure which one i should focus on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dilipbiswal . He means line 365. And, that's the same question with the below comment, https://github.com/apache/spark/pull/24150/files#r267283726.

| fromClause querySpecification queryOrganization                           #queryWithFrom

Line 365 is required for FROM ... SELECT ... statement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun Thanks for helping out. When i clicked.. it took me to AstBuilder ... so i was confused.

@cloud-fan So we need the rule to parse statement i mentioned earlier.

from (from test select transform(....) using .. as ...) T select ...

It is parsed sort of recursively in the queryNoWith rule .. calling queryNoWith from relationPrimary:aliasedQuery rule.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fromClause multiInsertQueryBody+ -> fromClause (insertInto? querySpecification queryOrganization)+ -> fromClause (querySpecification queryOrganization)+

and in queryNoWith, we have fromClause querySpecification queryOrganization. Seems we should change it to fromClause (querySpecification queryOrganization)+

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan Currently after this change, fromClause and multiple querySpecification queryOrganization will be handled by 2nd alternative of the insertStatement. I know that this may look confusing. In this pass, i wanted to handle single queries in one rule which is query. This would help us catch quite a few errors i.e when a user could plugin an insert in any part of the query. The rest is currently handled by insertStatement which includes multi insert, multi select etc. Please let me know what you think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan instead of name insertStatement, should we change to insertOrMultiQueries to clear the confusion ?

queryNoWith
: insertInto? queryTerm queryOrganization #singleInsertQuery
| fromClause multiInsertQueryBody+ #multiInsertQuery
: queryTerm queryOrganization #noWithQuery
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, queryNoWith has these two patterns. And both don't have With. Can we revise this name #noWithQuery more narrowly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun would you have any suggestion ? I was thinking selectQuery but this matches select, table query, inline table etc. So a little confused :-)

@@ -754,4 +754,29 @@ class PlanParserSuite extends AnalysisTest {
assertEqual(query2, Distinct(a.union(b)).except(c.intersect(d, isAll = true), isAll = true))
}
}

test("create/alter view as insert into table") {
intercept[ParseException](parsePlan("CREATE VIEW testView AS INSERT INTO jt VALUES(1, 1)"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ur, is this a behavior change? Previously, this seems to be AnalysisException with Operation not allowed: CREATE VIEW ... AS INSERT INTO. My bad. It was the same ParseException.

BTW, let's check the exception message always.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun For parser exceptions, we would have list of expected tokens which may change in future to make this a little fragile (we may need to update the error message frequently) ? Thats why i didn't check the exact message text here.
if you want it checked, would it be better to put it in SQlQueryTestSuite (perhaps easy to regenerate the output file) ? Let me know what you think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that means a subtle behavior change. There is no more kind message like Operation not allowed: ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun Let me paste a sample error from the describe query test.

DESCRIBE
   FROM desc_temp1 a
     insert into desc_temp1 select *
     insert into desc_temp2 select *
-- !query 13 schema
struct<>
-- !query 13 output
org.apache.spark.sql.catalyst.parser.ParseException

mismatched input 'insert' expecting {<EOF>, '(', ',', 'ANTI', 'CLUSTER', 'CROSS', 'DISTRIBUTE', 'EXCEPT', 'FULL', 'GROUP', 'HAVING', 'INNER', 'INTERSECT', 'JOIN', 'LATERAL', 'LEFT', 'LIMIT', 'NATURAL', 'ORDER', 'PIVOT', 'RIGHT', 'SELECT', 'SEMI', 'MINUS', 'SORT', 'UNION', 'WHERE', 'WINDOW'}(line 3, pos 5)

I am thinking, this is coming from antlr itself as it knows the expected grammar ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun I will do a contains and look for "mismatched input" like we do in test "select hint syntax". hopefully its ok.

|INSERT INTO tbl2 SELECT * WHERE jt.id > 4
""".stripMargin
intercept[ParseException](parsePlan(sql1))
intercept[ParseException](parsePlan("ALTER VIEW testView AS INSERT INTO jt VALUES(1, 1)"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

@SparkQA
Copy link

SparkQA commented Mar 20, 2019

Test build #103726 has finished for PR 24150 at commit 5033981.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 21, 2019

Test build #103742 has finished for PR 24150 at commit e651e14.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

: insertInto? queryTerm queryOrganization #singleInsertQuery
| fromClause multiInsertQueryBody+ #multiInsertQuery
: queryTerm queryOrganization #noWithQuery
| fromClause querySpecification queryOrganization #queryWithFrom
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously there can be multiple multiInsertQueryBody. Looks like there will be an Union for the inserts. Because the insertInto is optional, I think it can a fromClause and multiple querySpecification queryOrganization. But looks like the new rule doesn't allow it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya Currently after this change, fromClause and multiple querySpecification queryOrganization will be handled by 2nd alternative of the insertStatement rule ?

override def visitQueryWithFrom(ctx: QueryWithFromContext): LogicalPlan = withOrigin(ctx) {
val from = visitFromClause(ctx.fromClause)
validate(ctx.querySpecification.fromClause == null,
"Script transformation queries cannot have a FROM clause in their" +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I missing anything, is this related to script transformation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya First of all you are absolutely right. I could use your suggestion here. So what this is guarding against is of query of following form :

FROM foo select c1 from foo

So this has nothing to do with script transformation :-). I don't know what to call this kind of query as ? Should we say "Individual select statement can not have FROM cause as its already specified in the outer query block" ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Individual select statement can not have FROM cause as its already specified in the outer query block SGTM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan Thank you. will change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @viirya

@@ -354,9 +355,14 @@ resource
: identifier STRING
;

insertStatement
: (ctes)? insertInto queryTerm queryOrganization #singleInsertQuery
| (ctes)? fromClause multiInsertQueryBody+ #multiInsertQuery
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think why we keep insertInto as optional in multiInsertQueryBody after this change, if we want to split select and insert to individual rule.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot write it like this?


insertStatement
    : ...
    | (ctes)? fromClause insertInto someName (insertInto? someName)      #multiInsertQuery
    ;

someName
    : querySpecification queryOrganization
    ;

@@ -354,9 +355,14 @@ resource
: identifier STRING
;

insertStatement
: (ctes)? insertInto queryTerm queryOrganization #singleInsertQuery
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move (ctes)? to the line 80 like | (ctes)? insertStatement ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu then i may have to implement/override another visit method to apply the cte to an insert statement, okay ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the duplicate patten?

queryToDesc
    : queryTerm queryOrganization
    ;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu Thats right. After this PR is in, i will have a follow-up to allow CTEs for DESCRIBE QUERY (currently it is a limitation) and that would remove this pattern all together (i hope :-) ).

@dilipbiswal
Copy link
Contributor Author

dilipbiswal commented Mar 22, 2019

@cloud-fan @viirya
Update
Basically i was trying to understand and answer the question both of you have raised which is
why make insertInto optional if we are splitting insert and select ?

The only test that fails after making insertInto mandatory in multiInsertQueryBody is PlanParserSuite:"test("multi select query")". Currently we allow multi insert and multi select queries like following :

FROM T1
INSERT INTO T2 SELECT WHERE <condition>
INSERT INTO T3 SELECT WHERE <another_condition>

AND

FROM T1
SELECT C1 WHERE <condition>
SELECT C1 WHERE <another_condition>

Given only the planParseSuite is affected, my thinking is , this multi select support is NOT really something we were designing to support. What we wanted to support is a insert into.. select from. Currently we plan the multi-select queries by simply unioning each branch. My question is , is this something we should continue to support ? Since there are better ways to support union queries, i am wondering why would users ever want to use this way ? WDYT ?

@cloud-fan
Copy link
Contributor

I don't think the multi-select is valid syntax, what is the expected result? And what we get if we run a multi-select in real? Do we get an exception?

@dilipbiswal
Copy link
Contributor Author

@cloud-fan It works :-) .. here is the output snippet :

spark-sql> select * from foo;
1
2
Time taken: 0.112 seconds, Fetched 2 row(s)
spark-sql> from foo
         > select c1 where c1 = 1
         > select c1 where c1 = 2;
1
2
Time taken: 0.218 seconds, Fetched 2 row(s)

It basically turns into :

  select c1 from foo where c1 = 1
  union all
  select c1 from foo where c1 = 2

What i don't know is if it is designed to work or its simply a side effect of supporting multi insert.

@dilipbiswal
Copy link
Contributor Author

@cloud-fan and the error if we specify the from clause on a multi select query to add to the confusion :-)

spark-sql> from foo
         > select c1 from foo where c1 = 1
         > select c1 from foo where c1 = 2;
Error in query: 
Multi-Insert queries cannot have a FROM clause in their individual SELECT statements(line 2, pos 0)

== SQL ==
from foo
select c1 from foo where c1 = 1
^^^
select c1 from foo where c1 = 2

Please note the reference to "insert" .. even though there is no insert in the query.

@cloud-fan
Copy link
Contributor

damn, then this is something we can't break, even if it's not designed this way.

Can we create a special parser rule for it? Also add a comment to say that we support it for history reasons.

@dilipbiswal
Copy link
Contributor Author

@cloud-fan Sure. will do.

@viirya
Copy link
Member

viirya commented Mar 23, 2019

damn, then this is something we can't break, even if it's not designed this way.
Can we create a special parser rule for it? Also add a comment to say that we support it for history reasons.

I knew for compatibility we usually don't break such thing. But for a major version like 3.0, can't we break it?

@SparkQA
Copy link

SparkQA commented Mar 23, 2019

Test build #103846 has finished for PR 24150 at commit ea05263.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor Author

@cloud-fan Please let me know if this looks ok now. Thank you.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 9cc925c Mar 26, 2019
@dilipbiswal
Copy link
Contributor Author

Thank you very much @cloud-fan @viirya @maropu @dongjoon-hyun

mccheah pushed a commit to palantir/spark that referenced this pull request May 15, 2019
…vel rules in the grammar file.

Currently in the grammar file the rule `query` is responsible to parse both select and insert statements. As a result, we need to have more semantic checks in the code to guard against in-valid insert constructs in a query. Couple of examples are in the `visitCreateView` and `visitAlterView` functions. One other issue is that, we don't catch the `invalid insert constructs` in all the places until checkAnalysis (the errors we raise can be confusing as well). Here are couple of examples :

```SQL
select * from (insert into bar values (2));
```
```
Error in query: unresolved operator 'Project [*];
'Project [*]
+- SubqueryAlias `__auto_generated_subquery_name`
   +- InsertIntoHiveTable `default`.`bar`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, false, false, [c1]
      +- Project [cast(col1#18 as int) AS c1#20]
         +- LocalRelation [col1#18]
```

```SQL
select * from foo where c1 in (insert into bar values (2))
```
```
Error in query: cannot resolve '(default.foo.`c1` IN (listquery()))' due to data type mismatch:
The number of columns in the left hand side of an IN subquery does not match the
number of columns in the output of subquery.

Left side columns:
[default.foo.`c1`].
Right side columns:
[].;;
'Project [*]
+- 'Filter c1#6 IN (list#5 [])
   :  +- InsertIntoHiveTable `default`.`bar`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, false, false, [c1]
   :     +- Project [cast(col1#7 as int) AS c1#9]
   :        +- LocalRelation [col1#7]
   +- SubqueryAlias `default`.`foo`
      +- HiveTableRelation `default`.`foo`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#6]
```

For both the cases above, we should reject the syntax at parser level.

In this PR, we create two top-level parser rules to parse `SELECT` and `INSERT` respectively.
I will create a small PR to allow CTEs in DESCRIBE QUERY after this PR is in.
Added tests to PlanParserSuite and removed the semantic check tests from SparkSqlParserSuites.

Closes apache#24150 from dilipbiswal/split-query-insert.

Authored-by: Dilip Biswal <dbiswal@us.ibm.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
mccheah pushed a commit to palantir/spark that referenced this pull request May 15, 2019
…vel rules in the grammar file.

Currently in the grammar file the rule `query` is responsible to parse both select and insert statements. As a result, we need to have more semantic checks in the code to guard against in-valid insert constructs in a query. Couple of examples are in the `visitCreateView` and `visitAlterView` functions. One other issue is that, we don't catch the `invalid insert constructs` in all the places until checkAnalysis (the errors we raise can be confusing as well). Here are couple of examples :

```SQL
select * from (insert into bar values (2));
```
```
Error in query: unresolved operator 'Project [*];
'Project [*]
+- SubqueryAlias `__auto_generated_subquery_name`
   +- InsertIntoHiveTable `default`.`bar`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, false, false, [c1]
      +- Project [cast(col1#18 as int) AS c1#20]
         +- LocalRelation [col1#18]
```

```SQL
select * from foo where c1 in (insert into bar values (2))
```
```
Error in query: cannot resolve '(default.foo.`c1` IN (listquery()))' due to data type mismatch:
The number of columns in the left hand side of an IN subquery does not match the
number of columns in the output of subquery.

Left side columns:
[default.foo.`c1`].
Right side columns:
[].;;
'Project [*]
+- 'Filter c1#6 IN (list#5 [])
   :  +- InsertIntoHiveTable `default`.`bar`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, false, false, [c1]
   :     +- Project [cast(col1#7 as int) AS c1#9]
   :        +- LocalRelation [col1#7]
   +- SubqueryAlias `default`.`foo`
      +- HiveTableRelation `default`.`foo`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#6]
```

For both the cases above, we should reject the syntax at parser level.

In this PR, we create two top-level parser rules to parse `SELECT` and `INSERT` respectively.
I will create a small PR to allow CTEs in DESCRIBE QUERY after this PR is in.
Added tests to PlanParserSuite and removed the semantic check tests from SparkSqlParserSuites.

Closes apache#24150 from dilipbiswal/split-query-insert.

Authored-by: Dilip Biswal <dbiswal@us.ibm.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@chrisknoll
Copy link

Apologies on crashing the party on this closed PR, but my question relates directly to the parsing of INSERT statements:

Related to the ANSI SQL specification for insert syntax, could the parsing and underlying engine support the syntax of:
INSERT INTO <table> (<column list>) select <columns list) from <otherTable>

I think I read somewhere that there's some underlying technical detail where the columns inserted into SPARK tables must have the selected columns match the order of the table definition. But, if this is the case, isn't' there a place in the parser-layer and execution-layer where the parser can translate something like:

insert into someTable (col1,col2)
select someCol1, someCol2 from otherTable

Where someTable has 3 columns (col3,col2,col1) (note the order here), the query is rewritten and sent to the engine as:

insert into someTable
select null, someCol2, someCol1 from otherTable

Note, the reordering and adding of the null column was done based on some table metadata on someTable so it knew which columns from the INSERT() map over to the columns from the select.

Is this possible? The lack of specifying column values is preventing our project from SPARK being a supported platform.

@cloud-fan
Copy link
Contributor

This is unrelated to this PR but is a reasonable feature request. One missing part is the default column value, we should support it as well instead of always using null as the default value. It is still valuable to support this SQL syntax so that we can reorder the columns during insertion. Feel free to open a JIRA ticket/PR!

@maropu
Copy link
Member

maropu commented Feb 5, 2020

I remember we already have the tickets for the related topics, e.g., https://issues.apache.org/jira/browse/SPARK-21548 (#18756) and https://issues.apache.org/jira/browse/SPARK-20845 (#22532). Could you check them first? If there is a difference between your suggestion and these tickets, please file a jira!

@chrisknoll
Copy link

chrisknoll commented Feb 5, 2020

You are correct, apologies I didn't know about the jira issue tracking that was separate from github. You are correct that those issues do cover what I'm talking about here.

The closest PR was this one: #22532 but has been closed to inactivity. So, sadface on that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants