[SPARK-30822][SQL] Remove semicolon at the end of a sql query#27567
[SPARK-30822][SQL] Remove semicolon at the end of a sql query#27567samredai wants to merge 1 commit intoapache:masterfrom samredai:semicolon
Conversation
|
@dongjoon-hyun sure thing! I'll update the PR. |
|
In database systems, a semicolon is a standard way to split multiple statements into each one. On the other hand, in |
|
@maropu I think the two issues are related but very much separate at the same time. There is the question of should spark.sql accept multiple statements separated by a semicolon? which is the discussion posed by [SPARK-24260]. However, even if the spark.sql API was to never accept multiple statements, the issue presented here would still remain; should spark.sql fail when a single valid SQL statement is provided with a terminal semicolon? As examples, here are two popular MySQL clients for two different languages that do not accept multiple SQL statements yet do not fail when a single statement is provided with a terminal semicolon: Even when using connection = DriverManager.getConnection(url, username, password)
val statement = connection.createStatement()
val results = statement.executeQuery("select 'Bojack' as horseman;") |
|
Ah, I see. Do most JDBC clients ignore a semicolon in the end? Could you check the other JDBC client behaivours, too? |
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala
Outdated
Show resolved
Hide resolved
The acceptance of the terminating semicolon happens at the database layer so most (if not all) JDBC clients ship the query to the database with the semicolon included. Even if the client does not support multiple statements, the database does and so I imagine that a statement that looks like Here's an example in the Hive CLI source code where a statement is split at the semicolon and each statement has a terminating semicolon appended to it. There's also some logic to prevent confusing Terminating semicolons are also commonly accepted by NoSQL dbms', i.e. Cassandra which identifies a CQL statement as: |
|
ok, so I'll accept the tests after some tests addded in this pr. |
|
@maropu Thanks! I added a unit test to |
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala
Outdated
Show resolved
Hide resolved
|
ok to test |
sql/core/src/test/scala/org/apache/spark/sql/execution/SparkSqlParserSuite.scala
Outdated
Show resolved
Hide resolved
|
Test build #118934 has finished for PR 27567 at commit
|
|
Test build #118936 has finished for PR 27567 at commit
|
There was a problem hiding this comment.
Can we fix this in the SqlBase.g4 side? Then, add tests in PlanParserSuite?
There was a problem hiding this comment.
Yeah that would actually be much cleaner and is the more appropriate place for the fix. What do you think about adding the quantifier as (';')*? to allow for any number of semicolons, i.e.
sql("select name from people;; ;; ; ")It goes beyond handling the accidental additional semicolon but I don't see why any number of semicolons at the end would make a query ambiguous.
There was a problem hiding this comment.
@maropu I updated the PR as described including some tests, let me know what you think. To expand on the example in my earlier reply:
// Below query is understood to be a single statement
sql("select 1, 2, 3;; ;; ; ")
res1: org.apache.spark.sql.DataFrame = [1: int, 2: int ... 1 more field]
// Below multi-statement query appropriately fails
sql("select 1, 2, 3;; ;; select 4, 5, 6; ")
org.apache.spark.sql.catalyst.parser.ParseException:
extraneous input 'select' expecting {<EOF>, ';'}(line 1, pos 20)
// == SQL ==
// select 1, 2, 3;; ;; select 4, 5, 6
// --------------------^^^There was a problem hiding this comment.
How about this new behaviour? That looks fine to me though. @cloud-fan @dongjoon-hyun @HyukjinKwon
|
Test build #118991 has finished for PR 27567 at commit
|
|
Test build #119004 has finished for PR 27567 at commit
|
|
Test build #119013 has finished for PR 27567 at commit
|
|
All of the unit tests passed for this but it looks like there was some intermittent connection issue while installing R. |
|
retest this please |
|
Test build #119033 has finished for PR 27567 at commit
|
|
retest this please |
|
Test build #119055 has finished for PR 27567 at commit
|
There was a problem hiding this comment.
should a single statement support multiple ;?
There was a problem hiding this comment.
@cloud-fan I would expect the most common scenario would be the user unintentionally submitted an extra semicolon. Do you think the query should fail in that case or is the intention so obvious that it's essentially explicit? In cases where multiple statements are allowed i.e. spark-sql, would the additional semicolons just equate to empty statements that are ignored or would something else happen that could have an effect on performance?
There was a problem hiding this comment.
seems fine.
BTW should ';'* be good enough? * means 0 or more occurrence.
There was a problem hiding this comment.
Yes @cloud-fan you're right, I updated this and also squashed all the commits.
…icolons in SQL statements
When a user submits a sql query that is terminated with a semicolon, currently they are met with an `org.apache.spark.sql.catalyst.parser.ParseException` of `extraneous input ';' expecting <EOF>`. This fixes this by updating the ANTLR grammar to allow any number of consecutive terminating semicolons for a SQL statement.
Added tests to PlanParserSuite for terminal semicolons. For `describe-query.sql`, `grouping_set.sql`, `interval.sql`, and DDLParserSuite the changes to the grammar rules for singleStatement require the portion of the exception message that reads "...expecting [<EOF>]..." to be updated to "...expecting [{<EOF>, ';'}]...".
|
Test build #120220 has finished for PR 27567 at commit
|
|
retest this please |
|
Test build #120268 has finished for PR 27567 at commit
|
|
thanks, merging to master/3.0! |
|
Late LGTM. Thank you all. |
# What changes were proposed in this pull request? This change proposes ignoring a terminating semicolon from queries submitted by the user (if included) instead of raising a parse exception. # Why are the changes needed? When a user submits a directly executable SQL statement terminated with a semicolon, they receive an `org.apache.spark.sql.catalyst.parser.ParseException` of `extraneous input ';' expecting <EOF>`. SQL-92 describes a direct SQL statement as having the format of `<directly executable statement> <semicolon>` and the majority of SQL implementations either require the semicolon as a statement terminator, or make it optional (meaning not raising an exception when it's included, seemingly in recognition that it's a common behavior). # Does this PR introduce any user-facing change? No # How was this patch tested? Unit test added to `PlanParserSuite` ``` sbt> project catalyst sbt> testOnly *PlanParserSuite [info] - case insensitive (565 milliseconds) [info] - explain (9 milliseconds) [info] - set operations (41 milliseconds) [info] - common table expressions (31 milliseconds) [info] - simple select query (47 milliseconds) [info] - hive-style single-FROM statement (11 milliseconds) [info] - multi select query (32 milliseconds) [info] - query organization (41 milliseconds) [info] - insert into (12 milliseconds) [info] - aggregation (24 milliseconds) [info] - limit (11 milliseconds) [info] - window spec (11 milliseconds) [info] - lateral view (17 milliseconds) [info] - joins (62 milliseconds) [info] - sampled relations (11 milliseconds) [info] - sub-query (11 milliseconds) [info] - scalar sub-query (9 milliseconds) [info] - table reference (2 milliseconds) [info] - table valued function (8 milliseconds) [info] - SPARK-20311 range(N) as alias (2 milliseconds) [info] - SPARK-20841 Support table column aliases in FROM clause (3 milliseconds) [info] - SPARK-20962 Support subquery column aliases in FROM clause (4 milliseconds) [info] - SPARK-20963 Support aliases for join relations in FROM clause (3 milliseconds) [info] - inline table (23 milliseconds) [info] - simple select query with !> and !< (5 milliseconds) [info] - select hint syntax (34 milliseconds) [info] - SPARK-20854: select hint syntax with expressions (12 milliseconds) [info] - SPARK-20854: multiple hints (4 milliseconds) [info] - TRIM function (16 milliseconds) [info] - OVERLAY function (16 milliseconds) [info] - precedence of set operations (18 milliseconds) [info] - create/alter view as insert into table (4 milliseconds) [info] - Invalid insert constructs in the query (10 milliseconds) [info] - relation in v2 catalog (3 milliseconds) [info] - CTE with column alias (2 milliseconds) [info] - statement containing terminal semicolons (3 milliseconds) [info] ScalaTest [info] Run completed in 3 seconds, 129 milliseconds. [info] Total number of tests run: 36 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 36, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [info] Passed: Total 36, Failed 0, Errors 0, Passed 36 ``` ### Current behavior: #### scala ```scala scala> val df = sql("select 1") // df: org.apache.spark.sql.DataFrame = [1: int] scala> df.show() // +---+ // | 1| // +---+ // | 1| // +---+ scala> val df = sql("select 1;") // org.apache.spark.sql.catalyst.parser.ParseException: // extraneous input ';' expecting <EOF>(line 1, pos 8) // == SQL == // select 1; // --------^^^ // at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:263) // at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:130) // at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:52) // at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:76) // at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:605) // at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) // at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:605) // ... 47 elided ``` #### pyspark ```python df = spark.sql('select 1') df.show() #+---+ #| 1| #+---+ #| 1| #+---+ df = spark.sql('select 1;') # Traceback (most recent call last): # File "<stdin>", line 1, in <module> # File "/Users/ssetegne/spark/python/pyspark/sql/session.py", line 646, in sql # return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped) # File "/Users/ssetegne/spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1286, in # __call__ # File "/Users/ssetegne/spark/python/pyspark/sql/utils.py", line 102, in deco # raise converted # pyspark.sql.utils.ParseException: # extraneous input ';' expecting <EOF>(line 1, pos 8) # == SQL == # select 1; # --------^^^ ``` ### Behavior after proposed fix: #### scala ```scala scala> val df = sql("select 1") // df: org.apache.spark.sql.DataFrame = [1: int] scala> df.show() // +---+ // | 1| // +---+ // | 1| // +---+ scala> val df = sql("select 1;") // df: org.apache.spark.sql.DataFrame = [1: int] scala> df.show() // +---+ // | 1| // +---+ // | 1| // +---+ ``` #### pyspark ```python df = spark.sql('select 1') df.show() #+---+ #| 1 | #+---+ #| 1 | #+---+ df = spark.sql('select 1;') df.show() #+---+ #| 1 | #+---+ #| 1 | #+---+ ``` Closes #27567 from samsetegne/semicolon. Authored-by: samsetegne <samuelsetegne@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 44431d4) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
# What changes were proposed in this pull request?
This change proposes ignoring a terminating semicolon from queries submitted by the user (if included) instead of raising a parse exception.
# Why are the changes needed?
When a user submits a directly executable SQL statement terminated with a semicolon, they receive an `org.apache.spark.sql.catalyst.parser.ParseException` of `extraneous input ';' expecting <EOF>`. SQL-92 describes a direct SQL statement as having the format of `<directly executable statement> <semicolon>` and the majority of SQL implementations either require the semicolon as a statement terminator, or make it optional (meaning not raising an exception when it's included, seemingly in recognition that it's a common behavior).
# Does this PR introduce any user-facing change?
No
# How was this patch tested?
Unit test added to `PlanParserSuite`
```
sbt> project catalyst
sbt> testOnly *PlanParserSuite
[info] - case insensitive (565 milliseconds)
[info] - explain (9 milliseconds)
[info] - set operations (41 milliseconds)
[info] - common table expressions (31 milliseconds)
[info] - simple select query (47 milliseconds)
[info] - hive-style single-FROM statement (11 milliseconds)
[info] - multi select query (32 milliseconds)
[info] - query organization (41 milliseconds)
[info] - insert into (12 milliseconds)
[info] - aggregation (24 milliseconds)
[info] - limit (11 milliseconds)
[info] - window spec (11 milliseconds)
[info] - lateral view (17 milliseconds)
[info] - joins (62 milliseconds)
[info] - sampled relations (11 milliseconds)
[info] - sub-query (11 milliseconds)
[info] - scalar sub-query (9 milliseconds)
[info] - table reference (2 milliseconds)
[info] - table valued function (8 milliseconds)
[info] - SPARK-20311 range(N) as alias (2 milliseconds)
[info] - SPARK-20841 Support table column aliases in FROM clause (3 milliseconds)
[info] - SPARK-20962 Support subquery column aliases in FROM clause (4 milliseconds)
[info] - SPARK-20963 Support aliases for join relations in FROM clause (3 milliseconds)
[info] - inline table (23 milliseconds)
[info] - simple select query with !> and !< (5 milliseconds)
[info] - select hint syntax (34 milliseconds)
[info] - SPARK-20854: select hint syntax with expressions (12 milliseconds)
[info] - SPARK-20854: multiple hints (4 milliseconds)
[info] - TRIM function (16 milliseconds)
[info] - OVERLAY function (16 milliseconds)
[info] - precedence of set operations (18 milliseconds)
[info] - create/alter view as insert into table (4 milliseconds)
[info] - Invalid insert constructs in the query (10 milliseconds)
[info] - relation in v2 catalog (3 milliseconds)
[info] - CTE with column alias (2 milliseconds)
[info] - statement containing terminal semicolons (3 milliseconds)
[info] ScalaTest
[info] Run completed in 3 seconds, 129 milliseconds.
[info] Total number of tests run: 36
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 36, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[info] Passed: Total 36, Failed 0, Errors 0, Passed 36
```
### Current behavior:
#### scala
```scala
scala> val df = sql("select 1")
// df: org.apache.spark.sql.DataFrame = [1: int]
scala> df.show()
// +---+
// | 1|
// +---+
// | 1|
// +---+
scala> val df = sql("select 1;")
// org.apache.spark.sql.catalyst.parser.ParseException:
// extraneous input ';' expecting <EOF>(line 1, pos 8)
// == SQL ==
// select 1;
// --------^^^
// at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:263)
// at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:130)
// at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:52)
// at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:76)
// at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:605)
// at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
// at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:605)
// ... 47 elided
```
#### pyspark
```python
df = spark.sql('select 1')
df.show()
#+---+
#| 1|
#+---+
#| 1|
#+---+
df = spark.sql('select 1;')
# Traceback (most recent call last):
# File "<stdin>", line 1, in <module>
# File "/Users/ssetegne/spark/python/pyspark/sql/session.py", line 646, in sql
# return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
# File "/Users/ssetegne/spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1286, in # __call__
# File "/Users/ssetegne/spark/python/pyspark/sql/utils.py", line 102, in deco
# raise converted
# pyspark.sql.utils.ParseException:
# extraneous input ';' expecting <EOF>(line 1, pos 8)
# == SQL ==
# select 1;
# --------^^^
```
### Behavior after proposed fix:
#### scala
```scala
scala> val df = sql("select 1")
// df: org.apache.spark.sql.DataFrame = [1: int]
scala> df.show()
// +---+
// | 1|
// +---+
// | 1|
// +---+
scala> val df = sql("select 1;")
// df: org.apache.spark.sql.DataFrame = [1: int]
scala> df.show()
// +---+
// | 1|
// +---+
// | 1|
// +---+
```
#### pyspark
```python
df = spark.sql('select 1')
df.show()
#+---+
#| 1 |
#+---+
#| 1 |
#+---+
df = spark.sql('select 1;')
df.show()
#+---+
#| 1 |
#+---+
#| 1 |
#+---+
```
Closes apache#27567 from samsetegne/semicolon.
Authored-by: samsetegne <samuelsetegne@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
This change proposes ignoring a terminating semicolon from queries submitted by the user (if included) instead of raising a parse exception.
Why are the changes needed?
When a user submits a directly executable SQL statement terminated with a semicolon, they receive an
org.apache.spark.sql.catalyst.parser.ParseExceptionofextraneous input ';' expecting <EOF>. SQL-92 describes a direct SQL statement as having the format of<directly executable statement> <semicolon>and the majority of SQL implementations either require the semicolon as a statement terminator, or make it optional (meaning not raising an exception when it's included, seemingly in recognition that it's a common behavior).Does this PR introduce any user-facing change?
No
How was this patch tested?
Unit test added to
PlanParserSuiteCurrent behavior:
scala
pyspark
Behavior after proposed fix:
scala
pyspark