Skip to content
Permalink
Browse files

[SPARK-30098][SQL] Use default datasource as provider for CREATE TABL…

…E syntax

### What changes were proposed in this pull request?

In this PR, we propose to use the value of `spark.sql.source.default` as the provider for `CREATE TABLE` syntax instead of `hive` in Spark 3.0.

And to help the migration, we introduce a legacy conf `spark.sql.legacy.respectHiveDefaultProvider.enabled` and set its default to `false`.

### Why are the changes needed?

1. Currently, `CREATE TABLE` syntax use hive provider to create table while `DataFrameWriter.saveAsTable` API using the value of `spark.sql.source.default` as a provider to create table. It would be better to make them consistent.

2. User may gets confused in some cases. For example:

```
CREATE TABLE t1 (c1 INT) USING PARQUET;
CREATE TABLE t2 (c1 INT);
```

In these two DDLs, use may think that `t2` should also use parquet as default provider since Spark always advertise parquet as the default format. However, it's hive in this case.

On the other hand, if we omit the USING clause in a CTAS statement, we do pick parquet by default if `spark.sql.hive.convertCATS=true`:

```
CREATE TABLE t3 USING PARQUET AS SELECT 1 AS VALUE;
CREATE TABLE t4 AS SELECT 1 AS VALUE;
```
And these two cases together can be really confusing.

3. Now, Spark SQL is very independent and popular. We do not need to be fully consistent with Hive's behavior.

### Does this PR introduce any user-facing change?

Yes, before this PR, using `CREATE TABLE` syntax will use hive provider. But now, it use the value of `spark.sql.source.default` as its provider.

### How was this patch tested?

Added tests in `DDLParserSuite` and `HiveDDlSuite`.

Closes #26736 from Ngone51/dev-create-table-using-parquet-by-default.

Lead-authored-by: wuyi <yi.wu@databricks.com>
Co-authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
  • Loading branch information
Ngone51 authored and cloud-fan committed Dec 6, 2019
1 parent c1a5f94 commit 58be82ad4b98fc17e821e916e69e77a6aa36209d
Showing with 261 additions and 186 deletions.
  1. +2 −0 docs/sql-migration-guide.md
  2. +23 −13 sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
  3. +33 −33 sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
  4. +2 −0 sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/ParseDriver.scala
  5. +12 −0 sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
  6. +21 −0 sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/DDLParserSuite.scala
  7. +3 −10 sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala
  8. +2 −3 sql/core/src/test/resources/sql-tests/results/postgreSQL/create_view.sql.out
  9. +3 −2 sql/core/src/test/scala/org/apache/spark/sql/execution/SparkSqlParserSuite.scala
  10. +76 −78 sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLParserSuite.scala
  11. +3 −3 sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala
  12. +1 −1 sql/core/src/test/scala/org/apache/spark/sql/sources/CreateTableAsSelectSuite.scala
  13. +4 −4 sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala
  14. +11 −11 ...-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala
  15. +4 −0 ...ive/compatibility/src/test/scala/org/apache/spark/sql/hive/execution/HiveCompatibilitySuite.scala
  16. +1 −1 ...mpatibility/src/test/scala/org/apache/spark/sql/hive/execution/HiveWindowFunctionQuerySuite.scala
  17. +2 −2 sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveShowCreateTableSuite.scala
  18. +2 −2 sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala
  19. +1 −1 sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveCommandSuite.scala
  20. +33 −6 sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala
  21. +3 −2 sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveExplainSuite.scala
  22. +6 −3 sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveSerDeSuite.scala
  23. +10 −8 sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala
  24. +1 −1 sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/WindowQuerySuite.scala
  25. +2 −2 sql/hive/src/test/scala/org/apache/spark/sql/hive/test/TestHive.scala
@@ -253,6 +253,8 @@ license: |
</td>
</tr>
</table>

- Since Spark 3.0, CREATE TABLE without a specific provider will use the value of `spark.sql.sources.default` as its provider. In Spark version 2.4 and earlier, it was hive. To restore the behavior before Spark 3.0, you can set `spark.sql.legacy.createHiveTableByDefault.enabled` to `true`.

- Since Spark 3.0, the unary arithmetic operator plus(`+`) only accepts string, numeric and interval type values as inputs. Besides, `+` with a integral string representation will be coerced to double value, e.g. `+'1'` results `1.0`. In Spark version 2.4 and earlier, this operator is ignored. There is no type checking for it, thus, all type values with a `+` prefix are valid, e.g. `+ array(1, 2)` is valid and results `[1, 2]`. Besides, there is no type coercion for it at all, e.g. in Spark 2.4, the result of `+'1'` is string `1`.

@@ -29,6 +29,12 @@ grammar SqlBase;
*/
public boolean legacy_exponent_literal_as_decimal_enabled = false;
/**
* When false, CREATE TABLE syntax without a provider will use
* the value of spark.sql.sources.default as its provider.
*/
public boolean legacy_create_hive_table_by_default_enabled = false;
/**
* Verify whether current token is a valid decimal token (which contains dot).
* Returns true if the character that follows the token is not a digit or letter or underscore.
@@ -101,13 +107,13 @@ statement
(RESTRICT | CASCADE)? #dropNamespace
| SHOW (DATABASES | NAMESPACES) ((FROM | IN) multipartIdentifier)?
(LIKE? pattern=STRING)? #showNamespaces
| createTableHeader ('(' colTypeList ')')? tableProvider
((OPTIONS options=tablePropertyList) |
(PARTITIONED BY partitioning=transformList) |
bucketSpec |
locationSpec |
(COMMENT comment=STRING) |
(TBLPROPERTIES tableProps=tablePropertyList))*
| {!legacy_create_hive_table_by_default_enabled}?
createTableHeader ('(' colTypeList ')')? tableProvider?
createTableClauses
(AS? query)? #createTable
| {legacy_create_hive_table_by_default_enabled}?
createTableHeader ('(' colTypeList ')')? tableProvider
createTableClauses
(AS? query)? #createTable
| createTableHeader ('(' columns=colTypeList ')')?
((COMMENT comment=STRING) |
@@ -128,12 +134,7 @@ statement
locationSpec |
(TBLPROPERTIES tableProps=tablePropertyList))* #createTableLike
| replaceTableHeader ('(' colTypeList ')')? tableProvider
((OPTIONS options=tablePropertyList) |
(PARTITIONED BY partitioning=transformList) |
bucketSpec |
locationSpec |
(COMMENT comment=STRING) |
(TBLPROPERTIES tableProps=tablePropertyList))*
createTableClauses
(AS? query)? #replaceTable
| ANALYZE TABLE multipartIdentifier partitionSpec? COMPUTE STATISTICS
(identifier | FOR COLUMNS identifierSeq | FOR ALL COLUMNS)? #analyze
@@ -352,6 +353,15 @@ tableProvider
: USING multipartIdentifier
;

createTableClauses
:((OPTIONS options=tablePropertyList) |
(PARTITIONED BY partitioning=transformList) |
bucketSpec |
locationSpec |
(COMMENT comment=STRING) |
(TBLPROPERTIES tableProps=tablePropertyList))*
;

tablePropertyList
: '(' tableProperty (',' tableProperty)* ')'
;
@@ -2379,6 +2379,13 @@ class AstBuilder(conf: SQLConf) extends SqlBaseBaseVisitor[AnyRef] with Logging
*/
type TableHeader = (Seq[String], Boolean, Boolean, Boolean)

/**
* Type to keep track of table clauses:
* (partitioning, bucketSpec, options, locationSpec, properties, comment).
*/
type TableClauses = (Seq[Transform], Option[BucketSpec], Map[String, String],
Map[String, String], Option[String], Option[String])

/**
* Validate a create table statement and return the [[TableIdentifier]].
*/
@@ -2614,6 +2621,24 @@ class AstBuilder(conf: SQLConf) extends SqlBaseBaseVisitor[AnyRef] with Logging
ctx.EXTENDED != null)
}

override def visitCreateTableClauses(ctx: CreateTableClausesContext): TableClauses = {
checkDuplicateClauses(ctx.TBLPROPERTIES, "TBLPROPERTIES", ctx)
checkDuplicateClauses(ctx.OPTIONS, "OPTIONS", ctx)
checkDuplicateClauses(ctx.PARTITIONED, "PARTITIONED BY", ctx)
checkDuplicateClauses(ctx.COMMENT, "COMMENT", ctx)
checkDuplicateClauses(ctx.bucketSpec(), "CLUSTERED BY", ctx)
checkDuplicateClauses(ctx.locationSpec, "LOCATION", ctx)

val partitioning: Seq[Transform] =
Option(ctx.partitioning).map(visitTransformList).getOrElse(Nil)
val bucketSpec = ctx.bucketSpec().asScala.headOption.map(visitBucketSpec)
val properties = Option(ctx.tableProps).map(visitPropertyKeyValues).getOrElse(Map.empty)
val options = Option(ctx.options).map(visitPropertyKeyValues).getOrElse(Map.empty)
val location = ctx.locationSpec.asScala.headOption.map(visitLocationSpec)
val comment = Option(ctx.comment).map(string)
(partitioning, bucketSpec, properties, options, location, comment)
}

/**
* Create a table, returning a [[CreateTableStatement]] logical plan.
*
@@ -2639,26 +2664,14 @@ class AstBuilder(conf: SQLConf) extends SqlBaseBaseVisitor[AnyRef] with Logging
override def visitCreateTable(ctx: CreateTableContext): LogicalPlan = withOrigin(ctx) {
val (table, temp, ifNotExists, external) = visitCreateTableHeader(ctx.createTableHeader)
if (external) {
operationNotAllowed("CREATE EXTERNAL TABLE ... USING", ctx)
operationNotAllowed("CREATE EXTERNAL TABLE ...", ctx)
}

checkDuplicateClauses(ctx.TBLPROPERTIES, "TBLPROPERTIES", ctx)
checkDuplicateClauses(ctx.OPTIONS, "OPTIONS", ctx)
checkDuplicateClauses(ctx.PARTITIONED, "PARTITIONED BY", ctx)
checkDuplicateClauses(ctx.COMMENT, "COMMENT", ctx)
checkDuplicateClauses(ctx.bucketSpec(), "CLUSTERED BY", ctx)
checkDuplicateClauses(ctx.locationSpec, "LOCATION", ctx)

val schema = Option(ctx.colTypeList()).map(createSchema)
val partitioning: Seq[Transform] =
Option(ctx.partitioning).map(visitTransformList).getOrElse(Nil)
val bucketSpec = ctx.bucketSpec().asScala.headOption.map(visitBucketSpec)
val properties = Option(ctx.tableProps).map(visitPropertyKeyValues).getOrElse(Map.empty)
val options = Option(ctx.options).map(visitPropertyKeyValues).getOrElse(Map.empty)

val provider = ctx.tableProvider.multipartIdentifier.getText
val location = ctx.locationSpec.asScala.headOption.map(visitLocationSpec)
val comment = Option(ctx.comment).map(string)
val defaultProvider = conf.defaultDataSourceName
val provider =
Option(ctx.tableProvider).map(_.multipartIdentifier.getText).getOrElse(defaultProvider)
val (partitioning, bucketSpec, properties, options, location, comment) =
visitCreateTableClauses(ctx.createTableClauses())

Option(ctx.query).map(plan) match {
case Some(_) if temp =>
@@ -2713,23 +2726,10 @@ class AstBuilder(conf: SQLConf) extends SqlBaseBaseVisitor[AnyRef] with Logging
operationNotAllowed("REPLACE EXTERNAL TABLE ... USING", ctx)
}

checkDuplicateClauses(ctx.TBLPROPERTIES, "TBLPROPERTIES", ctx)
checkDuplicateClauses(ctx.OPTIONS, "OPTIONS", ctx)
checkDuplicateClauses(ctx.PARTITIONED, "PARTITIONED BY", ctx)
checkDuplicateClauses(ctx.COMMENT, "COMMENT", ctx)
checkDuplicateClauses(ctx.bucketSpec(), "CLUSTERED BY", ctx)
checkDuplicateClauses(ctx.locationSpec, "LOCATION", ctx)

val (partitioning, bucketSpec, properties, options, location, comment) =
visitCreateTableClauses(ctx.createTableClauses())
val schema = Option(ctx.colTypeList()).map(createSchema)
val partitioning: Seq[Transform] =
Option(ctx.partitioning).map(visitTransformList).getOrElse(Nil)
val bucketSpec = ctx.bucketSpec().asScala.headOption.map(visitBucketSpec)
val properties = Option(ctx.tableProps).map(visitPropertyKeyValues).getOrElse(Map.empty)
val options = Option(ctx.options).map(visitPropertyKeyValues).getOrElse(Map.empty)

val provider = ctx.tableProvider.multipartIdentifier.getText
val location = ctx.locationSpec.asScala.headOption.map(visitLocationSpec)
val comment = Option(ctx.comment).map(string)
val orCreate = ctx.replaceTableHeader().CREATE() != null

Option(ctx.query).map(plan) match {
@@ -101,6 +101,7 @@ abstract class AbstractSqlParser(conf: SQLConf) extends ParserInterface with Log
lexer.addErrorListener(ParseErrorListener)
lexer.legacy_setops_precedence_enbled = conf.setOpsPrecedenceEnforced
lexer.legacy_exponent_literal_as_decimal_enabled = conf.exponentLiteralAsDecimalEnabled
lexer.legacy_create_hive_table_by_default_enabled = conf.createHiveTableByDefaultEnabled
lexer.SQL_standard_keyword_behavior = SQLStandardKeywordBehavior

val tokenStream = new CommonTokenStream(lexer)
@@ -110,6 +111,7 @@ abstract class AbstractSqlParser(conf: SQLConf) extends ParserInterface with Log
parser.addErrorListener(ParseErrorListener)
parser.legacy_setops_precedence_enbled = conf.setOpsPrecedenceEnforced
parser.legacy_exponent_literal_as_decimal_enabled = conf.exponentLiteralAsDecimalEnabled
parser.legacy_create_hive_table_by_default_enabled = conf.createHiveTableByDefaultEnabled
parser.SQL_standard_keyword_behavior = SQLStandardKeywordBehavior

try {
@@ -1966,6 +1966,15 @@ object SQLConf {
.booleanConf
.createWithDefault(false)

val LEGACY_CREATE_HIVE_TABLE_BY_DEFAULT_ENABLED =
buildConf("spark.sql.legacy.createHiveTableByDefault.enabled")
.internal()
.doc("When set to true, CREATE TABLE syntax without a provider will use hive " +
s"instead of the value of ${DEFAULT_DATA_SOURCE_NAME.key}.")
.booleanConf
.createWithDefault(false)


val LEGACY_INTEGRALDIVIDE_RETURN_LONG = buildConf("spark.sql.legacy.integralDivide.returnBigint")
.doc("If it is set to true, the div operator returns always a bigint. This behavior was " +
"inherited from Hive. Otherwise, the return type is the data type of the operands.")
@@ -2583,6 +2592,9 @@ class SQLConf extends Serializable with Logging {
def exponentLiteralAsDecimalEnabled: Boolean =
getConf(SQLConf.LEGACY_EXPONENT_LITERAL_AS_DECIMAL_ENABLED)

def createHiveTableByDefaultEnabled: Boolean =
getConf(SQLConf.LEGACY_CREATE_HIVE_TABLE_BY_DEFAULT_ENABLED)

def integralDivideReturnLong: Boolean = getConf(SQLConf.LEGACY_INTEGRALDIVIDE_RETURN_LONG)

def nameNonStructGroupingKeyAsValue: Boolean =
@@ -25,6 +25,7 @@ import org.apache.spark.sql.catalyst.catalog.BucketSpec
import org.apache.spark.sql.catalyst.expressions.{EqualTo, Literal}
import org.apache.spark.sql.catalyst.plans.logical._
import org.apache.spark.sql.connector.expressions.{ApplyTransform, BucketTransform, DaysTransform, FieldReference, HoursTransform, IdentityTransform, LiteralValue, MonthsTransform, Transform, YearsTransform}
import org.apache.spark.sql.internal.SQLConf
import org.apache.spark.sql.types.{IntegerType, LongType, StringType, StructType, TimestampType}
import org.apache.spark.unsafe.types.UTF8String

@@ -48,6 +49,26 @@ class DDLParserSuite extends AnalysisTest {
comparePlans(parsePlan(sql), expected, checkAnalysis = false)
}

test("SPARK-30098: create table without provider should " +
"use default data source under non-legacy mode") {
val createSql = "CREATE TABLE my_tab(a INT COMMENT 'test', b STRING)"
val defaultProvider = conf.defaultDataSourceName
val expectedPlan = CreateTableStatement(
Seq("my_tab"),
new StructType()
.add("a", IntegerType, nullable = true, "test")
.add("b", StringType),
Seq.empty[Transform],
None,
Map.empty[String, String],
defaultProvider,
Map.empty[String, String],
None,
None,
false)
parseCompare(createSql, expectedPlan)
}

test("create/replace table using - schema") {
val createSql = "CREATE TABLE my_tab(a INT COMMENT 'test', b STRING) USING parquet"
val replaceSql = "REPLACE TABLE my_tab(a INT COMMENT 'test', b STRING) USING parquet"
@@ -187,22 +187,15 @@ class SparkSqlAstBuilder(conf: SQLConf) extends AstBuilder(conf) {
if (external) {
operationNotAllowed("CREATE EXTERNAL TABLE ... USING", ctx)
}

checkDuplicateClauses(ctx.TBLPROPERTIES, "TBLPROPERTIES", ctx)
checkDuplicateClauses(ctx.OPTIONS, "OPTIONS", ctx)
checkDuplicateClauses(ctx.PARTITIONED, "PARTITIONED BY", ctx)
checkDuplicateClauses(ctx.COMMENT, "COMMENT", ctx)
checkDuplicateClauses(ctx.bucketSpec(), "CLUSTERED BY", ctx)
checkDuplicateClauses(ctx.locationSpec, "LOCATION", ctx)

if (ifNotExists) {
// Unlike CREATE TEMPORARY VIEW USING, CREATE TEMPORARY TABLE USING does not support
// IF NOT EXISTS. Users are not allowed to replace the existing temp table.
operationNotAllowed("CREATE TEMPORARY TABLE IF NOT EXISTS", ctx)
}

val options = Option(ctx.options).map(visitPropertyKeyValues).getOrElse(Map.empty)
val provider = ctx.tableProvider.multipartIdentifier.getText
val (_, _, _, options, _, _) = visitCreateTableClauses(ctx.createTableClauses())
val provider = Option(ctx.tableProvider).map(_.multipartIdentifier.getText).getOrElse(
throw new ParseException("CREATE TEMPORARY TABLE without a provider is not allowed.", ctx))
val schema = Option(ctx.colTypeList()).map(createSchema)

logWarning(s"CREATE TEMPORARY TABLE ... USING ... is deprecated, please use " +
@@ -46,8 +46,7 @@ CREATE TABLE view_base_table (key int /* PRIMARY KEY */, data varchar(20))
-- !query 4 schema
struct<>
-- !query 4 output
org.apache.spark.sql.AnalysisException
Hive support is required to CREATE Hive TABLE (AS SELECT);



-- !query 5
@@ -57,7 +56,7 @@ CREATE VIEW key_dependent_view AS
struct<>
-- !query 5 output
org.apache.spark.sql.AnalysisException
Table or view not found: view_base_table; line 2 pos 17
expression 'default.view_base_table.`data`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;


-- !query 6
@@ -159,7 +159,7 @@ class SparkSqlParserSuite extends AnalysisTest {
}

test("create table - schema") {
assertEqual("CREATE TABLE my_tab(a INT COMMENT 'test', b STRING)",
assertEqual("CREATE TABLE my_tab(a INT COMMENT 'test', b STRING) STORED AS textfile",
createTable(
table = "my_tab",
schema = (new StructType)
@@ -179,7 +179,8 @@ class SparkSqlParserSuite extends AnalysisTest {
partitionColumnNames = Seq("c", "d")
)
)
assertEqual("CREATE TABLE my_tab(id BIGINT, nested STRUCT<col1: STRING,col2: INT>)",
assertEqual("CREATE TABLE my_tab(id BIGINT, nested STRUCT<col1: STRING,col2: INT>) " +
"STORED AS textfile",
createTable(
table = "my_tab",
schema = (new StructType)

0 comments on commit 58be82a

Please sign in to comment.
You can’t perform that action at this time.