-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-16552] [SQL] Store the Inferred Schemas into External Catalog Tables when Creating Tables #14207
[SPARK-16552] [SQL] Store the Inferred Schemas into External Catalog Tables when Creating Tables #14207
Changes from 5 commits
3c992a9
5ed4e68
3be0dc0
c6afbbb
55c2c5e
a043ca2
e930819
727ecf8
1ee1743
b404eec
224b048
264ad35
6492e98
1ab7897
b694d8b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||
---|---|---|---|---|
|
@@ -52,7 +52,7 @@ case class CreateDataSourceTableCommand( | |||
userSpecifiedSchema: Option[StructType], | ||||
provider: String, | ||||
options: Map[String, String], | ||||
partitionColumns: Array[String], | ||||
userSpecifiedPartitionColumns: Array[String], | ||||
bucketSpec: Option[BucketSpec], | ||||
ignoreIfExists: Boolean, | ||||
managedIfNoPath: Boolean) | ||||
|
@@ -95,17 +95,39 @@ case class CreateDataSourceTableCommand( | |||
} | ||||
|
||||
// Create the relation to validate the arguments before writing the metadata to the metastore. | ||||
DataSource( | ||||
sparkSession = sparkSession, | ||||
userSpecifiedSchema = userSpecifiedSchema, | ||||
className = provider, | ||||
bucketSpec = None, | ||||
options = optionsWithPath).resolveRelation(checkPathExist = false) | ||||
val dataSource: HadoopFsRelation = | ||||
DataSource( | ||||
sparkSession = sparkSession, | ||||
userSpecifiedSchema = userSpecifiedSchema, | ||||
className = provider, | ||||
bucketSpec = None, | ||||
options = optionsWithPath) | ||||
.resolveRelation(checkPathExist = false).asInstanceOf[HadoopFsRelation] | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is it safe to cast it There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think a safer way is to do a pattern match here, if it's There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure, will do it. |
||||
|
||||
if (userSpecifiedSchema.isEmpty && userSpecifiedPartitionColumns.length > 0) { | ||||
// The table does not have a specified schema, which means that the schema will be inferred | ||||
// when we load the table. So, we are not expecting partition columns and we will discover | ||||
// partitions when we load the table. However, if there are specified partition columns, | ||||
// we simply ignore them and provide a warning message. | ||||
logWarning( | ||||
s"Specified partition columns (${userSpecifiedPartitionColumns.mkString(",")}) will be " + | ||||
s"ignored. The schema and partition columns of table $tableIdent are inferred. " + | ||||
s"Schema: ${dataSource.schema.simpleString}; " + | ||||
s"Partition columns: ${dataSource.partitionSchema.fieldNames}") | ||||
} | ||||
|
||||
val partitionColumns = | ||||
if (userSpecifiedSchema.isEmpty) { | ||||
dataSource.partitionSchema.fieldNames | ||||
} else { | ||||
userSpecifiedPartitionColumns | ||||
} | ||||
|
||||
CreateDataSourceTableUtils.createDataSourceTable( | ||||
sparkSession = sparkSession, | ||||
tableIdent = tableIdent, | ||||
userSpecifiedSchema = userSpecifiedSchema, | ||||
schema = dataSource.schema, | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. seems we should still use the user-specified schema, right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think from the code, it is not very clear that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here, Actually, after re-checking the code, I found the schema might be adjusted a little even if users specify the schema. For example, the nullability could be changed : spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala Line 407 in 64529b1
I think we should make such a change but maybe we should test and log it? |
||||
isSchemaInferred = userSpecifiedSchema.isEmpty, | ||||
partitionColumns = partitionColumns, | ||||
bucketSpec = bucketSpec, | ||||
provider = provider, | ||||
|
@@ -256,7 +278,8 @@ case class CreateDataSourceTableAsSelectCommand( | |||
CreateDataSourceTableUtils.createDataSourceTable( | ||||
sparkSession = sparkSession, | ||||
tableIdent = tableIdent, | ||||
userSpecifiedSchema = Some(result.schema), | ||||
schema = result.schema, | ||||
isSchemaInferred = false, | ||||
partitionColumns = partitionColumns, | ||||
bucketSpec = bucketSpec, | ||||
provider = provider, | ||||
|
@@ -270,7 +293,6 @@ case class CreateDataSourceTableAsSelectCommand( | |||
} | ||||
} | ||||
|
||||
|
||||
object CreateDataSourceTableUtils extends Logging { | ||||
|
||||
val DATASOURCE_PREFIX = "spark.sql.sources." | ||||
|
@@ -279,6 +301,7 @@ object CreateDataSourceTableUtils extends Logging { | |||
val DATASOURCE_OUTPUTPATH = DATASOURCE_PREFIX + "output.path" | ||||
val DATASOURCE_SCHEMA = DATASOURCE_PREFIX + "schema" | ||||
val DATASOURCE_SCHEMA_PREFIX = DATASOURCE_SCHEMA + "." | ||||
val DATASOURCE_SCHEMA_ISINFERRED = DATASOURCE_SCHEMA_PREFIX + "isInferred" | ||||
val DATASOURCE_SCHEMA_NUMPARTS = DATASOURCE_SCHEMA_PREFIX + "numParts" | ||||
val DATASOURCE_SCHEMA_NUMPARTCOLS = DATASOURCE_SCHEMA_PREFIX + "numPartCols" | ||||
val DATASOURCE_SCHEMA_NUMSORTCOLS = DATASOURCE_SCHEMA_PREFIX + "numSortCols" | ||||
|
@@ -303,10 +326,40 @@ object CreateDataSourceTableUtils extends Logging { | |||
matcher.matches() | ||||
} | ||||
|
||||
/** | ||||
* Saves the schema (including partition info) into the table properties. | ||||
* Overwrites the schema, if already existed. | ||||
*/ | ||||
def saveSchema( | ||||
sparkSession: SparkSession, | ||||
schema: StructType, | ||||
partitionColumns: Array[String], | ||||
tableProperties: mutable.HashMap[String, String]): Unit = { | ||||
// Serialized JSON schema string may be too long to be stored into a single | ||||
// metastore SerDe property. In this case, we split the JSON string and store each part as | ||||
// a separate table property. | ||||
val threshold = sparkSession.sessionState.conf.schemaStringLengthThreshold | ||||
val schemaJsonString = schema.json | ||||
// Split the JSON string. | ||||
val parts = schemaJsonString.grouped(threshold).toSeq | ||||
tableProperties.put(DATASOURCE_SCHEMA_NUMPARTS, parts.size.toString) | ||||
parts.zipWithIndex.foreach { case (part, index) => | ||||
tableProperties.put(s"$DATASOURCE_SCHEMA_PART_PREFIX$index", part) | ||||
} | ||||
|
||||
if (partitionColumns.length > 0) { | ||||
tableProperties.put(DATASOURCE_SCHEMA_NUMPARTCOLS, partitionColumns.length.toString) | ||||
partitionColumns.zipWithIndex.foreach { case (partCol, index) => | ||||
tableProperties.put(s"$DATASOURCE_SCHEMA_PARTCOL_PREFIX$index", partCol) | ||||
} | ||||
} | ||||
} | ||||
|
||||
def createDataSourceTable( | ||||
sparkSession: SparkSession, | ||||
tableIdent: TableIdentifier, | ||||
userSpecifiedSchema: Option[StructType], | ||||
schema: StructType, | ||||
isSchemaInferred: Boolean, | ||||
partitionColumns: Array[String], | ||||
bucketSpec: Option[BucketSpec], | ||||
provider: String, | ||||
|
@@ -315,28 +368,10 @@ object CreateDataSourceTableUtils extends Logging { | |||
val tableProperties = new mutable.HashMap[String, String] | ||||
tableProperties.put(DATASOURCE_PROVIDER, provider) | ||||
|
||||
// Saves optional user specified schema. Serialized JSON schema string may be too long to be | ||||
// stored into a single metastore SerDe property. In this case, we split the JSON string and | ||||
// store each part as a separate SerDe property. | ||||
userSpecifiedSchema.foreach { schema => | ||||
val threshold = sparkSession.sessionState.conf.schemaStringLengthThreshold | ||||
val schemaJsonString = schema.json | ||||
// Split the JSON string. | ||||
val parts = schemaJsonString.grouped(threshold).toSeq | ||||
tableProperties.put(DATASOURCE_SCHEMA_NUMPARTS, parts.size.toString) | ||||
parts.zipWithIndex.foreach { case (part, index) => | ||||
tableProperties.put(s"$DATASOURCE_SCHEMA_PART_PREFIX$index", part) | ||||
} | ||||
} | ||||
tableProperties.put(DATASOURCE_SCHEMA_ISINFERRED, isSchemaInferred.toString.toUpperCase) | ||||
saveSchema(sparkSession, schema, partitionColumns, tableProperties) | ||||
|
||||
if (userSpecifiedSchema.isDefined && partitionColumns.length > 0) { | ||||
tableProperties.put(DATASOURCE_SCHEMA_NUMPARTCOLS, partitionColumns.length.toString) | ||||
partitionColumns.zipWithIndex.foreach { case (partCol, index) => | ||||
tableProperties.put(s"$DATASOURCE_SCHEMA_PARTCOL_PREFIX$index", partCol) | ||||
} | ||||
} | ||||
|
||||
if (userSpecifiedSchema.isDefined && bucketSpec.isDefined) { | ||||
if (bucketSpec.isDefined) { | ||||
val BucketSpec(numBuckets, bucketColumnNames, sortColumnNames) = bucketSpec.get | ||||
|
||||
tableProperties.put(DATASOURCE_SCHEMA_NUMBUCKETS, numBuckets.toString) | ||||
|
@@ -353,16 +388,6 @@ object CreateDataSourceTableUtils extends Logging { | |||
} | ||||
} | ||||
|
||||
if (userSpecifiedSchema.isEmpty && partitionColumns.length > 0) { | ||||
// The table does not have a specified schema, which means that the schema will be inferred | ||||
// when we load the table. So, we are not expecting partition columns and we will discover | ||||
// partitions when we load the table. However, if there are specified partition columns, | ||||
// we simply ignore them and provide a warning message. | ||||
logWarning( | ||||
s"The schema and partitions of table $tableIdent will be inferred when it is loaded. " + | ||||
s"Specified partition columns (${partitionColumns.mkString(",")}) will be ignored.") | ||||
} | ||||
|
||||
val tableType = if (isExternal) { | ||||
tableProperties.put("EXTERNAL", "TRUE") | ||||
CatalogTableType.EXTERNAL | ||||
|
@@ -375,7 +400,7 @@ object CreateDataSourceTableUtils extends Logging { | |||
val dataSource = | ||||
DataSource( | ||||
sparkSession, | ||||
userSpecifiedSchema = userSpecifiedSchema, | ||||
userSpecifiedSchema = Some(schema), | ||||
partitionColumns = partitionColumns, | ||||
bucketSpec = bucketSpec, | ||||
className = provider, | ||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -413,15 +413,7 @@ case class DescribeTableCommand(table: TableIdentifier, isExtended: Boolean, isF | |
} else { | ||
val metadata = catalog.getTableMetadata(table) | ||
|
||
if (DDLUtils.isDatasourceTable(metadata)) { | ||
DDLUtils.getSchemaFromTableProperties(metadata) match { | ||
case Some(userSpecifiedSchema) => describeSchema(userSpecifiedSchema, result) | ||
case None => describeSchema(catalog.lookupRelation(table).schema, result) | ||
} | ||
} else { | ||
describeSchema(metadata.schema, result) | ||
} | ||
|
||
describeSchema(metadata, result) | ||
if (isExtended) { | ||
describeExtended(metadata, result) | ||
} else if (isFormatted) { | ||
|
@@ -518,6 +510,19 @@ case class DescribeTableCommand(table: TableIdentifier, isExtended: Boolean, isF | |
} | ||
} | ||
|
||
private def describeSchema( | ||
tableDesc: CatalogTable, | ||
buffer: ArrayBuffer[Row]): Unit = { | ||
if (DDLUtils.isDatasourceTable(tableDesc)) { | ||
DDLUtils.getSchemaFromTableProperties(tableDesc) match { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Now There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For all types of data source tables, we store the schema in the table properties. Thus, we should not return None; unless the table properties are modified by users using the Sorry, forgot to update the message. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Now, the message is changed to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we make There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure, will do. |
||
case Some(userSpecifiedSchema) => describeSchema(userSpecifiedSchema, buffer) | ||
case None => append(buffer, "# Schema of this table is inferred at runtime", "", "") | ||
} | ||
} else { | ||
describeSchema(tableDesc.schema, buffer) | ||
} | ||
} | ||
|
||
private def describeSchema(schema: Seq[CatalogColumn], buffer: ArrayBuffer[Row]): Unit = { | ||
schema.foreach { column => | ||
append(buffer, column.name, column.dataType.toLowerCase, column.comment.orNull) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @rxin, I'm thinking of what's the main reason to allow inferring the table schema at run time. IIRC, it's mainly because we wanna save some typing when creating external data source table by SQL string, which usually have very long schema, e.g. json files.
If this is true, then the table schema is not supposed to change. If users do wanna change it, I'd argue that it's a different table, users should drop this table and create a new one. Then we don't need to make
refresh table
support schema changing and thus don't need to store theDATASOURCE_SCHEMA_ISINFERRED
flag.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refreshTable shouldn't run schema inference. Only run schema inference when creating the table.
And don't make this a config flag. Just run schema inference when creating the table. For managed tables, store the schema explicitly. Users must explicitly change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rxin @cloud-fan I see. Will make a change
FYI, this will change the existing external behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes unfortunately I find out about this one too late. I will add it to the release notes for 2.0 that this change will come.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!