[SPARK-16552] [SQL] Store the Inferred Schemas into External Catalog Tables when Creating Tables #14207

gatorsmile · 2016-07-14T20:11:40Z

What changes were proposed in this pull request?

Currently, in Spark SQL, the initial creation of schema can be classified into two groups. It is applicable to both Hive tables and Data Source tables:

Group A. Users specify the schema.

Case 1 CREATE TABLE AS SELECT: the schema is determined by the result schema of the SELECT clause. For example,

CREATE TABLE tab STORED AS TEXTFILE
AS SELECT * from input

Case 2 CREATE TABLE: users explicitly specify the schema. For example,

CREATE TABLE jsonTable (_1 string, _2 string)
USING org.apache.spark.sql.json

Group B. Spark SQL infers the schema at runtime.

Case 3 CREATE TABLE. Users do not specify the schema but the path to the file location. For example,

CREATE TABLE jsonTable 
USING org.apache.spark.sql.json
OPTIONS (path '${tempDir.getCanonicalPath}')

Before this PR, Spark SQL does not store the inferred schema in the external catalog for the cases in Group B. When users refreshing the metadata cache, accessing the table at the first time after (re-)starting Spark, Spark SQL will infer the schema and store the info in the metadata cache for improving the performance of subsequent metadata requests. However, the runtime schema inference could cause undesirable schema changes after each reboot of Spark.

This PR is to store the inferred schema in the external catalog when creating the table. When users intend to refresh the schema after possible changes on external files (table location), they issue REFRESH TABLE. Spark SQL will infer the schema again based on the previously specified table location and update/refresh the schema in the external catalog and metadata cache For data source tables, the schema will not be inferred and refreshed. If the schema is changed, they need to recreate a new table. For managed tables, the schema can be changed through the DDL statements.

In this PR, we do not use the inferred schema to replace the user specified schema for avoiding external behavior changes . Based on the design, user-specified schemas (as described in Group A) can be changed by ALTER TABLE commands, although we do not support them now.

How was this patch tested?

TODO: add more cases to cover the changes.

SparkQA · 2016-07-14T22:15:00Z

Test build #62344 has finished for PR 14207 at commit 3be0dc0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-07-14T23:49:16Z

@rxin @cloud-fan @yhuai

This PR introduces a new concept SchemaType for determining the original source of a schema. When SchemaType is USER, it means this table belongs to Group A. When the type is INFERRED, the table requires schema inference. That is, Group B.

Not sure whether this solution sounds OK to you. Let me know whether this is a right direction to resolve the issue. Thanks!

jaceklaskowski · 2016-07-17T14:30:09Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala

@@ -487,6 +487,10 @@ object DDLUtils {
    isDatasourceTable(table.properties)
  }

+  def isSchemaInferred(table: CatalogTable): Boolean = {
+    table.properties.get(DATASOURCE_SCHEMA_TYPE) == Option(SchemaType.INFERRED.name)


Consider contains.

Please don't use contains. It makes it much harder to read and understand
the return type is an option.

On Sunday, July 17, 2016, Jacek Laskowski notifications@github.com wrote:

In
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala
#14207 (comment):

@@ -487,6 +487,10 @@ object DDLUtils {
isDatasourceTable(table.properties)
}

def isSchemaInferred(table: CatalogTable): Boolean = {

table.properties.get(DATASOURCE_SCHEMA_TYPE) == Option(SchemaType.INFERRED.name)

Consider contains.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/apache/spark/pull/14207/files/3be0dc0b7cfd942459c598c0d35f3d67a2c020ba#r71083304,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AATvPLJxrOTgsryjrhIAFMb3v7t5vl8-ks5qWjylgaJpZM4JMzdl
.

Thanks! @rxin @jaceklaskowski

I will not change it because using contains will break the Scala 2.10 compiler.

viirya · 2016-07-18T06:14:40Z

I think it is not clear what the problem this PR tries to solve is. It just says it proposes to save the inferred schema in external catalog.

gatorsmile · 2016-07-18T06:31:19Z

@viirya The problem it tries to resolve is from the comment of @rxin in another PR: #14148 (comment)

viirya · 2016-07-18T06:43:14Z

Does it mean that if users do not issue refresh when the table location is changed, the schema will be wrong when the Spark is re-starting?

gatorsmile · 2016-07-18T16:43:46Z

The table location is not allowed to change. Right?

With the changes of this PR, if the changes on the data/files (pointed by the table location) affect the table schema, they need to manually call the REFRESH command. Restarting Spark will not cause the schema changes.

Before this PR, if users restart Spark or the corresponding metadata cache item is replaced, the table schema could be changed without notice. This could be a potential issue when the read and write are conducted in parallel. This undocumented behavior could complicate the Spark applications.

The unexpected changes should be avoided. If the schema is changed and the table fetching is ready for new schema, users should manually issue REFRESH command.

cloud-fan · 2016-07-19T01:44:59Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala

@@ -270,6 +291,11 @@ case class CreateDataSourceTableAsSelectCommand(
  }
 }

+case class SchemaType private(name: String)
+object SchemaType {


will we have more schema type? If not, I think a boolean flag isSchemaInferred should be good

Sure, will do it.

viirya · 2016-07-19T05:27:46Z

@gatorsmile When the data/files are input by an external system, and Spark is just used to process them in batch. Does it mean that schema can be inconsistent? Or it should call refresh every time it is going to query the table?

gatorsmile · 2016-07-19T05:37:25Z

@viirya Schema inference is time-consuming, especially when the number of files is huge. Thus, we should avoid refreshing it every time. That is one of the major reasons why we have a metadata cache for data source tables.

viirya · 2016-07-19T05:43:14Z

@gatorsmile Yea. I meant that as you use the stored schema without inferred schema for table, when the data/files are changed by external system (e.g., appended by a streaming system), the stored schema can be inconsistent with the actual schema of the data.

cloud-fan · 2016-07-19T05:47:27Z

when the data/files are changed by external system (e.g., appended by a streaming system), the stored schema can be inconsistent with the actual schema of the data.

I think this problem already exists, as we will use cached schema instead of inferring it everytime. The only difference is after reboot, this PR will still use the stored schema, and require users to refresh table manually.

SparkQA · 2016-07-19T09:19:56Z

Test build #62512 has finished for PR 14207 at commit c6afbbb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-19T09:38:48Z

Test build #62513 has finished for PR 14207 at commit 55c2c5e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-07-19T15:36:44Z

sql/core/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala

@@ -223,6 +223,9 @@ abstract class Catalog {
   * If this table is cached as an InMemoryRelation, drop the original cached version and make the
   * new version cached lazily.
   *
+   * If the table's schema is inferred at runtime, infer the schema again and update the schema


cc @rxin, I'm thinking of what's the main reason to allow inferring the table schema at run time. IIRC, it's mainly because we wanna save some typing when creating external data source table by SQL string, which usually have very long schema, e.g. json files.

If this is true, then the table schema is not supposed to change. If users do wanna change it, I'd argue that it's a different table, users should drop this table and create a new one. Then we don't need to make refresh table support schema changing and thus don't need to store the DATASOURCE_SCHEMA_ISINFERRED flag.

refreshTable shouldn't run schema inference. Only run schema inference when creating the table.

And don't make this a config flag. Just run schema inference when creating the table. For managed tables, store the schema explicitly. Users must explicitly change it.

@rxin @cloud-fan I see. Will make a change

FYI, this will change the existing external behavior.

Yes unfortunately I find out about this one too late. I will add it to the release notes for 2.0 that this change will come.

SparkQA · 2016-07-19T22:12:23Z

Test build #62552 has finished for PR 14207 at commit a043ca2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-07-19T23:29:17Z

@cloud-fan @rxin @yhuai The code is ready to review. Thanks!

cloud-fan · 2016-07-20T01:31:20Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala

+      tableDesc: CatalogTable,
+      buffer: ArrayBuffer[Row]): Unit = {
+    if (DDLUtils.isDatasourceTable(tableDesc)) {
+      DDLUtils.getSchemaFromTableProperties(tableDesc) match {


Now getSchemaFromTableProperties should never return None?

For all types of data source tables, we store the schema in the table properties. Thus, we should not return None; unless the table properties are modified by users using the Alter Table command.

Sorry, forgot to update the message.

Now, the message is changed to "# Schema of this table is corrupted"

Can we make DDLUtils.getSchemaFromTableProperties always return a schema and throw exception if it's corrupted? I think it's more consistent with the previous behaviour, i.e. throw exception if the expected schema properties doesn't exist.

Sure, will do.

SparkQA · 2016-07-20T04:24:53Z

Test build #62574 has finished for PR 14207 at commit e930819.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-21T03:24:26Z

Test build #62647 has finished for PR 14207 at commit 264ad35.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-07-25T06:30:56Z

Could you please review it again? @yhuai @liancheng @rxin Thanks!

cloud-fan · 2016-07-25T08:17:17Z

retest this please

SparkQA · 2016-07-25T10:06:59Z

Test build #62810 has finished for PR 14207 at commit 264ad35.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-07-27T04:12:20Z

sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala

+  private def createDataSourceTable(
+      path: File,
+      userSpecifiedSchema: Option[String],
+      userSpecifiedPartitionCols: Option[String]): (StructType, Seq[String]) = {


how about we pass in the expected schema and partCols, and do the check in this method?

Sure, will do. Thanks!

SparkQA · 2016-07-27T08:32:33Z

Test build #62914 has finished for PR 14207 at commit 6492e98.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-27T17:01:33Z

Test build #62926 has finished for PR 14207 at commit b694d8b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-07-28T09:30:54Z

thanks, merging to master!

cc @yhuai @liancheng we will address your comments in follow up PRs if you have any.

yhuai · 2016-08-08T19:42:47Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala

+        case r: HadoopFsRelation => r.partitionSchema.fieldNames
+        case _ => Array.empty[String]
+      }
+      if (userSpecifiedPartitionColumns.length > 0) {


Should we throw an exception for this case?

Here, I just keep the existing behavior.

To be honest, I think we should throw an exception whenever it makes sense. It sounds like the job log is not being read by most users. Will submit a follow-up PR to make it a change. Thanks!

yhuai · 2016-08-08T19:56:24Z

@gatorsmile

Where is change for the following description?

When users intend to refresh the schema after possible changes on external files (table location),
they issue REFRESH TABLE.
Spark SQL will infer the schema again based on the previously specified table location and 
update/refresh the schema in the external catalog and metadata cache.

gatorsmile · 2016-08-08T22:39:01Z

@yhuai Forgot to change the PR description. For data source tables, the schema will not be inferred and refreshed. This is based on the comment: #14207 (comment)

BTW: just updated the PR description. Sorry for that.

…g Columns without a Given Schema ### What changes were proposed in this pull request? Address the comments by yhuai in the original PR: #14207 First, issue an exception instead of logging a warning when users specify the partitioning columns without a given schema. Second, refactor the codes a little. ### How was this patch tested? Fixed the test cases. Author: gatorsmile <gatorsmile@gmail.com> Closes #14572 from gatorsmile/followup16552.

…s between data and partition schema ## What changes were proposed in this pull request? This is a regression introduced by apache#14207. After Spark 2.1, we store the inferred schema when creating the table, to avoid inferring schema again at read path. However, there is one special case: overlapped columns between data and partition. For this case, it breaks the assumption of table schema that there is on ovelap between data and partition schema, and partition columns should be at the end. The result is, for Spark 2.1, the table scan has incorrect schema that puts partition columns at the end. For Spark 2.2, we add a check in CatalogTable to validate table schema, which fails at this case. To fix this issue, a simple and safe approach is to fallback to old behavior when overlapeed columns detected, i.e. store empty schema in metastore. ## How was this patch tested? new regression test Author: Wenchen Fan <wenchen@databricks.com> Closes apache#19579 from cloud-fan/bug2.

…s between data and partition schema This is a regression introduced by #14207. After Spark 2.1, we store the inferred schema when creating the table, to avoid inferring schema again at read path. However, there is one special case: overlapped columns between data and partition. For this case, it breaks the assumption of table schema that there is on ovelap between data and partition schema, and partition columns should be at the end. The result is, for Spark 2.1, the table scan has incorrect schema that puts partition columns at the end. For Spark 2.2, we add a check in CatalogTable to validate table schema, which fails at this case. To fix this issue, a simple and safe approach is to fallback to old behavior when overlapeed columns detected, i.e. store empty schema in metastore. new regression test Author: Wenchen Fan <wenchen@databricks.com> Closes #19579 from cloud-fan/bug2.

…s between data and partition schema This is a regression introduced by apache#14207. After Spark 2.1, we store the inferred schema when creating the table, to avoid inferring schema again at read path. However, there is one special case: overlapped columns between data and partition. For this case, it breaks the assumption of table schema that there is on ovelap between data and partition schema, and partition columns should be at the end. The result is, for Spark 2.1, the table scan has incorrect schema that puts partition columns at the end. For Spark 2.2, we add a check in CatalogTable to validate table schema, which fails at this case. To fix this issue, a simple and safe approach is to fallback to old behavior when overlapeed columns detected, i.e. store empty schema in metastore. new regression test Author: Wenchen Fan <wenchen@databricks.com> Closes apache#19579 from cloud-fan/bug2.

gatorsmile added 3 commits July 14, 2016 00:08

fix.

3c992a9

Merge remote-tracking branch 'upstream/master' into userSpecifiedSchema

5ed4e68

fix.

3be0dc0

jaceklaskowski reviewed Jul 17, 2016
View reviewed changes

cloud-fan reviewed Jul 19, 2016
View reviewed changes

gatorsmile added 2 commits July 19, 2016 00:16

address comments and added test cases

c6afbbb

updated the test case

55c2c5e

cloud-fan reviewed Jul 19, 2016
View reviewed changes

address comments

a043ca2

gatorsmile changed the title ~~[SPARK-16552] [SQL] [WIP] Store the Inferred Schemas into External Catalog Tables when Creating Tables~~ [SPARK-16552] [SQL] Store the Inferred Schemas into External Catalog Tables when Creating Tables Jul 19, 2016

cloud-fan reviewed Jul 20, 2016
View reviewed changes

address comments

e930819

addressed comments

264ad35

cloud-fan reviewed Jul 27, 2016
View reviewed changes

address comments

6492e98

gatorsmile added 2 commits July 27, 2016 07:18

Merge remote-tracking branch 'upstream/master' into userSpecifiedSchema

1ab7897

fix after merge the latest changes

b694d8b

asfgit closed this in 762366f Jul 28, 2016

yhuai reviewed Aug 8, 2016
View reviewed changes

gatorsmile deleted the userSpecifiedSchema branch August 8, 2016 22:33

gatorsmile mentioned this pull request Aug 9, 2016

[SPARK-17192] [SQL] Issue Exception when Users Specify the Partitioning Columns without a Given Schema #14572

Closed

cloud-fan mentioned this pull request Aug 26, 2016

[SPARK-16948][SQL] Use metastore schema instead of inferring schema for ORC in HiveMetastoreCatalog #14537

Closed

cloud-fan mentioned this pull request Oct 26, 2017

[SPARK-22356][SQL] data source table should support overlapped columns between data and partition schema #19579

Closed

[SPARK-16552] [SQL] Store the Inferred Schemas into External Catalog Tables when Creating Tables #14207

[SPARK-16552] [SQL] Store the Inferred Schemas into External Catalog Tables when Creating Tables #14207

Conversation

gatorsmile commented Jul 14, 2016 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Jul 14, 2016

gatorsmile commented Jul 14, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya commented Jul 18, 2016

gatorsmile commented Jul 18, 2016

viirya commented Jul 18, 2016

gatorsmile commented Jul 18, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya commented Jul 19, 2016

gatorsmile commented Jul 19, 2016

viirya commented Jul 19, 2016

cloud-fan commented Jul 19, 2016

SparkQA commented Jul 19, 2016

SparkQA commented Jul 19, 2016

cloud-fan Jul 19, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 19, 2016

gatorsmile commented Jul 19, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 20, 2016

SparkQA commented Jul 21, 2016

gatorsmile commented Jul 25, 2016

cloud-fan commented Jul 25, 2016

SparkQA commented Jul 25, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 27, 2016

SparkQA commented Jul 27, 2016

cloud-fan commented Jul 28, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yhuai commented Aug 8, 2016 • edited Loading

gatorsmile commented Aug 8, 2016 • edited Loading

gatorsmile commented Jul 14, 2016 •

edited

Loading

gatorsmile commented Jul 18, 2016 •

edited

Loading

cloud-fan Jul 19, 2016 •

edited

Loading

yhuai commented Aug 8, 2016 •

edited

Loading

gatorsmile commented Aug 8, 2016 •

edited

Loading