Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-16552] [SQL] Store the Inferred Schemas into External Catalog Tables when Creating Tables #14207

Closed
wants to merge 15 commits into from

Conversation

gatorsmile
Copy link
Member

@gatorsmile gatorsmile commented Jul 14, 2016

What changes were proposed in this pull request?

Currently, in Spark SQL, the initial creation of schema can be classified into two groups. It is applicable to both Hive tables and Data Source tables:

Group A. Users specify the schema.

Case 1 CREATE TABLE AS SELECT: the schema is determined by the result schema of the SELECT clause. For example,

CREATE TABLE tab STORED AS TEXTFILE
AS SELECT * from input

Case 2 CREATE TABLE: users explicitly specify the schema. For example,

CREATE TABLE jsonTable (_1 string, _2 string)
USING org.apache.spark.sql.json

Group B. Spark SQL infers the schema at runtime.

Case 3 CREATE TABLE. Users do not specify the schema but the path to the file location. For example,

CREATE TABLE jsonTable 
USING org.apache.spark.sql.json
OPTIONS (path '${tempDir.getCanonicalPath}')

Before this PR, Spark SQL does not store the inferred schema in the external catalog for the cases in Group B. When users refreshing the metadata cache, accessing the table at the first time after (re-)starting Spark, Spark SQL will infer the schema and store the info in the metadata cache for improving the performance of subsequent metadata requests. However, the runtime schema inference could cause undesirable schema changes after each reboot of Spark.

This PR is to store the inferred schema in the external catalog when creating the table. When users intend to refresh the schema after possible changes on external files (table location), they issue REFRESH TABLE. Spark SQL will infer the schema again based on the previously specified table location and update/refresh the schema in the external catalog and metadata cache For data source tables, the schema will not be inferred and refreshed. If the schema is changed, they need to recreate a new table. For managed tables, the schema can be changed through the DDL statements.

In this PR, we do not use the inferred schema to replace the user specified schema for avoiding external behavior changes . Based on the design, user-specified schemas (as described in Group A) can be changed by ALTER TABLE commands, although we do not support them now.

How was this patch tested?

TODO: add more cases to cover the changes.

@SparkQA
Copy link

SparkQA commented Jul 14, 2016

Test build #62344 has finished for PR 14207 at commit 3be0dc0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

@rxin @cloud-fan @yhuai

This PR introduces a new concept SchemaType for determining the original source of a schema. When SchemaType is USER, it means this table belongs to Group A. When the type is INFERRED, the table requires schema inference. That is, Group B.

Not sure whether this solution sounds OK to you. Let me know whether this is a right direction to resolve the issue. Thanks!

@@ -487,6 +487,10 @@ object DDLUtils {
isDatasourceTable(table.properties)
}

def isSchemaInferred(table: CatalogTable): Boolean = {
table.properties.get(DATASOURCE_SCHEMA_TYPE) == Option(SchemaType.INFERRED.name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider contains.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't use contains. It makes it much harder to read and understand
the return type is an option.

On Sunday, July 17, 2016, Jacek Laskowski notifications@github.com wrote:

In
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala
#14207 (comment):

@@ -487,6 +487,10 @@ object DDLUtils {
isDatasourceTable(table.properties)
}

  • def isSchemaInferred(table: CatalogTable): Boolean = {
  • table.properties.get(DATASOURCE_SCHEMA_TYPE) == Option(SchemaType.INFERRED.name)

Consider contains.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/apache/spark/pull/14207/files/3be0dc0b7cfd942459c598c0d35f3d67a2c020ba#r71083304,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AATvPLJxrOTgsryjrhIAFMb3v7t5vl8-ks5qWjylgaJpZM4JMzdl
.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! @rxin @jaceklaskowski

I will not change it because using contains will break the Scala 2.10 compiler.

@viirya
Copy link
Member

viirya commented Jul 18, 2016

I think it is not clear what the problem this PR tries to solve is. It just says it proposes to save the inferred schema in external catalog.

@gatorsmile
Copy link
Member Author

@viirya The problem it tries to resolve is from the comment of @rxin in another PR: #14148 (comment)

@viirya
Copy link
Member

viirya commented Jul 18, 2016

Does it mean that if users do not issue refresh when the table location is changed, the schema will be wrong when the Spark is re-starting?

@gatorsmile
Copy link
Member Author

gatorsmile commented Jul 18, 2016

The table location is not allowed to change. Right?

With the changes of this PR, if the changes on the data/files (pointed by the table location) affect the table schema, they need to manually call the REFRESH command. Restarting Spark will not cause the schema changes.

Before this PR, if users restart Spark or the corresponding metadata cache item is replaced, the table schema could be changed without notice. This could be a potential issue when the read and write are conducted in parallel. This undocumented behavior could complicate the Spark applications.

The unexpected changes should be avoided. If the schema is changed and the table fetching is ready for new schema, users should manually issue REFRESH command.

@@ -270,6 +291,11 @@ case class CreateDataSourceTableAsSelectCommand(
}
}

case class SchemaType private(name: String)
object SchemaType {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will we have more schema type? If not, I think a boolean flag isSchemaInferred should be good

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will do it.

@viirya
Copy link
Member

viirya commented Jul 19, 2016

@gatorsmile When the data/files are input by an external system, and Spark is just used to process them in batch. Does it mean that schema can be inconsistent? Or it should call refresh every time it is going to query the table?

@gatorsmile
Copy link
Member Author

@viirya Schema inference is time-consuming, especially when the number of files is huge. Thus, we should avoid refreshing it every time. That is one of the major reasons why we have a metadata cache for data source tables.

@viirya
Copy link
Member

viirya commented Jul 19, 2016

@gatorsmile Yea. I meant that as you use the stored schema without inferred schema for table, when the data/files are changed by external system (e.g., appended by a streaming system), the stored schema can be inconsistent with the actual schema of the data.

@cloud-fan
Copy link
Contributor

when the data/files are changed by external system (e.g., appended by a streaming system), the stored schema can be inconsistent with the actual schema of the data.

I think this problem already exists, as we will use cached schema instead of inferring it everytime. The only difference is after reboot, this PR will still use the stored schema, and require users to refresh table manually.

@SparkQA
Copy link

SparkQA commented Jul 19, 2016

Test build #62512 has finished for PR 14207 at commit c6afbbb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 19, 2016

Test build #62513 has finished for PR 14207 at commit 55c2c5e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -223,6 +223,9 @@ abstract class Catalog {
* If this table is cached as an InMemoryRelation, drop the original cached version and make the
* new version cached lazily.
*
* If the table's schema is inferred at runtime, infer the schema again and update the schema
Copy link
Contributor

@cloud-fan cloud-fan Jul 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @rxin, I'm thinking of what's the main reason to allow inferring the table schema at run time. IIRC, it's mainly because we wanna save some typing when creating external data source table by SQL string, which usually have very long schema, e.g. json files.

If this is true, then the table schema is not supposed to change. If users do wanna change it, I'd argue that it's a different table, users should drop this table and create a new one. Then we don't need to make refresh table support schema changing and thus don't need to store the DATASOURCE_SCHEMA_ISINFERRED flag.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refreshTable shouldn't run schema inference. Only run schema inference when creating the table.

And don't make this a config flag. Just run schema inference when creating the table. For managed tables, store the schema explicitly. Users must explicitly change it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rxin @cloud-fan I see. Will make a change

FYI, this will change the existing external behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes unfortunately I find out about this one too late. I will add it to the release notes for 2.0 that this change will come.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@SparkQA
Copy link

SparkQA commented Jul 19, 2016

Test build #62552 has finished for PR 14207 at commit a043ca2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile gatorsmile changed the title [SPARK-16552] [SQL] [WIP] Store the Inferred Schemas into External Catalog Tables when Creating Tables [SPARK-16552] [SQL] Store the Inferred Schemas into External Catalog Tables when Creating Tables Jul 19, 2016
@gatorsmile
Copy link
Member Author

@cloud-fan @rxin @yhuai The code is ready to review. Thanks!

tableDesc: CatalogTable,
buffer: ArrayBuffer[Row]): Unit = {
if (DDLUtils.isDatasourceTable(tableDesc)) {
DDLUtils.getSchemaFromTableProperties(tableDesc) match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now getSchemaFromTableProperties should never return None?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For all types of data source tables, we store the schema in the table properties. Thus, we should not return None; unless the table properties are modified by users using the Alter Table command.

Sorry, forgot to update the message.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, the message is changed to "# Schema of this table is corrupted"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make DDLUtils.getSchemaFromTableProperties always return a schema and throw exception if it's corrupted? I think it's more consistent with the previous behaviour, i.e. throw exception if the expected schema properties doesn't exist.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will do.

@SparkQA
Copy link

SparkQA commented Jul 20, 2016

Test build #62574 has finished for PR 14207 at commit e930819.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 21, 2016

Test build #62647 has finished for PR 14207 at commit 264ad35.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

Could you please review it again? @yhuai @liancheng @rxin Thanks!

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Jul 25, 2016

Test build #62810 has finished for PR 14207 at commit 264ad35.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

private def createDataSourceTable(
path: File,
userSpecifiedSchema: Option[String],
userSpecifiedPartitionCols: Option[String]): (StructType, Seq[String]) = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about we pass in the expected schema and partCols, and do the check in this method?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will do. Thanks!

@SparkQA
Copy link

SparkQA commented Jul 27, 2016

Test build #62914 has finished for PR 14207 at commit 6492e98.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 27, 2016

Test build #62926 has finished for PR 14207 at commit b694d8b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

cc @yhuai @liancheng we will address your comments in follow up PRs if you have any.

@asfgit asfgit closed this in 762366f Jul 28, 2016
case r: HadoopFsRelation => r.partitionSchema.fieldNames
case _ => Array.empty[String]
}
if (userSpecifiedPartitionColumns.length > 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we throw an exception for this case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, I just keep the existing behavior.

To be honest, I think we should throw an exception whenever it makes sense. It sounds like the job log is not being read by most users. Will submit a follow-up PR to make it a change. Thanks!

@yhuai
Copy link
Contributor

yhuai commented Aug 8, 2016

@gatorsmile

Where is change for the following description?

When users intend to refresh the schema after possible changes on external files (table location),
they issue REFRESH TABLE.
Spark SQL will infer the schema again based on the previously specified table location and 
update/refresh the schema in the external catalog and metadata cache.

@gatorsmile gatorsmile deleted the userSpecifiedSchema branch August 8, 2016 22:33
@gatorsmile
Copy link
Member Author

gatorsmile commented Aug 8, 2016

@yhuai Forgot to change the PR description. For data source tables, the schema will not be inferred and refreshed. This is based on the comment: #14207 (comment)

BTW: just updated the PR description. Sorry for that.

asfgit pushed a commit that referenced this pull request Aug 26, 2016
…g Columns without a Given Schema

### What changes were proposed in this pull request?
Address the comments by yhuai in the original PR: #14207

First, issue an exception instead of logging a warning when users specify the partitioning columns without a given schema.

Second, refactor the codes a little.

### How was this patch tested?
Fixed the test cases.

Author: gatorsmile <gatorsmile@gmail.com>

Closes #14572 from gatorsmile/followup16552.
ghost pushed a commit to dbtsai/spark that referenced this pull request Oct 27, 2017
…s between data and partition schema

## What changes were proposed in this pull request?

This is a regression introduced by apache#14207. After Spark 2.1, we store the inferred schema when creating the table, to avoid inferring schema again at read path. However, there is one special case: overlapped columns between data and partition. For this case, it breaks the assumption of table schema that there is on ovelap between data and partition schema, and partition columns should be at the end. The result is, for Spark 2.1, the table scan has incorrect schema that puts partition columns at the end. For Spark 2.2, we add a check in CatalogTable to validate table schema, which fails at this case.

To fix this issue, a simple and safe approach is to fallback to old behavior when overlapeed columns detected, i.e. store empty schema in metastore.

## How was this patch tested?

new regression test

Author: Wenchen Fan <wenchen@databricks.com>

Closes apache#19579 from cloud-fan/bug2.
asfgit pushed a commit that referenced this pull request Oct 27, 2017
…s between data and partition schema

This is a regression introduced by #14207. After Spark 2.1, we store the inferred schema when creating the table, to avoid inferring schema again at read path. However, there is one special case: overlapped columns between data and partition. For this case, it breaks the assumption of table schema that there is on ovelap between data and partition schema, and partition columns should be at the end. The result is, for Spark 2.1, the table scan has incorrect schema that puts partition columns at the end. For Spark 2.2, we add a check in CatalogTable to validate table schema, which fails at this case.

To fix this issue, a simple and safe approach is to fallback to old behavior when overlapeed columns detected, i.e. store empty schema in metastore.

new regression test

Author: Wenchen Fan <wenchen@databricks.com>

Closes #19579 from cloud-fan/bug2.
MatthewRBruce pushed a commit to Shopify/spark that referenced this pull request Jul 31, 2018
…s between data and partition schema

This is a regression introduced by apache#14207. After Spark 2.1, we store the inferred schema when creating the table, to avoid inferring schema again at read path. However, there is one special case: overlapped columns between data and partition. For this case, it breaks the assumption of table schema that there is on ovelap between data and partition schema, and partition columns should be at the end. The result is, for Spark 2.1, the table scan has incorrect schema that puts partition columns at the end. For Spark 2.2, we add a check in CatalogTable to validate table schema, which fails at this case.

To fix this issue, a simple and safe approach is to fallback to old behavior when overlapeed columns detected, i.e. store empty schema in metastore.

new regression test

Author: Wenchen Fan <wenchen@databricks.com>

Closes apache#19579 from cloud-fan/bug2.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants