Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-18464][SQL] support old table which doesn't store schema in metastore #15900

Closed
wants to merge 2 commits into from

Conversation

cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

Before Spark 2.1, users can create an external data source table without schema, and we will infer the table schema at runtime. In Spark 2.1, we decided to infer the schema when the table was created, so that we don't need to infer it again and again at runtime.

This is a good improvement, but we should still respect and support old tables which doesn't store table schema in metastore.

How was this patch tested?

regression test.

@cloud-fan
Copy link
Contributor Author

cc @yhuai @ericl @gatorsmile

@SparkQA
Copy link

SparkQA commented Nov 16, 2016

Test build #68697 has finished for PR 15900 at commit a75cf30.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 16, 2016

Test build #68717 has finished for PR 15900 at commit 4094a72.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -1023,6 +1023,11 @@ object HiveExternalCatalog {
// After SPARK-6024, we removed this flag.
// Although we are not using `spark.sql.sources.schema` any more, we need to still support.
DataType.fromJson(schema.get).asInstanceOf[StructType]
} else if (props.filterKeys(_.startsWith(DATASOURCE_SCHEMA_PREFIX)).isEmpty) {
// If there is no schema information in table properties, it means the schema of this table
// was empty when saving into metastore, which is possible in older version of Spark. We
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Please mention version number

@@ -64,7 +64,9 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log
val dataSource =
DataSource(
sparkSession,
userSpecifiedSchema = Some(table.schema),
// In older version of Spark, the table schema can be empty and should be inferred at
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Please mention version number

checkAnswer(spark.table("old"), Row(1, "a"))
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be good to actually create a set of compatibility tests to make sure a new version of Spark can access table metadata created by a older version (starting from Spark 1.3) without problem. Let's create a follow-up jira for this task and do it during the QA period of spark 2.1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// If there is no schema information in table properties, it means the schema of this table
// was empty when saving into metastore, which is possible in older version of Spark. We
// should respect it.
new StructType()
Copy link
Contributor

@yhuai yhuai Nov 16, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, a clarification question. This function is only needed for data source tables, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, since we also store schema for hive table, hive table will also call this function. But hive table will never go into this branch, as it always has a schema.(the removal of runtime schema inference happened before we store schema of hive table)

properties = Map(
HiveExternalCatalog.DATASOURCE_PROVIDER -> "parquet"))
hiveClient.createTable(tableDesc, ignoreIfExists = false)
checkAnswer(spark.table("old"), Row(1, "a"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also test describe table and make sure it can provide correct column info?

@SparkQA
Copy link

SparkQA commented Nov 17, 2016

Test build #68745 has finished for PR 15900 at commit 847dada.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Nov 17, 2016

Merging in master/branch-2.1.

@asfgit asfgit closed this in 07b3f04 Nov 17, 2016
asfgit pushed a commit that referenced this pull request Nov 17, 2016
…tastore

## What changes were proposed in this pull request?

Before Spark 2.1, users can create an external data source table without schema, and we will infer the table schema at runtime. In Spark 2.1, we decided to infer the schema when the table was created, so that we don't need to infer it again and again at runtime.

This is a good improvement, but we should still respect and support old tables which doesn't store table schema in metastore.

## How was this patch tested?

regression test.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #15900 from cloud-fan/hive-catalog.

(cherry picked from commit 07b3f04)
Signed-off-by: Reynold Xin <rxin@databricks.com>
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…tastore

## What changes were proposed in this pull request?

Before Spark 2.1, users can create an external data source table without schema, and we will infer the table schema at runtime. In Spark 2.1, we decided to infer the schema when the table was created, so that we don't need to infer it again and again at runtime.

This is a good improvement, but we should still respect and support old tables which doesn't store table schema in metastore.

## How was this patch tested?

regression test.

Author: Wenchen Fan <wenchen@databricks.com>

Closes apache#15900 from cloud-fan/hive-catalog.
ghost pushed a commit to dbtsai/spark that referenced this pull request Aug 15, 2017
…hema in table properties

## What changes were proposed in this pull request?

This is a follow-up of apache#15900 , to fix one more bug:
When table schema is empty and need to be inferred at runtime, we should not resolve parent plans before the schema has been inferred, or the parent plans will be resolved against an empty schema and may get wrong result for something like `select *`

The fix logic is: introduce `UnresolvedCatalogRelation` as a placeholder. Then we replace it with `LogicalRelation` or `HiveTableRelation` during analysis, so that it's guaranteed that we won't resolve parent plans until the schema has been inferred.

## How was this patch tested?

regression test

Author: Wenchen Fan <wenchen@databricks.com>

Closes apache#18907 from cloud-fan/bug.
cloud-fan added a commit to cloud-fan/spark that referenced this pull request Aug 16, 2017
…hema in table properties

This is a follow-up of apache#15900 , to fix one more bug:
When table schema is empty and need to be inferred at runtime, we should not resolve parent plans before the schema has been inferred, or the parent plans will be resolved against an empty schema and may get wrong result for something like `select *`

The fix logic is: introduce `UnresolvedCatalogRelation` as a placeholder. Then we replace it with `LogicalRelation` or `HiveTableRelation` during analysis, so that it's guaranteed that we won't resolve parent plans until the schema has been inferred.

regression test

Author: Wenchen Fan <wenchen@databricks.com>

Closes apache#18907 from cloud-fan/bug.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants