-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-18464][SQL] support old table which doesn't store schema in metastore #15900
Conversation
Test build #68697 has finished for PR 15900 at commit
|
a75cf30
to
4094a72
Compare
Test build #68717 has finished for PR 15900 at commit
|
@@ -1023,6 +1023,11 @@ object HiveExternalCatalog { | |||
// After SPARK-6024, we removed this flag. | |||
// Although we are not using `spark.sql.sources.schema` any more, we need to still support. | |||
DataType.fromJson(schema.get).asInstanceOf[StructType] | |||
} else if (props.filterKeys(_.startsWith(DATASOURCE_SCHEMA_PREFIX)).isEmpty) { | |||
// If there is no schema information in table properties, it means the schema of this table | |||
// was empty when saving into metastore, which is possible in older version of Spark. We |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Please mention version number
@@ -64,7 +64,9 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log | |||
val dataSource = | |||
DataSource( | |||
sparkSession, | |||
userSpecifiedSchema = Some(table.schema), | |||
// In older version of Spark, the table schema can be empty and should be inferred at |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Please mention version number
checkAnswer(spark.table("old"), Row(1, "a")) | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be good to actually create a set of compatibility tests to make sure a new version of Spark can access table metadata created by a older version (starting from Spark 1.3) without problem. Let's create a follow-up jira for this task and do it during the QA period of spark 2.1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// If there is no schema information in table properties, it means the schema of this table | ||
// was empty when saving into metastore, which is possible in older version of Spark. We | ||
// should respect it. | ||
new StructType() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, a clarification question. This function is only needed for data source tables, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, since we also store schema for hive table, hive table will also call this function. But hive table will never go into this branch, as it always has a schema.(the removal of runtime schema inference happened before we store schema of hive table)
properties = Map( | ||
HiveExternalCatalog.DATASOURCE_PROVIDER -> "parquet")) | ||
hiveClient.createTable(tableDesc, ignoreIfExists = false) | ||
checkAnswer(spark.table("old"), Row(1, "a")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also test describe table
and make sure it can provide correct column info?
Test build #68745 has finished for PR 15900 at commit
|
Merging in master/branch-2.1. |
…tastore ## What changes were proposed in this pull request? Before Spark 2.1, users can create an external data source table without schema, and we will infer the table schema at runtime. In Spark 2.1, we decided to infer the schema when the table was created, so that we don't need to infer it again and again at runtime. This is a good improvement, but we should still respect and support old tables which doesn't store table schema in metastore. ## How was this patch tested? regression test. Author: Wenchen Fan <wenchen@databricks.com> Closes #15900 from cloud-fan/hive-catalog. (cherry picked from commit 07b3f04) Signed-off-by: Reynold Xin <rxin@databricks.com>
…tastore ## What changes were proposed in this pull request? Before Spark 2.1, users can create an external data source table without schema, and we will infer the table schema at runtime. In Spark 2.1, we decided to infer the schema when the table was created, so that we don't need to infer it again and again at runtime. This is a good improvement, but we should still respect and support old tables which doesn't store table schema in metastore. ## How was this patch tested? regression test. Author: Wenchen Fan <wenchen@databricks.com> Closes apache#15900 from cloud-fan/hive-catalog.
…hema in table properties ## What changes were proposed in this pull request? This is a follow-up of apache#15900 , to fix one more bug: When table schema is empty and need to be inferred at runtime, we should not resolve parent plans before the schema has been inferred, or the parent plans will be resolved against an empty schema and may get wrong result for something like `select *` The fix logic is: introduce `UnresolvedCatalogRelation` as a placeholder. Then we replace it with `LogicalRelation` or `HiveTableRelation` during analysis, so that it's guaranteed that we won't resolve parent plans until the schema has been inferred. ## How was this patch tested? regression test Author: Wenchen Fan <wenchen@databricks.com> Closes apache#18907 from cloud-fan/bug.
…hema in table properties This is a follow-up of apache#15900 , to fix one more bug: When table schema is empty and need to be inferred at runtime, we should not resolve parent plans before the schema has been inferred, or the parent plans will be resolved against an empty schema and may get wrong result for something like `select *` The fix logic is: introduce `UnresolvedCatalogRelation` as a placeholder. Then we replace it with `LogicalRelation` or `HiveTableRelation` during analysis, so that it's guaranteed that we won't resolve parent plans until the schema has been inferred. regression test Author: Wenchen Fan <wenchen@databricks.com> Closes apache#18907 from cloud-fan/bug.
What changes were proposed in this pull request?
Before Spark 2.1, users can create an external data source table without schema, and we will infer the table schema at runtime. In Spark 2.1, we decided to infer the schema when the table was created, so that we don't need to infer it again and again at runtime.
This is a good improvement, but we should still respect and support old tables which doesn't store table schema in metastore.
How was this patch tested?
regression test.