[SPARK-18464][SQL] support old table which doesn't store schema in metastore #15900

cloud-fan · 2016-11-16T06:17:54Z

What changes were proposed in this pull request?

Before Spark 2.1, users can create an external data source table without schema, and we will infer the table schema at runtime. In Spark 2.1, we decided to infer the schema when the table was created, so that we don't need to infer it again and again at runtime.

This is a good improvement, but we should still respect and support old tables which doesn't store table schema in metastore.

How was this patch tested?

regression test.

cloud-fan · 2016-11-16T06:19:10Z

cc @yhuai @ericl @gatorsmile

SparkQA · 2016-11-16T07:33:49Z

Test build #68697 has finished for PR 15900 at commit a75cf30.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-16T14:31:30Z

Test build #68717 has finished for PR 15900 at commit 4094a72.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tejasapatil · 2016-11-16T15:25:15Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala

@@ -1023,6 +1023,11 @@ object HiveExternalCatalog {
      // After SPARK-6024, we removed this flag.
      // Although we are not using `spark.sql.sources.schema` any more, we need to still support.
      DataType.fromJson(schema.get).asInstanceOf[StructType]
+    } else if (props.filterKeys(_.startsWith(DATASOURCE_SCHEMA_PREFIX)).isEmpty) {
+      // If there is no schema information in table properties, it means the schema of this table
+      // was empty when saving into metastore, which is possible in older version of Spark. We


nit: Please mention version number

tejasapatil · 2016-11-16T15:25:35Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala

@@ -64,7 +64,9 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log
        val dataSource =
          DataSource(
            sparkSession,
-            userSpecifiedSchema = Some(table.schema),
+            // In older version of Spark, the table schema can be empty and should be inferred at


nit: Please mention version number

yhuai · 2016-11-16T17:40:13Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/MetastoreDataSourcesSuite.scala

+        checkAnswer(spark.table("old"), Row(1, "a"))
+      }
+    }
+  }


It will be good to actually create a set of compatibility tests to make sure a new version of Spark can access table metadata created by a older version (starting from Spark 1.3) without problem. Let's create a follow-up jira for this task and do it during the QA period of spark 2.1.

created https://issues.apache.org/jira/browse/SPARK-18482

yhuai · 2016-11-16T17:41:30Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala

+      // If there is no schema information in table properties, it means the schema of this table
+      // was empty when saving into metastore, which is possible in older version of Spark. We
+      // should respect it.
+      new StructType()


btw, a clarification question. This function is only needed for data source tables, right?

no, since we also store schema for hive table, hive table will also call this function. But hive table will never go into this branch, as it always has a schema.(the removal of runtime schema inference happened before we store schema of hive table)

yhuai · 2016-11-16T17:42:41Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/MetastoreDataSourcesSuite.scala

+          properties = Map(
+            HiveExternalCatalog.DATASOURCE_PROVIDER -> "parquet"))
+        hiveClient.createTable(tableDesc, ignoreIfExists = false)
+        checkAnswer(spark.table("old"), Row(1, "a"))


Can we also test describe table and make sure it can provide correct column info?

SparkQA · 2016-11-17T07:13:49Z

Test build #68745 has finished for PR 15900 at commit 847dada.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-11-17T07:59:46Z

Merging in master/branch-2.1.

…tastore ## What changes were proposed in this pull request? Before Spark 2.1, users can create an external data source table without schema, and we will infer the table schema at runtime. In Spark 2.1, we decided to infer the schema when the table was created, so that we don't need to infer it again and again at runtime. This is a good improvement, but we should still respect and support old tables which doesn't store table schema in metastore. ## How was this patch tested? regression test. Author: Wenchen Fan <wenchen@databricks.com> Closes #15900 from cloud-fan/hive-catalog. (cherry picked from commit 07b3f04) Signed-off-by: Reynold Xin <rxin@databricks.com>

…tastore ## What changes were proposed in this pull request? Before Spark 2.1, users can create an external data source table without schema, and we will infer the table schema at runtime. In Spark 2.1, we decided to infer the schema when the table was created, so that we don't need to infer it again and again at runtime. This is a good improvement, but we should still respect and support old tables which doesn't store table schema in metastore. ## How was this patch tested? regression test. Author: Wenchen Fan <wenchen@databricks.com> Closes apache#15900 from cloud-fan/hive-catalog.

…hema in table properties ## What changes were proposed in this pull request? This is a follow-up of apache#15900 , to fix one more bug: When table schema is empty and need to be inferred at runtime, we should not resolve parent plans before the schema has been inferred, or the parent plans will be resolved against an empty schema and may get wrong result for something like `select *` The fix logic is: introduce `UnresolvedCatalogRelation` as a placeholder. Then we replace it with `LogicalRelation` or `HiveTableRelation` during analysis, so that it's guaranteed that we won't resolve parent plans until the schema has been inferred. ## How was this patch tested? regression test Author: Wenchen Fan <wenchen@databricks.com> Closes apache#18907 from cloud-fan/bug.

…hema in table properties This is a follow-up of apache#15900 , to fix one more bug: When table schema is empty and need to be inferred at runtime, we should not resolve parent plans before the schema has been inferred, or the parent plans will be resolved against an empty schema and may get wrong result for something like `select *` The fix logic is: introduce `UnresolvedCatalogRelation` as a placeholder. Then we replace it with `LogicalRelation` or `HiveTableRelation` during analysis, so that it's guaranteed that we won't resolve parent plans until the schema has been inferred. regression test Author: Wenchen Fan <wenchen@databricks.com> Closes apache#18907 from cloud-fan/bug.

support old table which doesn't store schema in table properties

4094a72

cloud-fan force-pushed the hive-catalog branch from a75cf30 to 4094a72 Compare November 16, 2016 12:34

tejasapatil reviewed Nov 16, 2016

View reviewed changes

yhuai reviewed Nov 16, 2016

View reviewed changes

address comments

847dada

asfgit closed this in 07b3f04 Nov 17, 2016

cloud-fan mentioned this pull request Aug 10, 2017

[SPARK-18464][SQL][followup] support old table which doesn't store schema in table properties #18907

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-18464][SQL] support old table which doesn't store schema in metastore #15900

[SPARK-18464][SQL] support old table which doesn't store schema in metastore #15900

cloud-fan commented Nov 16, 2016

cloud-fan commented Nov 16, 2016

SparkQA commented Nov 16, 2016

SparkQA commented Nov 16, 2016

tejasapatil Nov 16, 2016

tejasapatil Nov 16, 2016

yhuai Nov 16, 2016

cloud-fan Nov 17, 2016

yhuai Nov 16, 2016 •

edited

Loading

cloud-fan Nov 17, 2016

yhuai Nov 16, 2016

SparkQA commented Nov 17, 2016

rxin commented Nov 17, 2016

[SPARK-18464][SQL] support old table which doesn't store schema in metastore #15900

[SPARK-18464][SQL] support old table which doesn't store schema in metastore #15900

Conversation

cloud-fan commented Nov 16, 2016

What changes were proposed in this pull request?

How was this patch tested?

cloud-fan commented Nov 16, 2016

SparkQA commented Nov 16, 2016

SparkQA commented Nov 16, 2016

tejasapatil Nov 16, 2016

Choose a reason for hiding this comment

tejasapatil Nov 16, 2016

Choose a reason for hiding this comment

yhuai Nov 16, 2016

Choose a reason for hiding this comment

cloud-fan Nov 17, 2016

Choose a reason for hiding this comment

yhuai Nov 16, 2016 • edited Loading

Choose a reason for hiding this comment

cloud-fan Nov 17, 2016

Choose a reason for hiding this comment

yhuai Nov 16, 2016

Choose a reason for hiding this comment

SparkQA commented Nov 17, 2016

rxin commented Nov 17, 2016

yhuai Nov 16, 2016 •

edited

Loading