[SPARK-16482] [SQL] Describe Table Command for Tables Requiring Runtime Inferred Schema #14148

gatorsmile · 2016-07-12T04:03:40Z

What changes were proposed in this pull request?

If we create a table pointing to a parquet/json datasets without specifying the schema, describe table command does not show the schema at all. It only shows # Schema of this table is inferred at runtime. In 1.6, describe table does show the schema of such a table.

~~For data source tables, to infer the schema, we need to load the data source tables at runtime. Thus, this PR calls the function lookupRelation.~~

For data source tables, we infer the schema before table creation. Thus, this PR set the inferred schema as the table schema when table creation.

How was this patch tested?

Added test cases

rxin · 2016-07-12T04:56:44Z

Shouldn't schema inference run as soon as the table is created?

gatorsmile · 2016-07-12T05:34:36Z

@rxin The created table could be empty. Thus, we are unable to cover all the cases even if we try schema inference when creating tables. You know, this is just my understanding. No clue about the original intention. : )

gatorsmile · 2016-07-12T05:39:10Z

Did a quick check. My understanding is wrong. We did the schema inference when creating the table. Let me fix it. Thanks!

SparkQA · 2016-07-12T05:52:55Z

Test build #62141 has finished for PR 14148 at commit d92ebcd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-12T07:14:02Z

Test build #62144 has finished for PR 14148 at commit a05383c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-12T07:17:34Z

Test build #62143 has finished for PR 14148 at commit 473b27d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-07-12T14:14:32Z

@rxin The failed test case is interesting! REFRESH TABLE command does not refresh the metadata stored in the external catalog. When the tables are data source tables, it is a bug?

Please let me know if this is by design. Thanks!

Update: I can understand the original intention, maybe. If the schema is not specified by users, we do not persistently store the runtime inferred schema of data source tables in external catalog. We use the metadata cache to store the schema. Thus, currently, REFRESH TABLE command just needs to refresh the metadata cache and data cache, when users change the external files. However, this design is not documented. The design has a bug when users manually specify the schema.

gatorsmile · 2016-07-12T16:44:07Z

    Seq("parquet", "json", "orc").foreach { fileFormat =>
      withTable("t1") {
        withTempPath { dir =>
          val path = dir.getCanonicalPath
          spark.range(1).write.format(fileFormat).save(path)
          sql(s"CREATE TABLE t1(a int, b int) USING $fileFormat OPTIONS (PATH '$path') ")
          sql("select * from t1").show(false)
        }
      }
    }

If users specify an unmatched schema, we did not do the check. I think at least we should report an error as early as possible. For the formats parquet and json, the outputs are

+----+----+
|a   |b   |
+----+----+
|null|null|
+----+----+

For the format orc, we got an stage failure.

Job aborted due to stage failure: Task 0 in stage 5.0 failed 1 times, most recent failure: Lost task 0.0 in stage 5.0 (TID 5, localhost): java.lang.IllegalArgumentException: Field "a" does not exist.

Because this is another issue, will submit a separate PR for it.

gatorsmile · 2016-07-12T16:54:53Z

Also cc @yhuai @cloud-fan @liancheng @marmbrus

yhuai · 2016-07-13T06:30:55Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala

@@ -431,7 +431,7 @@ case class DescribeTableCommand(table: TableIdentifier, isExtended: Boolean, isF
      val schema = DDLUtils.getSchemaFromTableProperties(table)

      if (schema.isEmpty) {
-        append(buffer, "# Schema of this table is inferred at runtime", "", "")
+        append(buffer, "# Schema of this table in catalog is corrupted", "", "")


Should we just use catalog.lookupRelation(table).schema to get the schema?

Do you like the last patch? d92ebcd

gatorsmile · 2016-07-13T06:40:53Z

retest this please

cloud-fan · 2016-07-13T06:42:33Z

LGTM, pending jenkins

rxin · 2016-07-13T06:46:04Z

Thanks. Just FYI when you make future changes, when a table is added to the catalog (regardless whether it is temporary, non-temp, external, internal), we should save its schema. We should not rely on schema inference every time the user runs a query, and the schema should not change depending on time or the underlying data. For tables in the catalog, schema should be specified by the user. It is OK as a convenience measure for user to rely on schema inference during table creation, but it is not OK to rely on schema inference every time.

rxin · 2016-07-13T06:47:35Z

@cloud-fan, @gatorsmile, and @yhuai - how difficult would it be to change Spark so that it runs schema inference during table creation, and saves the table schema when we create the table?

yhuai · 2016-07-13T06:48:21Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala

+        }
+      } else {
+        describeSchema(metadata.schema, result)
+      }


How about we try to put these into describeSchema? Of, maybe we can add a describeSchema(tableName, result)? Seems it is weird that describeExtended and describeFormatted do not contain the code for describing the schema.

Sure. Let me do it now

BTW, previously, describeExtended and describeFormatted also contain the schema. Both call the original function describe.

@yhuai I just did a try. We have to pass CatalogTable for avoiding another call of getTableMetadata. We also need to pass SessionCatalog for calling lookupRelation. Do you like this function? or keep the existing one? Thanks!

private def describeSchema( tableDesc: CatalogTable, catalog: SessionCatalog, buffer: ArrayBuffer[Row]): Unit = { if (DDLUtils.isDatasourceTable(tableDesc)) { DDLUtils.getSchemaFromTableProperties(tableDesc) match { case Some(userSpecifiedSchema) => describeSchema(userSpecifiedSchema, buffer) case None => describeSchema(catalog.lookupRelation(table).schema, buffer) } } else { describeSchema(tableDesc.schema, buffer) } }

gatorsmile · 2016-07-13T07:02:01Z

@rxin Currently, we do not run schema inference every time when metadata cache contains the plan. We do the runtime schema inference only in case of cache miss (e.g., REFRESH TABLE is triggered, or the cache is replaced, or the first time to read the table). Based on my understanding, that is the major reason why we introduced the metadata cache at the very beginning.

I think it is not hard to store the schema of data source tables in the external catalog (Hive metastore). However, Refresh Table only refreshes the metadata cache and the data cache. It does not update the schema stored in the external catalog. If we do not store the schema in the external catalog, it works well. Otherwise, we have to refresh the schema info in the external catalog.

To implement your idea, I can submit a PR for the release 2.1 tomorrow. We can discuss it in a separate PR.

rxin · 2016-07-13T07:06:47Z

I was not talking about caching here. Caching is transient. I want the behavior to be the same regardless of how many times I'm restarting Spark ...

And this has nothing to do with refresh. For tables in the catalog, NEVER change the schema implicitly, only do it when it is specified by the user.

gatorsmile · 2016-07-13T07:07:19Z

uh... I see what you mean. Agree.

gatorsmile · 2016-07-13T07:15:04Z

Tomorrow, I will try to dig it deeper and check whether schema evolution could be an issue if the schema is fixed when creating tables.

cloud-fan · 2016-07-13T07:53:28Z

It's easy to infer the schema once when we create the table and store it into external catalog. However, it's a breaking change which means users can't change the underlying data file schema after the table is created. It's a bad design we need to fix, but we also need to go through the code path to make sure we don't break other things.

SparkQA · 2016-07-13T08:44:59Z

Test build #62217 has finished for PR 14148 at commit d92ebcd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-07-13T15:53:47Z

@rxin @cloud-fan @yhuai Will do more investigation and submit a separate PR for solution review. Thanks!

gatorsmile · 2016-07-13T20:40:24Z

Many interesting observation after further investigation. Will post the findings tonight. Thanks!

yhuai · 2016-07-13T22:22:10Z

LGTM. Merging to master and branch 2.0

…e Inferred Schema #### What changes were proposed in this pull request? If we create a table pointing to a parquet/json datasets without specifying the schema, describe table command does not show the schema at all. It only shows `# Schema of this table is inferred at runtime`. In 1.6, describe table does show the schema of such a table. ~~For data source tables, to infer the schema, we need to load the data source tables at runtime. Thus, this PR calls the function `lookupRelation`.~~ For data source tables, we infer the schema before table creation. Thus, this PR set the inferred schema as the table schema when table creation. #### How was this patch tested? Added test cases Author: gatorsmile <gatorsmile@gmail.com> Closes #14148 from gatorsmile/describeSchema. (cherry picked from commit c5ec879) Signed-off-by: Yin Huai <yhuai@databricks.com>

gatorsmile added 3 commits July 11, 2016 15:30

fix

57893bd

another fix way

6f2deb3

another fix way

d92ebcd

yhuai reviewed Jul 13, 2016
View reviewed changes

gatorsmile force-pushed the describeSchema branch from a05383c to d92ebcd Compare July 13, 2016 06:39

yhuai reviewed Jul 13, 2016
View reviewed changes

asfgit closed this in c5ec879 Jul 13, 2016

gatorsmile mentioned this pull request Jul 18, 2016

[SPARK-16552] [SQL] Store the Inferred Schemas into External Catalog Tables when Creating Tables #14207

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-16482] [SQL] Describe Table Command for Tables Requiring Runtime Inferred Schema #14148

[SPARK-16482] [SQL] Describe Table Command for Tables Requiring Runtime Inferred Schema #14148

gatorsmile commented Jul 12, 2016 •

edited

Loading

rxin commented Jul 12, 2016

gatorsmile commented Jul 12, 2016

gatorsmile commented Jul 12, 2016

SparkQA commented Jul 12, 2016

SparkQA commented Jul 12, 2016

SparkQA commented Jul 12, 2016

gatorsmile commented Jul 12, 2016 •

edited

Loading

gatorsmile commented Jul 12, 2016 •

edited

Loading

gatorsmile commented Jul 12, 2016 •

edited

Loading

yhuai Jul 13, 2016

gatorsmile Jul 13, 2016

gatorsmile commented Jul 13, 2016

cloud-fan commented Jul 13, 2016

rxin commented Jul 13, 2016

rxin commented Jul 13, 2016

yhuai Jul 13, 2016

gatorsmile Jul 13, 2016

gatorsmile Jul 13, 2016

gatorsmile commented Jul 13, 2016 •

edited

Loading

rxin commented Jul 13, 2016

gatorsmile commented Jul 13, 2016

gatorsmile commented Jul 13, 2016

cloud-fan commented Jul 13, 2016

SparkQA commented Jul 13, 2016

gatorsmile commented Jul 13, 2016

gatorsmile commented Jul 13, 2016

yhuai commented Jul 13, 2016

[SPARK-16482] [SQL] Describe Table Command for Tables Requiring Runtime Inferred Schema #14148

[SPARK-16482] [SQL] Describe Table Command for Tables Requiring Runtime Inferred Schema #14148

Conversation

gatorsmile commented Jul 12, 2016 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

rxin commented Jul 12, 2016

gatorsmile commented Jul 12, 2016

gatorsmile commented Jul 12, 2016

SparkQA commented Jul 12, 2016

SparkQA commented Jul 12, 2016

SparkQA commented Jul 12, 2016

gatorsmile commented Jul 12, 2016 • edited Loading

gatorsmile commented Jul 12, 2016 • edited Loading

gatorsmile commented Jul 12, 2016 • edited Loading

yhuai Jul 13, 2016

Choose a reason for hiding this comment

gatorsmile Jul 13, 2016

Choose a reason for hiding this comment

gatorsmile commented Jul 13, 2016

cloud-fan commented Jul 13, 2016

rxin commented Jul 13, 2016

rxin commented Jul 13, 2016

yhuai Jul 13, 2016

Choose a reason for hiding this comment

gatorsmile Jul 13, 2016

Choose a reason for hiding this comment

gatorsmile Jul 13, 2016

Choose a reason for hiding this comment

gatorsmile commented Jul 13, 2016 • edited Loading

rxin commented Jul 13, 2016

gatorsmile commented Jul 13, 2016

gatorsmile commented Jul 13, 2016

cloud-fan commented Jul 13, 2016

SparkQA commented Jul 13, 2016

gatorsmile commented Jul 13, 2016

gatorsmile commented Jul 13, 2016

yhuai commented Jul 13, 2016

gatorsmile commented Jul 12, 2016 •

edited

Loading

gatorsmile commented Jul 12, 2016 •

edited

Loading

gatorsmile commented Jul 12, 2016 •

edited

Loading

gatorsmile commented Jul 12, 2016 •

edited

Loading

gatorsmile commented Jul 13, 2016 •

edited

Loading