Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-16482] [SQL] Describe Table Command for Tables Requiring Runtime Inferred Schema #14148

Closed
wants to merge 3 commits into from

Conversation

gatorsmile
Copy link
Member

@gatorsmile gatorsmile commented Jul 12, 2016

What changes were proposed in this pull request?

If we create a table pointing to a parquet/json datasets without specifying the schema, describe table command does not show the schema at all. It only shows # Schema of this table is inferred at runtime. In 1.6, describe table does show the schema of such a table.

For data source tables, to infer the schema, we need to load the data source tables at runtime. Thus, this PR calls the function lookupRelation.

For data source tables, we infer the schema before table creation. Thus, this PR set the inferred schema as the table schema when table creation.

How was this patch tested?

Added test cases

@rxin
Copy link
Contributor

rxin commented Jul 12, 2016

Shouldn't schema inference run as soon as the table is created?

@gatorsmile
Copy link
Member Author

@rxin The created table could be empty. Thus, we are unable to cover all the cases even if we try schema inference when creating tables. You know, this is just my understanding. No clue about the original intention. : )

@gatorsmile
Copy link
Member Author

Did a quick check. My understanding is wrong. We did the schema inference when creating the table. Let me fix it. Thanks!

@SparkQA
Copy link

SparkQA commented Jul 12, 2016

Test build #62141 has finished for PR 14148 at commit d92ebcd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 12, 2016

Test build #62144 has finished for PR 14148 at commit a05383c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 12, 2016

Test build #62143 has finished for PR 14148 at commit 473b27d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

gatorsmile commented Jul 12, 2016

@rxin The failed test case is interesting! REFRESH TABLE command does not refresh the metadata stored in the external catalog. When the tables are data source tables, it is a bug?

Please let me know if this is by design. Thanks!

Update: I can understand the original intention, maybe. If the schema is not specified by users, we do not persistently store the runtime inferred schema of data source tables in external catalog. We use the metadata cache to store the schema. Thus, currently, REFRESH TABLE command just needs to refresh the metadata cache and data cache, when users change the external files. However, this design is not documented. The design has a bug when users manually specify the schema.

@gatorsmile
Copy link
Member Author

gatorsmile commented Jul 12, 2016

    Seq("parquet", "json", "orc").foreach { fileFormat =>
      withTable("t1") {
        withTempPath { dir =>
          val path = dir.getCanonicalPath
          spark.range(1).write.format(fileFormat).save(path)
          sql(s"CREATE TABLE t1(a int, b int) USING $fileFormat OPTIONS (PATH '$path') ")
          sql("select * from t1").show(false)
        }
      }
    }

If users specify an unmatched schema, we did not do the check. I think at least we should report an error as early as possible. For the formats parquet and json, the outputs are

+----+----+
|a   |b   |
+----+----+
|null|null|
+----+----+

For the format orc, we got an stage failure.

Job aborted due to stage failure: Task 0 in stage 5.0 failed 1 times, most recent failure: Lost task 0.0 in stage 5.0 (TID 5, localhost): java.lang.IllegalArgumentException: Field "a" does not exist.

Because this is another issue, will submit a separate PR for it.

@gatorsmile
Copy link
Member Author

gatorsmile commented Jul 12, 2016

Also cc @yhuai @cloud-fan @liancheng @marmbrus

@@ -431,7 +431,7 @@ case class DescribeTableCommand(table: TableIdentifier, isExtended: Boolean, isF
val schema = DDLUtils.getSchemaFromTableProperties(table)

if (schema.isEmpty) {
append(buffer, "# Schema of this table is inferred at runtime", "", "")
append(buffer, "# Schema of this table in catalog is corrupted", "", "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just use catalog.lookupRelation(table).schema to get the schema?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you like the last patch? d92ebcd

@gatorsmile
Copy link
Member Author

retest this please

@cloud-fan
Copy link
Contributor

LGTM, pending jenkins

@rxin
Copy link
Contributor

rxin commented Jul 13, 2016

Thanks. Just FYI when you make future changes, when a table is added to the catalog (regardless whether it is temporary, non-temp, external, internal), we should save its schema. We should not rely on schema inference every time the user runs a query, and the schema should not change depending on time or the underlying data. For tables in the catalog, schema should be specified by the user. It is OK as a convenience measure for user to rely on schema inference during table creation, but it is not OK to rely on schema inference every time.

@rxin
Copy link
Contributor

rxin commented Jul 13, 2016

@cloud-fan, @gatorsmile, and @yhuai - how difficult would it be to change Spark so that it runs schema inference during table creation, and saves the table schema when we create the table?

}
} else {
describeSchema(metadata.schema, result)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we try to put these into describeSchema? Of, maybe we can add a describeSchema(tableName, result)? Seems it is weird that describeExtended and describeFormatted do not contain the code for describing the schema.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Let me do it now

BTW, previously, describeExtended and describeFormatted also contain the schema. Both call the original function describe.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yhuai I just did a try. We have to pass CatalogTable for avoiding another call of getTableMetadata. We also need to pass SessionCatalog for calling lookupRelation. Do you like this function? or keep the existing one? Thanks!

  private def describeSchema(
      tableDesc: CatalogTable,
      catalog: SessionCatalog,
      buffer: ArrayBuffer[Row]): Unit = {
    if (DDLUtils.isDatasourceTable(tableDesc)) {
      DDLUtils.getSchemaFromTableProperties(tableDesc) match {
        case Some(userSpecifiedSchema) => describeSchema(userSpecifiedSchema, buffer)
        case None => describeSchema(catalog.lookupRelation(table).schema, buffer)
      }
    } else {
      describeSchema(tableDesc.schema, buffer)
    }
  }

@gatorsmile
Copy link
Member Author

gatorsmile commented Jul 13, 2016

@rxin Currently, we do not run schema inference every time when metadata cache contains the plan. We do the runtime schema inference only in case of cache miss (e.g., REFRESH TABLE is triggered, or the cache is replaced, or the first time to read the table). Based on my understanding, that is the major reason why we introduced the metadata cache at the very beginning.

I think it is not hard to store the schema of data source tables in the external catalog (Hive metastore). However, Refresh Table only refreshes the metadata cache and the data cache. It does not update the schema stored in the external catalog. If we do not store the schema in the external catalog, it works well. Otherwise, we have to refresh the schema info in the external catalog.

To implement your idea, I can submit a PR for the release 2.1 tomorrow. We can discuss it in a separate PR.

@rxin
Copy link
Contributor

rxin commented Jul 13, 2016

I was not talking about caching here. Caching is transient. I want the behavior to be the same regardless of how many times I'm restarting Spark ...

And this has nothing to do with refresh. For tables in the catalog, NEVER change the schema implicitly, only do it when it is specified by the user.

@gatorsmile
Copy link
Member Author

uh... I see what you mean. Agree.

@gatorsmile
Copy link
Member Author

Tomorrow, I will try to dig it deeper and check whether schema evolution could be an issue if the schema is fixed when creating tables.

@cloud-fan
Copy link
Contributor

It's easy to infer the schema once when we create the table and store it into external catalog. However, it's a breaking change which means users can't change the underlying data file schema after the table is created. It's a bad design we need to fix, but we also need to go through the code path to make sure we don't break other things.

@SparkQA
Copy link

SparkQA commented Jul 13, 2016

Test build #62217 has finished for PR 14148 at commit d92ebcd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

@rxin @cloud-fan @yhuai Will do more investigation and submit a separate PR for solution review. Thanks!

@gatorsmile
Copy link
Member Author

Many interesting observation after further investigation. Will post the findings tonight. Thanks!

@yhuai
Copy link
Contributor

yhuai commented Jul 13, 2016

LGTM. Merging to master and branch 2.0

asfgit pushed a commit that referenced this pull request Jul 13, 2016
…e Inferred Schema

#### What changes were proposed in this pull request?
If we create a table pointing to a parquet/json datasets without specifying the schema, describe table command does not show the schema at all. It only shows `# Schema of this table is inferred at runtime`. In 1.6, describe table does show the schema of such a table.

~~For data source tables, to infer the schema, we need to load the data source tables at runtime. Thus, this PR calls the function `lookupRelation`.~~

For data source tables, we infer the schema before table creation. Thus, this PR set the inferred schema as the table schema when table creation.

#### How was this patch tested?
Added test cases

Author: gatorsmile <gatorsmile@gmail.com>

Closes #14148 from gatorsmile/describeSchema.

(cherry picked from commit c5ec879)
Signed-off-by: Yin Huai <yhuai@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants