-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-16482] [SQL] Describe Table Command for Tables Requiring Runtime Inferred Schema #14148
Conversation
Shouldn't schema inference run as soon as the table is created? |
@rxin The created table could be empty. Thus, we are unable to cover all the cases even if we try schema inference when creating tables. You know, this is just my understanding. No clue about the original intention. : ) |
Did a quick check. My understanding is wrong. We did the schema inference when creating the table. Let me fix it. Thanks! |
Test build #62141 has finished for PR 14148 at commit
|
Test build #62144 has finished for PR 14148 at commit
|
Test build #62143 has finished for PR 14148 at commit
|
@rxin The failed test case is interesting! Please let me know if this is by design. Thanks! Update: I can understand the original intention, maybe. If the schema is not specified by users, we do not persistently store the runtime inferred schema of data source tables in external catalog. We use the metadata cache to store the schema. Thus, currently, |
Seq("parquet", "json", "orc").foreach { fileFormat =>
withTable("t1") {
withTempPath { dir =>
val path = dir.getCanonicalPath
spark.range(1).write.format(fileFormat).save(path)
sql(s"CREATE TABLE t1(a int, b int) USING $fileFormat OPTIONS (PATH '$path') ")
sql("select * from t1").show(false)
}
}
} If users specify an unmatched schema, we did not do the check. I think at least we should report an error as early as possible. For the formats
For the format
Because this is another issue, will submit a separate PR for it. |
Also cc @yhuai @cloud-fan @liancheng @marmbrus |
@@ -431,7 +431,7 @@ case class DescribeTableCommand(table: TableIdentifier, isExtended: Boolean, isF | |||
val schema = DDLUtils.getSchemaFromTableProperties(table) | |||
|
|||
if (schema.isEmpty) { | |||
append(buffer, "# Schema of this table is inferred at runtime", "", "") | |||
append(buffer, "# Schema of this table in catalog is corrupted", "", "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just use catalog.lookupRelation(table).schema
to get the schema?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you like the last patch? d92ebcd
a05383c
to
d92ebcd
Compare
retest this please |
LGTM, pending jenkins |
Thanks. Just FYI when you make future changes, when a table is added to the catalog (regardless whether it is temporary, non-temp, external, internal), we should save its schema. We should not rely on schema inference every time the user runs a query, and the schema should not change depending on time or the underlying data. For tables in the catalog, schema should be specified by the user. It is OK as a convenience measure for user to rely on schema inference during table creation, but it is not OK to rely on schema inference every time. |
@cloud-fan, @gatorsmile, and @yhuai - how difficult would it be to change Spark so that it runs schema inference during table creation, and saves the table schema when we create the table? |
} | ||
} else { | ||
describeSchema(metadata.schema, result) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we try to put these into describeSchema? Of, maybe we can add a describeSchema(tableName, result)
? Seems it is weird that describeExtended
and describeFormatted
do not contain the code for describing the schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Let me do it now
BTW, previously, describeExtended
and describeFormatted
also contain the schema. Both call the original function describe
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yhuai I just did a try. We have to pass CatalogTable
for avoiding another call of getTableMetadata
. We also need to pass SessionCatalog
for calling lookupRelation
. Do you like this function? or keep the existing one? Thanks!
private def describeSchema(
tableDesc: CatalogTable,
catalog: SessionCatalog,
buffer: ArrayBuffer[Row]): Unit = {
if (DDLUtils.isDatasourceTable(tableDesc)) {
DDLUtils.getSchemaFromTableProperties(tableDesc) match {
case Some(userSpecifiedSchema) => describeSchema(userSpecifiedSchema, buffer)
case None => describeSchema(catalog.lookupRelation(table).schema, buffer)
}
} else {
describeSchema(tableDesc.schema, buffer)
}
}
@rxin Currently, we do not run schema inference every time when metadata cache contains the plan. We do the runtime schema inference only in case of cache miss (e.g., I think it is not hard to store the schema of data source tables in the external catalog (Hive metastore). However, To implement your idea, I can submit a PR for the release 2.1 tomorrow. We can discuss it in a separate PR. |
I was not talking about caching here. Caching is transient. I want the behavior to be the same regardless of how many times I'm restarting Spark ... And this has nothing to do with refresh. For tables in the catalog, NEVER change the schema implicitly, only do it when it is specified by the user. |
uh... I see what you mean. Agree. |
Tomorrow, I will try to dig it deeper and check whether schema evolution could be an issue if the schema is fixed when creating tables. |
It's easy to infer the schema once when we create the table and store it into external catalog. However, it's a breaking change which means users can't change the underlying data file schema after the table is created. It's a bad design we need to fix, but we also need to go through the code path to make sure we don't break other things. |
Test build #62217 has finished for PR 14148 at commit
|
@rxin @cloud-fan @yhuai Will do more investigation and submit a separate PR for solution review. Thanks! |
Many interesting observation after further investigation. Will post the findings tonight. Thanks! |
LGTM. Merging to master and branch 2.0 |
…e Inferred Schema #### What changes were proposed in this pull request? If we create a table pointing to a parquet/json datasets without specifying the schema, describe table command does not show the schema at all. It only shows `# Schema of this table is inferred at runtime`. In 1.6, describe table does show the schema of such a table. ~~For data source tables, to infer the schema, we need to load the data source tables at runtime. Thus, this PR calls the function `lookupRelation`.~~ For data source tables, we infer the schema before table creation. Thus, this PR set the inferred schema as the table schema when table creation. #### How was this patch tested? Added test cases Author: gatorsmile <gatorsmile@gmail.com> Closes #14148 from gatorsmile/describeSchema. (cherry picked from commit c5ec879) Signed-off-by: Yin Huai <yhuai@databricks.com>
What changes were proposed in this pull request?
If we create a table pointing to a parquet/json datasets without specifying the schema, describe table command does not show the schema at all. It only shows
# Schema of this table is inferred at runtime
. In 1.6, describe table does show the schema of such a table.For data source tables, to infer the schema, we need to load the data source tables at runtime. Thus, this PR calls the functionlookupRelation
.For data source tables, we infer the schema before table creation. Thus, this PR set the inferred schema as the table schema when table creation.
How was this patch tested?
Added test cases