-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-19484][SQL]continue work to create hive table with an empty schema #16828
[SPARK-19484][SQL]continue work to create hive table with an empty schema #16828
Conversation
Test build #72488 has finished for PR 16828 at commit
|
@@ -832,7 +832,8 @@ private[hive] class HiveClientImpl( | |||
val (partCols, schema) = table.schema.map(toHiveColumn).partition { c => | |||
table.partitionColumnNames.contains(c.getName) | |||
} | |||
if (schema.isEmpty) { | |||
if (schema.isEmpty&& table.properties.getOrElse( | |||
HiveExternalCatalog.DATASOURCE_SCHEMA_NUMPARTS, "0").toInt != 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we need this condition? I think we store schema in table properties for all kinds of tables.
@@ -187,15 +187,6 @@ class HiveExternalCatalogBackwardCompatibilitySuite extends QueryTest | |||
"spark.sql.sources.schema.numParts" -> "1", | |||
"spark.sql.sources.schema.part.0" -> simpleSchemaJson)) | |||
|
|||
lazy val dataSourceTableWithoutSchema = CatalogTable( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still a valid case we need to support
Test build #72495 has finished for PR 16828 at commit
|
Test build #72502 has finished for PR 16828 at commit
|
@@ -1328,7 +1330,7 @@ class MetastoreDataSourcesSuite extends QueryTest with SQLTestUtils with TestHiv | |||
storage = CatalogStorageFormat.empty.copy( | |||
properties = Map("path" -> path.getAbsolutePath) | |||
), | |||
schema = new StructType(), | |||
schema = simpleSchema, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we were testing empty schema here.
// spark table. the SPARK_TEST_OLD_SOURCE_TABLE_CREATE property is used to resolve this. | ||
if (schema.isEmpty && (table.properties.getOrElse( | ||
HiveExternalCatalog.DATASOURCE_SCHEMA_NUMPARTS, "0").toInt != 0) || table.properties | ||
.getOrElse(HiveExternalCatalog.SPARK_TEST_OLD_SOURCE_TABLE_CREATE, "false").toBoolean) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when will we hit this branch? only when we create empty schema table in test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we create a datasource table, the schema is empty while the table properties contains the DATASOURCE_SCHEMA_NUMPARTS
will also hit this branch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but table properties should always contain DATASOURCE_SCHEMA_NUMPARTS
, doesn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we inferSchema here
https://github.com/apache/spark/pull/16636/files#diff-c4ed9859978dd6ac271b6a40ee945e4bR95
this catalog does not have a DATASOURCE_SCHEMA_NUMPARTS
in properties
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we can just check DDLUtils.isHiveTable
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, Iet me modify this , thanks!
Test build #72517 has finished for PR 16828 at commit
|
Test build #72519 has finished for PR 16828 at commit
|
@@ -832,7 +833,10 @@ private[hive] class HiveClientImpl( | |||
val (partCols, schema) = table.schema.map(toHiveColumn).partition { c => | |||
table.partitionColumnNames.contains(c.getName) | |||
} | |||
if (schema.isEmpty) { | |||
|
|||
// after SPARK-19279, it is not allowed to create a hive table with an empty schema, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uh... I found a bug in my original code for inferSchema. Let me fix it now. The schema could be not empty when it is a partitioned table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove || table.schema.nonEmpty
this condition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We just need to check whether the dataSchema is not empty, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we have partition schema ,the condition will be hit || table.schema.nonEmpty
and return table directly, maybe the hive table dataschema can be determined by Hive serde?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the PR: #16848
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok~ thanks~
|
||
// after SPARK-19279, it is not allowed to create a hive table with an empty schema, | ||
// so here we should not add a default col schema | ||
if (schema.isEmpty && !DDLUtils.isHiveTable(table)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change !DDLUtils.isHiveTable(table)
to DDLUtils.isDatasourceTable(table)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok~ thanks~
Test build #72551 has finished for PR 16828 at commit
|
looks like this PR is not so self-contained, can you embed it to your PR about unifying |
I think this is also an improvement to avoid some later potential problems, can we keep it? |
The later potential problems? If we do not unify |
ok, I will close it. Thanks! @cloud-fan @gatorsmile |
Test build #72557 has finished for PR 16828 at commit
|
What changes were proposed in this pull request?
after SPARK-19279, we could not create a Hive table with an empty schema,
we should tighten up the condition when create a hive table in
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L835
That is if a CatalogTable t has an empty schema, and (there is no
spark.sql.schema.numParts
or its value is 0), we should not add a defaultcol
schema, if we did, a table with an empty schema will be created, that is not we expected.Additional reason to do this PR is that, when I do the optimize duplicate functions in MetaStoreRelation in #16787, there is a function to merge
toHiveTable
betweenHiveClientImpl
andHiveUtils
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L818
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala#L494
the problem when I do this merge is that
HiveClientImpl
'stoHiveTable
will add a defaultcol
schema for an empty schema table, this will affect theinferSchema
result to check if user want to create a table with an empty schema.https://github.com/apache/spark/pull/16636/files#diff-842e3447fc453de26c706db1cac8f2c4R586
https://github.com/apache/spark/pull/16636/files#diff-c4ed9859978dd6ac271b6a40ee945e4bR95
How was this patch tested?
N/A