Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19484][SQL]continue work to create hive table with an empty schema #16828

Closed

Conversation

windpiger
Copy link
Contributor

@windpiger windpiger commented Feb 7, 2017

What changes were proposed in this pull request?

after SPARK-19279, we could not create a Hive table with an empty schema,
we should tighten up the condition when create a hive table in
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L835

That is if a CatalogTable t has an empty schema, and (there is no spark.sql.schema.numParts or its value is 0), we should not add a default col schema, if we did, a table with an empty schema will be created, that is not we expected.

Additional reason to do this PR is that, when I do the optimize duplicate functions in MetaStoreRelation in #16787, there is a function to merge toHiveTable between HiveClientImpl and HiveUtils

https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L818
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala#L494

the problem when I do this merge is that HiveClientImpl's toHiveTable will add a default col schema for an empty schema table, this will affect the inferSchema result to check if user want to create a table with an empty schema.
https://github.com/apache/spark/pull/16636/files#diff-842e3447fc453de26c706db1cac8f2c4R586
https://github.com/apache/spark/pull/16636/files#diff-c4ed9859978dd6ac271b6a40ee945e4bR95

How was this patch tested?

N/A

@windpiger
Copy link
Contributor Author

cc @gatorsmile @cloud-fan

@SparkQA
Copy link

SparkQA commented Feb 7, 2017

Test build #72488 has finished for PR 16828 at commit 40f2896.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -832,7 +832,8 @@ private[hive] class HiveClientImpl(
val (partCols, schema) = table.schema.map(toHiveColumn).partition { c =>
table.partitionColumnNames.contains(c.getName)
}
if (schema.isEmpty) {
if (schema.isEmpty&& table.properties.getOrElse(
HiveExternalCatalog.DATASOURCE_SCHEMA_NUMPARTS, "0").toInt != 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we need this condition? I think we store schema in table properties for all kinds of tables.

@@ -187,15 +187,6 @@ class HiveExternalCatalogBackwardCompatibilitySuite extends QueryTest
"spark.sql.sources.schema.numParts" -> "1",
"spark.sql.sources.schema.part.0" -> simpleSchemaJson))

lazy val dataSourceTableWithoutSchema = CatalogTable(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still a valid case we need to support

@SparkQA
Copy link

SparkQA commented Feb 7, 2017

Test build #72495 has finished for PR 16828 at commit 56a01e7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 7, 2017

Test build #72502 has finished for PR 16828 at commit 02f3147.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -1328,7 +1330,7 @@ class MetastoreDataSourcesSuite extends QueryTest with SQLTestUtils with TestHiv
storage = CatalogStorageFormat.empty.copy(
properties = Map("path" -> path.getAbsolutePath)
),
schema = new StructType(),
schema = simpleSchema,
Copy link
Contributor

@cloud-fan cloud-fan Feb 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we were testing empty schema here.

// spark table. the SPARK_TEST_OLD_SOURCE_TABLE_CREATE property is used to resolve this.
if (schema.isEmpty && (table.properties.getOrElse(
HiveExternalCatalog.DATASOURCE_SCHEMA_NUMPARTS, "0").toInt != 0) || table.properties
.getOrElse(HiveExternalCatalog.SPARK_TEST_OLD_SOURCE_TABLE_CREATE, "false").toBoolean) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when will we hit this branch? only when we create empty schema table in test?

Copy link
Contributor Author

@windpiger windpiger Feb 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we create a datasource table, the schema is empty while the table properties contains the DATASOURCE_SCHEMA_NUMPARTS will also hit this branch

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but table properties should always contain DATASOURCE_SCHEMA_NUMPARTS, doesn't it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we inferSchema here
https://github.com/apache/spark/pull/16636/files#diff-c4ed9859978dd6ac271b6a40ee945e4bR95

this catalog does not have a DATASOURCE_SCHEMA_NUMPARTS in properties

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we can just check DDLUtils.isHiveTable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, Iet me modify this , thanks!

@SparkQA
Copy link

SparkQA commented Feb 7, 2017

Test build #72517 has finished for PR 16828 at commit c4e54c1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 7, 2017

Test build #72519 has finished for PR 16828 at commit 8fba58d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -832,7 +833,10 @@ private[hive] class HiveClientImpl(
val (partCols, schema) = table.schema.map(toHiveColumn).partition { c =>
table.partitionColumnNames.contains(c.getName)
}
if (schema.isEmpty) {

// after SPARK-19279, it is not allowed to create a hive table with an empty schema,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh... I found a bug in my original code for inferSchema. Let me fix it now. The schema could be not empty when it is a partitioned table.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove || table.schema.nonEmpty this condition?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We just need to check whether the dataSchema is not empty, right?

Copy link
Contributor Author

@windpiger windpiger Feb 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we have partition schema ,the condition will be hit || table.schema.nonEmpty and return table directly, maybe the hive table dataschema can be determined by Hive serde?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the PR: #16848

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok~ thanks~


// after SPARK-19279, it is not allowed to create a hive table with an empty schema,
// so here we should not add a default col schema
if (schema.isEmpty && !DDLUtils.isHiveTable(table)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change !DDLUtils.isHiveTable(table) to DDLUtils.isDatasourceTable(table)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok~ thanks~

@SparkQA
Copy link

SparkQA commented Feb 8, 2017

Test build #72551 has finished for PR 16828 at commit cf2c382.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

looks like this PR is not so self-contained, can you embed it to your PR about unifying toHiveTable?

@windpiger
Copy link
Contributor Author

I think this is also an improvement to avoid some later potential problems, can we keep it?

@gatorsmile
Copy link
Member

The later potential problems? If we do not unify toHiveTable, are you able to write a test case to hit this issue?

@windpiger
Copy link
Contributor Author

ok, I will close it. Thanks! @cloud-fan @gatorsmile

@windpiger windpiger closed this Feb 8, 2017
@SparkQA
Copy link

SparkQA commented Feb 8, 2017

Test build #72557 has finished for PR 16828 at commit cc5141a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants