Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-31751][SQL]Serde property path overwrites hive table property location #28882

Closed

Conversation

TJX2014
Copy link
Contributor

@TJX2014 TJX2014 commented Jun 21, 2020

What changes were proposed in this pull request?

1.Add a UT in org.apache.spark.sql.hive.HiveExternalCatalogSuite
2.Throw AnalysisException when table storage location url not equal to storage path properties.

Why are the changes needed?

Step 1,When hive support is enabled and we use sparksql create and save hive table as fellow:
df = spark.createDataFrame([{"a": "x", "b": "y", "c": "3"}])
df.write.format("orc").option("compression", "ZLIB").mode("overwrite").saveAsTable('test_spark');
Step 2,When we alter table name through hive interface not spark like below:
alter table test_spark rename to test_spark2;
The location of test_spark is changed to test_spark2 but spark maintain the path serde test_spark,this lead to spark read test_spark while test_spark2 is expected
Step 3, Since we can not forbidden alter table from hive side, once this happens, we need to info users.

Does this PR introduce any user-facing change?

Yes,once users alter table in hive side, users will get an AnalysisException to maintain consistent location and path property.

How was this patch tested?

Unit test.

@TJX2014
Copy link
Contributor Author

TJX2014 commented Jun 24, 2020

@cloud-fan Could you please help me check this PR. :-)

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you provide a test case for this, @TJX2014 ? Thanks!

@TJX2014
Copy link
Contributor Author

TJX2014 commented Jun 28, 2020

Could you provide a test case for this, @TJX2014 ? Thanks!

Kindly ping @dongjoon-hyun Thanks for your suggestion, I have added a UT for spark.sql.follow.hive.table.location

@dongjoon-hyun
Copy link
Member

ok to test

tableType = CatalogTableType.MANAGED,
storage = storageFormat,
schema = new StructType().add("col1", "int"),
provider = Some("orc"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conventionally, we use provider = Some("parquet")) for the tables who names start with parq_.

@@ -226,4 +226,11 @@ object StaticSQLConf {
.version("3.0.0")
.intConf
.createWithDefault(100)

val FOLLOW_HIVE_TABLE_LOCATION_ENABLED =
buildStaticConf("spark.sql.follow.hive.table.location")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @TJX2014 . Since you are working on these, I'll inform you some community rules. Not every rule is written clearly, so I understand that you may not be aware of that. This is just a tip for your this and next PRs.

  1. NAMING: configuration naming follows a namespace concept. Please don't make a new namespace, spark.sql.follow if that is a new set of concept. For all Hive related config, spark.sql.hive is used as the namespace. Please see the other existing Hive configurations.
  2. MODULE: Apache Spark follows modular design. In general, for Hive Specific configurations, sql/hive module is the better place to be considered for a new conf instead of sql/catalyst module. Please try to put into HiveUtils.scala . If HiveUtils.scala is insufficient for your PR, then, you can promote your conf into sql/core or sql/catalyst modules.

@dongjoon-hyun
Copy link
Member

Thank you for your contribution, @TJX2014 . For this feature, it seems that we need more time to review.

@SparkQA
Copy link

SparkQA commented Jun 29, 2020

Test build #124610 has finished for PR 28882 at commit 161d54e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@TJX2014
Copy link
Contributor Author

TJX2014 commented Jun 29, 2020

@dongjoon-hyun Thank you for your suggestion, I have corrected it.

@@ -545,7 +545,11 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
}

private def getLocationFromStorageProps(table: CatalogTable): Option[String] = {
CaseInsensitiveMap(table.storage.properties).get("path")
if (conf.get(HiveUtils.FOLLOW_TABLE_LOCATION)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks tricky to have a config for such a subtle behavior. I think it makes more sense to fail when the path serde property doesn't match the table location, and ask users to either correct the path property or the table location.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan Thank you for your suggestion, I have corrected.

@SparkQA
Copy link

SparkQA commented Jun 29, 2020

Test build #124624 has started for PR 28882 at commit 43a949c.

@SparkQA
Copy link

SparkQA commented Jun 29, 2020

Test build #124631 has started for PR 28882 at commit 81a5eec.

Comment on lines +235 to +227
externalCatalog.client.runSqlHive(
"alter table db1.parq_alter rename to db1.parq_alter2")

val e = intercept[AnalysisException](
externalCatalog.getTable("db1", "parq_alter2")
)
assert(e.getMessage.contains("not equal to table prop path")
&& e.getMessage.contains("parq_alter2"))
}
Copy link
Contributor Author

@TJX2014 TJX2014 Jun 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will get an exception when the path property is not consistent with storage location.

@TJX2014 TJX2014 requested a review from cloud-fan June 30, 2020 13:16
@@ -545,7 +545,14 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
}

private def getLocationFromStorageProps(table: CatalogTable): Option[String] = {
CaseInsensitiveMap(table.storage.properties).get("path")
val storageLoc = table.storage.locationUri.map(_.toString)
val storageProp = CaseInsensitiveMap(table.storage.properties).get("path")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit worried about string comparison. Are you sure the path string is always equal to the URI string? Shall we do normalization before comparing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We normalization URI as toString => CatalogUtils.URIToString(_) may be better ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be safe, can we compare the URI? we can convert path string to URI with CatalogUtils.stringToURI.

@SparkQA
Copy link

SparkQA commented Jul 1, 2020

Test build #124697 has finished for PR 28882 at commit 51e1cb4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Comment on lines 186 to 192
val parquetTable = CatalogTable(
identifier = TableIdentifier("parq_tbl", Some("db1")),
tableType = CatalogTableType.MANAGED,
storage = storageFormat.copy(locationUri = Some(new URI("file:/some/path"))),
schema = new StructType().add("col1", "int").add("col2", "string"),
provider = Some("parquet"))
catalog.createTable(parquetTable, ignoreIfExists = false)
Copy link
Contributor Author

@TJX2014 TJX2014 Jul 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brkyvz @cloud-fan #27822
I find storage = storageFormat.copy(locationUri = Some(new URI("file:/some/path"))), maybe useless, and will cause the path incompatible with path property. So shall we remove it and merge the two test cases to one ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does it cause path incompatibility? storageFormat doesn't have a path property.

Copy link
Contributor Author

@TJX2014 TJX2014 Jul 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this UT failed and tell us:path in location Some(file:/tmp/spark-cc7f24a5-e626-4d91-afc6-4f2702482980/parq_tbl) not equal to table prop path Some(file:/some/path), so could I send another PR to polish this ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes please. This test looks like a valid one and should not fail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is a valid case, and it seems there are something to improve, #28980

@SparkQA
Copy link

SparkQA commented Jul 1, 2020

Test build #124754 has finished for PR 28882 at commit 35ec888.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

@TJX2014 TJX2014 force-pushed the master-SPARK-31751-hive-table-location branch from 35ec888 to 1496c66 Compare July 2, 2020 22:35
@TJX2014 TJX2014 requested a review from cloud-fan July 2, 2020 22:40
@SparkQA
Copy link

SparkQA commented Jul 5, 2020

Test build #124946 has finished for PR 28882 at commit 1496c66.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

can we test non-hive compatible file sources like json? e.g. CREATE TABLE t USING json LOCATION .... It seems like the serde property path will always be different from the table location, as spark doesn't store the table location at all.

@TJX2014
Copy link
Contributor Author

TJX2014 commented Jul 8, 2020

Seems we should consider CREATE TABLE t USING json LOCATION ...

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Oct 17, 2020
@github-actions github-actions bot closed this Oct 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants