-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-31751][SQL]Serde property path
overwrites hive table property location
#28882
[SPARK-31751][SQL]Serde property path
overwrites hive table property location
#28882
Conversation
@cloud-fan Could you please help me check this PR. :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you provide a test case for this, @TJX2014 ? Thanks!
Kindly ping @dongjoon-hyun Thanks for your suggestion, I have added a UT for |
sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogSuite.scala
Show resolved
Hide resolved
ok to test |
tableType = CatalogTableType.MANAGED, | ||
storage = storageFormat, | ||
schema = new StructType().add("col1", "int"), | ||
provider = Some("orc")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conventionally, we use provider = Some("parquet"))
for the tables who names start with parq_
.
@@ -226,4 +226,11 @@ object StaticSQLConf { | |||
.version("3.0.0") | |||
.intConf | |||
.createWithDefault(100) | |||
|
|||
val FOLLOW_HIVE_TABLE_LOCATION_ENABLED = | |||
buildStaticConf("spark.sql.follow.hive.table.location") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @TJX2014 . Since you are working on these, I'll inform you some community rules. Not every rule is written clearly, so I understand that you may not be aware of that. This is just a tip for your this and next PRs.
- NAMING:
configuration
naming follows a namespace concept. Please don't make a new namespace,spark.sql.follow
if that is a new set of concept. For all Hive related config,spark.sql.hive
is used as the namespace. Please see the other existing Hive configurations. - MODULE: Apache Spark follows modular design. In general, for Hive Specific configurations,
sql/hive
module is the better place to be considered for a new conf instead ofsql/catalyst
module. Please try to put intoHiveUtils.scala
. IfHiveUtils.scala
is insufficient for your PR, then, you can promote your conf intosql/core
orsql/catalyst
modules.
Thank you for your contribution, @TJX2014 . For this feature, it seems that we need more time to review. |
Test build #124610 has finished for PR 28882 at commit
|
sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogSuite.scala
Show resolved
Hide resolved
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
Outdated
Show resolved
Hide resolved
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
Outdated
Show resolved
Hide resolved
@dongjoon-hyun Thank you for your suggestion, I have corrected it. |
@@ -545,7 +545,11 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat | |||
} | |||
|
|||
private def getLocationFromStorageProps(table: CatalogTable): Option[String] = { | |||
CaseInsensitiveMap(table.storage.properties).get("path") | |||
if (conf.get(HiveUtils.FOLLOW_TABLE_LOCATION)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks tricky to have a config for such a subtle behavior. I think it makes more sense to fail when the path serde property doesn't match the table location, and ask users to either correct the path property or the table location.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan Thank you for your suggestion, I have corrected.
Test build #124624 has started for PR 28882 at commit |
Test build #124631 has started for PR 28882 at commit |
externalCatalog.client.runSqlHive( | ||
"alter table db1.parq_alter rename to db1.parq_alter2") | ||
|
||
val e = intercept[AnalysisException]( | ||
externalCatalog.getTable("db1", "parq_alter2") | ||
) | ||
assert(e.getMessage.contains("not equal to table prop path") | ||
&& e.getMessage.contains("parq_alter2")) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will get an exception when the path property is not consistent with storage location.
@@ -545,7 +545,14 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat | |||
} | |||
|
|||
private def getLocationFromStorageProps(table: CatalogTable): Option[String] = { | |||
CaseInsensitiveMap(table.storage.properties).get("path") | |||
val storageLoc = table.storage.locationUri.map(_.toString) | |||
val storageProp = CaseInsensitiveMap(table.storage.properties).get("path") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit worried about string comparison. Are you sure the path string is always equal to the URI string? Shall we do normalization before comparing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We normalization URI as toString => CatalogUtils.URIToString(_)
may be better ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be safe, can we compare the URI? we can convert path string to URI with CatalogUtils.stringToURI
.
Test build #124697 has finished for PR 28882 at commit
|
val parquetTable = CatalogTable( | ||
identifier = TableIdentifier("parq_tbl", Some("db1")), | ||
tableType = CatalogTableType.MANAGED, | ||
storage = storageFormat.copy(locationUri = Some(new URI("file:/some/path"))), | ||
schema = new StructType().add("col1", "int").add("col2", "string"), | ||
provider = Some("parquet")) | ||
catalog.createTable(parquetTable, ignoreIfExists = false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brkyvz @cloud-fan #27822
I find storage = storageFormat.copy(locationUri = Some(new URI("file:/some/path"))),
maybe useless, and will cause the path incompatible with path property. So shall we remove it and merge the two test cases to one ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does it cause path incompatibility? storageFormat
doesn't have a path property.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because this UT failed and tell us:path in location Some(file:/tmp/spark-cc7f24a5-e626-4d91-afc6-4f2702482980/parq_tbl) not equal to table prop path Some(file:/some/path)
, so could I send another PR to polish this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes please. This test looks like a valid one and should not fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is a valid case, and it seems there are something to improve, #28980
Test build #124754 has finished for PR 28882 at commit
|
35ec888
to
1496c66
Compare
Test build #124946 has finished for PR 28882 at commit
|
can we test non-hive compatible file sources like json? e.g. |
Seems we should consider |
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
1.Add a UT in
org.apache.spark.sql.hive.HiveExternalCatalogSuite
2.Throw
AnalysisException
when table storage location url not equal to storage path properties.Why are the changes needed?
Step 1,When hive support is enabled and we use sparksql create and save hive table as fellow:
df = spark.createDataFrame([{"a": "x", "b": "y", "c": "3"}])
df.write.format("orc").option("compression", "ZLIB").mode("overwrite").saveAsTable('test_spark');
Step 2,When we alter table name through hive interface not spark like below:
alter table test_spark rename to test_spark2;
The location of
test_spark
is changed totest_spark2
but spark maintain the path serdetest_spark
,this lead to spark readtest_spark
whiletest_spark2
is expectedStep 3, Since we can not forbidden alter table from hive side, once this happens, we need to info users.
Does this PR introduce any user-facing change?
Yes,once users alter table in hive side, users will get an
AnalysisException
to maintain consistent location and path property.How was this patch tested?
Unit test.