-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-19359][SQL]clear useless path after rename a partition with upper-case by HiveExternalCatalog #16700
[SPARK-19359][SQL]clear useless path after rename a partition with upper-case by HiveExternalCatalog #16700
Conversation
…per-case in HiveExternalCatalog
Test build #71976 has started for PR 16700 at commit |
retest this please |
Just a general point that it seems odd to have a method whose name suggests it finds useless partition dirs. Can this be done more simply - is it not just a question of one extra delete somewhere? |
thanks, what about |
Test build #71978 has finished for PR 16700 at commit
|
Test build #71987 has finished for PR 16700 at commit
|
def getExtraPartPathCreatedByHive( | ||
lowerCaseSpec: TablePartitionSpec, | ||
partitionColumnNames: Seq[String], | ||
tablePath: Path): Path = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix the indent issues.
@@ -120,6 +120,17 @@ object ExternalCatalogUtils { | |||
new Path(totalPath, nextPartPath) | |||
} | |||
} | |||
|
|||
def getExtraPartPathCreatedByHive( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A general suggestion. When the function names become not self-descriptive. Write the function comments.
For example, here, please add comments for generatePartitionPath
and getExtraPartPathCreatedByHive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok ,thanks very much~ BTW, Happy Chinese New Year~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy Chinese New Year
What happens if the partitioning columns have more than two columns? |
the example showed A/B are two partition columns |
Yeah, if we having three columns, does your solution resolve all the issues? |
renamePartition: renamePartition: it will delete the first [upper-case partition col name] path which create by hive with lower-case path. more tests added to cover this ,thanks~ |
Test build #72017 has started for PR 16700 at commit |
retest this please |
Test build #72020 has finished for PR 16700 at commit
|
* e.g. /path/A=1/B=2/C=3 rename to /path/A=4/B=5/C=6, the extra path returned is | ||
* /path/a=4, which also include all its' child path, such as /path/a=4/b=2 | ||
*/ | ||
def getExtraPartPathCreatedByHive( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a hive only feature, why do we put it in ExternalCatalogUtils
? How about the object HiveExternalCatalog
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks !I will move it
Test build #72025 has finished for PR 16700 at commit
|
Test build #72026 has finished for PR 16700 at commit
|
def getExtraPartPathCreatedByHive( | ||
lowerCaseSpec: TablePartitionSpec, | ||
partitionColumnNames: Seq[String], | ||
tablePath: Path): Path = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indent issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh...sorry...thanks!
/** | ||
* partition path created by Hive is lower-case, while Spark SQL will | ||
* rename it with the partition name in partitionColumnNames, and this function | ||
* return the extra lower-case path created by Hive, and then we can delete it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return
-> returns
* partition path created by Hive is lower-case, while Spark SQL will | ||
* rename it with the partition name in partitionColumnNames, and this function | ||
* return the extra lower-case path created by Hive, and then we can delete it. | ||
* e.g. /path/A=1/B=2/C=3 rename to /path/A=4/B=5/C=6, the extra path returned is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename to
-> is changed to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the extra path returned is
-> this function returns
* rename it with the partition name in partitionColumnNames, and this function | ||
* return the extra lower-case path created by Hive, and then we can delete it. | ||
* e.g. /path/A=1/B=2/C=3 rename to /path/A=4/B=5/C=6, the extra path returned is | ||
* /path/a=4, which also include all its' child path, such as /path/a=4/b=2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/path/a=4, which also include all its' child path, such as /path/a=4/b=2
-> /path/a=4
@@ -839,6 +839,25 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat | |||
spec.map { case (k, v) => partCols.find(_.equalsIgnoreCase(k)).get -> v } | |||
} | |||
|
|||
|
|||
/** | |||
* partition path created by Hive is lower-case, while Spark SQL will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
partition path
-> The partition path
is lower-case
-> is in lowercase
@@ -899,6 +918,21 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat | |||
spec, partitionColumnNames, tablePath) | |||
try { | |||
tablePath.getFileSystem(hadoopConf).rename(wrongPath, rightPath) | |||
|
|||
// if the newSpec contains more than one depth partitoin, FileSystem.rename just delete |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
partitoin
-> partition
if
-> If
delete
-> deletes
@@ -899,6 +918,21 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat | |||
spec, partitionColumnNames, tablePath) | |||
try { | |||
tablePath.getFileSystem(hadoopConf).rename(wrongPath, rightPath) | |||
|
|||
// if the newSpec contains more than one depth partitoin, FileSystem.rename just delete | |||
// only one path(wrongPath), we should check if wrongPath's parents need to be deleted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only one path(wrongPath)
-> the leaf (i.e., wrongPath)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot!
// 'a=1/b=2' in FileSystem is deleted, while 'a=1' is already exists, | ||
// which should also be deleted | ||
val delHivePartPathAfterRename = getExtraPartPathCreatedByHive( | ||
lowerCasePartitionSpec(spec), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last comment: I prefer to calling lowerCasePartitionSpec
in the func getExtraPartPathCreatedByHive
.
// newSpec is 'A=1/B=2', after renamePartitions by Hive, the location path in FileSystem | ||
// is changed to 'a=1/b=2', which is wrongPath, then we renamed to 'A=1/B=2', and | ||
// 'a=1/b=2' in FileSystem is deleted, while 'a=1' is already exists, | ||
// which should also be deleted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about?
For example, give a newSpec 'A=1/B=2', after calling Hive's client.renamePartitions, the location path in FileSystem is changed to 'a=1/b=2', which is wrongPath. Then, although we renamed it to 'A=1/B=2', 'a=1/b=2' in FileSystem is deleted but 'a=1' still exists. We also need to delete the useless directory 'a=1'.
* The partition path created by Hive is in lowercase, while Spark SQL will | ||
* rename it with the partition name in partitionColumnNames, and this function | ||
* returns the extra lowercase path created by Hive, and then we can delete it. | ||
* e.g. /path/A=1/B=2/C=3 is changed to /path/A=4/B=5/C=6, this function returns is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
returns is
-> returns
LGTM except three comments. |
Test build #72062 has finished for PR 16700 at commit
|
Test build #72066 has started for PR 16700 at commit |
retest this please |
Test build #72069 has finished for PR 16700 at commit
|
/** | ||
* The partition path created by Hive is in lowercase, while Spark SQL will | ||
* rename it with the partition name in partitionColumnNames, and this function | ||
* returns the extra lowercase path created by Hive, and then we can delete it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: all of them are commas. You need to use periods. : )
* The partition path created by Hive is in lowercase, while Spark SQL will | ||
* rename it with the partition name in partitionColumnNames, and this function | ||
* returns the extra lowercase path created by Hive, and then we can delete it. | ||
* e.g. /path/A=1/B=2/C=3 is changed to /path/A=4/B=5/C=6, this function returns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same issue here.
|
||
// If the newSpec contains more than one depth partition, FileSystem.rename just deletes | ||
// the leaf(i.e. wrongPath), we should check if wrongPath's parents need to be deleted. | ||
// For example, give a newSpec 'A=1/B=2', after calling Hive's client.renamePartitions, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
give
-> given
// If the newSpec contains more than one depth partition, FileSystem.rename just deletes | ||
// the leaf(i.e. wrongPath), we should check if wrongPath's parents need to be deleted. | ||
// For example, give a newSpec 'A=1/B=2', after calling Hive's client.renamePartitions, | ||
// the location path in FileSystem is changed to 'a=1/b=2', which is wrongPath, then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
, then
-> . Then
// the leaf(i.e. wrongPath), we should check if wrongPath's parents need to be deleted. | ||
// For example, give a newSpec 'A=1/B=2', after calling Hive's client.renamePartitions, | ||
// the location path in FileSystem is changed to 'a=1/b=2', which is wrongPath, then | ||
// although we renamed it to 'A=1/B=2', 'a=1/b=2' in FileSystem is deleted, but 'a=1' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to use a period
here.
Thanks! Merging it to master. You can fix the minor comments in your other PRs. |
// the location path in FileSystem is changed to 'a=1/b=2', which is wrongPath, then | ||
// although we renamed it to 'A=1/B=2', 'a=1/b=2' in FileSystem is deleted, but 'a=1' | ||
// is still exists, which we also need to delete | ||
val delHivePartPathAfterRename = getExtraPartPathCreatedByHive( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, could it possibly have multiple specs sharing the same parent directory, e.g., 'A=1/B=2', 'A=1/B=3', ...?
If so, when you delete the path 'a=1' here, in processing the next spec 'A=1/B=3', I think the rename will fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The path a=1
was created when you call client.renamePartitions
, right? Based on my understanding, when you rename A=1/B=3
, Hive will create the directory a=1
and a=1/b=3
. Thus, the rename will not fail. Have you made a try?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
client.renamePartitions
is called at the beginning of renamePartitions
for all specs at once. It creates the directory a=1
and a=1/b=2
and a=1/b=3
.
When you iterates specs and rename the directories with FileSystem.rename, in the first iteration, a=1/b=2
is renamed, and a=1
is deleted in this change, then a=1/b=3
will be deleted too. So in next iteration, the renaming of a=1/b=3
to A=1/B=3
will fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far, the partition rename DDL we support is for a single pair of partition spec. That is, ALTER TABLE table PARTITION spec1 RENAME TO PARTITION spec2
. This PR will not introduce a bug to end users.
However, your concern looks reasonable. I think we should not support the partition renaming for multiple partitions in a single DDL in the SessionCatalog and ExternalCatalog. It just makes the code more complex for error handling. Let me remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be worse. If we already have a partition A=1/B=2
, and we rename some other partition to A=1/B=3
, then we will have A=1/B=2
and a=1/b=3
, and we have a lot of work to do, instead of just a renaming.
@@ -899,6 +919,21 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat | |||
spec, partitionColumnNames, tablePath) | |||
try { | |||
tablePath.getFileSystem(hadoopConf).rename(wrongPath, rightPath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found an issue here... When we call rename
, not all the file systems have the same behaviors. For example, on mac OS, when we doing this .../tbl/a=5/b=6
-> .../tbl/A=5/B=6
. The result is .../tbl/a=5/B=6
. Thus, it is not recursive. However, the file system used in Jenkin does not have such an issue. You can hit this issue if you are using macOS. Thus, this fix causes an regression, but the bug is not in your fix.
… with upper-case by HiveExternalCatalog ### What changes were proposed in this pull request? This PR is to revert the changes made in #16700. It could cause the data loss after partition rename, because we have a bug in the file renaming. Not all the OSs have the same behaviors. For example, on mac OS, if we renaming a path from `.../tbl/a=5/b=6` to `.../tbl/A=5/B=6`. The result is `.../tbl/a=5/B=6`. The expected result is `.../tbl/A=5/B=6`. Thus, renaming on mac OS is not recursive. However, the systems used in Jenkin does not have such an issue. Although this PR is not the root cause, it exposes an existing issue on the code `tablePath.getFileSystem(hadoopConf).rename(wrongPath, rightPath)` --- Hive metastore is not case preserving and keep partition columns with lower case names. If SparkSQL create a table with upper-case partion name use HiveExternalCatalog, when we rename partition, it first call the HiveClient to renamePartition, which will create a new lower case partition path, then SparkSql rename the lower case path to the upper-case. while if the renamed partition contains more than one depth partition ,e.g. A=1/B=2, hive renamePartition change to a=1/b=2, then SparkSql rename it to A=1/B=2, but the a=1 still exists in the filesystem, we should also delete it. ### How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #16728 from gatorsmile/revert-pr-16700.
Just revert it. Let us wait for the decision how we plan to deal with the file renaming. My major concern is the errors in file renaming could cause the data loss, unless we can introduce a robust rollback solution. |
oh, sorry I see your comments just now... when I see the pr #16837 |
…pper-case by HiveExternalCatalog ## What changes were proposed in this pull request? Hive metastore is not case preserving and keep partition columns with lower case names. If SparkSQL create a table with upper-case partion name use HiveExternalCatalog, when we rename partition, it first call the HiveClient to renamePartition, which will create a new lower case partition path, then SparkSql rename the lower case path to the upper-case. while if the renamed partition contains more than one depth partition ,e.g. A=1/B=2, hive renamePartition change to a=1/b=2, then SparkSql rename it to A=1/B=2, but the a=1 still exists in the filesystem, we should also delete it. ## How was this patch tested? unit test added Author: windpiger <songjun@outlook.com> Closes apache#16700 from windpiger/clearUselessPathAfterRenamPartition.
… with upper-case by HiveExternalCatalog ### What changes were proposed in this pull request? This PR is to revert the changes made in apache#16700. It could cause the data loss after partition rename, because we have a bug in the file renaming. Not all the OSs have the same behaviors. For example, on mac OS, if we renaming a path from `.../tbl/a=5/b=6` to `.../tbl/A=5/B=6`. The result is `.../tbl/a=5/B=6`. The expected result is `.../tbl/A=5/B=6`. Thus, renaming on mac OS is not recursive. However, the systems used in Jenkin does not have such an issue. Although this PR is not the root cause, it exposes an existing issue on the code `tablePath.getFileSystem(hadoopConf).rename(wrongPath, rightPath)` --- Hive metastore is not case preserving and keep partition columns with lower case names. If SparkSQL create a table with upper-case partion name use HiveExternalCatalog, when we rename partition, it first call the HiveClient to renamePartition, which will create a new lower case partition path, then SparkSql rename the lower case path to the upper-case. while if the renamed partition contains more than one depth partition ,e.g. A=1/B=2, hive renamePartition change to a=1/b=2, then SparkSql rename it to A=1/B=2, but the a=1 still exists in the filesystem, we should also delete it. ### How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes apache#16728 from gatorsmile/revert-pr-16700.
What changes were proposed in this pull request?
Hive metastore is not case preserving and keep partition columns with lower case names.
If SparkSQL create a table with upper-case partion name use HiveExternalCatalog, when we rename partition, it first call the HiveClient to renamePartition, which will create a new lower case partition path, then SparkSql rename the lower case path to the upper-case.
while if the renamed partition contains more than one depth partition ,e.g. A=1/B=2, hive renamePartition change to a=1/b=2, then SparkSql rename it to A=1/B=2, but the a=1 still exists in the filesystem, we should also delete it.
How was this patch tested?
unit test added