[SPARK-19359][SQL]clear useless path after rename a partition with upper-case by HiveExternalCatalog #16700

windpiger · 2017-01-25T07:51:23Z

What changes were proposed in this pull request?

Hive metastore is not case preserving and keep partition columns with lower case names.

If SparkSQL create a table with upper-case partion name use HiveExternalCatalog, when we rename partition, it first call the HiveClient to renamePartition, which will create a new lower case partition path, then SparkSql rename the lower case path to the upper-case.

while if the renamed partition contains more than one depth partition ,e.g. A=1/B=2, hive renamePartition change to a=1/b=2, then SparkSql rename it to A=1/B=2, but the a=1 still exists in the filesystem, we should also delete it.

How was this patch tested?

unit test added

…per-case in HiveExternalCatalog

SparkQA · 2017-01-25T07:53:36Z

Test build #71976 has started for PR 16700 at commit 878d45e.

windpiger · 2017-01-25T08:31:29Z

retest this please

srowen · 2017-01-25T10:25:55Z

Just a general point that it seems odd to have a method whose name suggests it finds useless partition dirs. Can this be done more simply - is it not just a question of one extra delete somewhere?

windpiger · 2017-01-25T11:00:17Z

thanks, what about getExtraPartPathCreatedByHive

SparkQA · 2017-01-25T11:02:48Z

Test build #71978 has finished for PR 16700 at commit 878d45e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-25T13:33:30Z

Test build #71987 has finished for PR 16700 at commit 5403595.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

windpiger · 2017-01-25T15:48:51Z

cc @gatorsmile @cloud-fan

gatorsmile · 2017-01-26T05:33:58Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalogUtils.scala

+  def getExtraPartPathCreatedByHive(
+                             lowerCaseSpec: TablePartitionSpec,
+                             partitionColumnNames: Seq[String],
+                             tablePath: Path): Path = {


Please fix the indent issues.

gatorsmile · 2017-01-26T05:35:17Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalogUtils.scala

@@ -120,6 +120,17 @@ object ExternalCatalogUtils {
      new Path(totalPath, nextPartPath)
    }
  }
+
+  def getExtraPartPathCreatedByHive(


A general suggestion. When the function names become not self-descriptive. Write the function comments.

For example, here, please add comments for generatePartitionPath and getExtraPartPathCreatedByHive

ok ,thanks very much~ BTW, Happy Chinese New Year~

Happy Chinese New Year

gatorsmile · 2017-01-26T05:41:55Z

What happens if the partitioning columns have more than two columns?

windpiger · 2017-01-26T06:52:59Z

the example showed A/B are two partition columns

gatorsmile · 2017-01-26T06:54:09Z

Yeah, if we having three columns, does your solution resolve all the issues?

windpiger · 2017-01-26T07:01:32Z

renamePartition:
A=1/B=2/C=3 -> A=4/B=5/C=6
path created by Hive after renamePartition:
/path/a=4/b=5/c=6
and SparkSQL rename it /path/A=4/B=5/C=6, and this pr will delete /path/a=4.

renamePartition:
a=1/B=2/C=3 -> a=4/B=5/C=6
path created by Hive after renamePartition:
/path/a=4/b=5/c=6
and SparkSQL rename it /path/a=4/B=5/C=6, and this pr will delete /path/a=4/b=5.

it will delete the first [upper-case partition col name] path which create by hive with lower-case path.

more tests added to cover this ,thanks~

SparkQA · 2017-01-26T07:48:44Z

Test build #72017 has started for PR 16700 at commit 40efce2.

windpiger · 2017-01-26T08:23:14Z

retest this please

SparkQA · 2017-01-26T10:57:17Z

Test build #72020 has finished for PR 16700 at commit 40efce2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-01-26T11:39:54Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalogUtils.scala

+   * e.g. /path/A=1/B=2/C=3 rename to /path/A=4/B=5/C=6, the extra path returned is
+   * /path/a=4, which also include all its' child path, such as /path/a=4/b=2
+   */
+  def getExtraPartPathCreatedByHive(


this is a hive only feature, why do we put it in ExternalCatalogUtils? How about the object HiveExternalCatalog?

thanks ！I will move it

SparkQA · 2017-01-26T15:20:41Z

Test build #72025 has finished for PR 16700 at commit 12acdc6.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-26T17:08:20Z

Test build #72026 has finished for PR 16700 at commit 7aba059.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-01-26T18:14:27Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala

+  def getExtraPartPathCreatedByHive(
+                                     lowerCaseSpec: TablePartitionSpec,
+                                     partitionColumnNames: Seq[String],
+                                     tablePath: Path): Path = {


indent issues.

oh...sorry...thanks!

gatorsmile · 2017-01-26T18:15:00Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala

+  /**
+   * partition path created by Hive is lower-case, while Spark SQL will
+   * rename it with the partition name in partitionColumnNames, and this function
+   * return the extra lower-case path created by Hive, and then we can delete it.


return -> returns

gatorsmile · 2017-01-26T18:17:06Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala

+   * partition path created by Hive is lower-case, while Spark SQL will
+   * rename it with the partition name in partitionColumnNames, and this function
+   * return the extra lower-case path created by Hive, and then we can delete it.
+   * e.g. /path/A=1/B=2/C=3 rename to /path/A=4/B=5/C=6, the extra path returned is


rename to -> is changed to

the extra path returned is -> this function returns

gatorsmile · 2017-01-26T18:19:16Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala

+   * rename it with the partition name in partitionColumnNames, and this function
+   * return the extra lower-case path created by Hive, and then we can delete it.
+   * e.g. /path/A=1/B=2/C=3 rename to /path/A=4/B=5/C=6, the extra path returned is
+   * /path/a=4, which also include all its' child path, such as /path/a=4/b=2


/path/a=4, which also include all its' child path, such as /path/a=4/b=2 -> /path/a=4

gatorsmile · 2017-01-26T18:20:25Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala

@@ -839,6 +839,25 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
    spec.map { case (k, v) => partCols.find(_.equalsIgnoreCase(k)).get -> v }
  }

+
+  /**
+   * partition path created by Hive is lower-case, while Spark SQL will


partition path -> The partition path
is lower-case -> is in lowercase

gatorsmile · 2017-01-26T18:21:59Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala

@@ -899,6 +918,21 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
          spec, partitionColumnNames, tablePath)
        try {
          tablePath.getFileSystem(hadoopConf).rename(wrongPath, rightPath)
+
+          // if the newSpec contains more than one depth partitoin, FileSystem.rename just delete


partitoin -> partition
if -> If
delete -> deletes

gatorsmile · 2017-01-26T18:23:15Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala

@@ -899,6 +918,21 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
          spec, partitionColumnNames, tablePath)
        try {
          tablePath.getFileSystem(hadoopConf).rename(wrongPath, rightPath)
+
+          // if the newSpec contains more than one depth partitoin, FileSystem.rename just delete
+          // only one path(wrongPath), we should check if wrongPath's parents need to be deleted.


only one path(wrongPath) -> the leaf (i.e., wrongPath)

Thanks a lot!

gatorsmile · 2017-01-27T01:56:41Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala

+          // 'a=1/b=2' in FileSystem is deleted, while 'a=1' is already exists,
+          // which should also be deleted
+          val delHivePartPathAfterRename = getExtraPartPathCreatedByHive(
+            lowerCasePartitionSpec(spec),


The last comment: I prefer to calling lowerCasePartitionSpec in the func getExtraPartPathCreatedByHive.

gatorsmile · 2017-01-27T02:03:54Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala

+          // newSpec is 'A=1/B=2', after renamePartitions by Hive, the location path in FileSystem
+          // is changed to 'a=1/b=2', which is wrongPath, then we renamed to 'A=1/B=2', and
+          // 'a=1/b=2' in FileSystem is deleted, while 'a=1' is already exists,
+          // which should also be deleted


How about?

For example, give a newSpec 'A=1/B=2', after calling Hive's client.renamePartitions, the location path in FileSystem is changed to 'a=1/b=2', which is wrongPath. Then, although we renamed it to 'A=1/B=2', 'a=1/b=2' in FileSystem is deleted but 'a=1' still exists. We also need to delete the useless directory 'a=1'.

gatorsmile · 2017-01-27T02:05:47Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala

+   * The partition path created by Hive is in lowercase, while Spark SQL will
+   * rename it with the partition name in partitionColumnNames, and this function
+   * returns the extra lowercase path created by Hive, and then we can delete it.
+   * e.g. /path/A=1/B=2/C=3 is changed to /path/A=4/B=5/C=6, this function returns is


returns is -> returns

gatorsmile · 2017-01-27T02:06:24Z

LGTM except three comments.

SparkQA · 2017-01-27T03:11:49Z

Test build #72062 has finished for PR 16700 at commit 0136388.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-27T07:18:39Z

Test build #72066 has started for PR 16700 at commit de4c409.

gatorsmile · 2017-01-27T08:51:18Z

retest this please

SparkQA · 2017-01-27T10:30:31Z

Test build #72069 has finished for PR 16700 at commit de4c409.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-01-28T01:00:36Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala

+  /**
+   * The partition path created by Hive is in lowercase, while Spark SQL will
+   * rename it with the partition name in partitionColumnNames, and this function
+   * returns the extra lowercase path created by Hive, and then we can delete it.


Nit: all of them are commas. You need to use periods. : )

gatorsmile · 2017-01-28T01:00:52Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala

+   * The partition path created by Hive is in lowercase, while Spark SQL will
+   * rename it with the partition name in partitionColumnNames, and this function
+   * returns the extra lowercase path created by Hive, and then we can delete it.
+   * e.g. /path/A=1/B=2/C=3 is changed to /path/A=4/B=5/C=6, this function returns


The same issue here.

gatorsmile · 2017-01-28T01:01:16Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala

+
+          // If the newSpec contains more than one depth partition, FileSystem.rename just deletes
+          // the leaf(i.e. wrongPath), we should check if wrongPath's parents need to be deleted.
+          // For example, give a newSpec 'A=1/B=2', after calling Hive's client.renamePartitions,


give -> given

gatorsmile · 2017-01-28T01:03:00Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala

+          // If the newSpec contains more than one depth partition, FileSystem.rename just deletes
+          // the leaf(i.e. wrongPath), we should check if wrongPath's parents need to be deleted.
+          // For example, give a newSpec 'A=1/B=2', after calling Hive's client.renamePartitions,
+          // the location path in FileSystem is changed to 'a=1/b=2', which is wrongPath, then


, then -> . Then

gatorsmile · 2017-01-28T01:04:30Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala

+          // the leaf(i.e. wrongPath), we should check if wrongPath's parents need to be deleted.
+          // For example, give a newSpec 'A=1/B=2', after calling Hive's client.renamePartitions,
+          // the location path in FileSystem is changed to 'a=1/b=2', which is wrongPath, then
+          // although we renamed it to 'A=1/B=2', 'a=1/b=2' in FileSystem is deleted, but 'a=1'


You need to use a period here.

gatorsmile · 2017-01-28T02:09:08Z

Thanks! Merging it to master.

You can fix the minor comments in your other PRs.

viirya · 2017-01-28T02:52:38Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala

+          // the location path in FileSystem is changed to 'a=1/b=2', which is wrongPath, then
+          // although we renamed it to 'A=1/B=2', 'a=1/b=2' in FileSystem is deleted, but 'a=1'
+          // is still exists, which we also need to delete
+          val delHivePartPathAfterRename = getExtraPartPathCreatedByHive(


Hmmm, could it possibly have multiple specs sharing the same parent directory, e.g., 'A=1/B=2', 'A=1/B=3', ...?

If so, when you delete the path 'a=1' here, in processing the next spec 'A=1/B=3', I think the rename will fail.

The path a=1 was created when you call client.renamePartitions, right? Based on my understanding, when you rename A=1/B=3, Hive will create the directory a=1 and a=1/b=3. Thus, the rename will not fail. Have you made a try?

client.renamePartitions is called at the beginning of renamePartitions for all specs at once. It creates the directory a=1 and a=1/b=2 and a=1/b=3.

When you iterates specs and rename the directories with FileSystem.rename, in the first iteration, a=1/b=2 is renamed, and a=1 is deleted in this change, then a=1/b=3 will be deleted too. So in next iteration, the renaming of a=1/b=3 to A=1/B=3 will fail.

So far, the partition rename DDL we support is for a single pair of partition spec. That is, ALTER TABLE table PARTITION spec1 RENAME TO PARTITION spec2. This PR will not introduce a bug to end users.

However, your concern looks reasonable. I think we should not support the partition renaming for multiple partitions in a single DDL in the SessionCatalog and ExternalCatalog. It just makes the code more complex for error handling. Let me remove it.

this can be worse. If we already have a partition A=1/B=2, and we rename some other partition to A=1/B=3, then we will have A=1/B=2 and a=1/b=3, and we have a lot of work to do, instead of just a renaming.

gatorsmile · 2017-01-28T08:59:10Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala

@@ -899,6 +919,21 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
          spec, partitionColumnNames, tablePath)
        try {
          tablePath.getFileSystem(hadoopConf).rename(wrongPath, rightPath)


Found an issue here... When we call rename, not all the file systems have the same behaviors. For example, on mac OS, when we doing this .../tbl/a=5/b=6 -> .../tbl/A=5/B=6 . The result is .../tbl/a=5/B=6. Thus, it is not recursive. However, the file system used in Jenkin does not have such an issue. You can hit this issue if you are using macOS. Thus, this fix causes an regression, but the bug is not in your fix.

… with upper-case by HiveExternalCatalog ### What changes were proposed in this pull request? This PR is to revert the changes made in #16700. It could cause the data loss after partition rename, because we have a bug in the file renaming. Not all the OSs have the same behaviors. For example, on mac OS, if we renaming a path from `.../tbl/a=5/b=6` to `.../tbl/A=5/B=6`. The result is `.../tbl/a=5/B=6`. The expected result is `.../tbl/A=5/B=6`. Thus, renaming on mac OS is not recursive. However, the systems used in Jenkin does not have such an issue. Although this PR is not the root cause, it exposes an existing issue on the code `tablePath.getFileSystem(hadoopConf).rename(wrongPath, rightPath)` --- Hive metastore is not case preserving and keep partition columns with lower case names. If SparkSQL create a table with upper-case partion name use HiveExternalCatalog, when we rename partition, it first call the HiveClient to renamePartition, which will create a new lower case partition path, then SparkSql rename the lower case path to the upper-case. while if the renamed partition contains more than one depth partition ,e.g. A=1/B=2, hive renamePartition change to a=1/b=2, then SparkSql rename it to A=1/B=2, but the a=1 still exists in the filesystem, we should also delete it. ### How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #16728 from gatorsmile/revert-pr-16700.

gatorsmile · 2017-01-28T21:35:30Z

Just revert it. Let us wait for the decision how we plan to deal with the file renaming. My major concern is the errors in file renaming could cause the data loss, unless we can introduce a robust rollback solution.

windpiger · 2017-02-08T01:12:32Z

oh, sorry I see your comments just now... when I see the pr #16837

…pper-case by HiveExternalCatalog ## What changes were proposed in this pull request? Hive metastore is not case preserving and keep partition columns with lower case names. If SparkSQL create a table with upper-case partion name use HiveExternalCatalog, when we rename partition, it first call the HiveClient to renamePartition, which will create a new lower case partition path, then SparkSql rename the lower case path to the upper-case. while if the renamed partition contains more than one depth partition ,e.g. A=1/B=2, hive renamePartition change to a=1/b=2, then SparkSql rename it to A=1/B=2, but the a=1 still exists in the filesystem, we should also delete it. ## How was this patch tested? unit test added Author: windpiger <songjun@outlook.com> Closes apache#16700 from windpiger/clearUselessPathAfterRenamPartition.

… with upper-case by HiveExternalCatalog ### What changes were proposed in this pull request? This PR is to revert the changes made in apache#16700. It could cause the data loss after partition rename, because we have a bug in the file renaming. Not all the OSs have the same behaviors. For example, on mac OS, if we renaming a path from `.../tbl/a=5/b=6` to `.../tbl/A=5/B=6`. The result is `.../tbl/a=5/B=6`. The expected result is `.../tbl/A=5/B=6`. Thus, renaming on mac OS is not recursive. However, the systems used in Jenkin does not have such an issue. Although this PR is not the root cause, it exposes an existing issue on the code `tablePath.getFileSystem(hadoopConf).rename(wrongPath, rightPath)` --- Hive metastore is not case preserving and keep partition columns with lower case names. If SparkSQL create a table with upper-case partion name use HiveExternalCatalog, when we rename partition, it first call the HiveClient to renamePartition, which will create a new lower case partition path, then SparkSql rename the lower case path to the upper-case. while if the renamed partition contains more than one depth partition ,e.g. A=1/B=2, hive renamePartition change to a=1/b=2, then SparkSql rename it to A=1/B=2, but the a=1 still exists in the filesystem, we should also delete it. ### How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes apache#16728 from gatorsmile/revert-pr-16700.

windpiger added 2 commits January 25, 2017 15:44

[SPARK-19359][SQL]clear useless path after rename a partition with up…

6a8efdd

…per-case in HiveExternalCatalog

reset a tc

878d45e

modify a method name

5403595

gatorsmile reviewed Jan 26, 2017

View reviewed changes

add more tc and comment

40efce2

cloud-fan reviewed Jan 26, 2017

View reviewed changes

move func to another class

12acdc6

fix a style

7aba059

gatorsmile reviewed Jan 26, 2017

View reviewed changes

fix some comment

0136388

gatorsmile reviewed Jan 27, 2017

View reviewed changes

fix some comments

de4c409

gatorsmile reviewed Jan 28, 2017

View reviewed changes

asfgit closed this in 1b5ee20 Jan 28, 2017

viirya reviewed Jan 28, 2017

View reviewed changes

gatorsmile reviewed Jan 28, 2017

View reviewed changes

gatorsmile mentioned this pull request Jan 28, 2017

[SPARK-19359][SQL] Revert Clear useless path after rename a partition with upper-case by HiveExternalCatalog #16728

Closed

[SPARK-19359][SQL]clear useless path after rename a partition with upper-case by HiveExternalCatalog #16700

[SPARK-19359][SQL]clear useless path after rename a partition with upper-case by HiveExternalCatalog #16700

Conversation

windpiger commented Jan 25, 2017

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Jan 25, 2017

windpiger commented Jan 25, 2017

srowen commented Jan 25, 2017

windpiger commented Jan 25, 2017

SparkQA commented Jan 25, 2017

SparkQA commented Jan 25, 2017

windpiger commented Jan 25, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented Jan 26, 2017

windpiger commented Jan 26, 2017

gatorsmile commented Jan 26, 2017

windpiger commented Jan 26, 2017 • edited

SparkQA commented Jan 26, 2017

windpiger commented Jan 26, 2017

SparkQA commented Jan 26, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 26, 2017

SparkQA commented Jan 26, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile Jan 26, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile Jan 27, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented Jan 27, 2017

SparkQA commented Jan 27, 2017

SparkQA commented Jan 27, 2017

gatorsmile commented Jan 27, 2017

SparkQA commented Jan 27, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile Jan 28, 2017 • edited

Choose a reason for hiding this comment

gatorsmile commented Jan 28, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile Jan 28, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented Jan 28, 2017

windpiger commented Feb 8, 2017 • edited

windpiger commented Jan 26, 2017 •

edited

gatorsmile Jan 26, 2017 •

edited

gatorsmile Jan 27, 2017 •

edited

gatorsmile Jan 28, 2017 •

edited

gatorsmile Jan 28, 2017 •

edited

windpiger commented Feb 8, 2017 •

edited