Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19359][SQL]clear useless path after rename a partition with upper-case by HiveExternalCatalog #16700

Closed

Conversation

windpiger
Copy link
Contributor

What changes were proposed in this pull request?

Hive metastore is not case preserving and keep partition columns with lower case names.

If SparkSQL create a table with upper-case partion name use HiveExternalCatalog, when we rename partition, it first call the HiveClient to renamePartition, which will create a new lower case partition path, then SparkSql rename the lower case path to the upper-case.

while if the renamed partition contains more than one depth partition ,e.g. A=1/B=2, hive renamePartition change to a=1/b=2, then SparkSql rename it to A=1/B=2, but the a=1 still exists in the filesystem, we should also delete it.

How was this patch tested?

unit test added

@SparkQA
Copy link

SparkQA commented Jan 25, 2017

Test build #71976 has started for PR 16700 at commit 878d45e.

@windpiger
Copy link
Contributor Author

retest this please

@srowen
Copy link
Member

srowen commented Jan 25, 2017

Just a general point that it seems odd to have a method whose name suggests it finds useless partition dirs. Can this be done more simply - is it not just a question of one extra delete somewhere?

@windpiger
Copy link
Contributor Author

thanks, what about getExtraPartPathCreatedByHive

@SparkQA
Copy link

SparkQA commented Jan 25, 2017

Test build #71978 has finished for PR 16700 at commit 878d45e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 25, 2017

Test build #71987 has finished for PR 16700 at commit 5403595.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@windpiger
Copy link
Contributor Author

cc @gatorsmile @cloud-fan

def getExtraPartPathCreatedByHive(
lowerCaseSpec: TablePartitionSpec,
partitionColumnNames: Seq[String],
tablePath: Path): Path = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix the indent issues.

@@ -120,6 +120,17 @@ object ExternalCatalogUtils {
new Path(totalPath, nextPartPath)
}
}

def getExtraPartPathCreatedByHive(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A general suggestion. When the function names become not self-descriptive. Write the function comments.

For example, here, please add comments for generatePartitionPath and getExtraPartPathCreatedByHive

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok ,thanks very much~ BTW, Happy Chinese New Year~

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy Chinese New Year

@gatorsmile
Copy link
Member

What happens if the partitioning columns have more than two columns?

@windpiger
Copy link
Contributor Author

the example showed A/B are two partition columns

@gatorsmile
Copy link
Member

Yeah, if we having three columns, does your solution resolve all the issues?

@windpiger
Copy link
Contributor Author

windpiger commented Jan 26, 2017

renamePartition:
A=1/B=2/C=3 -> A=4/B=5/C=6
path created by Hive after renamePartition:
/path/a=4/b=5/c=6
and SparkSQL rename it /path/A=4/B=5/C=6, and this pr will delete /path/a=4.

renamePartition:
a=1/B=2/C=3 -> a=4/B=5/C=6
path created by Hive after renamePartition:
/path/a=4/b=5/c=6
and SparkSQL rename it /path/a=4/B=5/C=6, and this pr will delete /path/a=4/b=5.

it will delete the first [upper-case partition col name] path which create by hive with lower-case path.

more tests added to cover this ,thanks~

@SparkQA
Copy link

SparkQA commented Jan 26, 2017

Test build #72017 has started for PR 16700 at commit 40efce2.

@windpiger
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Jan 26, 2017

Test build #72020 has finished for PR 16700 at commit 40efce2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* e.g. /path/A=1/B=2/C=3 rename to /path/A=4/B=5/C=6, the extra path returned is
* /path/a=4, which also include all its' child path, such as /path/a=4/b=2
*/
def getExtraPartPathCreatedByHive(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a hive only feature, why do we put it in ExternalCatalogUtils? How about the object HiveExternalCatalog?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks !I will move it

@SparkQA
Copy link

SparkQA commented Jan 26, 2017

Test build #72025 has finished for PR 16700 at commit 12acdc6.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 26, 2017

Test build #72026 has finished for PR 16700 at commit 7aba059.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

def getExtraPartPathCreatedByHive(
lowerCaseSpec: TablePartitionSpec,
partitionColumnNames: Seq[String],
tablePath: Path): Path = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indent issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh...sorry...thanks!

/**
* partition path created by Hive is lower-case, while Spark SQL will
* rename it with the partition name in partitionColumnNames, and this function
* return the extra lower-case path created by Hive, and then we can delete it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return -> returns

* partition path created by Hive is lower-case, while Spark SQL will
* rename it with the partition name in partitionColumnNames, and this function
* return the extra lower-case path created by Hive, and then we can delete it.
* e.g. /path/A=1/B=2/C=3 rename to /path/A=4/B=5/C=6, the extra path returned is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to -> is changed to

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the extra path returned is -> this function returns

* rename it with the partition name in partitionColumnNames, and this function
* return the extra lower-case path created by Hive, and then we can delete it.
* e.g. /path/A=1/B=2/C=3 rename to /path/A=4/B=5/C=6, the extra path returned is
* /path/a=4, which also include all its' child path, such as /path/a=4/b=2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/path/a=4, which also include all its' child path, such as /path/a=4/b=2 -> /path/a=4

@@ -839,6 +839,25 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
spec.map { case (k, v) => partCols.find(_.equalsIgnoreCase(k)).get -> v }
}


/**
* partition path created by Hive is lower-case, while Spark SQL will
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partition path -> The partition path
is lower-case -> is in lowercase

@@ -899,6 +918,21 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
spec, partitionColumnNames, tablePath)
try {
tablePath.getFileSystem(hadoopConf).rename(wrongPath, rightPath)

// if the newSpec contains more than one depth partitoin, FileSystem.rename just delete
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partitoin -> partition
if -> If
delete -> deletes

@@ -899,6 +918,21 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
spec, partitionColumnNames, tablePath)
try {
tablePath.getFileSystem(hadoopConf).rename(wrongPath, rightPath)

// if the newSpec contains more than one depth partitoin, FileSystem.rename just delete
// only one path(wrongPath), we should check if wrongPath's parents need to be deleted.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only one path(wrongPath) -> the leaf (i.e., wrongPath)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot!

// 'a=1/b=2' in FileSystem is deleted, while 'a=1' is already exists,
// which should also be deleted
val delHivePartPathAfterRename = getExtraPartPathCreatedByHive(
lowerCasePartitionSpec(spec),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last comment: I prefer to calling lowerCasePartitionSpec in the func getExtraPartPathCreatedByHive.

// newSpec is 'A=1/B=2', after renamePartitions by Hive, the location path in FileSystem
// is changed to 'a=1/b=2', which is wrongPath, then we renamed to 'A=1/B=2', and
// 'a=1/b=2' in FileSystem is deleted, while 'a=1' is already exists,
// which should also be deleted
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about?

For example, give a newSpec 'A=1/B=2', after calling Hive's client.renamePartitions, the location path in FileSystem is changed to 'a=1/b=2', which is wrongPath. Then, although we renamed it to 'A=1/B=2', 'a=1/b=2' in FileSystem is deleted but 'a=1' still exists. We also need to delete the useless directory 'a=1'.

* The partition path created by Hive is in lowercase, while Spark SQL will
* rename it with the partition name in partitionColumnNames, and this function
* returns the extra lowercase path created by Hive, and then we can delete it.
* e.g. /path/A=1/B=2/C=3 is changed to /path/A=4/B=5/C=6, this function returns is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

returns is -> returns

@gatorsmile
Copy link
Member

LGTM except three comments.

@SparkQA
Copy link

SparkQA commented Jan 27, 2017

Test build #72062 has finished for PR 16700 at commit 0136388.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 27, 2017

Test build #72066 has started for PR 16700 at commit de4c409.

@gatorsmile
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Jan 27, 2017

Test build #72069 has finished for PR 16700 at commit de4c409.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

/**
* The partition path created by Hive is in lowercase, while Spark SQL will
* rename it with the partition name in partitionColumnNames, and this function
* returns the extra lowercase path created by Hive, and then we can delete it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: all of them are commas. You need to use periods. : )

* The partition path created by Hive is in lowercase, while Spark SQL will
* rename it with the partition name in partitionColumnNames, and this function
* returns the extra lowercase path created by Hive, and then we can delete it.
* e.g. /path/A=1/B=2/C=3 is changed to /path/A=4/B=5/C=6, this function returns
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same issue here.


// If the newSpec contains more than one depth partition, FileSystem.rename just deletes
// the leaf(i.e. wrongPath), we should check if wrongPath's parents need to be deleted.
// For example, give a newSpec 'A=1/B=2', after calling Hive's client.renamePartitions,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

give -> given

// If the newSpec contains more than one depth partition, FileSystem.rename just deletes
// the leaf(i.e. wrongPath), we should check if wrongPath's parents need to be deleted.
// For example, give a newSpec 'A=1/B=2', after calling Hive's client.renamePartitions,
// the location path in FileSystem is changed to 'a=1/b=2', which is wrongPath, then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

, then -> . Then

// the leaf(i.e. wrongPath), we should check if wrongPath's parents need to be deleted.
// For example, give a newSpec 'A=1/B=2', after calling Hive's client.renamePartitions,
// the location path in FileSystem is changed to 'a=1/b=2', which is wrongPath, then
// although we renamed it to 'A=1/B=2', 'a=1/b=2' in FileSystem is deleted, but 'a=1'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to use a period here.

@asfgit asfgit closed this in 1b5ee20 Jan 28, 2017
@gatorsmile
Copy link
Member

Thanks! Merging it to master.

You can fix the minor comments in your other PRs.

// the location path in FileSystem is changed to 'a=1/b=2', which is wrongPath, then
// although we renamed it to 'A=1/B=2', 'a=1/b=2' in FileSystem is deleted, but 'a=1'
// is still exists, which we also need to delete
val delHivePartPathAfterRename = getExtraPartPathCreatedByHive(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, could it possibly have multiple specs sharing the same parent directory, e.g., 'A=1/B=2', 'A=1/B=3', ...?

If so, when you delete the path 'a=1' here, in processing the next spec 'A=1/B=3', I think the rename will fail.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The path a=1 was created when you call client.renamePartitions, right? Based on my understanding, when you rename A=1/B=3, Hive will create the directory a=1 and a=1/b=3. Thus, the rename will not fail. Have you made a try?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

client.renamePartitions is called at the beginning of renamePartitions for all specs at once. It creates the directory a=1 and a=1/b=2 and a=1/b=3.

When you iterates specs and rename the directories with FileSystem.rename, in the first iteration, a=1/b=2 is renamed, and a=1 is deleted in this change, then a=1/b=3 will be deleted too. So in next iteration, the renaming of a=1/b=3 to A=1/B=3 will fail.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far, the partition rename DDL we support is for a single pair of partition spec. That is, ALTER TABLE table PARTITION spec1 RENAME TO PARTITION spec2. This PR will not introduce a bug to end users.

However, your concern looks reasonable. I think we should not support the partition renaming for multiple partitions in a single DDL in the SessionCatalog and ExternalCatalog. It just makes the code more complex for error handling. Let me remove it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be worse. If we already have a partition A=1/B=2, and we rename some other partition to A=1/B=3, then we will have A=1/B=2 and a=1/b=3, and we have a lot of work to do, instead of just a renaming.

@@ -899,6 +919,21 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
spec, partitionColumnNames, tablePath)
try {
tablePath.getFileSystem(hadoopConf).rename(wrongPath, rightPath)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found an issue here... When we call rename, not all the file systems have the same behaviors. For example, on mac OS, when we doing this .../tbl/a=5/b=6 -> .../tbl/A=5/B=6 . The result is .../tbl/a=5/B=6. Thus, it is not recursive. However, the file system used in Jenkin does not have such an issue. You can hit this issue if you are using macOS. Thus, this fix causes an regression, but the bug is not in your fix.

asfgit pushed a commit that referenced this pull request Jan 28, 2017
… with upper-case by HiveExternalCatalog

### What changes were proposed in this pull request?

This PR is to revert the changes made in #16700. It could cause the data loss after partition rename, because we have a bug in the file renaming.

Not all the OSs have the same behaviors. For example, on mac OS, if we renaming a path from `.../tbl/a=5/b=6` to `.../tbl/A=5/B=6`. The result is `.../tbl/a=5/B=6`. The expected result is `.../tbl/A=5/B=6`. Thus, renaming on mac OS is not recursive. However, the systems used in Jenkin does not have such an issue. Although this PR is not the root cause, it exposes an existing issue on the code `tablePath.getFileSystem(hadoopConf).rename(wrongPath, rightPath)`

---

Hive metastore is not case preserving and keep partition columns with lower case names.

If SparkSQL create a table with upper-case partion name use HiveExternalCatalog, when we rename partition, it first call the HiveClient to renamePartition, which will create a new lower case partition path, then SparkSql rename the lower case path to the upper-case.

while if the renamed partition contains more than one depth partition ,e.g. A=1/B=2, hive renamePartition change to a=1/b=2, then SparkSql rename it to A=1/B=2, but the a=1 still exists in the filesystem, we should also delete it.

### How was this patch tested?
N/A

Author: gatorsmile <gatorsmile@gmail.com>

Closes #16728 from gatorsmile/revert-pr-16700.
@gatorsmile
Copy link
Member

Just revert it. Let us wait for the decision how we plan to deal with the file renaming. My major concern is the errors in file renaming could cause the data loss, unless we can introduce a robust rollback solution.

@windpiger
Copy link
Contributor Author

windpiger commented Feb 8, 2017

oh, sorry I see your comments just now... when I see the pr #16837

cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 15, 2017
…pper-case by HiveExternalCatalog

## What changes were proposed in this pull request?

Hive metastore is not case preserving and keep partition columns with lower case names.

If SparkSQL create a table with upper-case partion name use HiveExternalCatalog, when we rename partition, it first call the HiveClient to renamePartition, which will create a new lower case partition path, then SparkSql rename the lower case path to the upper-case.

while if the renamed partition contains more than one depth partition ,e.g. A=1/B=2, hive renamePartition change to a=1/b=2, then SparkSql rename it to A=1/B=2, but the a=1 still exists in the filesystem, we should also delete it.

## How was this patch tested?
unit test added

Author: windpiger <songjun@outlook.com>

Closes apache#16700 from windpiger/clearUselessPathAfterRenamPartition.
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 15, 2017
… with upper-case by HiveExternalCatalog

### What changes were proposed in this pull request?

This PR is to revert the changes made in apache#16700. It could cause the data loss after partition rename, because we have a bug in the file renaming.

Not all the OSs have the same behaviors. For example, on mac OS, if we renaming a path from `.../tbl/a=5/b=6` to `.../tbl/A=5/B=6`. The result is `.../tbl/a=5/B=6`. The expected result is `.../tbl/A=5/B=6`. Thus, renaming on mac OS is not recursive. However, the systems used in Jenkin does not have such an issue. Although this PR is not the root cause, it exposes an existing issue on the code `tablePath.getFileSystem(hadoopConf).rename(wrongPath, rightPath)`

---

Hive metastore is not case preserving and keep partition columns with lower case names.

If SparkSQL create a table with upper-case partion name use HiveExternalCatalog, when we rename partition, it first call the HiveClient to renamePartition, which will create a new lower case partition path, then SparkSql rename the lower case path to the upper-case.

while if the renamed partition contains more than one depth partition ,e.g. A=1/B=2, hive renamePartition change to a=1/b=2, then SparkSql rename it to A=1/B=2, but the a=1 still exists in the filesystem, we should also delete it.

### How was this patch tested?
N/A

Author: gatorsmile <gatorsmile@gmail.com>

Closes apache#16728 from gatorsmile/revert-pr-16700.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants