Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19833][SQL]remove SQLConf.HIVE_VERIFY_PARTITION_PATH, always return empty when the location does not exists #17176

Closed
wants to merge 6 commits into from

Conversation

windpiger
Copy link
Contributor

@windpiger windpiger commented Mar 6, 2017

What changes were proposed in this pull request?

In SPARK-5068, we introduce a SQLConf spark.sql.hive.verifyPartitionPath,
if it is set to true, it will avoid the task failed when the patition location does not exists in the filesystem.

this situation should always return emtpy and don't lead to the task failed, here we remove this conf.

And the function verifyPartitionPath has a bug ,that if the partition path is custom path

it will still do filter for all partition path in the parameter partitionToDeserializer,
it will scan the path which does not belong to the table ,e.g. custom path is /root/a
and the partitionSpec is b=1/c=2, this will lead to scan / because of the getPathPatternByPath

How was this patch tested?

modify a test case

…eturn empty when the location does not exists
@SparkQA
Copy link

SparkQA commented Mar 6, 2017

Test build #73991 has finished for PR 17176 at commit 95aa931.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}
// convert /demo/data/year/month/day to /demo/data/*/*/*/
def getPathPatternByPath(parNum: Int, tempPath: Path, partitionName: String): String = {
// if the partition path does not end with partition name, we should not
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the partition location has been altered to another location, we should not do this pattern, or we will list pattern files which does not belong to the partition

@SparkQA
Copy link

SparkQA commented Mar 6, 2017

Test build #73998 has finished for PR 17176 at commit 8128567.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 6, 2017

Test build #73992 has finished for PR 17176 at commit 4bb0e28.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@windpiger
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Mar 6, 2017

Test build #74016 has finished for PR 17176 at commit 8128567.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@windpiger
Copy link
Contributor Author

why jenkins failed...

@SparkQA
Copy link

SparkQA commented Mar 7, 2017

Test build #74072 has finished for PR 17176 at commit 22b1f53.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -159,36 +159,11 @@ class HadoopTableReader(
def verifyPartitionPath(
partitionToDeserializer: Map[HivePartition, Class[_ <: Deserializer]]):
Map[HivePartition, Class[_ <: Deserializer]] = {
if (!sparkSession.sessionState.conf.verifyPartitionPath) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after this pr https://github.com/apache/spark/pull/17187, read hive table which does not use stored by will not use HiveTableScanExec.

this function has a bug ,that if the partition path is custom path

  1. it will still do filter for all partition path in the parameter partitionToDeserializer,
  2. it will scan the path which does not belong to the table ,e.g. custom path is /root/a
    and the partitionSpec is b=1/c=2, this will lead to scan / because of the getPathPatternByPath

@SparkQA
Copy link

SparkQA commented Mar 7, 2017

Test build #74106 has finished for PR 17176 at commit 262e2f2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 7, 2017

Test build #74107 has finished for PR 17176 at commit 3a15e5d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

@windpiger If you do not have a bandwidth to continue it, how about closing it now?

case (partition, partDeserializer) =>
val partPath = partition.getDataLocation
val fs = partPath.getFileSystem(hadoopConf)
fs.exists(partPath)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each partition sending an RPC request to the NameNode can result in poor performance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants