Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spark-5068][SQL]Fix bug query data when path doesn't exist for HiveContext #5059

Closed
wants to merge 6 commits into from

Conversation

lazyman500
Copy link
Contributor

This PR follow up PR #3907 & #3891 & #4356.
According to @marmbrus @liancheng 's comments, I try to use fs.globStatus to retrieve all FileStatus objects under path(s), and then do the filtering locally.

[1]. get pathPattern by path, and put it into pathPatternSet. (hdfs://cluster/user/demo/2016/08/12 -> hdfs://cluster/user/demo///*)
[2]. retrieve all FileStatus objects ,and cache them by undating existPathSet.
[3]. do the filtering locally
[4]. if we have new pathPattern,do 1,2 step again. (external table maybe have more than one partition pathPattern)

@chenghao-intel @jeanlyn

@marmbrus
Copy link
Contributor

ok to test

@SparkQA
Copy link

SparkQA commented Mar 17, 2015

Test build #28738 has finished for PR 5059 at commit 04c443c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val hivePartitionRDDs = partitionToDeserializer.map { case (partition, partDeserializer) =>
// SPARK-5068:get FileStatus and do the filtering locally when the path is not exists

var existPathSet =collection.mutable.Set[String]()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space after =

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also the indent id off

@marmbrus
Copy link
Contributor

Lots of style comments. Please checkout: https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide

Also, while this seems better it does seem like it could have a non-trivial performance penalty. Maybe we should have a config flag to turn it off?

@SparkQA
Copy link

SparkQA commented Mar 18, 2015

Test build #28791 has finished for PR 5059 at commit 47e0023.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 18, 2015

Test build #28795 has finished for PR 5059 at commit f23133f.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@lazyman500
Copy link
Contributor Author

Thanks for your review.
I have added config flag. But how can users know this config? Do I need to add some documents ? @marmbrus

@SparkQA
Copy link

SparkQA commented Mar 18, 2015

Test build #28796 has finished for PR 5059 at commit e1d6386.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

def verifyPartitionPath(
partitionToDeserializer: Map[HivePartition, Class[_ <: Deserializer]]):
Map[HivePartition, Class[_ <: Deserializer]] = {
if (!sc.getConf("spark.sql.hive.verifyPartitionPath", "true").toBoolean) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this into SQLConf?

@marmbrus
Copy link
Contributor

marmbrus commented Apr 3, 2015

Thanks for working on this! Only two minor comments, then I think we can probably commit this. We can probably leave the flag undocumented for now and only add it to the docs if we see real world cases where it makes a performance difference.

val pathPattern = new Path(pathPatternStr)
val fs = pathPattern.getFileSystem(sc.hiveconf)
val matches = fs.globStatus(pathPattern)
matches.map(fileStatus => existPathSet += fileStatus.getPath.toString)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use foreach?

@SparkQA
Copy link

SparkQA commented Apr 7, 2015

Test build #29776 has finished for PR 5059 at commit 5bfcbfd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@lazyman500
Copy link
Contributor Author

Thanks for suggestion ! @chenghao-intel @marmbrus
I had optimized them.

@asfgit asfgit closed this in 1f39a61 Apr 12, 2015
@moustaki
Copy link

@marmbrus On your earlier documentation point, I was thinking that there might be genuine cases where the actual location of a partition might not be matching the expected partition structure, and where this check needs to be disabled.

The case I just encountered was a unit test which was adding as a partition a randomly generated temp directory using ALTER TABLE / ADD PARTITION, which leads to an NPE in https://github.com/apache/spark/pull/5059/files#diff-8887a877bd52611df9aea06ccfe3a2d7R166 when running a SELECT against the corresponding table.

Disabling the check using that configuration flag gets rid of that exception. Really not sure how common that use-case is though.

@marmbrus
Copy link
Contributor

We turned this off by default in Spark 1.5 as it was causing problems similar to what you saw.

@moustaki
Copy link

Perfect, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants