[Spark-5068][SQL]Fix bug query data when path doesn't exist for HiveContext #5059

lazyman500 · 2015-03-17T03:28:44Z

This PR follow up PR #3907 & #3891 & #4356.
According to @marmbrus @liancheng 's comments, I try to use fs.globStatus to retrieve all FileStatus objects under path(s), and then do the filtering locally.

[1]. get pathPattern by path, and put it into pathPatternSet. (hdfs://cluster/user/demo/2016/08/12 -> hdfs://cluster/user/demo///*)
[2]. retrieve all FileStatus objects ,and cache them by undating existPathSet.
[3]. do the filtering locally
[4]. if we have new pathPattern,do 1,2 step again. (external table maybe have more than one partition pathPattern)

@chenghao-intel @jeanlyn

update

marmbrus · 2015-03-17T21:28:49Z

ok to test

SparkQA · 2015-03-17T22:56:12Z

Test build #28738 has finished for PR 5059 at commit 04c443c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus · 2015-03-18T03:14:39Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala

-    val hivePartitionRDDs = partitionToDeserializer.map { case (partition, partDeserializer) =>
+    // SPARK-5068:get FileStatus and do the filtering locally when the path is not exists
+
+      var existPathSet =collection.mutable.Set[String]()


space after =

also the indent id off

marmbrus · 2015-03-18T03:38:11Z

Lots of style comments. Please checkout: https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide

Also, while this seems better it does seem like it could have a non-trivial performance penalty. Maybe we should have a config flag to turn it off?

SparkQA · 2015-03-18T10:39:04Z

Test build #28791 has finished for PR 5059 at commit 47e0023.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-03-18T11:44:08Z

Test build #28795 has finished for PR 5059 at commit f23133f.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

lazyman500 · 2015-03-18T12:21:40Z

Thanks for your review.
I have added config flag. But how can users know this config? Do I need to add some documents ? @marmbrus

SparkQA · 2015-03-18T13:16:58Z

Test build #28796 has finished for PR 5059 at commit e1d6386.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus · 2015-04-03T00:57:52Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala

+    def verifyPartitionPath(
+        partitionToDeserializer: Map[HivePartition, Class[_ <: Deserializer]]):
+        Map[HivePartition, Class[_ <: Deserializer]] = {
+      if (!sc.getConf("spark.sql.hive.verifyPartitionPath", "true").toBoolean) {


Can you move this into SQLConf?

marmbrus · 2015-04-03T00:59:06Z

Thanks for working on this! Only two minor comments, then I think we can probably commit this. We can probably leave the flag undocumented for now and only add it to the docs if we see real world cases where it makes a performance difference.

chenghao-intel · 2015-04-06T01:25:06Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala

+              val pathPattern = new Path(pathPatternStr)
+              val fs = pathPattern.getFileSystem(sc.hiveconf)
+              val matches = fs.globStatus(pathPattern)
+              matches.map(fileStatus => existPathSet += fileStatus.getPath.toString)


Use foreach?

SparkQA · 2015-04-07T04:52:06Z

Test build #29776 has finished for PR 5059 at commit 5bfcbfd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

lazyman500 · 2015-04-07T05:48:48Z

Thanks for suggestion ! @chenghao-intel @marmbrus
I had optimized them.

moustaki · 2015-09-18T00:56:33Z

@marmbrus On your earlier documentation point, I was thinking that there might be genuine cases where the actual location of a partition might not be matching the expected partition structure, and where this check needs to be disabled.

The case I just encountered was a unit test which was adding as a partition a randomly generated temp directory using ALTER TABLE / ADD PARTITION, which leads to an NPE in https://github.com/apache/spark/pull/5059/files#diff-8887a877bd52611df9aea06ccfe3a2d7R166 when running a SELECT against the corresponding table.

Disabling the check using that configuration flag gets rid of that exception. Really not sure how common that use-case is though.

marmbrus · 2015-09-18T18:06:18Z

We turned this off by default in Spark 1.5 as it was causing problems similar to what you saw.

moustaki · 2015-09-18T18:10:29Z

Perfect, thanks!

lazyman500 added 2 commits March 16, 2015 11:20

Merge pull request #1 from apache/master

41f60ce

update

SPARK-5068: fix bug when partition path doesn't exists #2

04c443c

marmbrus reviewed Mar 18, 2015
View reviewed changes

marmbrus mentioned this pull request Mar 18, 2015

[SPARK-5068] [SQL] Fix bug query data when path doesn't exist for HiveContext #4356

Closed

fix scala style,add config flag,break the chaining

47e0023

bug fix

f23133f

fix scala style

e1d6386

marmbrus reviewed Apr 3, 2015
View reviewed changes

chenghao-intel reviewed Apr 6, 2015
View reviewed changes

move spark.sql.hive.verifyPartitionPath to SQLConf,fix scala style

5bfcbfd

asfgit closed this in 1f39a61 Apr 12, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spark-5068][SQL]Fix bug query data when path doesn't exist for HiveContext #5059

[Spark-5068][SQL]Fix bug query data when path doesn't exist for HiveContext #5059

lazyman500 commented Mar 17, 2015

marmbrus commented Mar 17, 2015

SparkQA commented Mar 17, 2015

marmbrus Mar 18, 2015

marmbrus Mar 18, 2015

marmbrus commented Mar 18, 2015

SparkQA commented Mar 18, 2015

SparkQA commented Mar 18, 2015

lazyman500 commented Mar 18, 2015

SparkQA commented Mar 18, 2015

marmbrus Apr 3, 2015

marmbrus commented Apr 3, 2015

chenghao-intel Apr 6, 2015

SparkQA commented Apr 7, 2015

lazyman500 commented Apr 7, 2015

moustaki commented Sep 18, 2015

marmbrus commented Sep 18, 2015

moustaki commented Sep 18, 2015

[Spark-5068][SQL]Fix bug query data when path doesn't exist for HiveContext #5059

[Spark-5068][SQL]Fix bug query data when path doesn't exist for HiveContext #5059

Conversation

lazyman500 commented Mar 17, 2015

marmbrus commented Mar 17, 2015

SparkQA commented Mar 17, 2015

marmbrus Mar 18, 2015

Choose a reason for hiding this comment

marmbrus Mar 18, 2015

Choose a reason for hiding this comment

marmbrus commented Mar 18, 2015

SparkQA commented Mar 18, 2015

SparkQA commented Mar 18, 2015

lazyman500 commented Mar 18, 2015

SparkQA commented Mar 18, 2015

marmbrus Apr 3, 2015

Choose a reason for hiding this comment

marmbrus commented Apr 3, 2015

chenghao-intel Apr 6, 2015

Choose a reason for hiding this comment

SparkQA commented Apr 7, 2015

lazyman500 commented Apr 7, 2015

moustaki commented Sep 18, 2015

marmbrus commented Sep 18, 2015

moustaki commented Sep 18, 2015