Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-11102] [SQL] Uninformative exception when specifing non-exist #9490

Closed
wants to merge 11 commits into from

Conversation

zjffdu
Copy link
Contributor

@zjffdu zjffdu commented Nov 5, 2015

Rebase the code base and create another PR, and close the previous PR #9223
Paste comment from last PR here to give more context

The JsonRelation has another special case that it allow input paths as empty when it takes RDD[String] as input. So HadoopFsRelation can not assume all its sub classes must have non-empty input paths. Maybe need to add another flag "allowEmptyInputPaths" in HadoopFsRelation and allow its implementation to decide that.

@srowen
Copy link
Member

srowen commented Nov 5, 2015

CC @chenghao-intel

@SparkQA
Copy link

SparkQA commented Nov 5, 2015

Test build #45111 has finished for PR 9490 at commit 7a14b8c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Nov 6, 2015

cc @liancheng

@SparkQA
Copy link

SparkQA commented Nov 6, 2015

Test build #45178 has finished for PR 9490 at commit 63ebcdf.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zjffdu
Copy link
Contributor Author

zjffdu commented Nov 6, 2015

Looks like the failed test is not related. @liancheng Please help review it. Thanks

@marmbrus
Copy link
Contributor

test this please

@SparkQA
Copy link

SparkQA commented Nov 10, 2015

Test build #45500 has finished for PR 9490 at commit 63ebcdf.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 11, 2015

Test build #45590 has finished for PR 9490 at commit 8f7ebc6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 11, 2015

Test build #45606 has finished for PR 9490 at commit 219db87.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zjffdu
Copy link
Contributor Author

zjffdu commented Nov 11, 2015

Comments for the change:
The case of empty or non-exist inputs is a little tricky. Here's the several cases I summarize

  • Only parse the inputs at execution stage. e.g. TextRelation
  • Need parse the inputs at analysis stage. e.g. JsonRelation, ParquetRelation & OrcRelation
  • Don't need to parse the inputs if the schema is provided. (when creating table) e.g. ParquetRelation & OrcRelation
  • Empty is also valid. e.g. JsonRelation can accept RDD[String] rather than from hdfs
  • Empty inputs is valid for creating table.

So for these cases, I do the following changes

  • Add 2 api in HadoopFsRelation. sub classes can override it. Now only JsonRelation will override it.
    ** def inputExists: Boolean = fileStatusCache.inputExists
    ** def readFromHDFS: Boolean = true
  • If the inputs are only empty directories, it should be valid, just return EmptyRDD
  • If the inputs are not-existed directories/files, it is invalid, just throw exception.
  • If it needs to parse data in the analysis stage, it is sub classes' responsibility to check whether inputs is empty or not. Parent class (HadoopFsRelation) only check the inputs at execution stage.

@@ -431,7 +436,7 @@ abstract class HadoopFsRelation private[sql](maybePartitionSpec: Option[Partitio
val hdfsPath = new Path(path)
val fs = hdfsPath.getFileSystem(hadoopConf)
val qualified = hdfsPath.makeQualified(fs.getUri, fs.getWorkingDirectory)

inputExists = inputExists && fs.exists(qualified)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be quite expensive since each fs.exists(qualified) call invokes FileSystem.getFileStatus(), which is an RPC call. On the other hand, we've already called fs.listStatus(qualified) below. Would be better to merge these two.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another issue is that this block doesn't handle the case of parallel file listing (the other if block above).

@liancheng
Copy link
Contributor

Hey @zjffdu, are you still working on this?

@zjffdu
Copy link
Contributor Author

zjffdu commented Nov 27, 2015

@liancheng sorry for late response, will try to make a new commit this weekend

@SparkQA
Copy link

SparkQA commented Nov 27, 2015

Test build #46809 has finished for PR 9490 at commit 9dd1801.

  • This patch fails Scala style tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 27, 2015

Test build #46810 has finished for PR 9490 at commit d8e7f5c.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 27, 2015

Test build #46821 has finished for PR 9490 at commit 9e67a3d.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@yhuai
Copy link
Contributor

yhuai commented Nov 29, 2015

@zjffdu I feel it is not a good idea to change lots of code just to get a better error message. Is it possible to have a small change to achieve the goal?

@zjffdu
Copy link
Contributor Author

zjffdu commented Nov 30, 2015

@yhuai Only changing JSONRelation/TextRelation can be a short-term solution. But since this is a general issue for HadoopFsRelation. so it would be better to do it in HadoopFsRelation. Otherwise we may meet same issue when we have new HadoopFsRelation implemention or user create a custom HadoopFsRelation.

@zjffdu
Copy link
Contributor Author

zjffdu commented Dec 7, 2015

@yhuai @liancheng any more comments ?

@yhuai
Copy link
Contributor

yhuai commented Dec 7, 2015

@zjffdu How about we revisit it after we release 1.6?

@zjffdu
Copy link
Contributor Author

zjffdu commented Dec 7, 2015

@yhuai Sure, np

@srowen
Copy link
Member

srowen commented Feb 11, 2016

Please close this PR

@HyukjinKwon
Copy link
Member

ping @zjffdu

@asfgit asfgit closed this in 6acc72a Apr 23, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
8 participants