Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-16126] [SQL] Better Error Message When using DataFrameReader without path #13837

Closed
wants to merge 10 commits into from

Conversation

gatorsmile
Copy link
Member

What changes were proposed in this pull request?

When users do not specify the path in DataFrameReader APIs, it can get a confusing error message. For example,

spark.read.json()

Error message:

Unable to infer schema for JSON at . It must be specified manually;

After the fix, the error message will be like:

'path' is not specified

Another major goal of this PR is to add test cases for the latest changes in #13727.

  • orc read APIs
  • illegal format name
  • save API - empty path or illegal path
  • load API - empty path
  • illegal compression
  • fixed a test case in the existing test case prevent all column partitioning

How was this patch tested?

Test cases are added.

@gatorsmile gatorsmile changed the title [SPARK-16126] [SQL] Better Message When using DataFrameReader without path [SPARK-16126] [SQL] Better Error Message When using DataFrameReader without path Jun 22, 2016
@SparkQA
Copy link

SparkQA commented Jun 22, 2016

Test build #61016 has finished for PR 13837 at commit a1ae724.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class OrcSourceSuite extends OrcSuite

@SparkQA
Copy link

SparkQA commented Jun 22, 2016

Test build #61044 has started for PR 13837 at commit 635046a.

@gatorsmile
Copy link
Member Author

Weird? How to stop this test case run?

@gatorsmile
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Jun 23, 2016

Test build #61074 has finished for PR 13837 at commit 635046a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

cc @tdas @zsxwing Could you review this PR? It adds the test cases for #13727

Thanks!

@@ -40,7 +40,7 @@ private[sql] class ParquetOptions(
if (!shortParquetCompressionCodecNames.contains(codecName)) {
val availableCodecs = shortParquetCompressionCodecNames.keys.map(_.toLowerCase)
throw new IllegalArgumentException(s"Codec [$codecName] " +
s"is not available. Available codecs are ${availableCodecs.mkString(", ")}.")
s"is not available. Known codecs are ${availableCodecs.mkString(", ")}.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make it consistent with the output of the other cases. See the code:

case e: ClassNotFoundException =>
throw new IllegalArgumentException(s"Codec [$codecName] " +
s"is not available. Known codecs are ${shortCompressionCodecNames.keys.mkString(", ")}.")

Copy link
Member

@HyukjinKwon HyukjinKwon Nov 14, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Available was intentionally used because Parquet only supports snappy, gzip or lzo whereas Known was used for text-based ones (Please see #10805 (comment)) as they support compression codecs including other codecs but that lists the known ones.

@gatorsmile gatorsmile closed this Aug 22, 2016
@gatorsmile gatorsmile reopened this Nov 12, 2016
@SparkQA
Copy link

SparkQA commented Nov 12, 2016

Test build #68556 has finished for PR 13837 at commit 635046a.

  • This patch fails MiMa tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 12, 2016

Test build #68567 has finished for PR 13837 at commit b6bdf92.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 13, 2016

Test build #68579 has finished for PR 13837 at commit 4511037.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -2684,8 +2684,7 @@ test_that("Call DataFrameWriter.load() API in Java without path and check argume
# It makes sure that we can omit path argument in read.df API and then it calls
# DataFrameWriter.load() without path.
expect_error(read.df(source = "json"),
paste("Error in loadDF : analysis error - Unable to infer schema for JSON at .",
"It must be specified manually"))
paste("Error in loadDF : illegal argument - 'path' is not specified"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recall this test is intentionally testing without path argument?
cc @HyukjinKwon

Copy link
Member

@HyukjinKwon HyukjinKwon Nov 14, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for cc'ing me. Yes, I did. It seems the changes are reasonable as it seems this checking applies to the data sources that need path.

@@ -322,6 +323,9 @@ case class DataSource(
val equality = sparkSession.sessionState.conf.resolver
StructType(schema.filterNot(f => partitionColumns.exists(equality(_, f.name))))
}.orElse {
if (allPaths.isEmpty && !format.isInstanceOf[TextFileFormat]) {
Copy link
Member

@HyukjinKwon HyukjinKwon Nov 14, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @gatorsmile, would this be better if we explain here text data source is excluded because text datasource always uses a schema consisting of a string field if the schema is not explicitly given?

BTW, should we maybe change text.TextFileFormat to TextFileFormat https://github.com/gatorsmile/spark/blob/45110370fb1889f244a6750ef2a18dbc9f1ba9c2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L139 ?

@felixcheung
Copy link
Member

hi - where are we on this one?

@HyukjinKwon
Copy link
Member

(gentle ping)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants