[SPARK-16126] [SQL] Better Error Message When using DataFrameReader without `path` #13837

gatorsmile · 2016-06-22T06:35:43Z

What changes were proposed in this pull request?

When users do not specify the path in DataFrameReader APIs, it can get a confusing error message. For example,

spark.read.json()

Error message:

Unable to infer schema for JSON at . It must be specified manually;

After the fix, the error message will be like:

'path' is not specified

Another major goal of this PR is to add test cases for the latest changes in #13727.

orc read APIs
illegal format name
save API - empty path or illegal path
load API - empty path
illegal compression
fixed a test case in the existing test case prevent all column partitioning

How was this patch tested?

Test cases are added.

SparkQA · 2016-06-22T07:53:43Z

Test build #61016 has finished for PR 13837 at commit a1ae724.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class OrcSourceSuite extends OrcSuite

SparkQA · 2016-06-22T16:18:41Z

Test build #61044 has started for PR 13837 at commit 635046a.

gatorsmile · 2016-06-22T22:24:57Z

Weird? How to stop this test case run?

gatorsmile · 2016-06-22T22:25:17Z

retest this please

SparkQA · 2016-06-23T00:33:10Z

Test build #61074 has finished for PR 13837 at commit 635046a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-06-23T00:58:00Z

cc @tdas @zsxwing Could you review this PR? It adds the test cases for #13727

Thanks!

tdas · 2016-06-23T22:09:05Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetOptions.scala

@@ -40,7 +40,7 @@ private[sql] class ParquetOptions(
    if (!shortParquetCompressionCodecNames.contains(codecName)) {
      val availableCodecs = shortParquetCompressionCodecNames.keys.map(_.toLowerCase)
      throw new IllegalArgumentException(s"Codec [$codecName] " +
-        s"is not available. Available codecs are ${availableCodecs.mkString(", ")}.")
+        s"is not available. Known codecs are ${availableCodecs.mkString(", ")}.")


why this change?

Just to make it consistent with the output of the other cases. See the code:

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/CompressionCodecs.scala

Lines 49 to 51 in d6dc12e

case e: ClassNotFoundException =>

throw new IllegalArgumentException(s"Codec [$codecName] " +

s"is not available. Known codecs are ${shortCompressionCodecNames.keys.mkString(", ")}.")

Available was intentionally used because Parquet only supports snappy, gzip or lzo whereas Known was used for text-based ones (Please see #10805 (comment)) as they support compression codecs including other codecs but that lists the known ones.

SparkQA · 2016-11-12T08:11:01Z

Test build #68556 has finished for PR 13837 at commit 635046a.

This patch fails MiMa tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2016-11-12T20:40:27Z

Test build #68567 has finished for PR 13837 at commit b6bdf92.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-13T07:14:50Z

Test build #68579 has finished for PR 13837 at commit 4511037.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-11-13T22:27:55Z

R/pkg/inst/tests/testthat/test_sparkSQL.R

@@ -2684,8 +2684,7 @@ test_that("Call DataFrameWriter.load() API in Java without path and check argume
  # It makes sure that we can omit path argument in read.df API and then it calls
  # DataFrameWriter.load() without path.
  expect_error(read.df(source = "json"),
-               paste("Error in loadDF : analysis error - Unable to infer schema for JSON at .",
-                     "It must be specified manually"))
+               paste("Error in loadDF : illegal argument - 'path' is not specified"))


I recall this test is intentionally testing without path argument?
cc @HyukjinKwon

Thanks for cc'ing me. Yes, I did. It seems the changes are reasonable as it seems this checking applies to the data sources that need path.

HyukjinKwon · 2016-11-14T03:18:56Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

@@ -322,6 +323,9 @@ case class DataSource(
          val equality = sparkSession.sessionState.conf.resolver
          StructType(schema.filterNot(f => partitionColumns.exists(equality(_, f.name))))
        }.orElse {
+          if (allPaths.isEmpty && !format.isInstanceOf[TextFileFormat]) {


Hi @gatorsmile, would this be better if we explain here text data source is excluded because text datasource always uses a schema consisting of a string field if the schema is not explicitly given?

BTW, should we maybe change text.TextFileFormat to TextFileFormat https://github.com/gatorsmile/spark/blob/45110370fb1889f244a6750ef2a18dbc9f1ba9c2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L139 ?

felixcheung · 2017-03-11T04:45:56Z

hi - where are we on this one?

HyukjinKwon · 2017-05-11T12:47:48Z

(gentle ping)

gatorsmile added 6 commits June 17, 2016 11:24

test cases

8d021e4

add test cases.

5e4a3c6

fix and test cases

2643715

Merge remote-tracking branch 'upstream/master' into dfWriterAudit

cfc0188

more test case

3007fe6

fix test case

a1ae724

gatorsmile changed the title ~~[SPARK-16126] [SQL] Better Message When using DataFrameReader without path~~ [SPARK-16126] [SQL] Better Error Message When using DataFrameReader without path Jun 22, 2016

fix test case

635046a

tdas reviewed Jun 23, 2016
View reviewed changes

gatorsmile closed this Aug 22, 2016

gatorsmile reopened this Nov 12, 2016

gatorsmile added 2 commits November 12, 2016 10:32

Merge remote-tracking branch 'upstream/master' into dfWriterAudit

6bf0779

fix nit

b6bdf92

fix test cases

4511037

felixcheung reviewed Nov 13, 2016

View reviewed changes

HyukjinKwon reviewed Nov 14, 2016

View reviewed changes

HyukjinKwon mentioned this pull request May 17, 2017

[INFRA] Close stale PRs #18017

Closed

asfgit closed this in 5d2750a May 18, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-16126] [SQL] Better Error Message When using DataFrameReader without `path` #13837

[SPARK-16126] [SQL] Better Error Message When using DataFrameReader without `path` #13837

gatorsmile commented Jun 22, 2016

SparkQA commented Jun 22, 2016

SparkQA commented Jun 22, 2016

gatorsmile commented Jun 22, 2016

gatorsmile commented Jun 22, 2016

SparkQA commented Jun 23, 2016

gatorsmile commented Jun 23, 2016

tdas Jun 23, 2016

gatorsmile Jun 24, 2016

HyukjinKwon Nov 14, 2016 •

edited

Loading

SparkQA commented Nov 12, 2016

SparkQA commented Nov 12, 2016

SparkQA commented Nov 13, 2016

felixcheung Nov 13, 2016

HyukjinKwon Nov 14, 2016 •

edited

Loading

HyukjinKwon Nov 14, 2016 •

edited

Loading

felixcheung commented Mar 11, 2017

HyukjinKwon commented May 11, 2017

	case e: ClassNotFoundException =>
	throw new IllegalArgumentException(s"Codec [$codecName] " +
	s"is not available. Known codecs are ${shortCompressionCodecNames.keys.mkString(", ")}.")

[SPARK-16126] [SQL] Better Error Message When using DataFrameReader without path #13837

[SPARK-16126] [SQL] Better Error Message When using DataFrameReader without path #13837

Conversation

gatorsmile commented Jun 22, 2016

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Jun 22, 2016

SparkQA commented Jun 22, 2016

gatorsmile commented Jun 22, 2016

gatorsmile commented Jun 22, 2016

SparkQA commented Jun 23, 2016

gatorsmile commented Jun 23, 2016

tdas Jun 23, 2016

Choose a reason for hiding this comment

gatorsmile Jun 24, 2016

Choose a reason for hiding this comment

HyukjinKwon Nov 14, 2016 • edited Loading

Choose a reason for hiding this comment

SparkQA commented Nov 12, 2016

SparkQA commented Nov 12, 2016

SparkQA commented Nov 13, 2016

felixcheung Nov 13, 2016

Choose a reason for hiding this comment

HyukjinKwon Nov 14, 2016 • edited Loading

Choose a reason for hiding this comment

HyukjinKwon Nov 14, 2016 • edited Loading

Choose a reason for hiding this comment

felixcheung commented Mar 11, 2017

HyukjinKwon commented May 11, 2017

[SPARK-16126] [SQL] Better Error Message When using DataFrameReader without `path` #13837

[SPARK-16126] [SQL] Better Error Message When using DataFrameReader without `path` #13837

HyukjinKwon Nov 14, 2016 •

edited

Loading

HyukjinKwon Nov 14, 2016 •

edited

Loading

HyukjinKwon Nov 14, 2016 •

edited

Loading