[SPARK-17838][SparkR] Check named arguments for options and use formatted R friendly message from JVM exception message #15608

HyukjinKwon · 2016-10-24T06:04:20Z

What changes were proposed in this pull request?

This PR proposes to

improve the R-friendly error messages rather than raw JVM exception one.

As read.json, read.text, read.orc, read.parquet and read.jdbc are executed in the same path with read.df, and write.json, write.text, write.orc, write.parquet and write.jdbc shares the same path with write.df, it seems it is safe to call handledCallJMethod to handle
JVM messages.
prevent zero-length variable name and prints the ignored options as an warning message.

Before

> read.json("path", a = 1, 2, 3, "a")
Error in env[[name]] <- value :
  zero-length variable name

> read.json("arbitrary_path")
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
  org.apache.spark.sql.AnalysisException: Path does not exist: file:/...;
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:398)
  ...

> read.orc("arbitrary_path")
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
  org.apache.spark.sql.AnalysisException: Path does not exist: file:/...;
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:398)
  ...

> read.text("arbitrary_path")
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
  org.apache.spark.sql.AnalysisException: Path does not exist: file:/...;
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:398)
  ...

> read.parquet("arbitrary_path")
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
  org.apache.spark.sql.AnalysisException: Path does not exist: file:/...;
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:398)
  ...

> write.json(df, "existing_path")
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
  org.apache.spark.sql.AnalysisException: path file:/... already exists.;
  at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:68)

> write.orc(df, "existing_path")
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
  org.apache.spark.sql.AnalysisException: path file:/... already exists.;
  at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:68)

> write.text(df, "existing_path")
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
  org.apache.spark.sql.AnalysisException: path file:/... already exists.;
  at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:68)

> write.parquet(df, "existing_path")
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
  org.apache.spark.sql.AnalysisException: path file:/... already exists.;
  at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:68)

After

read.json("arbitrary_path", a = 1, 2, 3, "a")
Unnamed arguments ignored: 2, 3, a.

> read.json("arbitrary_path")
Error in json : analysis error - Path does not exist: file:/...

> read.orc("arbitrary_path")
Error in orc : analysis error - Path does not exist: file:/...

> read.text("arbitrary_path")
Error in text : analysis error - Path does not exist: file:/...

> read.parquet("arbitrary_path")
Error in parquet : analysis error - Path does not exist: file:/...

> write.json(df, "existing_path")
Error in json : analysis error - path file:/... already exists.;

> write.orc(df, "existing_path")
Error in orc : analysis error - path file:/... already exists.;

> write.text(df, "existing_path")
Error in text : analysis error - path file:/... already exists.;

> write.parquet(df, "existing_path")
Error in parquet : analysis error - path file:/... already exists.;

How was this patch tested?

Unit tests in test_utils.R and test_sparkSQL.R.

…e from JVM exception message

HyukjinKwon · 2016-10-24T06:05:36Z

cc @felixcheung I recall we talked about this before. I first wanted to handle all the argument type checking but I just decided to do what I am pretty sure of (I remember we were concerned of sweeping). Could you please take a look?

SparkQA · 2016-10-24T06:40:55Z

Test build #67440 has finished for PR 15608 at commit e6afa4b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-10-24T22:21:51Z

R/pkg/R/utils.R

-      env[[name]] <- as.character(value)
+  nameList <- names(pairs)
+  ignoredNames <- list()
+  i <- 1


why is i needed here?
could you elaborate on changes in this file? why do we need them?

Oh, sure. Actually, it took me a while to find out a clean way but it ended up with introducing another variable here...

value <- pairs[[name]]

is failed when name is empty string (when they are non-named arguments), producing the error as below:

Error in env[[name]] <- value : zero-length variable name

I wanted to access to that problematic values to print in

ignoredNames <- append(ignoredNames, pairs[i]) ... warning(paste0("Non-named arguments ignored: ", paste(ignoredNames, collapse = ", "), "."), call. = FALSE)

To cut it short, I introduced that variable to access to the value when name is empty. Maybe there should be another cleaver way but I couldn't come up with it.

I see, one possible alternative is to turn the list into a sequence of its indices but use lapply instead of for (which is more R-like)

lapply(seq_along(x), function(i) paste(names(x)[[i]], x[[i]]))

That looks nicer. Let me try that!

felixcheung · 2016-10-24T22:59:09Z

Thanks @HyukjinKwon This is definitely good checks to have. Calls to read.* and write.* are not easily checked for parameters (file paths are hard) so to me it is better to leave the checking to JVM and just handle the exception better.

I wouldn't close this JIRA though but it is covering a broader case - those we could check parameters before passing to JVM? Would you like to keep this JIRA open after this PR, or open a separate JIRA for this specific class of handling?

One case I'm not sure about, is what Reader/Writer does with unnamed parameter - should we optionally fail here on the R side, instead of just ignoring them (in R or JVM)?

HyukjinKwon · 2016-10-25T01:38:29Z

I am fine with leaving the JIRA open. I can definitely try to open followups. Otherwise, I also can convert the JIRA as a sub-task after introducing a parent JIRA.

I will follow your lead. (If you close the JIRA then I will open another to make this as a sub-task. If you don't I will just try to make a followup in the future).

One case I'm not sure about, is what Reader/Writer does with unnamed parameter - should we optionally fail here on the R side, instead of just ignoring them (in R or JVM)?

I was worried of this part too and It took me a while to build up my argument for this... My argument to convince myself was that currently we don't make this failed when unused arbitrary options are given (e.g. option("abc", "1")), so, it'd be okay to not make this failed. but of course this is not a strong opinion.

felixcheung · 2016-10-25T19:51:13Z

re: JIRA - I don't mind one way or the other - both proposals sound good to me.

currently we don't make this failed when unused arbitrary options are given (e.g. option("abc", "1")), so, it'd be okay to not make this failed. but of course this is not a strong opinion.

a valid point. does it work properly when the Spark properties are set into a Java Properties class?
https://docs.oracle.com/javase/7/docs/api/java/util/Properties.html#setProperty(java.lang.String,%20java.lang.String)
It seems if the key is "" or null then it would overwrite?

HyukjinKwon · 2016-10-26T01:02:39Z

I guess you meant when the options, for example, DataFrameReader.extraOptions turns into Properties when we call JDBC APIs. In this case, Properties always has a higher precedence and will overwrite DataFrameReader.extraOptions[1] and exclude Spark internal options[2].

[1]https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L235
[2]

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala

Line 43 in 0c0ad43

parameters.filterKeys(!jdbcOptionNames.contains(_))

(BTW, it seems throwing NullPointException when we call setProperty and the key is null)

HyukjinKwon · 2016-10-26T08:31:22Z

@felixcheung I just made it as a single for loop (slightly different with the suggested one though..). Could you please check it again?

SparkQA · 2016-10-26T09:05:36Z

Test build #67570 has finished for PR 15608 at commit 2336dd9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-10-31T21:04:54Z

LGTM. I ran through a few other cases and I think the omitted names are handled properly with this.
This should go to master then (handledCallJMethod is not in branch-2.0)

HyukjinKwon · 2016-11-01T15:49:04Z

Yup, it seems not even varargsToStrEnv.

felixcheung · 2016-11-02T05:16:19Z

merged to master.

…tted R friendly message from JVM exception message ## What changes were proposed in this pull request? This PR proposes to - improve the R-friendly error messages rather than raw JVM exception one. As `read.json`, `read.text`, `read.orc`, `read.parquet` and `read.jdbc` are executed in the same path with `read.df`, and `write.json`, `write.text`, `write.orc`, `write.parquet` and `write.jdbc` shares the same path with `write.df`, it seems it is safe to call `handledCallJMethod` to handle JVM messages. - prevent `zero-length variable name` and prints the ignored options as an warning message. **Before** ``` r > read.json("path", a = 1, 2, 3, "a") Error in env[[name]] <- value : zero-length variable name ``` ``` r > read.json("arbitrary_path") Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : org.apache.spark.sql.AnalysisException: Path does not exist: file:/...; at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:398) ... > read.orc("arbitrary_path") Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : org.apache.spark.sql.AnalysisException: Path does not exist: file:/...; at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:398) ... > read.text("arbitrary_path") Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : org.apache.spark.sql.AnalysisException: Path does not exist: file:/...; at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:398) ... > read.parquet("arbitrary_path") Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : org.apache.spark.sql.AnalysisException: Path does not exist: file:/...; at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:398) ... ``` ``` r > write.json(df, "existing_path") Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : org.apache.spark.sql.AnalysisException: path file:/... already exists.; at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:68) > write.orc(df, "existing_path") Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : org.apache.spark.sql.AnalysisException: path file:/... already exists.; at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:68) > write.text(df, "existing_path") Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : org.apache.spark.sql.AnalysisException: path file:/... already exists.; at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:68) > write.parquet(df, "existing_path") Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : org.apache.spark.sql.AnalysisException: path file:/... already exists.; at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:68) ``` **After** ``` r read.json("arbitrary_path", a = 1, 2, 3, "a") Unnamed arguments ignored: 2, 3, a. ``` ``` r > read.json("arbitrary_path") Error in json : analysis error - Path does not exist: file:/... > read.orc("arbitrary_path") Error in orc : analysis error - Path does not exist: file:/... > read.text("arbitrary_path") Error in text : analysis error - Path does not exist: file:/... > read.parquet("arbitrary_path") Error in parquet : analysis error - Path does not exist: file:/... ``` ``` r > write.json(df, "existing_path") Error in json : analysis error - path file:/... already exists.; > write.orc(df, "existing_path") Error in orc : analysis error - path file:/... already exists.; > write.text(df, "existing_path") Error in text : analysis error - path file:/... already exists.; > write.parquet(df, "existing_path") Error in parquet : analysis error - path file:/... already exists.; ``` ## How was this patch tested? Unit tests in `test_utils.R` and `test_sparkSQL.R`. Author: hyukjinkwon <gurwls223@gmail.com> Closes apache#15608 from HyukjinKwon/SPARK-17838.

HyukjinKwon added 2 commits October 24, 2016 14:00

Check named arguments for options and use formatted R friendly messag…

ef2cd07

…e from JVM exception message

Fix messages in writing too

e6afa4b

felixcheung reviewed Oct 24, 2016

View reviewed changes

Clean up the loop and fix warning message

2336dd9

asfgit closed this in 1ecfafa Nov 2, 2016

HyukjinKwon deleted the SPARK-17838 branch January 2, 2018 03:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-17838][SparkR] Check named arguments for options and use formatted R friendly message from JVM exception message #15608

[SPARK-17838][SparkR] Check named arguments for options and use formatted R friendly message from JVM exception message #15608

HyukjinKwon commented Oct 24, 2016 •

edited

HyukjinKwon commented Oct 24, 2016 •

edited

SparkQA commented Oct 24, 2016

felixcheung Oct 24, 2016

HyukjinKwon Oct 25, 2016

felixcheung Oct 25, 2016

HyukjinKwon Oct 26, 2016

felixcheung commented Oct 24, 2016

HyukjinKwon commented Oct 25, 2016 •

edited

felixcheung commented Oct 25, 2016

HyukjinKwon commented Oct 26, 2016

HyukjinKwon commented Oct 26, 2016

SparkQA commented Oct 26, 2016

felixcheung commented Oct 31, 2016

HyukjinKwon commented Nov 1, 2016

felixcheung commented Nov 2, 2016

[SPARK-17838][SparkR] Check named arguments for options and use formatted R friendly message from JVM exception message #15608

[SPARK-17838][SparkR] Check named arguments for options and use formatted R friendly message from JVM exception message #15608

Conversation

HyukjinKwon commented Oct 24, 2016 • edited

What changes were proposed in this pull request?

How was this patch tested?

HyukjinKwon commented Oct 24, 2016 • edited

SparkQA commented Oct 24, 2016

felixcheung Oct 24, 2016

Choose a reason for hiding this comment

HyukjinKwon Oct 25, 2016

Choose a reason for hiding this comment

felixcheung Oct 25, 2016

Choose a reason for hiding this comment

HyukjinKwon Oct 26, 2016

Choose a reason for hiding this comment

felixcheung commented Oct 24, 2016

HyukjinKwon commented Oct 25, 2016 • edited

felixcheung commented Oct 25, 2016

HyukjinKwon commented Oct 26, 2016

HyukjinKwon commented Oct 26, 2016

SparkQA commented Oct 26, 2016

felixcheung commented Oct 31, 2016

HyukjinKwon commented Nov 1, 2016

felixcheung commented Nov 2, 2016

HyukjinKwon commented Oct 24, 2016 •

edited

HyukjinKwon commented Oct 24, 2016 •

edited

HyukjinKwon commented Oct 25, 2016 •

edited