Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-17838][SparkR] Check named arguments for options and use formatted R friendly message from JVM exception message #15608

Closed
wants to merge 3 commits into from

Conversation

HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Oct 24, 2016

What changes were proposed in this pull request?

This PR proposes to

  • improve the R-friendly error messages rather than raw JVM exception one.

    As read.json, read.text, read.orc, read.parquet and read.jdbc are executed in the same path with read.df, and write.json, write.text, write.orc, write.parquet and write.jdbc shares the same path with write.df, it seems it is safe to call handledCallJMethod to handle
    JVM messages.

  • prevent zero-length variable name and prints the ignored options as an warning message.

Before

> read.json("path", a = 1, 2, 3, "a")
Error in env[[name]] <- value :
  zero-length variable name
> read.json("arbitrary_path")
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
  org.apache.spark.sql.AnalysisException: Path does not exist: file:/...;
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:398)
  ...

> read.orc("arbitrary_path")
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
  org.apache.spark.sql.AnalysisException: Path does not exist: file:/...;
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:398)
  ...

> read.text("arbitrary_path")
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
  org.apache.spark.sql.AnalysisException: Path does not exist: file:/...;
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:398)
  ...

> read.parquet("arbitrary_path")
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
  org.apache.spark.sql.AnalysisException: Path does not exist: file:/...;
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:398)
  ...
> write.json(df, "existing_path")
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
  org.apache.spark.sql.AnalysisException: path file:/... already exists.;
  at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:68)

> write.orc(df, "existing_path")
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
  org.apache.spark.sql.AnalysisException: path file:/... already exists.;
  at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:68)

> write.text(df, "existing_path")
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
  org.apache.spark.sql.AnalysisException: path file:/... already exists.;
  at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:68)

> write.parquet(df, "existing_path")
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
  org.apache.spark.sql.AnalysisException: path file:/... already exists.;
  at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:68)

After

read.json("arbitrary_path", a = 1, 2, 3, "a")
Unnamed arguments ignored: 2, 3, a.
> read.json("arbitrary_path")
Error in json : analysis error - Path does not exist: file:/...

> read.orc("arbitrary_path")
Error in orc : analysis error - Path does not exist: file:/...

> read.text("arbitrary_path")
Error in text : analysis error - Path does not exist: file:/...

> read.parquet("arbitrary_path")
Error in parquet : analysis error - Path does not exist: file:/...
> write.json(df, "existing_path")
Error in json : analysis error - path file:/... already exists.;

> write.orc(df, "existing_path")
Error in orc : analysis error - path file:/... already exists.;

> write.text(df, "existing_path")
Error in text : analysis error - path file:/... already exists.;

> write.parquet(df, "existing_path")
Error in parquet : analysis error - path file:/... already exists.;

How was this patch tested?

Unit tests in test_utils.R and test_sparkSQL.R.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Oct 24, 2016

cc @felixcheung I recall we talked about this before. I first wanted to handle all the argument type checking but I just decided to do what I am pretty sure of (I remember we were concerned of sweeping). Could you please take a look?

@SparkQA
Copy link

SparkQA commented Oct 24, 2016

Test build #67440 has finished for PR 15608 at commit e6afa4b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

env[[name]] <- as.character(value)
nameList <- names(pairs)
ignoredNames <- list()
i <- 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is i needed here?
could you elaborate on changes in this file? why do we need them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sure. Actually, it took me a while to find out a clean way but it ended up with introducing another variable here...

value <- pairs[[name]]

is failed when name is empty string (when they are non-named arguments), producing the error as below:

Error in env[[name]] <- value :
  zero-length variable name

I wanted to access to that problematic values to print in

ignoredNames <- append(ignoredNames, pairs[i])
...
warning(paste0("Non-named arguments ignored: ", paste(ignoredNames, collapse = ", "), "."),
         call. = FALSE)

To cut it short, I introduced that variable to access to the value when name is empty. Maybe there should be another cleaver way but I couldn't come up with it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, one possible alternative is to turn the list into a sequence of its indices but use lapply instead of for (which is more R-like)

lapply(seq_along(x), function(i) paste(names(x)[[i]], x[[i]]))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks nicer. Let me try that!

@felixcheung
Copy link
Member

Thanks @HyukjinKwon This is definitely good checks to have. Calls to read.* and write.* are not easily checked for parameters (file paths are hard) so to me it is better to leave the checking to JVM and just handle the exception better.

I wouldn't close this JIRA though but it is covering a broader case - those we could check parameters before passing to JVM? Would you like to keep this JIRA open after this PR, or open a separate JIRA for this specific class of handling?

One case I'm not sure about, is what Reader/Writer does with unnamed parameter - should we optionally fail here on the R side, instead of just ignoring them (in R or JVM)?

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Oct 25, 2016

I am fine with leaving the JIRA open. I can definitely try to open followups. Otherwise, I also can convert the JIRA as a sub-task after introducing a parent JIRA.

I will follow your lead. (If you close the JIRA then I will open another to make this as a sub-task. If you don't I will just try to make a followup in the future).

One case I'm not sure about, is what Reader/Writer does with unnamed parameter - should we optionally fail here on the R side, instead of just ignoring them (in R or JVM)?

I was worried of this part too and It took me a while to build up my argument for this... My argument to convince myself was that currently we don't make this failed when unused arbitrary options are given (e.g. option("abc", "1")), so, it'd be okay to not make this failed. but of course this is not a strong opinion.

@felixcheung
Copy link
Member

re: JIRA - I don't mind one way or the other - both proposals sound good to me.

currently we don't make this failed when unused arbitrary options are given (e.g. option("abc", "1")), so, it'd be okay to not make this failed. but of course this is not a strong opinion.

a valid point. does it work properly when the Spark properties are set into a Java Properties class?
https://docs.oracle.com/javase/7/docs/api/java/util/Properties.html#setProperty(java.lang.String,%20java.lang.String)
It seems if the key is "" or null then it would overwrite?

@HyukjinKwon
Copy link
Member Author

I guess you meant when the options, for example, DataFrameReader.extraOptions turns into Properties when we call JDBC APIs. In this case, Properties always has a higher precedence and will overwrite DataFrameReader.extraOptions[1] and exclude Spark internal options[2].

[1]https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L235
[2]

(BTW, it seems throwing NullPointException when we call setProperty and the key is null)

@HyukjinKwon
Copy link
Member Author

@felixcheung I just made it as a single for loop (slightly different with the suggested one though..). Could you please check it again?

@SparkQA
Copy link

SparkQA commented Oct 26, 2016

Test build #67570 has finished for PR 15608 at commit 2336dd9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

LGTM. I ran through a few other cases and I think the omitted names are handled properly with this.
This should go to master then (handledCallJMethod is not in branch-2.0)

@HyukjinKwon
Copy link
Member Author

Yup, it seems not even varargsToStrEnv.

@felixcheung
Copy link
Member

merged to master.

@asfgit asfgit closed this in 1ecfafa Nov 2, 2016
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…tted R friendly message from JVM exception message

## What changes were proposed in this pull request?

This PR proposes to
- improve the R-friendly error messages rather than raw JVM exception one.

  As `read.json`, `read.text`, `read.orc`, `read.parquet` and `read.jdbc` are executed in the same  path with `read.df`, and `write.json`, `write.text`, `write.orc`, `write.parquet` and `write.jdbc` shares the same path with `write.df`, it seems it is safe to call `handledCallJMethod` to handle
  JVM messages.
-  prevent `zero-length variable name` and prints the ignored options as an warning message.

**Before**

``` r
> read.json("path", a = 1, 2, 3, "a")
Error in env[[name]] <- value :
  zero-length variable name
```

``` r
> read.json("arbitrary_path")
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
  org.apache.spark.sql.AnalysisException: Path does not exist: file:/...;
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:398)
  ...

> read.orc("arbitrary_path")
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
  org.apache.spark.sql.AnalysisException: Path does not exist: file:/...;
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:398)
  ...

> read.text("arbitrary_path")
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
  org.apache.spark.sql.AnalysisException: Path does not exist: file:/...;
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:398)
  ...

> read.parquet("arbitrary_path")
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
  org.apache.spark.sql.AnalysisException: Path does not exist: file:/...;
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:398)
  ...
```

``` r
> write.json(df, "existing_path")
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
  org.apache.spark.sql.AnalysisException: path file:/... already exists.;
  at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:68)

> write.orc(df, "existing_path")
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
  org.apache.spark.sql.AnalysisException: path file:/... already exists.;
  at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:68)

> write.text(df, "existing_path")
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
  org.apache.spark.sql.AnalysisException: path file:/... already exists.;
  at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:68)

> write.parquet(df, "existing_path")
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
  org.apache.spark.sql.AnalysisException: path file:/... already exists.;
  at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:68)
```

**After**

``` r
read.json("arbitrary_path", a = 1, 2, 3, "a")
Unnamed arguments ignored: 2, 3, a.
```

``` r
> read.json("arbitrary_path")
Error in json : analysis error - Path does not exist: file:/...

> read.orc("arbitrary_path")
Error in orc : analysis error - Path does not exist: file:/...

> read.text("arbitrary_path")
Error in text : analysis error - Path does not exist: file:/...

> read.parquet("arbitrary_path")
Error in parquet : analysis error - Path does not exist: file:/...
```

``` r
> write.json(df, "existing_path")
Error in json : analysis error - path file:/... already exists.;

> write.orc(df, "existing_path")
Error in orc : analysis error - path file:/... already exists.;

> write.text(df, "existing_path")
Error in text : analysis error - path file:/... already exists.;

> write.parquet(df, "existing_path")
Error in parquet : analysis error - path file:/... already exists.;
```
## How was this patch tested?

Unit tests in `test_utils.R` and `test_sparkSQL.R`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes apache#15608 from HyukjinKwon/SPARK-17838.
@HyukjinKwon HyukjinKwon deleted the SPARK-17838 branch January 2, 2018 03:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants