[SPARK-31918][R] Ignore S4 generic methods under SparkR namespace in closure cleaning to support R 4.0.0+ #28907

HyukjinKwon · 2020-06-23T11:43:51Z

What changes were proposed in this pull request?

This PR proposes to ignore S4 generic methods under SparkR namespace in closure cleaning to support R 4.0.0+.

Currently, when you run the codes that runs R native codes, it fails as below with R 4.0.0:

df <- createDataFrame(lapply(seq(100), function (e) list(value=e)))
count(dapply(df, function(x) as.data.frame(x[x$value < 50,]), schema(df)))

org.apache.spark.SparkException: R unexpectedly exited.
R worker produced errors: Error in lapply(part, FUN) : attempt to bind a variable to R_UnboundValue

The root cause seems to be related to when an S4 generic method is manually included into the closure's environment via SparkR:::processClosure. For example, when an RRDD is created via createDataFrame with calling lapply to convert, lapply itself:

spark/R/pkg/R/RDD.R

Line 484 in f53d8c6

setMethod("lapply",

is added into the environment of the cleaned closure - because this is not an exposed namespace; however, this is broken in R 4.0.0+ for an unknown reason with an error message such as "attempt to bind a variable to R_UnboundValue".

Actually, we don't need to add the lapply into the environment of the closure because it is not supposed to be called on the worker side. In fact, there is no private generic methods supposed to be called on the worker side in SparkR at all from my understanding.

Therefore, this PR takes a simpler path to work around just by explicitly excluding the S4 generic methods under SparkR namespace to support R 4.0.0. in SparkR.

Why are the changes needed?

To support R 4.0.0+ with SparkR, and unblock the releases on CRAN. CRAN requires the tests pass with the latest R.

Does this PR introduce any user-facing change?

Yes, it will support R 4.0.0 to end-users.

How was this patch tested?

Manually tested. Both CRAN and tests with R 4.0.1:

══ testthat results  ═══════════════════════════════════════════════════════════
[ OK: 13 | SKIPPED: 0 | WARNINGS: 0 | FAILED: 0 ]
✔ |  OK F W S | Context
✔ |  11       | binary functions [2.5 s]
✔ |   4       | functions on binary files [2.1 s]
✔ |   2       | broadcast variables [0.5 s]
✔ |   5       | functions in client.R
✔ |  46       | test functions in sparkR.R [6.3 s]
✔ |   2       | include R packages [0.3 s]
✔ |   2       | JVM API [0.2 s]
✔ |  75       | MLlib classification algorithms, except for tree-based algorithms [86.3 s]
✔ |  70       | MLlib clustering algorithms [44.5 s]
✔ |   6       | MLlib frequent pattern mining [3.0 s]
✔ |   8       | MLlib recommendation algorithms [9.6 s]
✔ | 136       | MLlib regression algorithms, except for tree-based algorithms [76.0 s]
✔ |   8       | MLlib statistics algorithms [0.6 s]
✔ |  94       | MLlib tree-based algorithms [85.2 s]
✔ |  29       | parallelize() and collect() [0.5 s]
✔ | 428       | basic RDD functions [25.3 s]
✔ |  39       | SerDe functionality [2.2 s]
✔ |  20       | partitionBy, groupByKey, reduceByKey etc. [3.9 s]
✔ |   4       | functions in sparkR.R
✔ |  16       | SparkSQL Arrow optimization [19.2 s]
✔ |   6       | test show SparkDataFrame when eager execution is enabled. [1.1 s]
✔ | 1175       | SparkSQL functions [134.8 s]
✔ |  42       | Structured Streaming [478.2 s]
✔ |  16       | tests RDD function take() [1.1 s]
✔ |  14       | the textFile() function [2.9 s]
✔ |  46       | functions in utils.R [0.7 s]
✔ |   0     1 | Windows-specific tests
────────────────────────────────────────────────────────────────────────────────
test_Windows.R:22: skip: sparkJars tag in SparkContext
Reason: This test is only for Windows, skipped
────────────────────────────────────────────────────────────────────────────────

══ Results ═════════════════════════════════════════════════════════════════════
Duration: 987.3 s

OK:       2304
Failed:   0
Warnings: 0
Skipped:  1
...
Status: OK
+ popd
Tests passed.

Note that I tested to build SparkR in R 4.0.0, and run the tests with R 3.6.3. It all passed. See also the comment in the JIRA.

HyukjinKwon · 2020-06-23T11:44:34Z

R/pkg/tests/fulltests/test_mllib_classification.R

-  model <- spark.logit(training, Species ~ ., family = "multinomial",
-                       lowerBoundsOnCoefficients = l,
-                       lowerBoundsOnIntercepts = as.array(c(0.0, 0.0)))
+  model <- suppressWarnings(spark.logit(training, Species ~ ., family = "multinomial",


It suppresses:

test_mllib_classification.R:258: error: spark.logit (converted from warning) the condition has length > 1 and only the first element will be used Backtrace: 1. SparkR::spark.logit(...) tests/fulltests/test_mllib_classification.R:258:2 2. SparkR::spark.logit(...)

HyukjinKwon · 2020-06-23T11:45:08Z

R/pkg/tests/fulltests/test_mllib_classification.R

@@ -130,7 +130,7 @@ test_that("spark.logit", {
  summary <- summary(model)

  # test summary coefficients return matrix type
-  expect_true(class(summary$coefficients) == "matrix")
+  expect_true(any(class(summary$coefficients) == "matrix"))


matrix objects now also inherit from class "array", so e.g., class(diag(1)) is c("matrix", "array"). This invalidates code incorrectly assuming that class(matrix_obj)) has length one.

https://cran.r-project.org/doc/manuals/r-devel/NEWS.html

Thanks! This also reminds me that it'll be great to test on r-devel (which I guess is 4.0.3 right now)

HyukjinKwon · 2020-06-23T11:46:08Z

R/pkg/R/utils.R

+               !(nodeChar %in% getNamespaceExports("SparkR")) &&
+                  # Note that generic S4 methods should not be set to the environment of
+                  # cleaned closure. It does not work with R 4.0.0+. See also SPARK-31918.
+                  nodeChar != "" && !methods::isGeneric(nodeChar, func.env))) {


nodeChar can be empty strings, and isGeneric rejects empty strings.

Just to confirm this will only exclude generics inside SparkR -- is that right?

Yes .. so it wouldn't affect and/or protect other cases.

HyukjinKwon · 2020-06-23T11:46:35Z

cc @shivaram, @felixcheung, @dongjoon-hyun.

SparkQA · 2020-06-23T12:31:44Z

Test build #124417 has finished for PR 28907 at commit fe0deac.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram

Just some minor comments. The only major concern I have is that we are not sure why this broke (as in what was the underlying change to R) -- so I'm not certain if other issues won't creep up later. But its still good to get this fix in for 3.0.1

shivaram · 2020-06-23T14:49:06Z

R/pkg/R/utils.R

+               !(nodeChar %in% getNamespaceExports("SparkR")) &&
+                  # Note that generic S4 methods should not be set to the environment of
+                  # cleaned closure. It does not work with R 4.0.0+. See also SPARK-31918.
+                  nodeChar != "" && !methods::isGeneric(nodeChar, func.env))) {


Just to confirm this will only exclude generics inside SparkR -- is that right?

R/pkg/tests/fulltests/test_context.R

shivaram · 2020-06-23T14:51:10Z

R/pkg/tests/fulltests/test_mllib_classification.R

@@ -130,7 +130,7 @@ test_that("spark.logit", {
  summary <- summary(model)

  # test summary coefficients return matrix type
-  expect_true(class(summary$coefficients) == "matrix")
+  expect_true(any(class(summary$coefficients) == "matrix"))


Thanks! This also reminds me that it'll be great to test on r-devel (which I guess is 4.0.3 right now)

dongjoon-hyun · 2020-06-23T17:31:57Z

Thank you for pinging me, @HyukjinKwon .

SparkQA · 2020-06-24T01:53:33Z

Test build #124442 has finished for PR 28907 at commit 29bfcdb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-06-24T02:00:36Z

Test build #124443 has finished for PR 28907 at commit 14886cb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-06-24T02:03:19Z

Merged to master, branch-3.0 and branch-2.4.

…closure cleaning to support R 4.0.0+ ### What changes were proposed in this pull request? This PR proposes to ignore S4 generic methods under SparkR namespace in closure cleaning to support R 4.0.0+. Currently, when you run the codes that runs R native codes, it fails as below with R 4.0.0: ```r df <- createDataFrame(lapply(seq(100), function (e) list(value=e))) count(dapply(df, function(x) as.data.frame(x[x$value < 50,]), schema(df))) ``` ``` org.apache.spark.SparkException: R unexpectedly exited. R worker produced errors: Error in lapply(part, FUN) : attempt to bind a variable to R_UnboundValue ``` The root cause seems to be related to when an S4 generic method is manually included into the closure's environment via `SparkR:::cleanClosure`. For example, when an RRDD is created via `createDataFrame` with calling `lapply` to convert, `lapply` itself: https://github.com/apache/spark/blob/f53d8c63e80172295e2fbc805c0c391bdececcaa/R/pkg/R/RDD.R#L484 is added into the environment of the cleaned closure - because this is not an exposed namespace; however, this is broken in R 4.0.0+ for an unknown reason with an error message such as "attempt to bind a variable to R_UnboundValue". Actually, we don't need to add the `lapply` into the environment of the closure because it is not supposed to be called in worker side. In fact, there is no private generic methods supposed to be called in worker side in SparkR at all from my understanding. Therefore, this PR takes a simpler path to work around just by explicitly excluding the S4 generic methods under SparkR namespace to support R 4.0.0. in SparkR. ### Why are the changes needed? To support R 4.0.0+ with SparkR, and unblock the releases on CRAN. CRAN requires the tests pass with the latest R. ### Does this PR introduce _any_ user-facing change? Yes, it will support R 4.0.0 to end-users. ### How was this patch tested? Manually tested. Both CRAN and tests with R 4.0.1: ``` ══ testthat results ═══════════════════════════════════════════════════════════ [ OK: 13 | SKIPPED: 0 | WARNINGS: 0 | FAILED: 0 ] ✔ | OK F W S | Context ✔ | 11 | binary functions [2.5 s] ✔ | 4 | functions on binary files [2.1 s] ✔ | 2 | broadcast variables [0.5 s] ✔ | 5 | functions in client.R ✔ | 46 | test functions in sparkR.R [6.3 s] ✔ | 2 | include R packages [0.3 s] ✔ | 2 | JVM API [0.2 s] ✔ | 75 | MLlib classification algorithms, except for tree-based algorithms [86.3 s] ✔ | 70 | MLlib clustering algorithms [44.5 s] ✔ | 6 | MLlib frequent pattern mining [3.0 s] ✔ | 8 | MLlib recommendation algorithms [9.6 s] ✔ | 136 | MLlib regression algorithms, except for tree-based algorithms [76.0 s] ✔ | 8 | MLlib statistics algorithms [0.6 s] ✔ | 94 | MLlib tree-based algorithms [85.2 s] ✔ | 29 | parallelize() and collect() [0.5 s] ✔ | 428 | basic RDD functions [25.3 s] ✔ | 39 | SerDe functionality [2.2 s] ✔ | 20 | partitionBy, groupByKey, reduceByKey etc. [3.9 s] ✔ | 4 | functions in sparkR.R ✔ | 16 | SparkSQL Arrow optimization [19.2 s] ✔ | 6 | test show SparkDataFrame when eager execution is enabled. [1.1 s] ✔ | 1175 | SparkSQL functions [134.8 s] ✔ | 42 | Structured Streaming [478.2 s] ✔ | 16 | tests RDD function take() [1.1 s] ✔ | 14 | the textFile() function [2.9 s] ✔ | 46 | functions in utils.R [0.7 s] ✔ | 0 1 | Windows-specific tests ──────────────────────────────────────────────────────────────────────────────── test_Windows.R:22: skip: sparkJars tag in SparkContext Reason: This test is only for Windows, skipped ──────────────────────────────────────────────────────────────────────────────── ══ Results ═════════════════════════════════════════════════════════════════════ Duration: 987.3 s OK: 2304 Failed: 0 Warnings: 0 Skipped: 1 ... Status: OK + popd Tests passed. ``` Note that I tested to build SparkR in R 4.0.0, and run the tests with R 3.6.3. It all passed. See also [the comment in the JIRA](https://issues.apache.org/jira/browse/SPARK-31918?focusedCommentId=17142837&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17142837). Closes #28907 from HyukjinKwon/SPARK-31918. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 11d2b07) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…closure cleaning to support R 4.0.0+ This PR proposes to ignore S4 generic methods under SparkR namespace in closure cleaning to support R 4.0.0+. Currently, when you run the codes that runs R native codes, it fails as below with R 4.0.0: ```r df <- createDataFrame(lapply(seq(100), function (e) list(value=e))) count(dapply(df, function(x) as.data.frame(x[x$value < 50,]), schema(df))) ``` ``` org.apache.spark.SparkException: R unexpectedly exited. R worker produced errors: Error in lapply(part, FUN) : attempt to bind a variable to R_UnboundValue ``` The root cause seems to be related to when an S4 generic method is manually included into the closure's environment via `SparkR:::cleanClosure`. For example, when an RRDD is created via `createDataFrame` with calling `lapply` to convert, `lapply` itself: https://github.com/apache/spark/blob/f53d8c63e80172295e2fbc805c0c391bdececcaa/R/pkg/R/RDD.R#L484 is added into the environment of the cleaned closure - because this is not an exposed namespace; however, this is broken in R 4.0.0+ for an unknown reason with an error message such as "attempt to bind a variable to R_UnboundValue". Actually, we don't need to add the `lapply` into the environment of the closure because it is not supposed to be called in worker side. In fact, there is no private generic methods supposed to be called in worker side in SparkR at all from my understanding. Therefore, this PR takes a simpler path to work around just by explicitly excluding the S4 generic methods under SparkR namespace to support R 4.0.0. in SparkR. To support R 4.0.0+ with SparkR, and unblock the releases on CRAN. CRAN requires the tests pass with the latest R. Yes, it will support R 4.0.0 to end-users. Manually tested. Both CRAN and tests with R 4.0.1: ``` ══ testthat results ═══════════════════════════════════════════════════════════ [ OK: 13 | SKIPPED: 0 | WARNINGS: 0 | FAILED: 0 ] ✔ | OK F W S | Context ✔ | 11 | binary functions [2.5 s] ✔ | 4 | functions on binary files [2.1 s] ✔ | 2 | broadcast variables [0.5 s] ✔ | 5 | functions in client.R ✔ | 46 | test functions in sparkR.R [6.3 s] ✔ | 2 | include R packages [0.3 s] ✔ | 2 | JVM API [0.2 s] ✔ | 75 | MLlib classification algorithms, except for tree-based algorithms [86.3 s] ✔ | 70 | MLlib clustering algorithms [44.5 s] ✔ | 6 | MLlib frequent pattern mining [3.0 s] ✔ | 8 | MLlib recommendation algorithms [9.6 s] ✔ | 136 | MLlib regression algorithms, except for tree-based algorithms [76.0 s] ✔ | 8 | MLlib statistics algorithms [0.6 s] ✔ | 94 | MLlib tree-based algorithms [85.2 s] ✔ | 29 | parallelize() and collect() [0.5 s] ✔ | 428 | basic RDD functions [25.3 s] ✔ | 39 | SerDe functionality [2.2 s] ✔ | 20 | partitionBy, groupByKey, reduceByKey etc. [3.9 s] ✔ | 4 | functions in sparkR.R ✔ | 16 | SparkSQL Arrow optimization [19.2 s] ✔ | 6 | test show SparkDataFrame when eager execution is enabled. [1.1 s] ✔ | 1175 | SparkSQL functions [134.8 s] ✔ | 42 | Structured Streaming [478.2 s] ✔ | 16 | tests RDD function take() [1.1 s] ✔ | 14 | the textFile() function [2.9 s] ✔ | 46 | functions in utils.R [0.7 s] ✔ | 0 1 | Windows-specific tests ──────────────────────────────────────────────────────────────────────────────── test_Windows.R:22: skip: sparkJars tag in SparkContext Reason: This test is only for Windows, skipped ──────────────────────────────────────────────────────────────────────────────── ══ Results ═════════════════════════════════════════════════════════════════════ Duration: 987.3 s OK: 2304 Failed: 0 Warnings: 0 Skipped: 1 ... Status: OK + popd Tests passed. ``` Note that I tested to build SparkR in R 4.0.0, and run the tests with R 3.6.3. It all passed. See also [the comment in the JIRA](https://issues.apache.org/jira/browse/SPARK-31918?focusedCommentId=17142837&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17142837). Closes apache#28907 from HyukjinKwon/SPARK-31918. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

Ignore S4 generic methods in closure cleaning to support R 4.0.0+

fe0deac

probot-autolabeler bot added ML R labels Jun 23, 2020

HyukjinKwon commented Jun 23, 2020

View reviewed changes

HyukjinKwon mentioned this pull request Jun 23, 2020

[SPARK-32074][BUILD][R] Update AppVeyor R version to 4.0.2 #28909

Closed

shivaram approved these changes Jun 23, 2020

View reviewed changes

Address a comment

405b964

This comment has been minimized.

Sign in to view

HyukjinKwon added 2 commits June 24, 2020 10:13

style

29bfcdb

Lint version differences

14886cb

HyukjinKwon force-pushed the SPARK-31918 branch from 79ff989 to 14886cb Compare June 24, 2020 01:19

HyukjinKwon closed this in 11d2b07 Jun 24, 2020

HyukjinKwon deleted the SPARK-31918 branch July 27, 2020 07:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-31918][R] Ignore S4 generic methods under SparkR namespace in closure cleaning to support R 4.0.0+ #28907

[SPARK-31918][R] Ignore S4 generic methods under SparkR namespace in closure cleaning to support R 4.0.0+ #28907

HyukjinKwon commented Jun 23, 2020 •

edited

Loading

HyukjinKwon Jun 23, 2020 •

edited

Loading

HyukjinKwon Jun 23, 2020

shivaram Jun 23, 2020

HyukjinKwon Jun 23, 2020

shivaram Jun 23, 2020

HyukjinKwon Jun 24, 2020

HyukjinKwon commented Jun 23, 2020

SparkQA commented Jun 23, 2020

shivaram left a comment

shivaram Jun 23, 2020

shivaram Jun 23, 2020

dongjoon-hyun commented Jun 23, 2020

This comment has been minimized.

SparkQA commented Jun 24, 2020

SparkQA commented Jun 24, 2020

HyukjinKwon commented Jun 24, 2020 •

edited

Loading

[SPARK-31918][R] Ignore S4 generic methods under SparkR namespace in closure cleaning to support R 4.0.0+ #28907

[SPARK-31918][R] Ignore S4 generic methods under SparkR namespace in closure cleaning to support R 4.0.0+ #28907

Conversation

HyukjinKwon commented Jun 23, 2020 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

HyukjinKwon Jun 23, 2020 • edited Loading

Choose a reason for hiding this comment

HyukjinKwon Jun 23, 2020

Choose a reason for hiding this comment

shivaram Jun 23, 2020

Choose a reason for hiding this comment

HyukjinKwon Jun 23, 2020

Choose a reason for hiding this comment

shivaram Jun 23, 2020

Choose a reason for hiding this comment

HyukjinKwon Jun 24, 2020

Choose a reason for hiding this comment

HyukjinKwon commented Jun 23, 2020

SparkQA commented Jun 23, 2020

shivaram left a comment

Choose a reason for hiding this comment

shivaram Jun 23, 2020

Choose a reason for hiding this comment

shivaram Jun 23, 2020

Choose a reason for hiding this comment

dongjoon-hyun commented Jun 23, 2020

This comment has been minimized.

SparkQA commented Jun 24, 2020

SparkQA commented Jun 24, 2020

HyukjinKwon commented Jun 24, 2020 • edited Loading

HyukjinKwon commented Jun 23, 2020 •

edited

Loading

HyukjinKwon Jun 23, 2020 •

edited

Loading

HyukjinKwon commented Jun 24, 2020 •

edited

Loading