[SPARK-18325][SPARKR] Add example for using native R package. #16214

yanboliang · 2016-12-08T13:14:27Z

What changes were proposed in this pull request?

SparkR provides the ability to distribute the computations by spark.lapply. In many cases, the distributed computation depends on third-party R packages, and we need to install corresponding packages to executor in advance which is tedious and unappeasable. Since SparkR is an interactive analytic tools and users may load lots of native R packages across the session, it's difficult to install all necessary packages before starting SparkR session, so I think it's helpful to provide an example to illustrate how to handle this scenario in an elegant way.

How was this patch tested?

Offline test.

yanboliang · 2016-12-08T13:22:11Z

@felixcheung @shivaram

SparkQA · 2016-12-08T13:40:52Z

Test build #69861 has finished for PR 16214 at commit 9b15bb3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-12-08T17:12:55Z

Great - I think we should be careful with example code that install stuff automatically, especially if that could run on a cluster without any change.

Perhaps we should have a blocking prompt at the beginning with a warning message?

Also it would be better to have more description on what it is doing, where things would go (eg. where package would be installed to), and why?
And perhaps a section on programming guide? Or that is too public (depending on how much we want to promote such things)?

felixcheung · 2016-12-08T17:19:22Z

examples/src/main/r/native-r-package.R

+sparkR.session(appName = "SparkR-native-r-package-example")
+
+# Get the location of the default library
+libDir <- .libPaths()[1]


install.packages() already defaults to the default library.
as discussed before, perhaps this is more useful to be set to the application directory (eg. with YARN) because default libPath is usually secured?

Good suggestion!

felixcheung · 2016-12-08T17:19:52Z

examples/src/main/r/native-r-package.R

+# Perform distributed training of multiple models with spark.lapply
+costs <- exp(seq(from = log(1), to = log(1000), length.out = 5))
+train <- function(cost) {
+    if("e1071" %in% rownames(installed.packages(libDir)) == FALSE) {


I'd prefer installed.packages(lib = libDir) to be more clear

felixcheung · 2016-12-08T17:20:47Z

examples/src/main/r/native-r-package.R

+path <- spark.getSparkFiles(filename)
+costs <- exp(seq(from = log(1), to = log(1000), length.out = 5))
+train <- function(cost) {
+    if("e1071" %in% rownames(installed.packages(libDir)) == FALSE) {


felixcheung · 2016-12-08T17:23:07Z

examples/src/main/r/native-r-package.R

+costs <- exp(seq(from = log(1), to = log(1000), length.out = 5))
+train <- function(cost) {
+    if("e1071" %in% rownames(installed.packages(libDir)) == FALSE) {
+        install.packages(path, repos=NULL, type="source")


although if this is an example of how R package can be distributed, I wouldn't call install.packages here, because of secure location and duplications.

instead, this could do library(e1071, lib.loc = path) - ie. the package doesn't need to be "installed" to be loaded.

https://stat.ethz.ch/R-manual/R-devel/library/base/html/library.html

I removed the test which directly install R packages from CRAN consider that most of users don't allow the cluster to connect internet.

mengxr · 2016-12-08T20:14:19Z

@yanboliang What happens if there are multiple executors on the same machine? Are there concurrency issues?

SparkQA · 2016-12-09T07:31:51Z

Test build #69908 has finished for PR 16214 at commit ce18e2e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2016-12-09T07:34:33Z

@mengxr I think there is no (at least I did not find) concurrency issue when multiple executors on the same machine, since native R has lock to protect this. You can verify it by installing package to a shared directory in a single node in multi-thread way. However, we should recommend to install package to an executor associated directory(or random temporary directory) which means different executor has its own R lib directory even if they are in the same machine. For YARN mode, it works well since each executor has its own R lib directory natively.
@felixcheung I updated the examples and add corresponding session in SparkR user guide. I'm not sure whether such an example is appropriate, since it has many dependencies on environment. But I think for an interactive analytical tool, installing packages across the session is not very rare. Still, I'm open to hear your thought, thanks.

mengxr · 2016-12-09T08:28:15Z

examples/src/main/r/native-r-package.R

+    if("e1071" %in% rownames(installed.packages(lib = libDir)) == FALSE) {
+        install.packages(path, repos = NULL, type = "source")
+    }
+    library(e1071)


Did you try standalone mode? Even R installation has locks, this might have concurrency issues, e.g.:

Executor A executes L55 and sets a lock.

Executor B executes L55 and wait for the lock.

A finished installation and released the lock.

Executor B sets the lock and start executing L55, i.e., re-installing the same package.

A executes L57 but failed due to partially installed package.

Just my guess. But it may be worth testing.

Yeah, I run it in standalone mode several times, all work well. But I think I can not guarantee it works well always before more careful test, may be I'm lucky and not hit the concurrent issue. I'll figure out a religious test to verify it later. Thanks.

re: here we shouldn't need to install it.

@mengxr I tested the scenario you mentioned above, and found it's not a problem. If the R package already exists, then installing the same package to the same directory, R will install it as a separate directory with a different name(00LOCK-e1071). The temporary directory will be switched to e1071 when success and the original one would be removed meanwhile.

felixcheung · 2016-12-09T18:07:12Z

docs/sparkr.md

@@ -472,21 +472,17 @@ should fit in a single machine. If that is not the case they can do something li
 `dapply`

 <div data-lang="r"  markdown="1">
-{% highlight r %}
-# Perform distributed training of multiple models with spark.lapply. Here, we pass
-# a read-only list of arguments which specifies family the generalized linear model should be.


I think it'll be good to have a similar comment in the example ml.R, something like a readonly list of cost values distributed etc

Sounds good, updated.

felixcheung · 2016-12-09T18:07:31Z

docs/sparkr.md

-  model <- glm(Sepal.Length ~ Sepal.Width + Species, iris, family = family)
-  summary(model)
-}
-# Return a list of model's summaries


ditto, but minor here

felixcheung · 2016-12-09T18:10:20Z

docs/sparkr.md

-{% endhighlight %}
+Many of the SparkR jobs distributed by `spark.lapply` need supports from third-party packages. Rather than installing all necessary packages to all executors in advance,
+we could install them during the SparkR interactive session or script. Users can add package files or directories by `spark.addFile` firstly,
+download them to every executor node, and install them.


this kind of sounds like the user will need to separately "download them to executor node" - perhaps instead say "by spark.addFile first, which automatically download them to every executor node, and then install them`?

felixcheung · 2016-12-09T18:12:59Z

examples/src/main/r/native-r-package.R

+costs <- exp(seq(from = log(1), to = log(1000), length.out = 5))
+train <- function(cost) {
+    if("e1071" %in% rownames(installed.packages(lib = libDir)) == FALSE) {
+        install.packages(path, repos = NULL, type = "source")


re: comment here if you already have the package content from sparkFiles you do not need to call install.packages(), which I think would be better without it.

Yeah, we have the package content, but it's source package rather than binary package, so we can not use library to load the package. This is the pain point for this example. If we illustrate this example with binary package, we should provide scripts for different os versions, and it requires all nodes in users' whole cluster should have the same architecture. So I use source package, I think it's a more universal example.

SparkQA · 2016-12-12T04:25:52Z

Test build #70000 has finished for PR 16214 at commit d3ec5fa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2016-12-14T13:20:10Z

After rethinking this issue, I found this is not very appropriate as an example delivered with Spark, since it depends on the environment that users running the script. I'm closing this PR, thanks for all your reviewing.

yanboliang changed the title ~~[[SPARK-18325]][SPARKR] Add example for using native R package.~~ [SPARK-18325][SPARKR] Add example for using native R package. Dec 8, 2016

felixcheung reviewed Dec 8, 2016

View reviewed changes

yanboliang added 3 commits December 8, 2016 22:39

Add example for using native R package.

4f6e027

Update native-r-package.R and user guide.

cfb674f

Update user guide.

ce18e2e

yanboliang force-pushed the native-r-package branch from 9b15bb3 to ce18e2e Compare December 9, 2016 07:00

mengxr reviewed Dec 9, 2016

View reviewed changes

felixcheung reviewed Dec 9, 2016

View reviewed changes

Update docs and user guide.

d3ec5fa

yanboliang closed this Dec 14, 2016

yanboliang deleted the native-r-package branch December 14, 2016 13:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-18325][SPARKR] Add example for using native R package. #16214

[SPARK-18325][SPARKR] Add example for using native R package. #16214

yanboliang commented Dec 8, 2016 •

edited

Loading

yanboliang commented Dec 8, 2016

SparkQA commented Dec 8, 2016

felixcheung commented Dec 8, 2016

felixcheung Dec 8, 2016

yanboliang Dec 9, 2016

felixcheung Dec 8, 2016

felixcheung Dec 8, 2016

felixcheung Dec 8, 2016

felixcheung Dec 8, 2016

yanboliang Dec 9, 2016

mengxr commented Dec 8, 2016

SparkQA commented Dec 9, 2016

yanboliang commented Dec 9, 2016 •

edited

Loading

mengxr Dec 9, 2016

yanboliang Dec 9, 2016

felixcheung Dec 9, 2016

yanboliang Dec 12, 2016 •

edited

Loading

felixcheung Dec 9, 2016 •

edited

Loading

yanboliang Dec 12, 2016

felixcheung Dec 9, 2016

felixcheung Dec 9, 2016 •

edited

Loading

yanboliang Dec 12, 2016

felixcheung Dec 9, 2016

yanboliang Dec 12, 2016 •

edited

Loading

SparkQA commented Dec 12, 2016

yanboliang commented Dec 14, 2016

[SPARK-18325][SPARKR] Add example for using native R package. #16214

[SPARK-18325][SPARKR] Add example for using native R package. #16214

Conversation

yanboliang commented Dec 8, 2016 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

yanboliang commented Dec 8, 2016

SparkQA commented Dec 8, 2016

felixcheung commented Dec 8, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mengxr commented Dec 8, 2016

SparkQA commented Dec 9, 2016

yanboliang commented Dec 9, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yanboliang Dec 12, 2016 • edited Loading

Choose a reason for hiding this comment

felixcheung Dec 9, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felixcheung Dec 9, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yanboliang Dec 12, 2016 • edited Loading

Choose a reason for hiding this comment

SparkQA commented Dec 12, 2016

yanboliang commented Dec 14, 2016

yanboliang commented Dec 8, 2016 •

edited

Loading

yanboliang commented Dec 9, 2016 •

edited

Loading

yanboliang Dec 12, 2016 •

edited

Loading

felixcheung Dec 9, 2016 •

edited

Loading

felixcheung Dec 9, 2016 •

edited

Loading

yanboliang Dec 12, 2016 •

edited

Loading