Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-18325][SPARKR] Add example for using native R package. #16214

Closed
wants to merge 4 commits into from

Conversation

yanboliang
Copy link
Contributor

@yanboliang yanboliang commented Dec 8, 2016

What changes were proposed in this pull request?

SparkR provides the ability to distribute the computations by spark.lapply. In many cases, the distributed computation depends on third-party R packages, and we need to install corresponding packages to executor in advance which is tedious and unappeasable. Since SparkR is an interactive analytic tools and users may load lots of native R packages across the session, it's difficult to install all necessary packages before starting SparkR session, so I think it's helpful to provide an example to illustrate how to handle this scenario in an elegant way.

How was this patch tested?

Offline test.

@yanboliang yanboliang changed the title [[SPARK-18325]][SPARKR] Add example for using native R package. [SPARK-18325][SPARKR] Add example for using native R package. Dec 8, 2016
@yanboliang
Copy link
Contributor Author

@felixcheung @shivaram

@SparkQA
Copy link

SparkQA commented Dec 8, 2016

Test build #69861 has finished for PR 16214 at commit 9b15bb3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

Great - I think we should be careful with example code that install stuff automatically, especially if that could run on a cluster without any change.

Perhaps we should have a blocking prompt at the beginning with a warning message?

Also it would be better to have more description on what it is doing, where things would go (eg. where package would be installed to), and why?
And perhaps a section on programming guide? Or that is too public (depending on how much we want to promote such things)?

sparkR.session(appName = "SparkR-native-r-package-example")

# Get the location of the default library
libDir <- .libPaths()[1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

install.packages() already defaults to the default library.
as discussed before, perhaps this is more useful to be set to the application directory (eg. with YARN) because default libPath is usually secured?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion!

# Perform distributed training of multiple models with spark.lapply
costs <- exp(seq(from = log(1), to = log(1000), length.out = 5))
train <- function(cost) {
if("e1071" %in% rownames(installed.packages(libDir)) == FALSE) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer installed.packages(lib = libDir) to be more clear

path <- spark.getSparkFiles(filename)
costs <- exp(seq(from = log(1), to = log(1000), length.out = 5))
train <- function(cost) {
if("e1071" %in% rownames(installed.packages(libDir)) == FALSE) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

costs <- exp(seq(from = log(1), to = log(1000), length.out = 5))
train <- function(cost) {
if("e1071" %in% rownames(installed.packages(libDir)) == FALSE) {
install.packages(path, repos=NULL, type="source")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

although if this is an example of how R package can be distributed, I wouldn't call install.packages here, because of secure location and duplications.

instead, this could do library(e1071, lib.loc = path) - ie. the package doesn't need to be "installed" to be loaded.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the test which directly install R packages from CRAN consider that most of users don't allow the cluster to connect internet.

@mengxr
Copy link
Contributor

mengxr commented Dec 8, 2016

@yanboliang What happens if there are multiple executors on the same machine? Are there concurrency issues?

@SparkQA
Copy link

SparkQA commented Dec 9, 2016

Test build #69908 has finished for PR 16214 at commit ce18e2e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor Author

yanboliang commented Dec 9, 2016

@mengxr I think there is no (at least I did not find) concurrency issue when multiple executors on the same machine, since native R has lock to protect this. You can verify it by installing package to a shared directory in a single node in multi-thread way. However, we should recommend to install package to an executor associated directory(or random temporary directory) which means different executor has its own R lib directory even if they are in the same machine. For YARN mode, it works well since each executor has its own R lib directory natively.
@felixcheung I updated the examples and add corresponding session in SparkR user guide. I'm not sure whether such an example is appropriate, since it has many dependencies on environment. But I think for an interactive analytical tool, installing packages across the session is not very rare. Still, I'm open to hear your thought, thanks.

if("e1071" %in% rownames(installed.packages(lib = libDir)) == FALSE) {
install.packages(path, repos = NULL, type = "source")
}
library(e1071)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you try standalone mode? Even R installation has locks, this might have concurrency issues, e.g.:

  1. Executor A executes L55 and sets a lock.
  2. Executor B executes L55 and wait for the lock.
  3. A finished installation and released the lock.
  4. Executor B sets the lock and start executing L55, i.e., re-installing the same package.
  5. A executes L57 but failed due to partially installed package.

Just my guess. But it may be worth testing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I run it in standalone mode several times, all work well. But I think I can not guarantee it works well always before more careful test, may be I'm lucky and not hit the concurrent issue. I'll figure out a religious test to verify it later. Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: here we shouldn't need to install it.

Copy link
Contributor Author

@yanboliang yanboliang Dec 12, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mengxr I tested the scenario you mentioned above, and found it's not a problem. If the R package already exists, then installing the same package to the same directory, R will install it as a separate directory with a different name(00LOCK-e1071). The temporary directory will be switched to e1071 when success and the original one would be removed meanwhile.

image

@@ -472,21 +472,17 @@ should fit in a single machine. If that is not the case they can do something li
`dapply`

<div data-lang="r" markdown="1">
{% highlight r %}
# Perform distributed training of multiple models with spark.lapply. Here, we pass
# a read-only list of arguments which specifies family the generalized linear model should be.
Copy link
Member

@felixcheung felixcheung Dec 9, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'll be good to have a similar comment in the example ml.R, something like a readonly list of cost values distributed etc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, updated.

model <- glm(Sepal.Length ~ Sepal.Width + Species, iris, family = family)
summary(model)
}
# Return a list of model's summaries
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, but minor here

{% endhighlight %}
Many of the SparkR jobs distributed by `spark.lapply` need supports from third-party packages. Rather than installing all necessary packages to all executors in advance,
we could install them during the SparkR interactive session or script. Users can add package files or directories by `spark.addFile` firstly,
download them to every executor node, and install them.
Copy link
Member

@felixcheung felixcheung Dec 9, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this kind of sounds like the user will need to separately "download them to executor node" - perhaps instead say "by spark.addFile first, which automatically download them to every executor node, and then install them`?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

costs <- exp(seq(from = log(1), to = log(1000), length.out = 5))
train <- function(cost) {
if("e1071" %in% rownames(installed.packages(lib = libDir)) == FALSE) {
install.packages(path, repos = NULL, type = "source")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: comment here if you already have the package content from sparkFiles you do not need to call install.packages(), which I think would be better without it.

Copy link
Contributor Author

@yanboliang yanboliang Dec 12, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we have the package content, but it's source package rather than binary package, so we can not use library to load the package. This is the pain point for this example. If we illustrate this example with binary package, we should provide scripts for different os versions, and it requires all nodes in users' whole cluster should have the same architecture. So I use source package, I think it's a more universal example.

@SparkQA
Copy link

SparkQA commented Dec 12, 2016

Test build #70000 has finished for PR 16214 at commit d3ec5fa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor Author

After rethinking this issue, I found this is not very appropriate as an example delivered with Spark, since it depends on the environment that users running the script. I'm closing this PR, thanks for all your reviewing.

@yanboliang yanboliang closed this Dec 14, 2016
@yanboliang yanboliang deleted the native-r-package branch December 14, 2016 13:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants