From 5c0bb24bd77a6e1ed4474144f14b6458cdd2c157 Mon Sep 17 00:00:00 2001 From: Felix Cheung Date: Sun, 1 Mar 2015 22:20:41 -0800 Subject: [PATCH 1/4] Doc updates: build and running on YARN --- BUILDING.md | 9 +++------ README.md | 41 +++++++++++++++++++++++++++++++++++++++++ 2 files changed, 44 insertions(+), 6 deletions(-) diff --git a/BUILDING.md b/BUILDING.md index c1929f94bd65e..08d9a8129009f 100644 --- a/BUILDING.md +++ b/BUILDING.md @@ -7,11 +7,8 @@ include Rtools and R in `PATH`. 2. Install [JDK7](http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html) and set `JAVA_HOME` in the system environment variables. -3. Install `rJava` using `install.packages(rJava)`. If rJava fails to load due to missing jvm.dll, -you will need to add the directory containing jvm.dll to `PATH`. See this [stackoverflow post](http://stackoverflow.com/a/7604469] -for more details. -4. Download and install [Maven](http://maven.apache.org/download.html). Also include the `bin` +3. Download and install [Maven](http://maven.apache.org/download.html). Also include the `bin` directory in Maven in `PATH`. -5. Get SparkR source code either using [`git]`(http://git-scm.com/downloads) or by downloading a +4. Get SparkR source code either using [`git`](http://git-scm.com/downloads) or by downloading a source zip from github. -6. Open a command shell (`cmd`) in the SparkR directory and run `install-dev.bat` +5. Open a command shell (`cmd`) in the SparkR directory and run `install-dev.bat` diff --git a/README.md b/README.md index 6d6b097222ade..fa4655180ca73 100644 --- a/README.md +++ b/README.md @@ -46,6 +46,22 @@ the environment variable `USE_MAVEN=1`. For example If you are building SparkR from behind a proxy, you can [setup maven](https://maven.apache.org/guides/mini/guide-proxies.html) to use the right proxy server. +#### Building from source from GitHub + +Run the following within R to pull source code from GitHub and build locally. It is possible +to specify build dependencies by starting R with environment values: + +1. Start R +``` +SPARK_VERSION=1.2.0 SPARK_HADOOP_VERSION=2.5.0 R +``` + +2. Run install_github +``` +library(devtools) +install_github("repo/SparkR-pkg", ref="branchname", subdir="pkg") +``` +*note: replace repo and branchname* ## Running sparkR If you have cloned and built SparkR, you can start using it by launching the SparkR @@ -110,10 +126,23 @@ cd SparkR-pkg/ USE_YARN=1 SPARK_YARN_VERSION=2.4.0 SPARK_HADOOP_VERSION=2.4.0 ./install-dev.sh ``` +Alternatively, install_github can be use (on CDH in this case): + +``` +# assume devtools package is installed by install.packages("devtools") +USE_YARN=1 SPARK_VERSION=1.1.0 SPARK_YARN_VERSION=2.5.0-cdh5.3.0 SPARK_HADOOP_VERSION=2.5.0-cdh5.3.0 R +``` +Then within R, +``` +library(devtools) +install_github("amplab-extras/SparkR-pkg", ref="master", subdir="pkg") +``` + Before launching an application, make sure each worker node has a local copy of `lib/SparkR/sparkr-assembly-0.1.jar`. With a cluster launched with the `spark-ec2` script, do: ``` ~/spark-ec2/copy-dir ~/SparkR-pkg ``` +Or run the above installation steps on all worker node. Finally, when launching an application, the environment variable `YARN_CONF_DIR` needs to be set to the directory which contains the client-side configuration files for the Hadoop cluster (with a cluster launched with `spark-ec2`, this defaults to `/root/ephemeral-hdfs/conf/`): ``` @@ -121,6 +150,18 @@ YARN_CONF_DIR=/root/ephemeral-hdfs/conf/ MASTER=yarn-client ./sparkR YARN_CONF_DIR=/root/ephemeral-hdfs/conf/ ./sparkR examples/pi.R yarn-client ``` +### Using sparkR-submit +sparkR-submit is a script introduced to faciliate submission of SparkR jobs to a YARN cluster. +It supports the same commandline parameters as [spark-submit](http://spark.apache.org/docs/latest/running-on-yarn.html). SPARK_HOME, YARN_HOME, and JAVA_HOME must be defined. + +(On CDH 5.3.0) +``` +export SPARK_HOME=/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/spark +export YARN_CONF_DIR=/etc/hadoop/conf +export JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera +/usr/lib64/R/library/SparkR/sparkR-submit --master yarn-client examples/pi.R yarn-client 4 +``` + ## Report Issues/Feedback For better tracking and collaboration, issues and TODO items are reported to a dedicated [SparkR JIRA](https://sparkr.atlassian.net/browse/SPARKR/). From 03402ebdef99be680c4d0c9c475fd08702d3eb9e Mon Sep 17 00:00:00 2001 From: Felix Cheung Date: Mon, 2 Mar 2015 16:17:17 -0800 Subject: [PATCH 2/4] Updates as per feedback on sparkR-submit --- README.md | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index fa4655180ca73..027614ab74808 100644 --- a/README.md +++ b/README.md @@ -150,11 +150,16 @@ YARN_CONF_DIR=/root/ephemeral-hdfs/conf/ MASTER=yarn-client ./sparkR YARN_CONF_DIR=/root/ephemeral-hdfs/conf/ ./sparkR examples/pi.R yarn-client ``` -### Using sparkR-submit -sparkR-submit is a script introduced to faciliate submission of SparkR jobs to a YARN cluster. -It supports the same commandline parameters as [spark-submit](http://spark.apache.org/docs/latest/running-on-yarn.html). SPARK_HOME, YARN_HOME, and JAVA_HOME must be defined. +## Running on a cluster using sparkR-submit -(On CDH 5.3.0) +sparkR-submit is a script introduced to faciliate submission of SparkR jobs to a Spark supported cluster (eg. Standalone, Mesos, YARN). +It supports the same commandline parameters as [spark-submit](http://spark.apache.org/docs/latest/submitting-applications.html). SPARK_HOME and JAVA_HOME must be defined. + +On YARN, YARN_HOME must be defined. Currently, SparkR only supports [yarn-client](http://spark.apache.org/docs/latest/running-on-yarn.html) mode. + +sparkR-submit is installed with the SparkR package. By default, it can be found under the default Library (['library'](https://stat.ethz.ch/R-manual/R-devel/library/base/html/libPaths.html) subdirectory of R_HOME) + +For example, to run on YARN (CDH 5.3.0), ``` export SPARK_HOME=/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/spark export YARN_CONF_DIR=/etc/hadoop/conf From e2d144a798f8ef293467ed8a3eb20b6cf77dcb56 Mon Sep 17 00:00:00 2001 From: Felix Cheung Date: Mon, 2 Mar 2015 17:52:10 -0800 Subject: [PATCH 3/4] Fixed small typos --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 027614ab74808..92ee42adf8363 100644 --- a/README.md +++ b/README.md @@ -152,10 +152,10 @@ YARN_CONF_DIR=/root/ephemeral-hdfs/conf/ ./sparkR examples/pi.R yarn-client ## Running on a cluster using sparkR-submit -sparkR-submit is a script introduced to faciliate submission of SparkR jobs to a Spark supported cluster (eg. Standalone, Mesos, YARN). +sparkR-submit is a script introduced to facilitate submission of SparkR jobs to a Spark supported cluster (eg. Standalone, Mesos, YARN). It supports the same commandline parameters as [spark-submit](http://spark.apache.org/docs/latest/submitting-applications.html). SPARK_HOME and JAVA_HOME must be defined. -On YARN, YARN_HOME must be defined. Currently, SparkR only supports [yarn-client](http://spark.apache.org/docs/latest/running-on-yarn.html) mode. +On YARN, YARN_HOME must be defined. Currently, SparkR only supports the [yarn-client](http://spark.apache.org/docs/latest/running-on-yarn.html) mode. sparkR-submit is installed with the SparkR package. By default, it can be found under the default Library (['library'](https://stat.ethz.ch/R-manual/R-devel/library/base/html/libPaths.html) subdirectory of R_HOME) From 2e7b19002918a1e447efe4b0b43af181c6b49844 Mon Sep 17 00:00:00 2001 From: Felix Cheung Date: Tue, 3 Mar 2015 13:41:25 -0800 Subject: [PATCH 4/4] small update on yarn deploy mode. --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 92ee42adf8363..92bc035b87842 100644 --- a/README.md +++ b/README.md @@ -155,7 +155,7 @@ YARN_CONF_DIR=/root/ephemeral-hdfs/conf/ ./sparkR examples/pi.R yarn-client sparkR-submit is a script introduced to facilitate submission of SparkR jobs to a Spark supported cluster (eg. Standalone, Mesos, YARN). It supports the same commandline parameters as [spark-submit](http://spark.apache.org/docs/latest/submitting-applications.html). SPARK_HOME and JAVA_HOME must be defined. -On YARN, YARN_HOME must be defined. Currently, SparkR only supports the [yarn-client](http://spark.apache.org/docs/latest/running-on-yarn.html) mode. +On YARN, YARN_CONF_DIR must be defined. sparkR-submit supports [YARN deploy modes](http://spark.apache.org/docs/latest/running-on-yarn.html): yarn-client and yarn-cluster. sparkR-submit is installed with the SparkR package. By default, it can be found under the default Library (['library'](https://stat.ethz.ch/R-manual/R-devel/library/base/html/libPaths.html) subdirectory of R_HOME)