Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions R/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,16 @@
# R on Spark

SparkR is an R package that provides a light-weight frontend to use Spark from R.
### Installing sparkR

Libraries of sparkR need to be created in `$SPARK_HOME/R/lib`. This can be done by running the script `$SPARK_HOME/R/install-dev.sh`.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If one were running from R studio with the steps in Using SparkR from RStudio below he wouldn't have to install or run install-dev.sh though - could we clarify that?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am running this on a machine with no X-server hence no R-studio, so this kind of functionality will be needed for users like me.

Even for R-studio I feel sparkR needs to be compiled with the version of R >= 3.0

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, I'm referring the part about
sparkR need to be created in $SPARK_HOME/R/lib. This can be done by running the script $SPARK_HOME/R/install-dev.sh``

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is needed as mentioned in running sparkR from R studio. The script there tries to access the lib location which might not be present by default in the sparkR folder if the wrong version of R is selected by default.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, this line
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
added SPARK_HOME/R/lib into R's lib path and allows R, any running version, to load SparkR package from there - SparkR packages does not need to be installed with R CMD INSTALL (in install-dev.sh) at all.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sorry if my I was not clear before but what I mean is the following:

In order for the code .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths())) to work, the directory $SPARK_HOME/R/lib needs to exist. When we build spark using build/mvn -DskipTests clean package this directory is not created by default. Hence we have to run install-dev.sh in order to use SparkR from an R shell.

If we look at the code in install-dev.sh, the following lines actually create the lib directory.

FWDIR="$(cd `dirname $0`; pwd)"
LIB_DIR="$FWDIR/lib"

mkdir -p $LIB_DIR

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I think I get your point now.
I guess we are saying this README.md is more for developer, so I'm ok with what you have here.
There are users that are not building Spark from source and are running with the binary release, in which case the SPARK_HOME/R/lib is there and they would not need to install the SparkR package. Similarly when running SparkR with a cluster manager, on the worker nodes SparkR would not need to be installed either. I agree they are possibly outside the scope of this file.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great. So is this PR ready for a merge into the master ?

By default the above script uses the system wide installation of R. However, this can be changed to any user installed location of R by setting the environment variable `R_HOME` the full path of the base directory where R is installed, before running install-dev.sh script.
Example:
```
# where /home/username/R is where R is installed and /home/username/R/bin contains the files R and RScript
export R_HOME=/home/username/R
./install-dev.sh
```

### SparkR development

Expand Down
11 changes: 9 additions & 2 deletions R/install-dev.sh
Original file line number Diff line number Diff line change
Expand Up @@ -35,12 +35,19 @@ LIB_DIR="$FWDIR/lib"
mkdir -p $LIB_DIR

pushd $FWDIR > /dev/null
if [ ! -z "$R_HOME" ]
then
R_SCRIPT_PATH="$R_HOME/bin"
else
R_SCRIPT_PATH="$(dirname $(which R))"
fi
echo "USING R_HOME = $R_HOME"

# Generate Rd files if devtools is installed
Rscript -e ' if("devtools" %in% rownames(installed.packages())) { library(devtools); devtools::document(pkg="./pkg", roclets=c("rd")) }'
"$R_SCRIPT_PATH/"Rscript -e ' if("devtools" %in% rownames(installed.packages())) { library(devtools); devtools::document(pkg="./pkg", roclets=c("rd")) }'

# Install SparkR to $LIB_DIR
R CMD INSTALL --library=$LIB_DIR $FWDIR/pkg/
"$R_SCRIPT_PATH/"R CMD INSTALL --library=$LIB_DIR $FWDIR/pkg/

# Zip the SparkR package so that it can be distributed to worker nodes on YARN
cd $LIB_DIR
Expand Down