-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-5654] Integrate SparkR #5096
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
71d66a1
e998356
f585929
1955a09
76cf2e0
03402eb
1d0f2ae
f798402
524c122
8a676b1
06cbc2d
3beadcf
e2d144a
98cc97a
39c253d
03bcf20
ed9a89f
e8639c3
2b6f980
3f22c8d
4fa6343
294ca4a
12a6db2
8ff29d6
2e7b190
32b37d1
e14c328
494a4dd
cd7ac8a
74269f3
acea146
7918634
5073e07
32aa01d
197a79b
4e0becc
8de958d
d18f9d3
a37fd80
3865f39
bc90115
8b9a963
870acd4
198c130
6a1fe64
20242c4
0ac4abc
68b11cf
779c102
3fab0f8
a5c2887
ff8b005
f10a24e
6fac596
5fd9575
3675fcf
de2abfa
b875b4f
e9e2a03
bb46832
8b7fb67
d0d4626
fb3b139
47a613f
62b0760
4d0fb56
ac8a852
3c7674f
f06ccec
dfb399a
18c6004
d8c1c09
9d01bcd
befbd32
fef99de
428a99a
e6fb8d8
9dd6a5a
bcb0bf5
c5fa3b9
7a5d6fd
a582810
f3d99a6
5e3a576
a8cebf0
3f7aed6
5eec6fc
e60578a
789be97
3db5649
b4c0b2e
dc1291b
15a713f
09ff163
97dde1a
89b886d
471c794
ffd6e8e
6bccbbf
8f8813f
411b751
ecdfda1
ff948db
01aa5ee
46cea3d
8583968
90f2692
5757b95
4e4908a
26a3621
a6dc435
e87bb98
9a6be74
0467474
3e0555d
55c38bc
66cc92a
72adb14
8e1497d
05b9126
e52258f
8bff523
963c7ee
d8c8fcc
8190127
7695d36
1bc2998
662938a
dd52cbc
46454e4
7f5e70c
bc2ff38
6122e0e
3139325
4b1628d
70f620c
f5d3355
3214c6d
6f95d49
5e610cb
b043876
2fc553f
44994c2
49a8133
014d253
180fc9c
3415cc7
df3eeea
a76472f
0a0e632
18e5eed
facb6e0
50bff63
35e5755
e8fc7ca
c4a5bdf
f403b4a
ba53b09
043959e
f7b6936
e4f1937
ff776aa
aae881b
2d235d4
716b16f
52ca6e5
479e3fe
9f6aa1f
0e2412c
ea90fab
d6f2bdd
baefd9e
95d2de3
05afef0
8030847
cb6e5e3
a1870e8
ef26015
42d8b4c
028cbfb
f04080c
ba4b80b
56670ef
1a16cd6
40d193a
d436f26
756ece0
38cbf59
ebd4d07
2892e29
7da0049
ce3ca62
d8b24fc
410ec18
974e4ea
423ea3c
9fb6af3
855537f
b44e371
05e7375
bdf3a14
7100fb9
d87a181
afd8a77
0e5a83f
d6d3729
02b4833
1f478c5
6ff5ea2
52cc92d
58276f5
e1f83ab
a1493d7
b21a0da
733380d
3eacfc0
85a50ec
cf5cd99
104ad4e
1f1a7e0
d425363
bc2d6d8
463e28c
e089151
a1cedad
b433817
6e20e71
e88b649
a1777eb
e7104b6
f8fa8af
19c9368
b045701
c300e08
11981b7
1d1802e
3487461
0e788c0
940b631
5133f3a
a18ff5c
eb5da53
377151f
d7c3f22
64eda24
f731b48
5581c75
55808e4
59266d1
da64742
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -67,3 +67,5 @@ logs | |
| .*scalastyle-output.xml | ||
| .*dependency-reduced-pom.xml | ||
| known_translations | ||
| DESCRIPTION | ||
| NAMESPACE | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,6 @@ | ||
| *.o | ||
| *.so | ||
| *.Rd | ||
| lib | ||
| pkg/man | ||
| pkg/html |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,12 @@ | ||
| # SparkR Documentation | ||
|
|
||
| SparkR documentation is generated using in-source comments annotated using using | ||
| `roxygen2`. After making changes to the documentation, to generate man pages, | ||
| you can run the following from an R console in the SparkR home directory | ||
|
|
||
| library(devtools) | ||
| devtools::document(pkg="./pkg", roclets=c("rd")) | ||
|
|
||
| You can verify if your changes are good by running | ||
|
|
||
| R CMD check pkg/ |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,67 @@ | ||
| # R on Spark | ||
|
|
||
| SparkR is an R package that provides a light-weight frontend to use Spark from R. | ||
|
|
||
| ### SparkR development | ||
|
|
||
| #### Build Spark | ||
|
|
||
| Build Spark with [Maven](http://spark.apache.org/docs/latest/building-spark.html#building-with-buildmvn) and include the `-PsparkR` profile to build the R package. For example to use the default Hadoop versions you can run | ||
| ``` | ||
| build/mvn -DskipTests -Psparkr package | ||
| ``` | ||
|
|
||
| #### Running sparkR | ||
|
|
||
| You can start using SparkR by launching the SparkR shell with | ||
|
|
||
| ./bin/sparkR | ||
|
|
||
| The `sparkR` script automatically creates a SparkContext with Spark by default in | ||
| local mode. To specify the Spark master of a cluster for the automatically created | ||
| SparkContext, you can run | ||
|
|
||
| ./bin/sparkR --master "local[2]" | ||
|
|
||
| To set other options like driver memory, executor memory etc. you can pass in the [spark-submit](http://spark.apache.org/docs/latest/submitting-applications.html) arguments to `./bin/sparkR` | ||
|
|
||
| #### Using SparkR from RStudio | ||
|
|
||
| If you wish to use SparkR from RStudio or other R frontends you will need to set some environment variables which point SparkR to your Spark installation. For example | ||
| ``` | ||
| # Set this to where Spark is installed | ||
| Sys.setenv(SPARK_HOME="/Users/shivaram/spark") | ||
| # This line loads SparkR from the installed directory | ||
| .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths())) | ||
| library(SparkR) | ||
| sc <- sparkR.init(master="local") | ||
| ``` | ||
|
|
||
| #### Making changes to SparkR | ||
|
|
||
| The [instructions](https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark) for making contributions to Spark also apply to SparkR. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Post-merge, we can update the wiki to include R-specific instructions. |
||
| If you only make R file changes (i.e. no Scala changes) then you can just re-install the R package using `R/install-dev.sh` and test your changes. | ||
| Once you have made your changes, please include unit tests for them and run existing unit tests using the `run-tests.sh` script as described below. | ||
|
|
||
| #### Generating documentation | ||
|
|
||
| The SparkR documentation (Rd files and HTML files) are not a part of the source repository. To generate them you can run the script `R/create-docs.sh`. This script uses `devtools` and `knitr` to generate the docs and these packages need to be installed on the machine before using the script. | ||
|
|
||
| ### Examples, Unit tests | ||
|
|
||
| SparkR comes with several sample programs in the `examples/src/main/r` directory. | ||
| To run one of them, use `./bin/sparkR <filename> <args>`. For example: | ||
|
|
||
| ./bin/sparkR examples/src/main/r/pi.R local[2] | ||
|
|
||
| You can also run the unit-tests for SparkR by running (you need to install the [testthat](http://cran.r-project.org/web/packages/testthat/index.html) package first): | ||
|
|
||
| R -e 'install.packages("testthat", repos="http://cran.us.r-project.org")' | ||
| ./R/run-tests.sh | ||
|
|
||
| ### Running on YARN | ||
| The `./bin/spark-submit` and `./bin/sparkR` can also be used to submit jobs to YARN clusters. You will need to set YARN conf dir before doing so. For example on CDH you can run | ||
| ``` | ||
| export YARN_CONF_DIR=/etc/hadoop/conf | ||
| ./bin/spark-submit --master yarn examples/src/main/r/pi.R 4 | ||
| ``` | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,13 @@ | ||
| ## Building SparkR on Windows | ||
|
|
||
| To build SparkR on Windows, the following steps are required | ||
|
|
||
| 1. Install R (>= 3.1) and [Rtools](http://cran.r-project.org/bin/windows/Rtools/). Make sure to | ||
| include Rtools and R in `PATH`. | ||
| 2. Install | ||
| [JDK7](http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html) and set | ||
| `JAVA_HOME` in the system environment variables. | ||
| 3. Download and install [Maven](http://maven.apache.org/download.html). Also include the `bin` | ||
| directory in Maven in `PATH`. | ||
| 4. Set `MAVEN_OPTS` as described in [Building Spark](http://spark.apache.org/docs/latest/building-spark.html). | ||
| 5. Open a command shell (`cmd`) in the Spark directory and run `mvn -DskipTests -Psparkr package` |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,46 @@ | ||
| #!/bin/bash | ||
|
|
||
| # | ||
| # Licensed to the Apache Software Foundation (ASF) under one or more | ||
| # contributor license agreements. See the NOTICE file distributed with | ||
| # this work for additional information regarding copyright ownership. | ||
| # The ASF licenses this file to You under the Apache License, Version 2.0 | ||
| # (the "License"); you may not use this file except in compliance with | ||
| # the License. You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
| # | ||
|
|
||
| # Script to create API docs for SparkR | ||
| # This requires `devtools` and `knitr` to be installed on the machine. | ||
|
|
||
| # After running this script the html docs can be found in | ||
| # $SPARK_HOME/R/pkg/html | ||
|
|
||
| # Figure out where the script is | ||
| export FWDIR="$(cd "`dirname "$0"`"; pwd)" | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can this script be wired up so that our normal doc generation invokes it? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done |
||
| pushd $FWDIR | ||
|
|
||
| # Generate Rd file | ||
| Rscript -e 'library(devtools); devtools::document(pkg="./pkg", roclets=c("rd"))' | ||
|
|
||
| # Install the package | ||
| ./install-dev.sh | ||
|
|
||
| # Now create HTML files | ||
|
|
||
| # knit_rd puts html in current working directory | ||
| mkdir -p pkg/html | ||
| pushd pkg/html | ||
|
|
||
| Rscript -e 'library(SparkR, lib.loc="../../lib"); library(knitr); knit_rd("SparkR")' | ||
|
|
||
| popd | ||
|
|
||
| popd | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,27 @@ | ||
| @echo off | ||
|
|
||
| rem | ||
| rem Licensed to the Apache Software Foundation (ASF) under one or more | ||
| rem contributor license agreements. See the NOTICE file distributed with | ||
| rem this work for additional information regarding copyright ownership. | ||
| rem The ASF licenses this file to You under the Apache License, Version 2.0 | ||
| rem (the "License"); you may not use this file except in compliance with | ||
| rem the License. You may obtain a copy of the License at | ||
| rem | ||
| rem http://www.apache.org/licenses/LICENSE-2.0 | ||
| rem | ||
| rem Unless required by applicable law or agreed to in writing, software | ||
| rem distributed under the License is distributed on an "AS IS" BASIS, | ||
| rem WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| rem See the License for the specific language governing permissions and | ||
| rem limitations under the License. | ||
| rem | ||
|
|
||
| rem Install development version of SparkR | ||
| rem | ||
|
|
||
| set SPARK_HOME=%~dp0.. | ||
|
|
||
| MKDIR %SPARK_HOME%\R\lib | ||
|
|
||
| R.exe CMD INSTALL --library="%SPARK_HOME%\R\lib" %SPARK_HOME%\R\pkg\ |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,36 @@ | ||
| #!/bin/bash | ||
|
|
||
| # | ||
| # Licensed to the Apache Software Foundation (ASF) under one or more | ||
| # contributor license agreements. See the NOTICE file distributed with | ||
| # this work for additional information regarding copyright ownership. | ||
| # The ASF licenses this file to You under the Apache License, Version 2.0 | ||
| # (the "License"); you may not use this file except in compliance with | ||
| # the License. You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
| # | ||
|
|
||
| # This scripts packages the SparkR source files (R and C files) and | ||
| # creates a package that can be loaded in R. The package is by default installed to | ||
| # $FWDIR/lib and the package can be loaded by using the following command in R: | ||
| # | ||
| # library(SparkR, lib.loc="$FWDIR/lib") | ||
| # | ||
| # NOTE(shivaram): Right now we use $SPARK_HOME/R/lib to be the installation directory | ||
| # to load the SparkR package on the worker nodes. | ||
|
|
||
|
|
||
| FWDIR="$(cd `dirname $0`; pwd)" | ||
| LIB_DIR="$FWDIR/lib" | ||
|
|
||
| mkdir -p $LIB_DIR | ||
|
|
||
| # Install R | ||
| R CMD INSTALL --library=$LIB_DIR $FWDIR/pkg/ |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,28 @@ | ||
| # | ||
| # Licensed to the Apache Software Foundation (ASF) under one or more | ||
| # contributor license agreements. See the NOTICE file distributed with | ||
| # this work for additional information regarding copyright ownership. | ||
| # The ASF licenses this file to You under the Apache License, Version 2.0 | ||
| # (the "License"); you may not use this file except in compliance with | ||
| # the License. You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
| # | ||
|
|
||
| # Set everything to be logged to the file target/unit-tests.log | ||
| log4j.rootCategory=INFO, file | ||
| log4j.appender.file=org.apache.log4j.FileAppender | ||
| log4j.appender.file.append=true | ||
| log4j.appender.file.file=R-unit-tests.log | ||
| log4j.appender.file.layout=org.apache.log4j.PatternLayout | ||
| log4j.appender.file.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss.SSS} %t %p %c{1}: %m%n | ||
|
|
||
| # Ignore messages below warning level from Jetty, because it's a bit verbose | ||
| log4j.logger.org.eclipse.jetty=WARN | ||
| org.eclipse.jetty.LEVEL=WARN |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,35 @@ | ||
| Package: SparkR | ||
| Type: Package | ||
| Title: R frontend for Spark | ||
| Version: 1.4.0 | ||
| Date: 2013-09-09 | ||
| Author: The Apache Software Foundation | ||
| Maintainer: Shivaram Venkataraman <shivaram@cs.berkeley.edu> | ||
| Imports: | ||
| methods | ||
| Depends: | ||
| R (>= 3.0), | ||
| methods, | ||
| Suggests: | ||
| testthat | ||
| Description: R frontend for Spark | ||
| License: Apache License (== 2.0) | ||
| Collate: | ||
| 'generics.R' | ||
| 'jobj.R' | ||
| 'SQLTypes.R' | ||
| 'RDD.R' | ||
| 'pairRDD.R' | ||
| 'column.R' | ||
| 'group.R' | ||
| 'DataFrame.R' | ||
| 'SQLContext.R' | ||
| 'broadcast.R' | ||
| 'context.R' | ||
| 'deserialize.R' | ||
| 'serialize.R' | ||
| 'sparkR.R' | ||
| 'backend.R' | ||
| 'client.R' | ||
| 'utils.R' | ||
| 'zzz.R' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a TODO comment here to merge this into the existing spark docs?