ATTENTION: This package should be considered as an experimental beta version only. I have yet to test it on a real cluster :)
Please see SparkR for the official Apache supported Spark / R interface.
rddlist Implements distributed computation on an R list with an Apache
Spark backend. This allows an R programmer to
use familiar operations like
mapply from within R to
perform computation on larger data sets using a Spark cluster. This is a
powerful combination, as any data in R can be represented with a list, and
*apply operations allow the application of arbitrary user defined
functions, provided they are pure. So this should work well for
embarrassingly parallel problems.
Under the hood this works by serializing each element of the list into a byte array and storing it in a Spark Pair RDD (resilient distributed data set).
sparklyr provides the spark connections.
library(sparklyr) spark_install(version = "2.0.0")
library(sparklyr) library(rddlist) sc <- spark_connect(master = "local", version = "2.0.0") x <- list(1:10, letters, rnorm(10)) xrdd <- rddlist(sc, x)
xrdd is an object in the local R session referencing the actual data
residing in Spark.
Collecting deserializes the object from Spark into local R.
x2 <- collect(xrdd) identical(x, x2)
There is an exact correspondence between the structures in Spark and local R to simplify reasoning about how the data is stored in Spark.
[[ will also collect.
mapply_rdd work similarly to their counterparts in base R.
first3 <- lapply_rdd(xrdd, function(x) x[1:3]) collect(first3)
yrdd <- rddlist(sc, list(21:30, LETTERS, rnorm(10))) xyrdd <- mapply_rdd(c, xrdd, yrdd) collect(xyrdd)