Implements some methods of an R list as a Spark RDD (resilient distributed dataset)
R Makefile
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.

README.Rmd

title output
rddlist
github_document
fig_width fig_height
9
5

ATTENTION: This package should be considered as an experimental beta version only. I have yet to test it on a real cluster :)

Please see SparkR for the official Apache supported Spark / R interface.

rddlist Implements distributed computation on an R list with an Apache Spark backend. This allows an R programmer to use familiar operations like [[, lapply, and mapply from within R to perform computation on larger data sets using a Spark cluster. This is a powerful combination, as any data in R can be represented with a list, and *apply operations allow the application of arbitrary user defined functions, provided they are pure. So this should work well for embarrassingly parallel problems.

The main purpose of this project is to serve as an object that will connect Spark to the more general ddR project (distributed data in R). Work supported by R-Consortium.

Under the hood this works by serializing each element of the list into a byte array and storing it in a Spark Pair RDD (resilient distributed data set).

Examples

sparklyr provides the spark connections.

library(sparklyr)
spark_install(version = "2.0.0")
library(sparklyr)
library(rddlist)

sc <- spark_connect(master = "local", version = "2.0.0")

x <- list(1:10, letters, rnorm(10))
xrdd <- rddlist(sc, x)

xrdd is an object in the local R session referencing the actual data residing in Spark. Collecting deserializes the object from Spark into local R.

x2 <- collect(xrdd)
identical(x, x2)

There is an exact correspondence between the structures in Spark and local R to simplify reasoning about how the data is stored in Spark.

[[ will also collect.

xrdd[[1]]

lapply_rdd and mapply_rdd work similarly to their counterparts in base R.

first3 <- lapply_rdd(xrdd, function(x) x[1:3])
collect(first3)
yrdd <- rddlist(sc, list(21:30, LETTERS, rnorm(10)))
xyrdd <- mapply_rdd(c, xrdd, yrdd)
collect(xyrdd)
spark_disconnect(sc)