[SPARK-17790][SPARKR] Support for parallelizing R data.frame larger than 2GB #15375

falaki · 2016-10-06T06:58:25Z

What changes were proposed in this pull request?

If the R data structure that is being parallelized is larger than INT_MAX we use files to transfer data to JVM. The serialization protocol mimics Python pickling. This allows us to simply call PythonRDD.readRDDFromFile to create the RDD.

I tested this on my MacBook. Following code works with this patch:

intMax <- .Machine$integer.max
largeVec <- 1:intMax
rdd <- SparkR:::parallelize(sc, largeVec, 2)

How was this patch tested?

Unit tests

SparkQA · 2016-10-06T07:07:04Z

Test build #66434 has finished for PR 15375 at commit f4b1638.

This patch fails R style tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-10-06T08:50:34Z

I think we need to delete the temp file?

falaki · 2016-10-06T20:51:06Z

@felixcheung added clean up for the temp file and unit test. PTAL.

SparkQA · 2016-10-06T20:56:58Z

Test build #66464 has finished for PR 15375 at commit 8691569.

This patch fails R style tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-10-06T22:07:25Z

R/pkg/R/context.R

@@ -123,19 +126,46 @@ parallelize <- function(sc, coll, numSlices = 1) {
  if (numSlices > length(coll))
    numSlices <- length(coll)

+  sizeLimit <- as.numeric(
+    sparkR.conf("spark.r.maxAllocationLimit", toString(.Machine$integer.max - 10240)))
+  objectSize <- object.size(coll)


Since the guess of size could easily be wrong, and writing them into disk is not that bad anyway, should we have a much smaller default value (for example, 100M)?

SparkQA · 2016-10-07T02:12:44Z

Test build #66467 has finished for PR 15375 at commit 8e065c1.

This patch fails from timeout after a configured wait of 250m.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-07T04:07:41Z

Test build #66472 has finished for PR 15375 at commit 4aab6cf.

This patch fails from timeout after a configured wait of 250m.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-10-07T05:36:27Z

R/pkg/R/context.R

@@ -126,13 +126,13 @@ parallelize <- function(sc, coll, numSlices = 1) {
  if (numSlices > length(coll))
    numSlices <- length(coll)

-  sizeLimit <- .Machine$integer.max - 10240 # Safe margin bellow maximum allocation limit
+  sizeLimit <- as.numeric(


shouldn't this be as.integer(?

This number is not serialized anywhere. I think as.numeric is fine.

agreed, probably not a big deal, an user could set spark.r.maxAllocationLimit to 0.01 though, to make numSlices bigger

felixcheung · 2016-10-07T05:40:00Z

Odd, this is the error from appveyor:

ontext: Fail to set Spark caller context
java.lang.ClassNotFoundException: org.apache.hadoop.ipc.CallerContext
    at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:264)
    at org.apache.spark.util.CallerContext.setCurrentContext(Utils.scala:2485)
    at org.apache.spark.scheduler.Task.run(Task.scala:96)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
16/10/07 03:45:58 INFO Executor: Finished task 1.

felixcheung · 2016-10-07T05:45:04Z

R/pkg/R/context.R

+    fileName <- writeToTempFile(serializedSlices)
+    jrdd <- callJStatic(
+      "org.apache.spark.api.r.RRDD", "createRDDFromFile", sc, fileName, as.integer(numSlices))
+    file.remove(fileName)


if the JVM call throws an exception, I don't think this line will execute, perhaps wrap this in tryCatch?

Good point. Done!

SparkQA · 2016-10-07T18:46:58Z

Test build #66517 has finished for PR 15375 at commit 6ea7fb3.

This patch fails R style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- public class InterfaceStability
- sealed trait ConsumerStrategy
- case class SubscribeStrategy(topics: Seq[String], kafkaParams: ju.Map[String, Object])
- case class SubscribePatternStrategy(
- case class KafkaSourceOffset(partitionToOffsets: Map[TopicPartition, Long]) extends Offset
- abstract class StreamExecutionThread(name: String) extends UninterruptibleThread(name)

SparkQA · 2016-10-07T22:12:13Z

Test build #66527 has finished for PR 15375 at commit ef989c1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-08T00:47:35Z

Test build #66528 has finished for PR 15375 at commit 9df49ea.

This patch fails from timeout after a configured wait of 250m.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-08T09:02:29Z

Test build #66565 has finished for PR 15375 at commit 766d903.

This patch fails from timeout after a configured wait of 250m.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-11T03:30:28Z

Test build #66699 has finished for PR 15375 at commit 62ab47b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-10-11T07:04:55Z

R/pkg/R/context.R

@@ -87,6 +87,9 @@ objectFile <- function(sc, path, minPartitions = NULL) {
 #' in the list are split into \code{numSlices} slices and distributed to nodes
 #' in the cluster.
 #'
+#' If size of serialized slices is larger than 2GB (or INT_MAX bytes), the function


shouldn't this be 200MB now?

SparkQA · 2016-10-11T18:26:21Z

Test build #3324 has finished for PR 15375 at commit 62ab47b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-11T19:49:29Z

Test build #66750 has finished for PR 15375 at commit 836e874.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

falaki · 2016-10-11T22:39:41Z

@felixcheung does it look OK now?

felixcheung · 2016-10-11T23:25:05Z

This LGTM - I think we merge this to master and branch-2.0
But Spark unit tests are failing?

falaki · 2016-10-11T23:57:36Z

Seems like a flaky test in DirectKafkaStreamSuite:

DirectKafkaStreamSuite:
- pattern based subscription *** FAILED *** (1 minute, 41 seconds)

If jenkins listens to your commands, maybe we can have it retest this?

shivaram · 2016-10-12T03:33:01Z

Jenkins, retest this please

tdas · 2016-10-12T05:07:28Z

@falaki @felixcheung The DirectKafkaStreamSuite is a known flaky test. Nothing in this patch should affect Kafka.

SparkQA · 2016-10-12T05:56:30Z

Test build #66792 has finished for PR 15375 at commit 836e874.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-10-12T17:31:54Z

everything passed, I'm merging to this master and branch-2.0

…han 2GB ## What changes were proposed in this pull request? If the R data structure that is being parallelized is larger than `INT_MAX` we use files to transfer data to JVM. The serialization protocol mimics Python pickling. This allows us to simply call `PythonRDD.readRDDFromFile` to create the RDD. I tested this on my MacBook. Following code works with this patch: ```R intMax <- .Machine$integer.max largeVec <- 1:intMax rdd <- SparkR:::parallelize(sc, largeVec, 2) ``` ## How was this patch tested? * [x] Unit tests Author: Hossein <hossein@databricks.com> Closes #15375 from falaki/SPARK-17790. (cherry picked from commit 5cc503f) Signed-off-by: Felix Cheung <felixcheung@apache.org>

…han 2GB ## What changes were proposed in this pull request? If the R data structure that is being parallelized is larger than `INT_MAX` we use files to transfer data to JVM. The serialization protocol mimics Python pickling. This allows us to simply call `PythonRDD.readRDDFromFile` to create the RDD. I tested this on my MacBook. Following code works with this patch: ```R intMax <- .Machine$integer.max largeVec <- 1:intMax rdd <- SparkR:::parallelize(sc, largeVec, 2) ``` ## How was this patch tested? * [x] Unit tests Author: Hossein <hossein@databricks.com> Closes apache#15375 from falaki/SPARK-17790.

falaki added 2 commits October 5, 2016 22:28

Using temp file for prallelizing large R objects

140755c

Adjusting number of slices according to size

f4b1638

Added unit test

8691569

Style fix

8e065c1

davies reviewed Oct 6, 2016

View reviewed changes

Set default size limit to 200MB

4aab6cf

felixcheung reviewed Oct 7, 2016

View reviewed changes

falaki added 2 commits October 7, 2016 11:39

Making sure file is removed in case of exceptions

5cbbd7c

Merge branch 'master' into SPARK-17790

6ea7fb3

falaki added 2 commits October 7, 2016 13:07

Lint

ef989c1

200MB default limit

9df49ea

Merge branch 'master' into SPARK-17790

766d903

falaki changed the title ~~[SPARK-17790] Support for parallelizing R data.frame larger than 2GB~~ [SPARK-17790][SPARKR] Support for parallelizing R data.frame larger than 2GB Oct 8, 2016

falaki added 3 commits October 10, 2016 16:28

Merge branch 'master' into SPARK-17790

5524e75

Not susing sparkR.getConf inside context.R

9f103c6

Fixing numSlice corner case

62ab47b

felixcheung reviewed Oct 11, 2016

View reviewed changes

Updated code comment

836e874

shivaram mentioned this pull request Oct 12, 2016

[SPARK-17882][SparkR] Fix swallowed exception in RBackendHandler #15446

Closed

asfgit closed this in 5cc503f Oct 12, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-17790][SPARKR] Support for parallelizing R data.frame larger than 2GB #15375

[SPARK-17790][SPARKR] Support for parallelizing R data.frame larger than 2GB #15375

falaki commented Oct 6, 2016 •

edited

Loading

SparkQA commented Oct 6, 2016

felixcheung commented Oct 6, 2016

falaki commented Oct 6, 2016

SparkQA commented Oct 6, 2016

davies Oct 6, 2016

falaki Oct 6, 2016

SparkQA commented Oct 7, 2016

SparkQA commented Oct 7, 2016

felixcheung Oct 7, 2016

falaki Oct 7, 2016

felixcheung Oct 7, 2016 •

edited

Loading

felixcheung commented Oct 7, 2016

felixcheung Oct 7, 2016

falaki Oct 7, 2016

SparkQA commented Oct 7, 2016

SparkQA commented Oct 7, 2016

SparkQA commented Oct 8, 2016

SparkQA commented Oct 8, 2016

SparkQA commented Oct 11, 2016

felixcheung Oct 11, 2016 •

edited

Loading

falaki Oct 11, 2016

SparkQA commented Oct 11, 2016

SparkQA commented Oct 11, 2016

falaki commented Oct 11, 2016

felixcheung commented Oct 11, 2016 •

edited

Loading

falaki commented Oct 11, 2016

shivaram commented Oct 12, 2016

tdas commented Oct 12, 2016

SparkQA commented Oct 12, 2016

felixcheung commented Oct 12, 2016

[SPARK-17790][SPARKR] Support for parallelizing R data.frame larger than 2GB #15375

[SPARK-17790][SPARKR] Support for parallelizing R data.frame larger than 2GB #15375

Conversation

falaki commented Oct 6, 2016 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Oct 6, 2016

felixcheung commented Oct 6, 2016

falaki commented Oct 6, 2016

SparkQA commented Oct 6, 2016

davies Oct 6, 2016

Choose a reason for hiding this comment

falaki Oct 6, 2016

Choose a reason for hiding this comment

SparkQA commented Oct 7, 2016

SparkQA commented Oct 7, 2016

felixcheung Oct 7, 2016

Choose a reason for hiding this comment

falaki Oct 7, 2016

Choose a reason for hiding this comment

felixcheung Oct 7, 2016 • edited Loading

Choose a reason for hiding this comment

felixcheung commented Oct 7, 2016

felixcheung Oct 7, 2016

Choose a reason for hiding this comment

falaki Oct 7, 2016

Choose a reason for hiding this comment

SparkQA commented Oct 7, 2016

SparkQA commented Oct 7, 2016

SparkQA commented Oct 8, 2016

SparkQA commented Oct 8, 2016

SparkQA commented Oct 11, 2016

felixcheung Oct 11, 2016 • edited Loading

Choose a reason for hiding this comment

falaki Oct 11, 2016

Choose a reason for hiding this comment

SparkQA commented Oct 11, 2016

SparkQA commented Oct 11, 2016

falaki commented Oct 11, 2016

felixcheung commented Oct 11, 2016 • edited Loading

falaki commented Oct 11, 2016

shivaram commented Oct 12, 2016

tdas commented Oct 12, 2016

SparkQA commented Oct 12, 2016

felixcheung commented Oct 12, 2016

falaki commented Oct 6, 2016 •

edited

Loading

felixcheung Oct 7, 2016 •

edited

Loading

felixcheung Oct 11, 2016 •

edited

Loading

felixcheung commented Oct 11, 2016 •

edited

Loading