-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-17790][SPARKR] Support for parallelizing R data.frame larger than 2GB #15375
Conversation
Test build #66434 has finished for PR 15375 at commit
|
I think we need to delete the temp file? |
@felixcheung added clean up for the temp file and unit test. PTAL. |
Test build #66464 has finished for PR 15375 at commit
|
@@ -123,19 +126,46 @@ parallelize <- function(sc, coll, numSlices = 1) { | |||
if (numSlices > length(coll)) | |||
numSlices <- length(coll) | |||
|
|||
sizeLimit <- as.numeric( | |||
sparkR.conf("spark.r.maxAllocationLimit", toString(.Machine$integer.max - 10240))) | |||
objectSize <- object.size(coll) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the guess of size could easily be wrong, and writing them into disk is not that bad anyway, should we have a much smaller default value (for example, 100M)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
Test build #66467 has finished for PR 15375 at commit
|
Test build #66472 has finished for PR 15375 at commit
|
@@ -126,13 +126,13 @@ parallelize <- function(sc, coll, numSlices = 1) { | |||
if (numSlices > length(coll)) | |||
numSlices <- length(coll) | |||
|
|||
sizeLimit <- .Machine$integer.max - 10240 # Safe margin bellow maximum allocation limit | |||
sizeLimit <- as.numeric( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this be as.integer(
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This number is not serialized anywhere. I think as.numeric is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed, probably not a big deal, an user could set spark.r.maxAllocationLimit
to 0.01
though, to make numSlices
bigger
Odd, this is the error from appveyor:
|
fileName <- writeToTempFile(serializedSlices) | ||
jrdd <- callJStatic( | ||
"org.apache.spark.api.r.RRDD", "createRDDFromFile", sc, fileName, as.integer(numSlices)) | ||
file.remove(fileName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the JVM call throws an exception, I don't think this line will execute, perhaps wrap this in tryCatch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Done!
Test build #66517 has finished for PR 15375 at commit
|
Test build #66527 has finished for PR 15375 at commit
|
Test build #66528 has finished for PR 15375 at commit
|
Test build #66565 has finished for PR 15375 at commit
|
Test build #66699 has finished for PR 15375 at commit
|
@@ -87,6 +87,9 @@ objectFile <- function(sc, path, minPartitions = NULL) { | |||
#' in the list are split into \code{numSlices} slices and distributed to nodes | |||
#' in the cluster. | |||
#' | |||
#' If size of serialized slices is larger than 2GB (or INT_MAX bytes), the function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this be 200MB now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Test build #3324 has finished for PR 15375 at commit
|
Test build #66750 has finished for PR 15375 at commit
|
@felixcheung does it look OK now? |
This LGTM - I think we merge this to master and branch-2.0 |
Seems like a flaky test in
If jenkins listens to your commands, maybe we can have it retest this? |
Jenkins, retest this please |
@falaki @felixcheung The DirectKafkaStreamSuite is a known flaky test. Nothing in this patch should affect Kafka. |
Test build #66792 has finished for PR 15375 at commit
|
everything passed, I'm merging to this master and branch-2.0 |
…han 2GB ## What changes were proposed in this pull request? If the R data structure that is being parallelized is larger than `INT_MAX` we use files to transfer data to JVM. The serialization protocol mimics Python pickling. This allows us to simply call `PythonRDD.readRDDFromFile` to create the RDD. I tested this on my MacBook. Following code works with this patch: ```R intMax <- .Machine$integer.max largeVec <- 1:intMax rdd <- SparkR:::parallelize(sc, largeVec, 2) ``` ## How was this patch tested? * [x] Unit tests Author: Hossein <hossein@databricks.com> Closes #15375 from falaki/SPARK-17790. (cherry picked from commit 5cc503f) Signed-off-by: Felix Cheung <felixcheung@apache.org>
…han 2GB ## What changes were proposed in this pull request? If the R data structure that is being parallelized is larger than `INT_MAX` we use files to transfer data to JVM. The serialization protocol mimics Python pickling. This allows us to simply call `PythonRDD.readRDDFromFile` to create the RDD. I tested this on my MacBook. Following code works with this patch: ```R intMax <- .Machine$integer.max largeVec <- 1:intMax rdd <- SparkR:::parallelize(sc, largeVec, 2) ``` ## How was this patch tested? * [x] Unit tests Author: Hossein <hossein@databricks.com> Closes apache#15375 from falaki/SPARK-17790.
What changes were proposed in this pull request?
If the R data structure that is being parallelized is larger than
INT_MAX
we use files to transfer data to JVM. The serialization protocol mimics Python pickling. This allows us to simply callPythonRDD.readRDDFromFile
to create the RDD.I tested this on my MacBook. Following code works with this patch:
How was this patch tested?