-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-25981][R] Enables Arrow optimization from R DataFrame to Spark DataFrame #22954
Conversation
Let me leave a cc @felixcheung, @BryanCutler, @yanboliang, @shivaram FYI. |
Test build #98508 has finished for PR 22954 at commit
|
Test build #98510 has finished for PR 22954 at commit
|
Test build #98512 has finished for PR 22954 at commit
|
Test build #98514 has finished for PR 22954 at commit
|
So far, the regressions tests are passed and newly added test for R optimization is verified locally. Let me fix CRAN test and some nits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this is super cool - my biggest concern is we changed to not write out to file for not respecting the encryption requirement, and this adds back the temp file
Thanks, @felixcheung. I will address those comments during cleaning up. |
For encryption stuff, I will try to handle that as well (maybe as a followup(?)) so that we support it even when that's enabled. |
@felixcheung! performance improvement was 955% ! I described the benchmark I took in PR description. |
Test build #98595 has finished for PR 22954 at commit
|
Test build #98603 has finished for PR 22954 at commit
|
Test build #98613 has started for PR 22954 at commit |
I have finished most of todos except waiting for R API of Arrow 0.12.0 and fixing some changes accordingly. |
Test build #98615 has finished for PR 22954 at commit
|
Test build #98614 has finished for PR 22954 at commit
|
retest this please |
Test build #98628 has finished for PR 22954 at commit
|
954bc0e
to
767af86
Compare
I was thinking a blog post in the Arrow project ;) |
Gotya, yea, I am interested in it of course. I'll start to work on that after this PR merged. |
Test build #101513 has finished for PR 22954 at commit
|
retest this please |
Test build #101551 has finished for PR 22954 at commit
|
To cut it short, I think this PR is ready to go. I reran the benchmark, and updated PR descriptions. Few things to mention:
Next items (im going to investigate first before filing JIRAs):
|
Test build #101633 has finished for PR 22954 at commit
|
Test build #101630 has finished for PR 22954 at commit
|
Test build #101632 has finished for PR 22954 at commit
|
@felixcheung and @shivaram, are you okay with this plan #22954 (comment) ? If so, I think we can go ahead. |
Yes, install_github is fine. I’d say #2 gapply etc is higher priority. Sounds good to me.
|
Yea will do. Do you mind if we go ahead with this PR @felixcheung? |
Sure.
|
Thanks. @felixcheung. Merged to master. |
BTW, https://issues.apache.org/jira/browse/SPARK-26759 has subtasks for Arrow optimization (just FYI if anyone missed it) |
… DataFrame ## What changes were proposed in this pull request? This PR targets to support Arrow optimization for conversion from R DataFrame to Spark DataFrame. Like PySpark side, it falls back to non-optimization code path when it's unable to use Arrow optimization. This can be tested as below: ```bash $ ./bin/sparkR --conf spark.sql.execution.arrow.enabled=true ``` ```r collect(createDataFrame(mtcars)) ``` ### Requirements - R 3.5.x - Arrow package 0.12+ ```bash Rscript -e 'remotes::install_github("apache/arrowapache-arrow-0.12.0", subdir = "r")' ``` **Note:** currently, Arrow R package is not in CRAN. Please take a look at ARROW-3204. **Note:** currently, Arrow R package seems not supporting Windows. Please take a look at ARROW-3204. ### Benchmarks **Shall** ```bash sync && sudo purge ./bin/sparkR --conf spark.sql.execution.arrow.enabled=false ``` ```bash sync && sudo purge ./bin/sparkR --conf spark.sql.execution.arrow.enabled=true ``` **R code** ```r createDataFrame(mtcars) # Initializes rdf <- read.csv("500000.csv") test <- function() { options(digits.secs = 6) # milliseconds start.time <- Sys.time() createDataFrame(rdf) end.time <- Sys.time() time.taken <- end.time - start.time print(time.taken) } test() ``` **Data (350 MB):** ```r object.size(read.csv("500000.csv")) 350379504 bytes ``` "500000 Records" http://eforexcel.com/wp/downloads-16-sample-csv-files-data-sets-for-testing/ **Results** ``` Time difference of 29.9468 secs ``` ``` Time difference of 3.222129 secs ``` The performance improvement was around **950%**. Actually, this PR improves around **1200%**+ because this PR includes a small optimization about regular R DataFrame -> Spark DatFrame. See apache#22954 (comment) ### Limitations: For now, Arrow optimization with R does not support when the data is `raw`, and when user explicitly gives float type in the schema. They produce corrupt values. In this case, we decide to fall back to non-optimization code path. ## How was this patch tested? Small test was added. I manually forced to set this optimization `true` for _all_ R tests and they were _all_ passed (with few of fallback warnings). **TODOs:** - [x] Draft codes - [x] make the tests passed - [x] make the CRAN check pass - [x] Performance measurement - [x] Supportability investigation (for instance types) - [x] Wait for Arrow 0.12.0 release - [x] Fix and match it to Arrow 0.12.0 Closes apache#22954 from HyukjinKwon/r-arrow-createdataframe. Lead-authored-by: hyukjinkwon <gurwls223@apache.org> Co-authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
What changes were proposed in this pull request?
This PR targets to support Arrow optimization for conversion from R DataFrame to Spark DataFrame.
Like PySpark side, it falls back to non-optimization code path when it's unable to use Arrow optimization.
This can be tested as below:
collect(createDataFrame(mtcars))
Requirements
Rscript -e 'remotes::install_github("apache/arrow@apache-arrow-0.12.0", subdir = "r")'
Note: currently, Arrow R package is not in CRAN. Please take a look at ARROW-3204.
Note: currently, Arrow R package seems not supporting Windows. Please take a look at ARROW-3204.
Benchmarks
Shall
sync && sudo purge ./bin/sparkR --conf spark.sql.execution.arrow.enabled=false
sync && sudo purge ./bin/sparkR --conf spark.sql.execution.arrow.enabled=true
R code
Data (350 MB):
"500000 Records" http://eforexcel.com/wp/downloads-16-sample-csv-files-data-sets-for-testing/
Results
The performance improvement was around 950%.
Actually, this PR improves around 1200%+ because this PR includes a small optimization about regular R DataFrame -> Spark DatFrame. See #22954 (comment)
Limitations:
For now, Arrow optimization with R does not support when the data is
raw
, and when user explicitly gives float type in the schema. They produce corrupt values.In this case, we decide to fall back to non-optimization code path.
How was this patch tested?
Small test was added.
I manually forced to set this optimization
true
for all R tests and they were all passed (with few of fallback warnings).TODOs: