[SPARK-25981][R] Enables Arrow optimization from R DataFrame to Spark DataFrame #22954

HyukjinKwon · 2018-11-06T09:45:02Z

What changes were proposed in this pull request?

This PR targets to support Arrow optimization for conversion from R DataFrame to Spark DataFrame.
Like PySpark side, it falls back to non-optimization code path when it's unable to use Arrow optimization.

This can be tested as below:

$ ./bin/sparkR --conf spark.sql.execution.arrow.enabled=true

collect(createDataFrame(mtcars))

Requirements

R 3.5.x

Arrow package 0.12+

Rscript -e 'remotes::install_github("apache/arrow@apache-arrow-0.12.0", subdir = "r")'

Note: currently, Arrow R package is not in CRAN. Please take a look at ARROW-3204.
Note: currently, Arrow R package seems not supporting Windows. Please take a look at ARROW-3204.

Benchmarks

Shall

sync && sudo purge
./bin/sparkR --conf spark.sql.execution.arrow.enabled=false

sync && sudo purge
./bin/sparkR --conf spark.sql.execution.arrow.enabled=true

R code

createDataFrame(mtcars) # Initializes
rdf <- read.csv("500000.csv")

test <- function() {
  options(digits.secs = 6) # milliseconds
  start.time <- Sys.time()
  createDataFrame(rdf)
  end.time <- Sys.time()
  time.taken <- end.time - start.time
  print(time.taken)
}

test()

Data (350 MB):

object.size(read.csv("500000.csv"))
350379504 bytes

"500000 Records" http://eforexcel.com/wp/downloads-16-sample-csv-files-data-sets-for-testing/

Results

Time difference of 29.9468 secs

Time difference of 3.222129 secs

The performance improvement was around 950%.
Actually, this PR improves around 1200%+ because this PR includes a small optimization about regular R DataFrame -> Spark DatFrame. See #22954 (comment)

Limitations:

For now, Arrow optimization with R does not support when the data is raw, and when user explicitly gives float type in the schema. They produce corrupt values.
In this case, we decide to fall back to non-optimization code path.

How was this patch tested?

Small test was added.

I manually forced to set this optimization true for all R tests and they were all passed (with few of fallback warnings).

TODOs:

HyukjinKwon · 2018-11-06T09:45:39Z

Let me leave a cc @felixcheung, @BryanCutler, @yanboliang, @shivaram FYI.

SparkQA · 2018-11-06T10:09:40Z

Test build #98508 has finished for PR 22954 at commit 90011a5.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-06T10:21:29Z

Test build #98510 has finished for PR 22954 at commit 46eaeca.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-06T14:04:31Z

Test build #98512 has finished for PR 22954 at commit 614170e.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-06T14:22:10Z

Test build #98514 has finished for PR 22954 at commit b15d79c.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-11-07T00:29:06Z

So far, the regressions tests are passed and newly added test for R optimization is verified locally. Let me fix CRAN test and some nits.

R/pkg/R/SQLContext.R

felixcheung

so this is super cool - my biggest concern is we changed to not write out to file for not respecting the encryption requirement, and this adds back the temp file

R/pkg/R/SQLContext.R

HyukjinKwon · 2018-11-07T08:28:05Z

Thanks, @felixcheung. I will address those comments during cleaning up.

HyukjinKwon · 2018-11-07T08:31:37Z

For encryption stuff, I will try to handle that as well (maybe as a followup(?)) so that we support it even when that's enabled.

HyukjinKwon · 2018-11-08T10:43:59Z

@felixcheung! performance improvement was 955% ! I described the benchmark I took in PR description.

R/pkg/R/SQLContext.R

HyukjinKwon · 2018-11-08T11:11:12Z

adding @falaki and @mengxr as well.

SparkQA · 2018-11-08T14:17:01Z

Test build #98595 has finished for PR 22954 at commit 8813192.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-08T17:09:07Z

Test build #98603 has finished for PR 22954 at commit 7be15d3.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-08T17:12:26Z

Test build #98613 has started for PR 22954 at commit 2ddbd69.

R/pkg/R/SQLContext.R

HyukjinKwon · 2018-11-08T17:42:11Z

I have finished most of todos except waiting for R API of Arrow 0.12.0 and fixing some changes accordingly.

SparkQA · 2018-11-08T19:25:12Z

Test build #98615 has finished for PR 22954 at commit 2ba6add.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-08T20:54:55Z

Test build #98614 has finished for PR 22954 at commit 0903736.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-11-08T23:19:05Z

retest this please

R/pkg/R/SQLContext.R

SparkQA · 2018-11-09T03:32:08Z

Test build #98628 has finished for PR 22954 at commit 2ba6add.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

R/pkg/R/SQLContext.R

sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala

R/pkg/R/SQLContext.R

felixcheung · 2019-01-22T06:15:24Z

I was thinking a blog post in the Arrow project ;)

HyukjinKwon · 2019-01-22T06:26:39Z

Gotya, yea, I am interested in it of course. I'll start to work on that after this PR merged.

SparkQA · 2019-01-22T08:05:01Z

Test build #101513 has finished for PR 22954 at commit 767af86.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2019-01-22T19:01:43Z

retest this please

SparkQA · 2019-01-22T23:11:46Z

Test build #101551 has finished for PR 22954 at commit 767af86.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

R/pkg/R/SQLContext.R

HyukjinKwon · 2019-01-24T14:25:24Z

To cut it short, I think this PR is ready to go. I reran the benchmark, and updated PR descriptions.

Few things to mention:

Arrow is not related on CRAN and looks it's going to take few months (see ARROW-3204). So, for now, it should be manually installed.
- It can be installed by Rscript -e 'remotes::install_github("apache/arrow@apache-arrow-0.12.0", subdir = "r")'.
- I used maxOS Mojave 10.14.2 and faced some problems to fix at my env. Please connect me if you guys face some issue during installing this. If this is globally happening, I will document this somewhere.
Looks we can run the build via AppVeyor when it's on CRAN (see ARROW-3204).
We should remove the workarounds that I used to avoid CRAN check (see [SPARK-25981][R] Enables Arrow optimization from R DataFrame to Spark DataFrame #22954 (comment) and [SPARK-25981][R] Enables Arrow optimization from R DataFrame to Spark DataFrame #22954 (comment))

Next items (im going to investigate first before filing JIRAs):

Im gonna take a look if we can do this Spark DataFrame -> R DataFrame too
Also, I'm going to take a look for R native function APIs like lapply and gapply and see if we can optimize this
Before Spark 3.0 release, I will document this. Hopefully, we can get rid of both workaround I mentioned above and Arrow is on CRAN before this.

SparkQA · 2019-01-24T14:52:28Z

Test build #101633 has finished for PR 22954 at commit 66b120b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-01-24T14:57:46Z

Test build #101630 has finished for PR 22954 at commit 92eec4e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-01-24T18:13:57Z

Test build #101632 has finished for PR 22954 at commit 854c9d8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-01-25T02:04:51Z

@felixcheung and @shivaram, are you okay with this plan #22954 (comment) ? If so, I think we can go ahead.

felixcheung · 2019-01-25T16:53:23Z

Yes, install_github is fine. I’d say #2 gapply etc is higher priority. Sounds good to me.

HyukjinKwon · 2019-01-26T10:01:36Z

Yea will do. Do you mind if we go ahead with this PR @felixcheung?

felixcheung · 2019-01-26T18:31:23Z

Sure.

HyukjinKwon · 2019-01-27T02:44:44Z

Thanks. @felixcheung.

Merged to master.

HyukjinKwon · 2019-02-13T03:24:57Z

BTW, https://issues.apache.org/jira/browse/SPARK-26759 has subtasks for Arrow optimization (just FYI if anyone missed it)

… DataFrame ## What changes were proposed in this pull request? This PR targets to support Arrow optimization for conversion from R DataFrame to Spark DataFrame. Like PySpark side, it falls back to non-optimization code path when it's unable to use Arrow optimization. This can be tested as below: ```bash $ ./bin/sparkR --conf spark.sql.execution.arrow.enabled=true ``` ```r collect(createDataFrame(mtcars)) ``` ### Requirements - R 3.5.x - Arrow package 0.12+ ```bash Rscript -e 'remotes::install_github("apache/arrowapache-arrow-0.12.0", subdir = "r")' ``` **Note:** currently, Arrow R package is not in CRAN. Please take a look at ARROW-3204. **Note:** currently, Arrow R package seems not supporting Windows. Please take a look at ARROW-3204. ### Benchmarks **Shall** ```bash sync && sudo purge ./bin/sparkR --conf spark.sql.execution.arrow.enabled=false ``` ```bash sync && sudo purge ./bin/sparkR --conf spark.sql.execution.arrow.enabled=true ``` **R code** ```r createDataFrame(mtcars) # Initializes rdf <- read.csv("500000.csv") test <- function() { options(digits.secs = 6) # milliseconds start.time <- Sys.time() createDataFrame(rdf) end.time <- Sys.time() time.taken <- end.time - start.time print(time.taken) } test() ``` **Data (350 MB):** ```r object.size(read.csv("500000.csv")) 350379504 bytes ``` "500000 Records" http://eforexcel.com/wp/downloads-16-sample-csv-files-data-sets-for-testing/ **Results** ``` Time difference of 29.9468 secs ``` ``` Time difference of 3.222129 secs ``` The performance improvement was around **950%**. Actually, this PR improves around **1200%**+ because this PR includes a small optimization about regular R DataFrame -> Spark DatFrame. See apache#22954 (comment) ### Limitations: For now, Arrow optimization with R does not support when the data is `raw`, and when user explicitly gives float type in the schema. They produce corrupt values. In this case, we decide to fall back to non-optimization code path. ## How was this patch tested? Small test was added. I manually forced to set this optimization `true` for _all_ R tests and they were _all_ passed (with few of fallback warnings). **TODOs:** - [x] Draft codes - [x] make the tests passed - [x] make the CRAN check pass - [x] Performance measurement - [x] Supportability investigation (for instance types) - [x] Wait for Arrow 0.12.0 release - [x] Fix and match it to Arrow 0.12.0 Closes apache#22954 from HyukjinKwon/r-arrow-createdataframe. Lead-authored-by: hyukjinkwon <gurwls223@apache.org> Co-authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

felixcheung reviewed Nov 7, 2018

View reviewed changes

R/pkg/R/SQLContext.R Outdated Show resolved Hide resolved

felixcheung reviewed Nov 7, 2018

View reviewed changes

R/pkg/R/SQLContext.R Outdated Show resolved Hide resolved

felixcheung reviewed Nov 7, 2018

View reviewed changes

R/pkg/R/SQLContext.R Outdated Show resolved Hide resolved

felixcheung reviewed Nov 7, 2018

View reviewed changes

R/pkg/R/SQLContext.R Outdated Show resolved Hide resolved

HyukjinKwon commented Nov 8, 2018

View reviewed changes

R/pkg/R/SQLContext.R Show resolved Hide resolved

HyukjinKwon changed the title ~~[DO-NOT-MERGE][POC] Enables Arrow optimization from R DataFrame to Spark DataFrame~~ [DO-NOT-MERGE] Enables Arrow optimization from R DataFrame to Spark DataFrame Nov 8, 2018

HyukjinKwon changed the title ~~[DO-NOT-MERGE] Enables Arrow optimization from R DataFrame to Spark DataFrame~~ [WIP] Enables Arrow optimization from R DataFrame to Spark DataFrame Nov 8, 2018

HyukjinKwon commented Nov 8, 2018

View reviewed changes

R/pkg/R/SQLContext.R Outdated Show resolved Hide resolved

HyukjinKwon changed the title ~~[WIP] Enables Arrow optimization from R DataFrame to Spark DataFrame~~ [SPARK-25981][R] Enables Arrow optimization from R DataFrame to Spark DataFrame Nov 8, 2018

HyukjinKwon commented Nov 9, 2018

View reviewed changes

R/pkg/R/SQLContext.R Outdated Show resolved Hide resolved

felixcheung reviewed Nov 9, 2018

View reviewed changes

HyukjinKwon added 3 commits January 22, 2019 11:39

Ha .. also avoid R lint failure

1d57e0a

Address nits

168a81e

Address comments

767af86

HyukjinKwon force-pushed the r-arrow-createdataframe branch from 954bc0e to 767af86 Compare January 22, 2019 06:00

HyukjinKwon changed the title ~~[SPARK-25981][R] Enables Arrow optimization from R DataFrame to Spark DataFrame~~ [WIP][SPARK-25981][R] Enables Arrow optimization from R DataFrame to Spark DataFrame Jan 22, 2019

HyukjinKwon commented Jan 24, 2019

View reviewed changes

R/pkg/R/SQLContext.R Outdated Show resolved Hide resolved

HyukjinKwon added 2 commits January 24, 2019 21:51

Match to Arrow 0.12.0

92eec4e

Make linter happy

854c9d8

HyukjinKwon changed the title ~~[WIP][SPARK-25981][R] Enables Arrow optimization from R DataFrame to Spark DataFrame~~ [SPARK-25981][R] Enables Arrow optimization from R DataFrame to Spark DataFrame Jan 24, 2019

HyukjinKwon commented Jan 24, 2019

View reviewed changes

R/pkg/R/SQLContext.R Show resolved Hide resolved

Fix comment

66b120b

asfgit closed this in e8982ca Jan 27, 2019

HyukjinKwon deleted the r-arrow-createdataframe branch March 3, 2020 01:20

behrica mentioned this pull request Oct 27, 2020

support for writing arrow files zero-one-group/geni#279

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-25981][R] Enables Arrow optimization from R DataFrame to Spark DataFrame #22954

[SPARK-25981][R] Enables Arrow optimization from R DataFrame to Spark DataFrame #22954

HyukjinKwon commented Nov 6, 2018 •

edited

Loading

HyukjinKwon commented Nov 6, 2018 •

edited

Loading

SparkQA commented Nov 6, 2018

SparkQA commented Nov 6, 2018

SparkQA commented Nov 6, 2018

SparkQA commented Nov 6, 2018

HyukjinKwon commented Nov 7, 2018

felixcheung left a comment

HyukjinKwon commented Nov 7, 2018

HyukjinKwon commented Nov 7, 2018

HyukjinKwon commented Nov 8, 2018

HyukjinKwon commented Nov 8, 2018

SparkQA commented Nov 8, 2018

SparkQA commented Nov 8, 2018

SparkQA commented Nov 8, 2018

HyukjinKwon commented Nov 8, 2018

SparkQA commented Nov 8, 2018

SparkQA commented Nov 8, 2018

HyukjinKwon commented Nov 8, 2018

SparkQA commented Nov 9, 2018

felixcheung commented Jan 22, 2019

HyukjinKwon commented Jan 22, 2019

SparkQA commented Jan 22, 2019

BryanCutler commented Jan 22, 2019

SparkQA commented Jan 22, 2019

HyukjinKwon commented Jan 24, 2019 •

edited

Loading

SparkQA commented Jan 24, 2019

SparkQA commented Jan 24, 2019

SparkQA commented Jan 24, 2019

HyukjinKwon commented Jan 25, 2019

felixcheung commented Jan 25, 2019 via email

HyukjinKwon commented Jan 26, 2019

felixcheung commented Jan 26, 2019 via email

HyukjinKwon commented Jan 27, 2019

HyukjinKwon commented Feb 13, 2019

[SPARK-25981][R] Enables Arrow optimization from R DataFrame to Spark DataFrame #22954

[SPARK-25981][R] Enables Arrow optimization from R DataFrame to Spark DataFrame #22954

Conversation

HyukjinKwon commented Nov 6, 2018 • edited Loading

What changes were proposed in this pull request?

Requirements

Benchmarks

Limitations:

How was this patch tested?

HyukjinKwon commented Nov 6, 2018 • edited Loading

SparkQA commented Nov 6, 2018

SparkQA commented Nov 6, 2018

SparkQA commented Nov 6, 2018

SparkQA commented Nov 6, 2018

HyukjinKwon commented Nov 7, 2018

felixcheung left a comment

Choose a reason for hiding this comment

HyukjinKwon commented Nov 7, 2018

HyukjinKwon commented Nov 7, 2018

HyukjinKwon commented Nov 8, 2018

HyukjinKwon commented Nov 8, 2018

SparkQA commented Nov 8, 2018

SparkQA commented Nov 8, 2018

SparkQA commented Nov 8, 2018

HyukjinKwon commented Nov 8, 2018

SparkQA commented Nov 8, 2018

SparkQA commented Nov 8, 2018

HyukjinKwon commented Nov 8, 2018

SparkQA commented Nov 9, 2018

felixcheung commented Jan 22, 2019

HyukjinKwon commented Jan 22, 2019

SparkQA commented Jan 22, 2019

BryanCutler commented Jan 22, 2019

SparkQA commented Jan 22, 2019

HyukjinKwon commented Jan 24, 2019 • edited Loading

SparkQA commented Jan 24, 2019

SparkQA commented Jan 24, 2019

SparkQA commented Jan 24, 2019

HyukjinKwon commented Jan 25, 2019

felixcheung commented Jan 25, 2019 via email

HyukjinKwon commented Jan 26, 2019

felixcheung commented Jan 26, 2019 via email

HyukjinKwon commented Jan 27, 2019

HyukjinKwon commented Feb 13, 2019

HyukjinKwon commented Nov 6, 2018 •

edited

Loading

HyukjinKwon commented Nov 6, 2018 •

edited

Loading

HyukjinKwon commented Jan 24, 2019 •

edited

Loading