Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-25981][R] Enables Arrow optimization from R DataFrame to Spark DataFrame #22954

Closed
wants to merge 19 commits into from

Conversation

HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Nov 6, 2018

What changes were proposed in this pull request?

This PR targets to support Arrow optimization for conversion from R DataFrame to Spark DataFrame.
Like PySpark side, it falls back to non-optimization code path when it's unable to use Arrow optimization.

This can be tested as below:

$ ./bin/sparkR --conf spark.sql.execution.arrow.enabled=true
collect(createDataFrame(mtcars))

Requirements

  • R 3.5.x
  • Arrow package 0.12+
    Rscript -e 'remotes::install_github("apache/arrow@apache-arrow-0.12.0", subdir = "r")'

Note: currently, Arrow R package is not in CRAN. Please take a look at ARROW-3204.
Note: currently, Arrow R package seems not supporting Windows. Please take a look at ARROW-3204.

Benchmarks

Shall

sync && sudo purge
./bin/sparkR --conf spark.sql.execution.arrow.enabled=false
sync && sudo purge
./bin/sparkR --conf spark.sql.execution.arrow.enabled=true

R code

createDataFrame(mtcars) # Initializes
rdf <- read.csv("500000.csv")

test <- function() {
  options(digits.secs = 6) # milliseconds
  start.time <- Sys.time()
  createDataFrame(rdf)
  end.time <- Sys.time()
  time.taken <- end.time - start.time
  print(time.taken)
}

test()

Data (350 MB):

object.size(read.csv("500000.csv"))
350379504 bytes

"500000 Records" http://eforexcel.com/wp/downloads-16-sample-csv-files-data-sets-for-testing/

Results

Time difference of 29.9468 secs
Time difference of 3.222129 secs

The performance improvement was around 950%.
Actually, this PR improves around 1200%+ because this PR includes a small optimization about regular R DataFrame -> Spark DatFrame. See #22954 (comment)

Limitations:

For now, Arrow optimization with R does not support when the data is raw, and when user explicitly gives float type in the schema. They produce corrupt values.
In this case, we decide to fall back to non-optimization code path.

How was this patch tested?

Small test was added.

I manually forced to set this optimization true for all R tests and they were all passed (with few of fallback warnings).

TODOs:

  • Draft codes
  • make the tests passed
  • make the CRAN check pass
  • Performance measurement
  • Supportability investigation (for instance types)
  • Wait for Arrow 0.12.0 release
  • Fix and match it to Arrow 0.12.0

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Nov 6, 2018

Let me leave a cc @felixcheung, @BryanCutler, @yanboliang, @shivaram FYI.

@SparkQA
Copy link

SparkQA commented Nov 6, 2018

Test build #98508 has finished for PR 22954 at commit 90011a5.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 6, 2018

Test build #98510 has finished for PR 22954 at commit 46eaeca.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 6, 2018

Test build #98512 has finished for PR 22954 at commit 614170e.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 6, 2018

Test build #98514 has finished for PR 22954 at commit b15d79c.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

So far, the regressions tests are passed and newly added test for R optimization is verified locally. Let me fix CRAN test and some nits.

R/pkg/R/SQLContext.R Outdated Show resolved Hide resolved
R/pkg/R/SQLContext.R Outdated Show resolved Hide resolved
R/pkg/R/SQLContext.R Outdated Show resolved Hide resolved
Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this is super cool - my biggest concern is we changed to not write out to file for not respecting the encryption requirement, and this adds back the temp file

R/pkg/R/SQLContext.R Outdated Show resolved Hide resolved
@HyukjinKwon
Copy link
Member Author

Thanks, @felixcheung. I will address those comments during cleaning up.

@HyukjinKwon
Copy link
Member Author

For encryption stuff, I will try to handle that as well (maybe as a followup(?)) so that we support it even when that's enabled.

@HyukjinKwon
Copy link
Member Author

@felixcheung! performance improvement was 955% ! I described the benchmark I took in PR description.

@HyukjinKwon
Copy link
Member Author

adding @falaki and @mengxr as well.

@HyukjinKwon HyukjinKwon changed the title [DO-NOT-MERGE][POC] Enables Arrow optimization from R DataFrame to Spark DataFrame [DO-NOT-MERGE] Enables Arrow optimization from R DataFrame to Spark DataFrame Nov 8, 2018
@HyukjinKwon HyukjinKwon changed the title [DO-NOT-MERGE] Enables Arrow optimization from R DataFrame to Spark DataFrame [WIP] Enables Arrow optimization from R DataFrame to Spark DataFrame Nov 8, 2018
@SparkQA
Copy link

SparkQA commented Nov 8, 2018

Test build #98595 has finished for PR 22954 at commit 8813192.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 8, 2018

Test build #98603 has finished for PR 22954 at commit 7be15d3.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 8, 2018

Test build #98613 has started for PR 22954 at commit 2ddbd69.

R/pkg/R/SQLContext.R Outdated Show resolved Hide resolved
@HyukjinKwon HyukjinKwon changed the title [WIP] Enables Arrow optimization from R DataFrame to Spark DataFrame [SPARK-25981][R] Enables Arrow optimization from R DataFrame to Spark DataFrame Nov 8, 2018
@HyukjinKwon
Copy link
Member Author

I have finished most of todos except waiting for R API of Arrow 0.12.0 and fixing some changes accordingly.

@SparkQA
Copy link

SparkQA commented Nov 8, 2018

Test build #98615 has finished for PR 22954 at commit 2ba6add.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 8, 2018

Test build #98614 has finished for PR 22954 at commit 0903736.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

retest this please

R/pkg/R/SQLContext.R Outdated Show resolved Hide resolved
@SparkQA
Copy link

SparkQA commented Nov 9, 2018

Test build #98628 has finished for PR 22954 at commit 2ba6add.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

R/pkg/R/SQLContext.R Outdated Show resolved Hide resolved
R/pkg/R/SQLContext.R Outdated Show resolved Hide resolved
R/pkg/R/SQLContext.R Outdated Show resolved Hide resolved
R/pkg/R/SQLContext.R Outdated Show resolved Hide resolved
R/pkg/R/SQLContext.R Show resolved Hide resolved
R/pkg/R/SQLContext.R Outdated Show resolved Hide resolved
R/pkg/R/SQLContext.R Outdated Show resolved Hide resolved
R/pkg/R/SQLContext.R Outdated Show resolved Hide resolved
R/pkg/R/SQLContext.R Outdated Show resolved Hide resolved
@HyukjinKwon HyukjinKwon changed the title [SPARK-25981][R] Enables Arrow optimization from R DataFrame to Spark DataFrame [WIP][SPARK-25981][R] Enables Arrow optimization from R DataFrame to Spark DataFrame Jan 22, 2019
@felixcheung
Copy link
Member

I was thinking a blog post in the Arrow project ;)

@HyukjinKwon
Copy link
Member Author

Gotya, yea, I am interested in it of course. I'll start to work on that after this PR merged.

@SparkQA
Copy link

SparkQA commented Jan 22, 2019

Test build #101513 has finished for PR 22954 at commit 767af86.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BryanCutler
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Jan 22, 2019

Test build #101551 has finished for PR 22954 at commit 767af86.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

R/pkg/R/SQLContext.R Outdated Show resolved Hide resolved
@HyukjinKwon HyukjinKwon changed the title [WIP][SPARK-25981][R] Enables Arrow optimization from R DataFrame to Spark DataFrame [SPARK-25981][R] Enables Arrow optimization from R DataFrame to Spark DataFrame Jan 24, 2019
@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Jan 24, 2019

To cut it short, I think this PR is ready to go. I reran the benchmark, and updated PR descriptions.

Few things to mention:

  1. Arrow is not related on CRAN and looks it's going to take few months (see ARROW-3204). So, for now, it should be manually installed.

    • It can be installed by Rscript -e 'remotes::install_github("apache/arrow@apache-arrow-0.12.0", subdir = "r")'.
    • I used maxOS Mojave 10.14.2 and faced some problems to fix at my env. Please connect me if you guys face some issue during installing this. If this is globally happening, I will document this somewhere.
  2. Looks we can run the build via AppVeyor when it's on CRAN (see ARROW-3204).

  3. We should remove the workarounds that I used to avoid CRAN check (see [SPARK-25981][R] Enables Arrow optimization from R DataFrame to Spark DataFrame #22954 (comment) and [SPARK-25981][R] Enables Arrow optimization from R DataFrame to Spark DataFrame #22954 (comment))

Next items (im going to investigate first before filing JIRAs):

  1. Im gonna take a look if we can do this Spark DataFrame -> R DataFrame too
  2. Also, I'm going to take a look for R native function APIs like lapply and gapply and see if we can optimize this
  3. Before Spark 3.0 release, I will document this. Hopefully, we can get rid of both workaround I mentioned above and Arrow is on CRAN before this.

@SparkQA
Copy link

SparkQA commented Jan 24, 2019

Test build #101633 has finished for PR 22954 at commit 66b120b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 24, 2019

Test build #101630 has finished for PR 22954 at commit 92eec4e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 24, 2019

Test build #101632 has finished for PR 22954 at commit 854c9d8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

@felixcheung and @shivaram, are you okay with this plan #22954 (comment) ? If so, I think we can go ahead.

@felixcheung
Copy link
Member

felixcheung commented Jan 25, 2019 via email

@HyukjinKwon
Copy link
Member Author

Yea will do. Do you mind if we go ahead with this PR @felixcheung?

@felixcheung
Copy link
Member

felixcheung commented Jan 26, 2019 via email

@HyukjinKwon
Copy link
Member Author

Thanks. @felixcheung.

Merged to master.

@asfgit asfgit closed this in e8982ca Jan 27, 2019
@HyukjinKwon
Copy link
Member Author

BTW, https://issues.apache.org/jira/browse/SPARK-26759 has subtasks for Arrow optimization (just FYI if anyone missed it)

jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
… DataFrame

## What changes were proposed in this pull request?

This PR targets to support Arrow optimization for conversion from R DataFrame to Spark DataFrame.
Like PySpark side, it falls back to non-optimization code path when it's unable to use Arrow optimization.

This can be tested as below:

```bash
$ ./bin/sparkR --conf spark.sql.execution.arrow.enabled=true
```

```r
collect(createDataFrame(mtcars))
```

### Requirements
  - R 3.5.x
  - Arrow package 0.12+
    ```bash
    Rscript -e 'remotes::install_github("apache/arrowapache-arrow-0.12.0", subdir = "r")'
    ```

**Note:** currently, Arrow R package is not in CRAN. Please take a look at ARROW-3204.
**Note:** currently, Arrow R package seems not supporting Windows. Please take a look at ARROW-3204.

### Benchmarks

**Shall**

```bash
sync && sudo purge
./bin/sparkR --conf spark.sql.execution.arrow.enabled=false
```

```bash
sync && sudo purge
./bin/sparkR --conf spark.sql.execution.arrow.enabled=true
```

**R code**

```r
createDataFrame(mtcars) # Initializes
rdf <- read.csv("500000.csv")

test <- function() {
  options(digits.secs = 6) # milliseconds
  start.time <- Sys.time()
  createDataFrame(rdf)
  end.time <- Sys.time()
  time.taken <- end.time - start.time
  print(time.taken)
}

test()
```

**Data (350 MB):**

```r
object.size(read.csv("500000.csv"))
350379504 bytes
```

"500000 Records"  http://eforexcel.com/wp/downloads-16-sample-csv-files-data-sets-for-testing/

**Results**

```
Time difference of 29.9468 secs
```

```
Time difference of 3.222129 secs
```

The performance improvement was around **950%**.
Actually, this PR improves around **1200%**+ because this PR includes a small optimization about regular R DataFrame -> Spark DatFrame. See apache#22954 (comment)

### Limitations:

For now, Arrow optimization with R does not support when the data is `raw`, and when user explicitly gives float type in the schema. They produce corrupt values.
In this case, we decide to fall back to non-optimization code path.

## How was this patch tested?

Small test was added.

I manually forced to set this optimization `true` for _all_ R tests and they were _all_ passed (with few of fallback warnings).

**TODOs:**
- [x] Draft codes
- [x] make the tests passed
- [x] make the CRAN check pass
- [x] Performance measurement
- [x] Supportability investigation (for instance types)
- [x] Wait for Arrow 0.12.0 release
- [x] Fix and match it to Arrow 0.12.0

Closes apache#22954 from HyukjinKwon/r-arrow-createdataframe.

Lead-authored-by: hyukjinkwon <gurwls223@apache.org>
Co-authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
@HyukjinKwon HyukjinKwon deleted the r-arrow-createdataframe branch March 3, 2020 01:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants