[WIP][SPARK-34625][R] Enable Arrow optimization for float types with SparkR #31744

msummersgill · 2021-03-04T21:40:23Z

What changes were proposed in this pull request?

I deleted several error-handlers like the following from the SparkR package types.R file.

  if (any(field_strings == "FloatType")) {
    stop("Arrow optimization in R does not support float type yet.")
  }

Before the proposed changes, error handlers in types.R like the snippet below prevented arrow optimization from being to applied to float types.

Why are the changes needed?

The R arrow package now supports FloatType, BinaryType, ArrayType,StructType. This was brought to my attention by Neal Richardson, maintainer of the R Arrow package, in the comments on this issue: https://issues.apache.org/jira/browse/ARROW-3783

This change allows SparkR users to leverage arrow optimization for the additional types supported. Documentation for currently supported types can be viewed at https://arrow.apache.org/docs/r/articles/arrow.html#arrow-to-r

str(collect(SparkR::sql("SELECT float('1') AS x;"))$x[[1]])
## num 1

## Warning message:In value[[3L]](cond) :  The conversion from Spark DataFrame to R DataFrame was attempted with Arrow optimization because 'spark.sql.execution.arrow.sparkr.enabled' is set to true; however, failed, attempting non-optimization. Reason: Error in checkSchemaInArrow(schema(x)): Arrow optimization in R does not support float type yet.

Does this PR introduce any user-facing change?

FloatType, BinaryType, ArrayType, and StructType, types will now be returned with arrow optimization (when spark.sql.execution.arrow.sparkr.enabled = "true" and the R arrow package is available in the executing environment.

How was this patch tested?

I built a copy of the SparkR package locally under R 3.6.0 using this branch, connected to a Databricks cluster running Databricks runtime version 7.3 LTS (Spark 3.0.1, Scala 2.12), and executed the following to test whether FloatType could be returned without error.

str(collect(SparkR::sql("SELECT float('-9999999999999999.9999999999999') AS x1,
                        float('-1.0') AS x2,
                        float('-0.00001') AS x3,
                        float('0') AS x4,
                        float('0.00001') AS x5,
                        float('1.0') AS x6,
                        float('9999999999999999.9999999999999') AS x7;")))

# 'data.frame':	1 obs. of  7 variables:
#  $ x1: num -1e+16
#  $ x2: num -1
#  $ x3: num -1e-05
#  $ x4: num 0
#  $ x5: num 1e-05
#  $ x6: num 1
#  $ x7: num 1e+16

In addition, I executed a handful of quick tests verifying that BinaryType, ArrayType, and StructType could be returned as well. However, I have not worked with these data types in Spark, so I think some additional testing is probably merited.

str(collect(SparkR::sql("SELECT binary('0') x;"))$x[[1]])
# raw 30
str(collect(SparkR::sql('SELECT array(\'{"x": [{"a": 1, "b": 2, "c": 3}]}\') x;'))$x[[1]])
# chr "{\"x\": [{\"a\": 1, \"b\": 2, \"c\": 3}]}"
str(collect(SparkR::sql("SELECT struct('1') x;"))$x[[1]])
# chr "1"

dongjoon-hyun · 2021-03-04T23:06:21Z

ok to test

dongjoon-hyun · 2021-03-04T23:10:21Z

cc @HyukjinKwon and @sunchao

SparkQA · 2021-03-04T23:12:34Z

Test build #135778 has finished for PR 31744 at commit 4d5de83.

This patch fails R style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-03-05T00:01:09Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40360/

SparkQA · 2021-03-05T00:10:52Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40360/

HyukjinKwon · 2021-03-05T01:30:44Z

R/pkg/R/types.R

@@ -94,20 +94,7 @@ checkSchemaInArrow <- function(schema) {

  # Both cases below produce a corrupt value for unknown reason. It needs to be investigated.
  field_strings <- sapply(schema$fields(), function(x) x$dataType.toString())
-  if (any(field_strings == "FloatType")) {


Sounds good. Can we add a test case? I think you can update the test cases with post-fix "- type specification" at test_sparkSQL_arrow.R

SparkQA · 2021-04-25T07:32:36Z

Test build #137909 has finished for PR 31744 at commit 4d5de83.

This patch fails R style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-05-19T12:58:25Z

Test build #138711 has finished for PR 31744 at commit a4d9091.

This patch fails R style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-05-19T13:52:55Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43232/

SparkQA · 2021-05-19T14:30:29Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43232/

github-actions · 2021-08-28T00:08:22Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

HyukjinKwon

@msummersgill the change seems promising from a cursory look. Mind adding some basic tests?

github-actions · 2021-12-07T00:11:25Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

Enable Arrow optimization for float types

5c98a41

github-actions bot added R SQL labels Mar 4, 2021

msummersgill added 2 commits March 4, 2021 15:49

Add arrow support for Binary/Array/Struct types

47e3be6

Delete error tests for types now supported by arrow

4d5de83

msummersgill changed the title ~~[WIP] Enable Arrow optimization for float types with using SparkR~~ [WIP] Enable Arrow optimization for float types with SparkR Mar 4, 2021

HyukjinKwon reviewed Mar 5, 2021

View reviewed changes

HyukjinKwon changed the title ~~[WIP] Enable Arrow optimization for float types with SparkR~~ [WIP][SPARK-34625][R] Enable Arrow optimization for float types with SparkR Mar 7, 2021

Merge branch 'apache:master' into msummersgill-arrow-supports-float

a4d9091

github-actions bot added the Stale label Aug 28, 2021

HyukjinKwon removed the Stale label Aug 28, 2021

HyukjinKwon reviewed Aug 28, 2021

View reviewed changes

github-actions bot added the Stale label Dec 7, 2021

github-actions bot closed this Dec 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP][SPARK-34625][R] Enable Arrow optimization for float types with SparkR #31744

[WIP][SPARK-34625][R] Enable Arrow optimization for float types with SparkR #31744

msummersgill commented Mar 4, 2021 •

edited

dongjoon-hyun commented Mar 4, 2021

dongjoon-hyun commented Mar 4, 2021

SparkQA commented Mar 4, 2021

SparkQA commented Mar 5, 2021

SparkQA commented Mar 5, 2021

HyukjinKwon Mar 5, 2021

SparkQA commented Apr 25, 2021

SparkQA commented May 19, 2021

SparkQA commented May 19, 2021

SparkQA commented May 19, 2021

github-actions bot commented Aug 28, 2021

HyukjinKwon left a comment

github-actions bot commented Dec 7, 2021

[WIP][SPARK-34625][R] Enable Arrow optimization for float types with SparkR #31744

[WIP][SPARK-34625][R] Enable Arrow optimization for float types with SparkR #31744

Conversation

msummersgill commented Mar 4, 2021 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

dongjoon-hyun commented Mar 4, 2021

dongjoon-hyun commented Mar 4, 2021

SparkQA commented Mar 4, 2021

SparkQA commented Mar 5, 2021

SparkQA commented Mar 5, 2021

HyukjinKwon Mar 5, 2021

Choose a reason for hiding this comment

SparkQA commented Apr 25, 2021

SparkQA commented May 19, 2021

SparkQA commented May 19, 2021

SparkQA commented May 19, 2021

github-actions bot commented Aug 28, 2021

HyukjinKwon left a comment

Choose a reason for hiding this comment

github-actions bot commented Dec 7, 2021

msummersgill commented Mar 4, 2021 •

edited