Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP][SPARK-34625][R] Enable Arrow optimization for float types with SparkR #31744

Conversation

msummersgill
Copy link

@msummersgill msummersgill commented Mar 4, 2021

What changes were proposed in this pull request?

I deleted several error-handlers like the following from the SparkR package types.R file.

  if (any(field_strings == "FloatType")) {
    stop("Arrow optimization in R does not support float type yet.")
  }

Before the proposed changes, error handlers in types.R like the snippet below prevented arrow optimization from being to applied to float types.

Why are the changes needed?

The R arrow package now supports FloatType, BinaryType, ArrayType,StructType. This was brought to my attention by Neal Richardson, maintainer of the R Arrow package, in the comments on this issue: https://issues.apache.org/jira/browse/ARROW-3783

This change allows SparkR users to leverage arrow optimization for the additional types supported. Documentation for currently supported types can be viewed at https://arrow.apache.org/docs/r/articles/arrow.html#arrow-to-r

str(collect(SparkR::sql("SELECT float('1') AS x;"))$x[[1]])
## num 1

## Warning message:In value[[3L]](cond) :  The conversion from Spark DataFrame to R DataFrame was attempted with Arrow optimization because 'spark.sql.execution.arrow.sparkr.enabled' is set to true; however, failed, attempting non-optimization. Reason: Error in checkSchemaInArrow(schema(x)): Arrow optimization in R does not support float type yet.

Does this PR introduce any user-facing change?

FloatType, BinaryType, ArrayType, and StructType, types will now be returned with arrow optimization (when spark.sql.execution.arrow.sparkr.enabled = "true" and the R arrow package is available in the executing environment.

How was this patch tested?

I built a copy of the SparkR package locally under R 3.6.0 using this branch, connected to a Databricks cluster running Databricks runtime version 7.3 LTS (Spark 3.0.1, Scala 2.12), and executed the following to test whether FloatType could be returned without error.

str(collect(SparkR::sql("SELECT float('-9999999999999999.9999999999999') AS x1,
                        float('-1.0') AS x2,
                        float('-0.00001') AS x3,
                        float('0') AS x4,
                        float('0.00001') AS x5,
                        float('1.0') AS x6,
                        float('9999999999999999.9999999999999') AS x7;")))

# 'data.frame':	1 obs. of  7 variables:
#  $ x1: num -1e+16
#  $ x2: num -1
#  $ x3: num -1e-05
#  $ x4: num 0
#  $ x5: num 1e-05
#  $ x6: num 1
#  $ x7: num 1e+16

In addition, I executed a handful of quick tests verifying that BinaryType, ArrayType, and StructType could be returned as well. However, I have not worked with these data types in Spark, so I think some additional testing is probably merited.

str(collect(SparkR::sql("SELECT binary('0') x;"))$x[[1]])
# raw 30
str(collect(SparkR::sql('SELECT array(\'{"x": [{"a": 1, "b": 2, "c": 3}]}\') x;'))$x[[1]])
# chr "{\"x\": [{\"a\": 1, \"b\": 2, \"c\": 3}]}"
str(collect(SparkR::sql("SELECT struct('1') x;"))$x[[1]])
# chr "1"

@msummersgill msummersgill changed the title [WIP] Enable Arrow optimization for float types with using SparkR [WIP] Enable Arrow optimization for float types with SparkR Mar 4, 2021
@dongjoon-hyun
Copy link
Member

ok to test

@dongjoon-hyun
Copy link
Member

cc @HyukjinKwon and @sunchao

@SparkQA
Copy link

SparkQA commented Mar 4, 2021

Test build #135778 has finished for PR 31744 at commit 4d5de83.

  • This patch fails R style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 5, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40360/

@SparkQA
Copy link

SparkQA commented Mar 5, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40360/

@@ -94,20 +94,7 @@ checkSchemaInArrow <- function(schema) {

# Both cases below produce a corrupt value for unknown reason. It needs to be investigated.
field_strings <- sapply(schema$fields(), function(x) x$dataType.toString())
if (any(field_strings == "FloatType")) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Can we add a test case? I think you can update the test cases with post-fix "- type specification" at test_sparkSQL_arrow.R

@HyukjinKwon HyukjinKwon changed the title [WIP] Enable Arrow optimization for float types with SparkR [WIP][SPARK-34625][R] Enable Arrow optimization for float types with SparkR Mar 7, 2021
@SparkQA
Copy link

SparkQA commented Apr 25, 2021

Test build #137909 has finished for PR 31744 at commit 4d5de83.

  • This patch fails R style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 19, 2021

Test build #138711 has finished for PR 31744 at commit a4d9091.

  • This patch fails R style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 19, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43232/

@SparkQA
Copy link

SparkQA commented May 19, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43232/

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@msummersgill the change seems promising from a cursory look. Mind adding some basic tests?

@github-actions
Copy link

github-actions bot commented Dec 7, 2021

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Dec 7, 2021
@github-actions github-actions bot closed this Dec 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants