New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP][SPARK-34625][R] Enable Arrow optimization for float types with SparkR #31744
[WIP][SPARK-34625][R] Enable Arrow optimization for float types with SparkR #31744
Conversation
ok to test |
cc @HyukjinKwon and @sunchao |
Test build #135778 has finished for PR 31744 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status failure |
@@ -94,20 +94,7 @@ checkSchemaInArrow <- function(schema) { | |||
|
|||
# Both cases below produce a corrupt value for unknown reason. It needs to be investigated. | |||
field_strings <- sapply(schema$fields(), function(x) x$dataType.toString()) | |||
if (any(field_strings == "FloatType")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Can we add a test case? I think you can update the test cases with post-fix "- type specification" at test_sparkSQL_arrow.R
Test build #137909 has finished for PR 31744 at commit
|
Test build #138711 has finished for PR 31744 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status success |
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@msummersgill the change seems promising from a cursory look. Mind adding some basic tests?
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
I deleted several error-handlers like the following from the SparkR package
types.R
file.Before the proposed changes, error handlers in
types.R
like the snippet below prevented arrow optimization from being to applied to float types.Why are the changes needed?
The R
arrow
package now supportsFloatType
,BinaryType
,ArrayType
,StructType
. This was brought to my attention by Neal Richardson, maintainer of the R Arrow package, in the comments on this issue: https://issues.apache.org/jira/browse/ARROW-3783This change allows SparkR users to leverage arrow optimization for the additional types supported. Documentation for currently supported types can be viewed at https://arrow.apache.org/docs/r/articles/arrow.html#arrow-to-r
Does this PR introduce any user-facing change?
FloatType
,BinaryType
,ArrayType
, andStructType
, types will now be returned with arrow optimization (whenspark.sql.execution.arrow.sparkr.enabled = "true"
and theR
arrow
package is available in the executing environment.How was this patch tested?
I built a copy of the SparkR package locally under R 3.6.0 using this branch, connected to a Databricks cluster running Databricks runtime version 7.3 LTS (Spark 3.0.1, Scala 2.12), and executed the following to test whether
FloatType
could be returned without error.In addition, I executed a handful of quick tests verifying that
BinaryType
,ArrayType
, andStructType
could be returned as well. However, I have not worked with these data types in Spark, so I think some additional testing is probably merited.