Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-27834][SQL][R][PYTHON] Make separate PySpark/SparkR vectorization configurations #24700

Closed
wants to merge 3 commits into from

Conversation

HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented May 24, 2019

What changes were proposed in this pull request?

spark.sql.execution.arrow.enabled was added when we add PySpark arrow optimization.
Later, in the current master, SparkR arrow optimization was added and it's controlled by the same configuration spark.sql.execution.arrow.enabled.

There look two issues about this:

  1. spark.sql.execution.arrow.enabled in PySpark was added from 2.3.0 whereas SparkR optimization was added 3.0.0. The stability is different so it's problematic when we change the default value for one of both optimization first.

  2. Suppose users want to share some JVM by PySpark and SparkR. They are currently forced to use the optimization for all or none if the configuration is set globally.

This PR proposes two separate configuration groups for PySpark and SparkR about Arrow optimization:

  • Deprecate spark.sql.execution.arrow.enabled
  • Add spark.sql.execution.arrow.pyspark.enabled (fallback to spark.sql.execution.arrow.enabled)
  • Add spark.sql.execution.arrow.sparkr.enabled
  • Deprecate spark.sql.execution.arrow.fallback.enabled
  • Add spark.sql.execution.arrow.pyspark.fallback.enabled (fallback to spark.sql.execution.arrow.fallback.enabled)

Note that spark.sql.execution.arrow.maxRecordsPerBatch is used within JVM side for both.
Note that spark.sql.execution.arrow.fallback.enabled was added due to behaviour change. We don't need it in SparkR - SparkR side has the automatic fallback.

How was this patch tested?

Manually tested and some unittests were added.

@@ -1326,14 +1326,24 @@ object SQLConf {

val ARROW_EXECUTION_ENABLED =
buildConf("spark.sql.execution.arrow.enabled")
.doc("When true, make use of Apache Arrow for columnar data transfers." +
"In case of PySpark, " +
.doc("(Deprecated since Spark 3.0, please set 'spark.sql.pyspark.execution.arrow.enabled'.)")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems I should use hardcoded one to avoid to refer each other.

.createWithDefault(false)

val PYSPARK_ARROW_EXECUTION_ENABLED =
buildConf("spark.sql.pyspark.execution.arrow.enabled")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spark.pyspark.arrow.enabled ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually that s what I tired first but here is in SQLConf.scala. if we go for Pyspark or Sparkr prefix, those configurations should be SparkConf under Python.scala, for instance.


"In case of SparkR," +
val SPARKR_ARROW_EXECUTION_ENABLED =
buildConf("spark.sql.sparkr.execution.arrow.enabled")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spark.sparkr.arrow.enabled ?

@felixcheung
Copy link
Member

I think it's fair, but just to call out, spark.sql.sparkr.* doesn't seem to be consistent with existing name scheme, I believe

@SparkQA
Copy link

SparkQA commented May 24, 2019

Test build #105763 has finished for PR 24700 at commit beec132.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

Yes.. I am not sure what we should name tho. If we name it spark.pyspark then it's usually spark conf at SpsrkContext. I was thinking it makes sense to spark.sql.pyspark too in a way because it works closely with SQL.

@HyukjinKwon
Copy link
Member Author

Adding @BryanCutler, @viirya too. Let me go ahead with it. The naming is a bit odd but I think we should use spark.sql.sparkr.* to keep it in SQLConf like before. Otherwise, we can't use it within session level like before ..

@HyukjinKwon
Copy link
Member Author

If there are no more concerns than that, let me go ahead.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I think this makes sense. Still no better idea about the naming.

docs/sql-pyspark-pandas-with-arrow.md Outdated Show resolved Hide resolved
@SparkQA
Copy link

SparkQA commented May 28, 2019

Test build #105851 has finished for PR 24700 at commit 9fbc9e1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BryanCutler
Copy link
Member

Not a big deal but would it make a little more sense to be called spark.sql.execution.arrow.pyspark.enabled and spark.sql.execution.arrow.sparkr.enabled? That roughly follows the Scala package and is a little more consistent with the other arrow confs. The current way is fine with me also though.

@HyukjinKwon
Copy link
Member Author

hmmmmmm .. yea, I can just grep and replace .. I don't have a preference. One argument I can think is that pyspark or sparkr looks wider. but I don't mind. WDYT @gatorsmile, @felixcheung, @viirya, @rxin?

spark.sql.execution.arrow.pyspark.enabled vs spark.sql.pyspark.execution.arrow.enabled

Just pick one (don't have to list up reasons).

@viirya
Copy link
Member

viirya commented May 29, 2019

spark.sql.execution.arrow.pyspark.enabled looks slightly better.

@HyukjinKwon
Copy link
Member Author

Let me replace it to spark.sql.execution.arrow.pyspark.enabled way soon.

@felixcheung
Copy link
Member

ok

@SparkQA
Copy link

SparkQA commented May 31, 2019

Test build #105987 has finished for PR 24700 at commit 6ad1cd8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 31, 2019

Test build #105988 has finished for PR 24700 at commit f6a2d99.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

Is everybody happy with it :-) ?

@HyukjinKwon
Copy link
Member Author

I am merging this - looks like we're positive on this in general and no notable comments.

@gatorsmile
Copy link
Member

LGTM

@HyukjinKwon HyukjinKwon deleted the separate-sparkr-arrow branch March 3, 2020 01:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
7 participants