[SPARK-27834][SQL][R][PYTHON] Make separate PySpark/SparkR vectorization configurations #24700

HyukjinKwon · 2019-05-24T14:57:23Z

What changes were proposed in this pull request?

spark.sql.execution.arrow.enabled was added when we add PySpark arrow optimization.
Later, in the current master, SparkR arrow optimization was added and it's controlled by the same configuration spark.sql.execution.arrow.enabled.

There look two issues about this:

spark.sql.execution.arrow.enabled in PySpark was added from 2.3.0 whereas SparkR optimization was added 3.0.0. The stability is different so it's problematic when we change the default value for one of both optimization first.
Suppose users want to share some JVM by PySpark and SparkR. They are currently forced to use the optimization for all or none if the configuration is set globally.

This PR proposes two separate configuration groups for PySpark and SparkR about Arrow optimization:

Deprecate spark.sql.execution.arrow.enabled
Add spark.sql.execution.arrow.pyspark.enabled (fallback to spark.sql.execution.arrow.enabled)
Add spark.sql.execution.arrow.sparkr.enabled
Deprecate spark.sql.execution.arrow.fallback.enabled
Add spark.sql.execution.arrow.pyspark.fallback.enabled (fallback to spark.sql.execution.arrow.fallback.enabled)

Note that spark.sql.execution.arrow.maxRecordsPerBatch is used within JVM side for both.
Note that spark.sql.execution.arrow.fallback.enabled was added due to behaviour change. We don't need it in SparkR - SparkR side has the automatic fallback.

How was this patch tested?

Manually tested and some unittests were added.

HyukjinKwon · 2019-05-24T14:58:31Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -1326,14 +1326,24 @@ object SQLConf {

  val ARROW_EXECUTION_ENABLED =
    buildConf("spark.sql.execution.arrow.enabled")
-      .doc("When true, make use of Apache Arrow for columnar data transfers." +
-        "In case of PySpark, " +
+      .doc("(Deprecated since Spark 3.0, please set 'spark.sql.pyspark.execution.arrow.enabled'.)")


Seems I should use hardcoded one to avoid to refer each other.

gatorsmile · 2019-05-24T15:33:56Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+      .createWithDefault(false)
+
+  val PYSPARK_ARROW_EXECUTION_ENABLED =
+    buildConf("spark.sql.pyspark.execution.arrow.enabled")


spark.pyspark.arrow.enabled ?

Actually that s what I tired first but here is in SQLConf.scala. if we go for Pyspark or Sparkr prefix, those configurations should be SparkConf under Python.scala, for instance.

gatorsmile · 2019-05-24T15:34:20Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala


-        "In case of SparkR," +
+  val SPARKR_ARROW_EXECUTION_ENABLED =
+    buildConf("spark.sql.sparkr.execution.arrow.enabled")


spark.sparkr.arrow.enabled ?

felixcheung · 2019-05-24T17:21:04Z

I think it's fair, but just to call out, spark.sql.sparkr.* doesn't seem to be consistent with existing name scheme, I believe

SparkQA · 2019-05-24T18:11:40Z

Test build #105763 has finished for PR 24700 at commit beec132.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-05-24T22:06:35Z

Yes.. I am not sure what we should name tho. If we name it spark.pyspark then it's usually spark conf at SpsrkContext. I was thinking it makes sense to spark.sql.pyspark too in a way because it works closely with SQL.

HyukjinKwon · 2019-05-27T12:38:43Z

Adding @BryanCutler, @viirya too. Let me go ahead with it. The naming is a bit odd but I think we should use spark.sql.sparkr.* to keep it in SQLConf like before. Otherwise, we can't use it within session level like before ..

HyukjinKwon · 2019-05-27T12:39:09Z

If there are no more concerns than that, let me go ahead.

viirya

Overall I think this makes sense. Still no better idea about the naming.

docs/sql-pyspark-pandas-with-arrow.md

SparkQA · 2019-05-28T04:13:37Z

Test build #105851 has finished for PR 24700 at commit 9fbc9e1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2019-05-29T00:37:26Z

Not a big deal but would it make a little more sense to be called spark.sql.execution.arrow.pyspark.enabled and spark.sql.execution.arrow.sparkr.enabled? That roughly follows the Scala package and is a little more consistent with the other arrow confs. The current way is fine with me also though.

HyukjinKwon · 2019-05-29T00:52:43Z

hmmmmmm .. yea, I can just grep and replace .. I don't have a preference. One argument I can think is that pyspark or sparkr looks wider. but I don't mind. WDYT @gatorsmile, @felixcheung, @viirya, @rxin?

spark.sql.execution.arrow.pyspark.enabled vs spark.sql.pyspark.execution.arrow.enabled

Just pick one (don't have to list up reasons).

viirya · 2019-05-29T04:04:10Z

spark.sql.execution.arrow.pyspark.enabled looks slightly better.

HyukjinKwon · 2019-05-30T06:56:27Z

Let me replace it to spark.sql.execution.arrow.pyspark.enabled way soon.

…rrow.[pyspark|sparkr].*

felixcheung · 2019-05-31T04:06:58Z

ok

SparkQA · 2019-05-31T05:57:47Z

Test build #105987 has finished for PR 24700 at commit 6ad1cd8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-05-31T06:09:32Z

Test build #105988 has finished for PR 24700 at commit f6a2d99.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-06-01T03:29:39Z

Is everybody happy with it :-) ?

HyukjinKwon · 2019-06-03T01:01:11Z

I am merging this - looks like we're positive on this in general and no notable comments.

gatorsmile · 2019-06-05T07:08:39Z

LGTM

Make separate PySpark/SparkR vectorization configurations

beec132

HyukjinKwon commented May 24, 2019

View reviewed changes

gatorsmile reviewed May 24, 2019

View reviewed changes

viirya reviewed May 27, 2019

View reviewed changes

docs/sql-pyspark-pandas-with-arrow.md Outdated Show resolved Hide resolved

Fix missed one

9fbc9e1

spark.sql.[pyspark|sparkr].execution.arrow.* -> spark.sql.execution.a…

f6a2d99

…rrow.[pyspark|sparkr].*

HyukjinKwon force-pushed the separate-sparkr-arrow branch from 6ad1cd8 to f6a2d99 Compare May 31, 2019 02:42

HyukjinKwon closed this in db48da8 Jun 3, 2019

dongjoon-hyun added the SQL label Feb 5, 2020

HyukjinKwon deleted the separate-sparkr-arrow branch March 3, 2020 01:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-27834][SQL][R][PYTHON] Make separate PySpark/SparkR vectorization configurations #24700

[SPARK-27834][SQL][R][PYTHON] Make separate PySpark/SparkR vectorization configurations #24700

HyukjinKwon commented May 24, 2019 •

edited

HyukjinKwon May 24, 2019

gatorsmile May 24, 2019 •

edited

HyukjinKwon May 24, 2019

gatorsmile May 24, 2019

felixcheung commented May 24, 2019

SparkQA commented May 24, 2019

HyukjinKwon commented May 24, 2019

HyukjinKwon commented May 27, 2019

HyukjinKwon commented May 27, 2019

viirya left a comment

SparkQA commented May 28, 2019

BryanCutler commented May 29, 2019

HyukjinKwon commented May 29, 2019

viirya commented May 29, 2019

HyukjinKwon commented May 30, 2019

felixcheung commented May 31, 2019

SparkQA commented May 31, 2019

SparkQA commented May 31, 2019

HyukjinKwon commented Jun 1, 2019

HyukjinKwon commented Jun 3, 2019

gatorsmile commented Jun 5, 2019

[SPARK-27834][SQL][R][PYTHON] Make separate PySpark/SparkR vectorization configurations #24700

[SPARK-27834][SQL][R][PYTHON] Make separate PySpark/SparkR vectorization configurations #24700

Conversation

HyukjinKwon commented May 24, 2019 • edited

What changes were proposed in this pull request?

How was this patch tested?

HyukjinKwon May 24, 2019

Choose a reason for hiding this comment

gatorsmile May 24, 2019 • edited

Choose a reason for hiding this comment

HyukjinKwon May 24, 2019

Choose a reason for hiding this comment

gatorsmile May 24, 2019

Choose a reason for hiding this comment

felixcheung commented May 24, 2019

SparkQA commented May 24, 2019

HyukjinKwon commented May 24, 2019

HyukjinKwon commented May 27, 2019

HyukjinKwon commented May 27, 2019

viirya left a comment

Choose a reason for hiding this comment

SparkQA commented May 28, 2019

BryanCutler commented May 29, 2019

HyukjinKwon commented May 29, 2019

viirya commented May 29, 2019

HyukjinKwon commented May 30, 2019

felixcheung commented May 31, 2019

SparkQA commented May 31, 2019

SparkQA commented May 31, 2019

HyukjinKwon commented Jun 1, 2019

HyukjinKwon commented Jun 3, 2019

gatorsmile commented Jun 5, 2019

HyukjinKwon commented May 24, 2019 •

edited

gatorsmile May 24, 2019 •

edited