New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-27834][SQL][R][PYTHON] Make separate PySpark/SparkR vectorization configurations #24700
Conversation
@@ -1326,14 +1326,24 @@ object SQLConf { | |||
|
|||
val ARROW_EXECUTION_ENABLED = | |||
buildConf("spark.sql.execution.arrow.enabled") | |||
.doc("When true, make use of Apache Arrow for columnar data transfers." + | |||
"In case of PySpark, " + | |||
.doc("(Deprecated since Spark 3.0, please set 'spark.sql.pyspark.execution.arrow.enabled'.)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems I should use hardcoded one to avoid to refer each other.
.createWithDefault(false) | ||
|
||
val PYSPARK_ARROW_EXECUTION_ENABLED = | ||
buildConf("spark.sql.pyspark.execution.arrow.enabled") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark.pyspark.arrow.enabled ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually that s what I tired first but here is in SQLConf.scala. if we go for Pyspark or Sparkr prefix, those configurations should be SparkConf under Python.scala, for instance.
|
||
"In case of SparkR," + | ||
val SPARKR_ARROW_EXECUTION_ENABLED = | ||
buildConf("spark.sql.sparkr.execution.arrow.enabled") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark.sparkr.arrow.enabled ?
I think it's fair, but just to call out, |
Test build #105763 has finished for PR 24700 at commit
|
Yes.. I am not sure what we should name tho. If we name it spark.pyspark then it's usually spark conf at SpsrkContext. I was thinking it makes sense to spark.sql.pyspark too in a way because it works closely with SQL. |
Adding @BryanCutler, @viirya too. Let me go ahead with it. The naming is a bit odd but I think we should use |
If there are no more concerns than that, let me go ahead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I think this makes sense. Still no better idea about the naming.
Test build #105851 has finished for PR 24700 at commit
|
Not a big deal but would it make a little more sense to be called |
hmmmmmm .. yea, I can just grep and replace .. I don't have a preference. One argument I can think is that
Just pick one (don't have to list up reasons). |
|
Let me replace it to |
…rrow.[pyspark|sparkr].*
6ad1cd8
to
f6a2d99
Compare
ok |
Test build #105987 has finished for PR 24700 at commit
|
Test build #105988 has finished for PR 24700 at commit
|
Is everybody happy with it :-) ? |
I am merging this - looks like we're positive on this in general and no notable comments. |
LGTM |
What changes were proposed in this pull request?
spark.sql.execution.arrow.enabled
was added when we add PySpark arrow optimization.Later, in the current master, SparkR arrow optimization was added and it's controlled by the same configuration
spark.sql.execution.arrow.enabled
.There look two issues about this:
spark.sql.execution.arrow.enabled
in PySpark was added from 2.3.0 whereas SparkR optimization was added 3.0.0. The stability is different so it's problematic when we change the default value for one of both optimization first.Suppose users want to share some JVM by PySpark and SparkR. They are currently forced to use the optimization for all or none if the configuration is set globally.
This PR proposes two separate configuration groups for PySpark and SparkR about Arrow optimization:
spark.sql.execution.arrow.enabled
spark.sql.execution.arrow.pyspark.enabled
(fallback tospark.sql.execution.arrow.enabled
)spark.sql.execution.arrow.sparkr.enabled
spark.sql.execution.arrow.fallback.enabled
spark.sql.execution.arrow.pyspark.fallback.enabled
(fallback tospark.sql.execution.arrow.fallback.enabled
)Note that
spark.sql.execution.arrow.maxRecordsPerBatch
is used within JVM side for both.Note that
spark.sql.execution.arrow.fallback.enabled
was added due to behaviour change. We don't need it in SparkR - SparkR side has the automatic fallback.How was this patch tested?
Manually tested and some unittests were added.