-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-54555][PYTHON] Enable Arrow-optimized Python UDFs and Arrow-based PySpark IPC by default #53264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, but -1 for Apache Spark 4.1.0, @asl3 .
We are already in RC2 status. It's really too late.
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
|
Yeah let's do this in |
|
If you re-target this to 4.2.0, we can merge this and backport this to your company, @asl3 . |
|
Agreed, retargeting to 4.2.0 / master @HyukjinKwon @dongjoon-hyun |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @asl3 and @HyukjinKwon .
allisonwang-db
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shujingyang-db
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will be great if we can document the exact type coercion difference introduced by this change
dbtsai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
+1 on the doc. We can create a separate PR for it as followup. |
|
regarding the failure in pyspark.sql.tests.connect.test_connect_creation.SparkConnectCreationTests.test_with_none_and_nan, you can just skip it for now, I will take a look |
|
Merged into master as the error is unrelated to this PR. Thanks. |
|
#53296 to restore the CI |
|
besides https://github.com/apache/spark/actions/runs/19878465653/job/56971154150 @HyukjinKwon @dbtsai @asl3 shall we revert it for now? |
|
@zhengruifeng Thanks! There was a doc fix I had not pushed - I can push a follow-up to add the whitespace: #53298 |
What changes were proposed in this pull request?
Enable PySpark Arrow-based optimizations by default in Spark 4.2.0, updating default conf values:
spark.sql.execution.pythonUDF.arrow.enabledandspark.sql.execution.pythonUDTF.arrow.enabledtotrueby default to enable Arrow-optimized execution for regular Python UDFs and UDTFs.spark.sql.execution.arrow.pyspark.enabledtotrueby default to enable Arrow-based columnar data exchange for PySpark APIs such as DataFrame.toPandas and SparkSession.createDataFrame when the input is a pandas DataFrame or NumPy array.Update user-facing docs and migration guides to reflect the change.
Why are the changes needed?
Arrow’s columnar IPC significantly improves JVM↔Python throughput and reduces serialization/deserialization overhead, speeding up Python UDFs and DataFrame conversions. Additionally, Arrow provides consistent, well-defined rules for type coercion when Python return values differ from declared UDF return types, reducing ambiguous behavior.
Enabling arrow by default brings performance and correctness improvements to the majority of PySpark users with minimal configuration. Users who depend on the previous (non-Arrow) implementation can opt out by explicitly setting
spark.sql.execution.pythonUDF.arrow.enabled,spark.sql.execution.pythonUTF.arrow.enabled, andspark.sql.execution.arrow.pyspark.enabledtofalse.Does this PR introduce any user-facing change?
Yes, changes the default configuration of
spark.sql.execution.pythonUDF.arrow.enabled,spark.sql.execution.pythonUDTF.arrow.enabled, andspark.sql.execution.arrow.pyspark.enabledtotrueand updates user-facing docs.How was this patch tested?
Existing PySpark test suites are run with enabling and disabling the arrow conf.
Was this patch authored or co-authored using generative AI tooling?
No