[SPARK-54555][PYTHON] Enable Arrow-optimized Python UDFs and Arrow-based PySpark IPC by default #53264

asl3 · 2025-11-30T21:24:09Z

What changes were proposed in this pull request?

Enable PySpark Arrow-based optimizations by default in Spark 4.2.0, updating default conf values:

Set spark.sql.execution.pythonUDF.arrow.enabled and spark.sql.execution.pythonUDTF.arrow.enabled totrue by default to enable Arrow-optimized execution for regular Python UDFs and UDTFs.
Set spark.sql.execution.arrow.pyspark.enabled to true by default to enable Arrow-based columnar data exchange for PySpark APIs such as DataFrame.toPandas and SparkSession.createDataFrame when the input is a pandas DataFrame or NumPy array.

Update user-facing docs and migration guides to reflect the change.

Why are the changes needed?

Arrow’s columnar IPC significantly improves JVM↔Python throughput and reduces serialization/deserialization overhead, speeding up Python UDFs and DataFrame conversions. Additionally, Arrow provides consistent, well-defined rules for type coercion when Python return values differ from declared UDF return types, reducing ambiguous behavior.

Enabling arrow by default brings performance and correctness improvements to the majority of PySpark users with minimal configuration. Users who depend on the previous (non-Arrow) implementation can opt out by explicitly setting spark.sql.execution.pythonUDF.arrow.enabled, spark.sql.execution.pythonUTF.arrow.enabled, and spark.sql.execution.arrow.pyspark.enabled to false.

Does this PR introduce any user-facing change?

Yes, changes the default configuration of spark.sql.execution.pythonUDF.arrow.enabled, spark.sql.execution.pythonUDTF.arrow.enabled, and spark.sql.execution.arrow.pyspark.enabled to true and updates user-facing docs.

How was this patch tested?

Existing PySpark test suites are run with enabling and disabling the arrow conf.

Was this patch authored or co-authored using generative AI tooling?

No

dongjoon-hyun

Sorry, but -1 for Apache Spark 4.1.0, @asl3 .

https://github.com/apache/spark/releases/tag/v4.1.0-rc2

We are already in RC2 status. It's really too late.

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

python/docs/source/migration_guide/pyspark_upgrade.rst

HyukjinKwon · 2025-11-30T21:50:40Z

Yeah let's do this in master

dongjoon-hyun · 2025-11-30T21:51:01Z

If you re-target this to 4.2.0, we can merge this and backport this to your company, @asl3 .

asl3 · 2025-11-30T22:18:35Z

Agreed, retargeting to 4.2.0 / master @HyukjinKwon @dongjoon-hyun

dongjoon-hyun

Thank you, @asl3 and @HyukjinKwon .

allisonwang-db

cc @ueshin @shujingyang-db

shujingyang-db

Will be great if we can document the exact type coercion difference introduced by this change

dbtsai

LGTM.

dbtsai · 2025-12-01T23:05:36Z

Will be great if we can document the exact type coercion difference introduced by this change

+1 on the doc. We can create a separate PR for it as followup.

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

zhengruifeng · 2025-12-02T08:37:26Z

regarding the failure in pyspark.sql.tests.connect.test_connect_creation.SparkConnectCreationTests.test_with_none_and_nan, you can just skip it for now, I will take a look

dbtsai · 2025-12-02T21:25:35Z

Merged into master as the error is unrelated to this PR. Thanks.

python/pyspark/sql/tests/test_unified_udf.py

zhengruifeng · 2025-12-03T00:42:48Z

@dbtsai @asl3 the failure is related, I mean we can use unittest.skip to skip it for now

zhengruifeng · 2025-12-03T00:44:22Z

#53296 to restore the CI

zhengruifeng · 2025-12-03T01:17:47Z

besides test_with_none_and_nan, the doc build is also failed by this PR

https://github.com/apache/spark/actions/runs/19878465653/job/56971154150

@HyukjinKwon @dbtsai @asl3 shall we revert it for now?

asl3 · 2025-12-03T01:21:16Z

@zhengruifeng Thanks! There was a doc fix I had not pushed - I can push a follow-up to add the whitespace: #53298

arrow python udf conf

8126520

github-actions bot added SQL DOCS PYTHON labels Nov 30, 2025

dongjoon-hyun requested changes Nov 30, 2025

View reviewed changes

dongjoon-hyun reviewed Nov 30, 2025

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Nov 30, 2025

View reviewed changes

python/docs/source/migration_guide/pyspark_upgrade.rst Outdated Show resolved Hide resolved

version

bde246f

arrow.pyspark.enabled

6f677a8

asl3 changed the title ~~[SPARK-54555][PYTHON] Enable Arrow-optimized Python UDFs by default~~ [SPARK-54555][PYTHON] Enable Arrow-optimization for Python UDF and PySpark DataFrame execution by default Nov 30, 2025

doc

56fa2ca

asl3 changed the title ~~[SPARK-54555][PYTHON] Enable Arrow-optimization for Python UDF and PySpark DataFrame execution by default~~ [SPARK-54555][PYTHON] Enable Arrow-optimized Python UDFs and Arrow-based PySpark IPC by default Nov 30, 2025

HyukjinKwon approved these changes Nov 30, 2025

View reviewed changes

asl3 requested a review from dongjoon-hyun November 30, 2025 22:24

udtf

c96f2b9

dongjoon-hyun approved these changes Dec 1, 2025

View reviewed changes

allisonwang-db approved these changes Dec 1, 2025

View reviewed changes

shujingyang-db reviewed Dec 1, 2025

View reviewed changes

asl3 added 2 commits December 1, 2025 13:09

docs

aa1e4b5

type coercion doc

2bcccaa

dbtsai self-requested a review December 1, 2025 23:02

dbtsai approved these changes Dec 1, 2025

View reviewed changes

asl3 added 4 commits December 1, 2025 15:08

doc

ccbb32a

doc

ce2d20c

doc

fc9222c

Merge remote-tracking branch 'upstream/master' into enablearrowbydefault

f627e54

fmt

baf2d06

zhengruifeng reviewed Dec 2, 2025

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Show resolved Hide resolved

dbtsai closed this in ea0a35e Dec 2, 2025

zhengruifeng reviewed Dec 3, 2025

View reviewed changes

python/pyspark/sql/tests/test_unified_udf.py Show resolved Hide resolved

[SPARK-54555][PYTHON] Enable Arrow-optimized Python UDFs and Arrow-based PySpark IPC by default #53264

[SPARK-54555][PYTHON] Enable Arrow-optimized Python UDFs and Arrow-based PySpark IPC by default #53264

Conversation

asl3 commented Nov 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

dongjoon-hyun left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

HyukjinKwon commented Nov 30, 2025

Uh oh!

dongjoon-hyun commented Nov 30, 2025

Uh oh!

asl3 commented Nov 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

allisonwang-db left a comment

Choose a reason for hiding this comment

Uh oh!

shujingyang-db left a comment

Choose a reason for hiding this comment

Uh oh!

dbtsai left a comment

Choose a reason for hiding this comment

Uh oh!

dbtsai commented Dec 1, 2025

Uh oh!

Uh oh!

zhengruifeng commented Dec 2, 2025

Uh oh!

dbtsai commented Dec 2, 2025

Uh oh!

Uh oh!

zhengruifeng commented Dec 3, 2025

Uh oh!

zhengruifeng commented Dec 3, 2025

Uh oh!

zhengruifeng commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asl3 commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

asl3 commented Nov 30, 2025 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading

asl3 commented Nov 30, 2025 •

edited

Loading

zhengruifeng commented Dec 3, 2025 •

edited

Loading

asl3 commented Dec 3, 2025 •

edited

Loading