Skip to content

Conversation

@asl3
Copy link
Contributor

@asl3 asl3 commented Nov 30, 2025

What changes were proposed in this pull request?

Enable PySpark Arrow-based optimizations by default in Spark 4.2.0, updating default conf values:

  • Set spark.sql.execution.pythonUDF.arrow.enabled and spark.sql.execution.pythonUDTF.arrow.enabled totrue by default to enable Arrow-optimized execution for regular Python UDFs and UDTFs.
  • Set spark.sql.execution.arrow.pyspark.enabled to true by default to enable Arrow-based columnar data exchange for PySpark APIs such as DataFrame.toPandas and SparkSession.createDataFrame when the input is a pandas DataFrame or NumPy array.

Update user-facing docs and migration guides to reflect the change.

Why are the changes needed?

Arrow’s columnar IPC significantly improves JVM↔Python throughput and reduces serialization/deserialization overhead, speeding up Python UDFs and DataFrame conversions. Additionally, Arrow provides consistent, well-defined rules for type coercion when Python return values differ from declared UDF return types, reducing ambiguous behavior.

Enabling arrow by default brings performance and correctness improvements to the majority of PySpark users with minimal configuration. Users who depend on the previous (non-Arrow) implementation can opt out by explicitly setting spark.sql.execution.pythonUDF.arrow.enabled, spark.sql.execution.pythonUTF.arrow.enabled, and spark.sql.execution.arrow.pyspark.enabled to false.

Does this PR introduce any user-facing change?

Yes, changes the default configuration of spark.sql.execution.pythonUDF.arrow.enabled, spark.sql.execution.pythonUDTF.arrow.enabled, and spark.sql.execution.arrow.pyspark.enabled to true and updates user-facing docs.

How was this patch tested?

Existing PySpark test suites are run with enabling and disabling the arrow conf.

Was this patch authored or co-authored using generative AI tooling?

No

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but -1 for Apache Spark 4.1.0, @asl3 .

We are already in RC2 status. It's really too late.

@HyukjinKwon
Copy link
Member

Yeah let's do this in master

@dongjoon-hyun
Copy link
Member

If you re-target this to 4.2.0, we can merge this and backport this to your company, @asl3 .

@asl3 asl3 changed the title [SPARK-54555][PYTHON] Enable Arrow-optimized Python UDFs by default [SPARK-54555][PYTHON] Enable Arrow-optimization for Python UDF and PySpark DataFrame execution by default Nov 30, 2025
@asl3 asl3 changed the title [SPARK-54555][PYTHON] Enable Arrow-optimization for Python UDF and PySpark DataFrame execution by default [SPARK-54555][PYTHON] Enable Arrow-optimized Python UDFs and Arrow-based PySpark IPC by default Nov 30, 2025
@asl3
Copy link
Contributor Author

asl3 commented Nov 30, 2025

Agreed, retargeting to 4.2.0 / master @HyukjinKwon @dongjoon-hyun

@asl3 asl3 requested a review from dongjoon-hyun November 30, 2025 22:24
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @asl3 and @HyukjinKwon .

Copy link
Contributor

@allisonwang-db allisonwang-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@shujingyang-db shujingyang-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will be great if we can document the exact type coercion difference introduced by this change

@dbtsai dbtsai self-requested a review December 1, 2025 23:02
Copy link
Member

@dbtsai dbtsai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@dbtsai
Copy link
Member

dbtsai commented Dec 1, 2025

Will be great if we can document the exact type coercion difference introduced by this change

+1 on the doc. We can create a separate PR for it as followup.

@zhengruifeng
Copy link
Contributor

regarding the failure in pyspark.sql.tests.connect.test_connect_creation.SparkConnectCreationTests.test_with_none_and_nan, you can just skip it for now, I will take a look

@dbtsai dbtsai closed this in ea0a35e Dec 2, 2025
@dbtsai
Copy link
Member

dbtsai commented Dec 2, 2025

Merged into master as the error is unrelated to this PR. Thanks.

@zhengruifeng
Copy link
Contributor

@dbtsai @asl3 the failure is related, I mean we can use unittest.skip to skip it for now

@zhengruifeng
Copy link
Contributor

#53296 to restore the CI

@zhengruifeng
Copy link
Contributor

zhengruifeng commented Dec 3, 2025

besides test_with_none_and_nan, the doc build is also failed by this PR

https://github.com/apache/spark/actions/runs/19878465653/job/56971154150

@HyukjinKwon @dbtsai @asl3 shall we revert it for now?

@asl3
Copy link
Contributor Author

asl3 commented Dec 3, 2025

@zhengruifeng Thanks! There was a doc fix I had not pushed - I can push a follow-up to add the whitespace: #53298

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants