[SPARK-54182][SQL][PYTHON] Optimize non-arrow conversion of `df.toPandas` #52897

Yicong-Huang · 2025-11-05T17:45:18Z

What changes were proposed in this pull request?

Following up with #52680, this PR optimizes the non-Arrow path of toPandas() to eliminate intermediate DataFrame creation.

Key optimizations:

Avoid intermediate DataFrame copy
- pd.DataFrame.from_records(rows) → Direct column extraction via zip(*rows)
- 2 DataFrame creations → 1 DataFrame creation
Optimize column-by-column conversion (especially for wide tables)
- Tuples → Lists for faster Series construction
- Implicit dtype inference → Explicit dtype=object
- pd.concat(axis="columns") + column rename → pd.concat(axis=1, keys=columns)
- Result: 43-67% speedup for 50-100 columns

Why are the changes needed?

Problem: Current flow creates DataFrame twice:

rows → pd.DataFrame.from_records() → temporary DataFrame → pd.concat() → final DataFrame

The intermediate DataFrame is immediately discarded, wasting memory. This is especially inefficient for wide tables where column-by-column overhead is significant.

Does this PR introduce any user-facing change?

No. This is a pure performance optimization with no API or behavior changes.

How was this patch tested?

Existing unit tests.
Benchmark

Benchmark setup:

Hardware: Driver memory 4GB, Executor memory 4GB
Configuration: spark.sql.execution.arrow.pyspark.enabled=false (testing non-Arrow path)
Iterations: 10 iterations per test case for statistical reliability
Test cases:
- Simple (numeric columns)
- Mixed (int, string, double, boolean)
- Timestamp (date and timestamp types)
- Nested (struct and array types)
- Wide (5, 10, 50, 100 column counts)

Performance Results

General Benchmark (10 iterations):

Test Case	Rows	OLD → NEW	Speedup
simple	1M	1.376s → 1.383s	≈ Tied
mixed	1M	2.396s → 2.553s	6% slower
timestamp	500K	4.323s → 4.392s	≈ Tied
nested	100K	0.558s → 0.580s	4% slower
wide (50)	100K	1.458s → 1.141s	28% faster 🚀

Column Width Benchmark (100K rows, 10 iterations):

Columns	OLD → NEW	Speedup
5	0.188s → 0.179s	5% faster
10	0.262s → 0.270s	≈ Tied
50	1.430s → 0.998s	43% faster 🚀
100	3.320s → 1.988s	67% faster 🚀

Was this patch authored or co-authored using generative AI tooling?

Yes. Co-Generated-by Cursor

zhengruifeng · 2025-11-11T07:45:37Z

python/pyspark/sql/pandas/conversion.py

@@ -208,13 +210,15 @@ def toPandas(self) -> "PandasDataFrameLike":
                        ),
                        error_on_duplicated_field_names=False,
                        timestamp_utc_localized=False,
-                    )(pser)
-                    for (_, pser), field in zip(pdf.items(), self.schema.fields)
+                    )(pd.Series(col_data, dtype=object))


why is dtype=object necessary?

here is building a series to pass to _create_converter_to_pandas, which will convert the series to the declared type in field.dataType. So I think the dtype here is optional and unnecessary. If we do not supply object, then when creating series it will start to infer the type, which will be quite slow.

I also tried to use field.dataType explicitly but it would need some conversion to pandas data type, which is the purpose of _create_converter_to_pandas. So I suggest we keep use object for disabling type inference.

zhengruifeng · 2025-11-11T07:46:17Z

python/pyspark/sql/pandas/conversion.py

-                    )(pser)
-                    for (_, pser), field in zip(pdf.items(), self.schema.fields)
+                    )(pd.Series(col_data, dtype=object))
+                    for col_data, field in zip(columns_data, self.schema.fields)


can we avoid creating columns_data: list[list] ?

good point, although I think it is a list of list references so memory won't be too large of a difference. I changed it to iterator.

dongjoon-hyun · 2025-11-12T16:45:20Z

Thank you, @Yicong-Huang and @zhengruifeng . I updated the JIRA information to target to 4.2.0 according to the PR description.

Following up with #52680, this PR optimizes the non-Arrow path of toPandas() to eliminate intermediate DataFrame creation.

zhengruifeng · 2025-11-13T07:18:26Z

merged to master

github-actions bot added SQL PYTHON labels Nov 5, 2025

HyukjinKwon changed the title ~~[WIP][Spark-54182] avoid intermedia dataframe creation in non-arrow codepath of df.toPandas~~ [WIP][SPARK-54182] Avoid intermedia dataframe creation in non-arrow codepath of df.toPandas Nov 10, 2025

Yicong-Huang changed the title ~~[WIP][SPARK-54182] Avoid intermedia dataframe creation in non-arrow codepath of df.toPandas~~ [SPARK-54182] Avoid intermedia dataframe creation in non-arrow codepath of df.toPandas Nov 10, 2025

Yicong-Huang changed the title ~~[SPARK-54182] Avoid intermedia dataframe creation in non-arrow codepath of df.toPandas~~ [SPARK-54182][SQL][PYTHON] Avoid intermedia dataframe creation in non-arrow codepath of df.toPandas Nov 10, 2025

Yicong-Huang changed the title ~~[SPARK-54182][SQL][PYTHON] Avoid intermedia dataframe creation in non-arrow codepath of df.toPandas~~ [SPARK-54182][SQL][PYTHON] Optimize non-arrow conversion of df.toPandas Nov 10, 2025

zhengruifeng approved these changes Nov 11, 2025

View reviewed changes

Yicong-Huang added 7 commits November 11, 2025 22:57

refactor: remove intermedia data frame

cd50afd

fix: format

26ad2cc

fix: empty df code path

dd32ab3

refactor: simplify arrow to pandas conversion

55c2634

chore: clean up

dc43392

feat: use iterator to replace list

5d7a578

fix: mypy issue

6a7b7a3

Yicong-Huang force-pushed the SPARK-54182/refactor/avoid-intermedia-df-in-non-arrow-toPandas branch from b516019 to 6a7b7a3 Compare November 12, 2025 06:57

zhengruifeng closed this in aad6c51 Nov 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-54182][SQL][PYTHON] Optimize non-arrow conversion of `df.toPandas` #52897

[SPARK-54182][SQL][PYTHON] Optimize non-arrow conversion of `df.toPandas` #52897

Uh oh!

Yicong-Huang commented Nov 5, 2025 •

edited

Loading

Uh oh!

zhengruifeng Nov 11, 2025

Uh oh!

Yicong-Huang Nov 11, 2025

Uh oh!

Yicong-Huang Nov 11, 2025

Uh oh!

zhengruifeng Nov 11, 2025

Uh oh!

Yicong-Huang Nov 11, 2025

Uh oh!

dongjoon-hyun commented Nov 12, 2025

Uh oh!

zhengruifeng commented Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-54182][SQL][PYTHON] Optimize non-arrow conversion of df.toPandas #52897

[SPARK-54182][SQL][PYTHON] Optimize non-arrow conversion of df.toPandas #52897

Uh oh!

Conversation

Yicong-Huang commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Performance Results

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

zhengruifeng Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

Yicong-Huang Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

Yicong-Huang Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

Yicong-Huang Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Nov 12, 2025

Uh oh!

zhengruifeng commented Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-54182][SQL][PYTHON] Optimize non-arrow conversion of `df.toPandas` #52897

[SPARK-54182][SQL][PYTHON] Optimize non-arrow conversion of `df.toPandas` #52897

Yicong-Huang commented Nov 5, 2025 •

edited

Loading