Skip to content

Conversation

@Yicong-Huang
Copy link
Contributor

@Yicong-Huang Yicong-Huang commented Nov 5, 2025

What changes were proposed in this pull request?

Following up with #52680, this PR optimizes the non-Arrow path of toPandas() to eliminate intermediate DataFrame creation.

Key optimizations:

  1. Avoid intermediate DataFrame copy

    • pd.DataFrame.from_records(rows) → Direct column extraction via zip(*rows)
    • 2 DataFrame creations → 1 DataFrame creation
  2. Optimize column-by-column conversion (especially for wide tables)

    • Tuples → Lists for faster Series construction
    • Implicit dtype inference → Explicit dtype=object
    • pd.concat(axis="columns") + column rename → pd.concat(axis=1, keys=columns)
    • Result: 43-67% speedup for 50-100 columns

Why are the changes needed?

Problem: Current flow creates DataFrame twice:

  • rowspd.DataFrame.from_records() → temporary DataFrame → pd.concat() → final DataFrame

The intermediate DataFrame is immediately discarded, wasting memory. This is especially inefficient for wide tables where column-by-column overhead is significant.

Does this PR introduce any user-facing change?

No. This is a pure performance optimization with no API or behavior changes.

How was this patch tested?

  • Existing unit tests.
  • Benchmark

Benchmark setup:

  • Hardware: Driver memory 4GB, Executor memory 4GB
  • Configuration: spark.sql.execution.arrow.pyspark.enabled=false (testing non-Arrow path)
  • Iterations: 10 iterations per test case for statistical reliability
  • Test cases:
    • Simple (numeric columns)
    • Mixed (int, string, double, boolean)
    • Timestamp (date and timestamp types)
    • Nested (struct and array types)
    • Wide (5, 10, 50, 100 column counts)

Performance Results

General Benchmark (10 iterations):

Test Case Rows OLD → NEW Speedup
simple 1M 1.376s → 1.383s ≈ Tied
mixed 1M 2.396s → 2.553s 6% slower
timestamp 500K 4.323s → 4.392s ≈ Tied
nested 100K 0.558s → 0.580s 4% slower
wide (50) 100K 1.458s → 1.141s 28% faster 🚀

Column Width Benchmark (100K rows, 10 iterations):

Columns OLD → NEW Speedup
5 0.188s → 0.179s 5% faster
10 0.262s → 0.270s ≈ Tied
50 1.430s → 0.998s 43% faster 🚀
100 3.320s → 1.988s 67% faster 🚀

Was this patch authored or co-authored using generative AI tooling?

Yes. Co-Generated-by Cursor

@HyukjinKwon HyukjinKwon changed the title [WIP][Spark-54182] avoid intermedia dataframe creation in non-arrow codepath of df.toPandas [WIP][SPARK-54182] Avoid intermedia dataframe creation in non-arrow codepath of df.toPandas Nov 10, 2025
@Yicong-Huang Yicong-Huang changed the title [WIP][SPARK-54182] Avoid intermedia dataframe creation in non-arrow codepath of df.toPandas [SPARK-54182] Avoid intermedia dataframe creation in non-arrow codepath of df.toPandas Nov 10, 2025
@Yicong-Huang Yicong-Huang changed the title [SPARK-54182] Avoid intermedia dataframe creation in non-arrow codepath of df.toPandas [SPARK-54182][SQL][PYTHON] Avoid intermedia dataframe creation in non-arrow codepath of df.toPandas Nov 10, 2025
@Yicong-Huang Yicong-Huang changed the title [SPARK-54182][SQL][PYTHON] Avoid intermedia dataframe creation in non-arrow codepath of df.toPandas [SPARK-54182][SQL][PYTHON] Optimize non-arrow conversion of df.toPandas Nov 10, 2025
@@ -208,13 +210,15 @@ def toPandas(self) -> "PandasDataFrameLike":
),
error_on_duplicated_field_names=False,
timestamp_utc_localized=False,
)(pser)
for (_, pser), field in zip(pdf.items(), self.schema.fields)
)(pd.Series(col_data, dtype=object))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is dtype=object necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here is building a series to pass to _create_converter_to_pandas, which will convert the series to the declared type in field.dataType. So I think the dtype here is optional and unnecessary. If we do not supply object, then when creating series it will start to infer the type, which will be quite slow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also tried to use field.dataType explicitly but it would need some conversion to pandas data type, which is the purpose of _create_converter_to_pandas. So I suggest we keep use object for disabling type inference.

)(pser)
for (_, pser), field in zip(pdf.items(), self.schema.fields)
)(pd.Series(col_data, dtype=object))
for col_data, field in zip(columns_data, self.schema.fields)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we avoid creating columns_data: list[list] ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, although I think it is a list of list references so memory won't be too large of a difference. I changed it to iterator.

@Yicong-Huang Yicong-Huang force-pushed the SPARK-54182/refactor/avoid-intermedia-df-in-non-arrow-toPandas branch from b516019 to 6a7b7a3 Compare November 12, 2025 06:57
@dongjoon-hyun
Copy link
Member

Thank you, @Yicong-Huang and @zhengruifeng . I updated the JIRA information to target to 4.2.0 according to the PR description.

Following up with #52680, this PR optimizes the non-Arrow path of toPandas() to eliminate intermediate DataFrame creation.

@zhengruifeng
Copy link
Contributor

merged to master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants