Skip to content

Conversation

@Yicong-Huang
Copy link
Contributor

What changes were proposed in this pull request?

This PR merges the Arrow conversion code paths between Spark Connect and Classic Spark by extracting shared logic into a reusable helper function _convert_arrow_table_to_pandas.

Why are the changes needed?

This unifies optimizations from two separate PRs:

  • [SPARK-53967] (Classic): Avoid intermediate pandas DataFrame creation by converting Arrow columns directly to Series
  • [SPARK-54183] (Connect): Same optimization implemented for Spark Connect

Does this PR introduce any user-facing change?

No. This is a pure refactoring with no API or behavior changes.

How was this patch tested?

Ran existing Arrow test suite: python/pyspark/sql/tests/arrow/test_arrow.py

Was this patch authored or co-authored using generative AI tooling?

Co-Generated-by Cursor with Claude 4.5 Sonnet

Extract shared conversion logic into _convert_arrow_table_to_pandas helper
function in conversion.py to avoid code duplication between Classic and Connect.

Key changes:
- Add _convert_arrow_table_to_pandas helper function in conversion.py
- Update Classic toPandas to handle empty tables explicitly (SPARK-51112)
- Only apply self_destruct options when table has rows
- Connect imports the shared helper from conversion.py

This unifies the optimizations from SPARK-53967 and SPARK-54183:
- Avoid intermediate pandas DataFrame during conversion
- Convert Arrow columns directly to Series with type converters
- Better memory efficiency with self_destruct on non-empty tables

Co-authored-by: cursor
for arrow_col, field in zip(table.columns, schema.fields)
],
axis="columns",
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we not also unify the # Restore original column names (including duplicates) pdf.columns = schema.names else: # empty columns pdf = table.to_pandas(**pandas_options) logic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes after reading more, I moved more common logic into the helper function to unify them. Thanks!

@Yicong-Huang Yicong-Huang requested a review from holdenk November 17, 2025 04:15
@zhengruifeng
Copy link
Contributor

merged to master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants