-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-54317][PYTHON][CONNECT] Unify Arrow conversion logic for Classic and Connect toPandas #53045
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-54317][PYTHON][CONNECT] Unify Arrow conversion logic for Classic and Connect toPandas #53045
Conversation
Extract shared conversion logic into _convert_arrow_table_to_pandas helper function in conversion.py to avoid code duplication between Classic and Connect. Key changes: - Add _convert_arrow_table_to_pandas helper function in conversion.py - Update Classic toPandas to handle empty tables explicitly (SPARK-51112) - Only apply self_destruct options when table has rows - Connect imports the shared helper from conversion.py This unifies the optimizations from SPARK-53967 and SPARK-54183: - Avoid intermediate pandas DataFrame during conversion - Convert Arrow columns directly to Series with type converters - Better memory efficiency with self_destruct on non-empty tables Co-authored-by: cursor
| for arrow_col, field in zip(table.columns, schema.fields) | ||
| ], | ||
| axis="columns", | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we not also unify the # Restore original column names (including duplicates) pdf.columns = schema.names else: # empty columns pdf = table.to_pandas(**pandas_options) logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes after reading more, I moved more common logic into the helper function to unify them. Thanks!
|
merged to master |
What changes were proposed in this pull request?
This PR merges the Arrow conversion code paths between Spark Connect and Classic Spark by extracting shared logic into a reusable helper function
_convert_arrow_table_to_pandas.Why are the changes needed?
This unifies optimizations from two separate PRs:
Does this PR introduce any user-facing change?
No. This is a pure refactoring with no API or behavior changes.
How was this patch tested?
Ran existing Arrow test suite:
python/pyspark/sql/tests/arrow/test_arrow.pyWas this patch authored or co-authored using generative AI tooling?
Co-Generated-by Cursor with Claude 4.5 Sonnet