-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-54182][SQL][PYTHON] Optimize non-arrow conversion of df.toPandas
#52897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-54182][SQL][PYTHON] Optimize non-arrow conversion of df.toPandas
#52897
Conversation
df.toPandasdf.toPandas
df.toPandasdf.toPandas
df.toPandasdf.toPandas
df.toPandasdf.toPandas
| @@ -208,13 +210,15 @@ def toPandas(self) -> "PandasDataFrameLike": | |||
| ), | |||
| error_on_duplicated_field_names=False, | |||
| timestamp_utc_localized=False, | |||
| )(pser) | |||
| for (_, pser), field in zip(pdf.items(), self.schema.fields) | |||
| )(pd.Series(col_data, dtype=object)) | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is dtype=object necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here is building a series to pass to _create_converter_to_pandas, which will convert the series to the declared type in field.dataType. So I think the dtype here is optional and unnecessary. If we do not supply object, then when creating series it will start to infer the type, which will be quite slow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also tried to use field.dataType explicitly but it would need some conversion to pandas data type, which is the purpose of _create_converter_to_pandas. So I suggest we keep use object for disabling type inference.
| )(pser) | ||
| for (_, pser), field in zip(pdf.items(), self.schema.fields) | ||
| )(pd.Series(col_data, dtype=object)) | ||
| for col_data, field in zip(columns_data, self.schema.fields) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we avoid creating columns_data: list[list] ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, although I think it is a list of list references so memory won't be too large of a difference. I changed it to iterator.
b516019 to
6a7b7a3
Compare
|
Thank you, @Yicong-Huang and @zhengruifeng . I updated the JIRA information to target to 4.2.0 according to the PR description.
|
|
merged to master |
What changes were proposed in this pull request?
Following up with #52680, this PR optimizes the non-Arrow path of
toPandas()to eliminate intermediate DataFrame creation.Key optimizations:
Avoid intermediate DataFrame copy
pd.DataFrame.from_records(rows)→ Direct column extraction viazip(*rows)Optimize column-by-column conversion (especially for wide tables)
dtype=objectpd.concat(axis="columns")+ column rename →pd.concat(axis=1, keys=columns)Why are the changes needed?
Problem: Current flow creates DataFrame twice:
rows→pd.DataFrame.from_records()→ temporary DataFrame →pd.concat()→ final DataFrameThe intermediate DataFrame is immediately discarded, wasting memory. This is especially inefficient for wide tables where column-by-column overhead is significant.
Does this PR introduce any user-facing change?
No. This is a pure performance optimization with no API or behavior changes.
How was this patch tested?
Benchmark setup:
spark.sql.execution.arrow.pyspark.enabled=false(testing non-Arrow path)Performance Results
General Benchmark (10 iterations):
Column Width Benchmark (100K rows, 10 iterations):
Was this patch authored or co-authored using generative AI tooling?
Yes. Co-Generated-by Cursor