[SPARK-30640][PYTHON][SQL] Prevent unnecessary copies of data during Arrow to Pandas conversion#27358
Conversation
…that are timestamp types
|
cc @HyukjinKwon @viirya please take a look, thanks! |
| return s | ||
|
|
||
|
|
||
| def _check_dataframe_localize_timestamps(pdf, timezone): |
There was a problem hiding this comment.
Better to just remove this, it was only used in the one place
| require_minimum_pandas_version() | ||
|
|
||
| for column, series in pdf.iteritems(): | ||
| pdf[column] = _check_series_localize_timestamps(series, timezone) |
There was a problem hiding this comment.
The problem is pyarrow stores the DataFrame data in blocks internally, and assigning series back to the DataFrame would cause the blocks to be reallocated.
| # datetime64[ns] type handling. | ||
| s = arrow_column.to_pandas(date_as_object=True) | ||
|
|
||
| s = _check_series_localize_timestamps(s, self._timezone) |
There was a problem hiding this comment.
I don't know if this was causing the same issue, but it's easy enough to just check the column type and only convert if necessary.
|
Test build #117382 has finished for PR 27358 at commit
|
|
This is a pretty minor change, so I'm gonna go ahead and merge |
| return _check_dataframe_localize_timestamps(pdf, timezone) | ||
| for field in self.schema: | ||
| if isinstance(field.dataType, TimestampType): | ||
| pdf[field.name] = \ |
There was a problem hiding this comment.
Is it different? Doesn't this also assign the series back to the DataFrame?
There was a problem hiding this comment.
Yeah, for the case of timestamps making a copy is unavailable. This is just to prevent non-timestamp columns that were also causing a copy when assigned back to the DataFrame
HyukjinKwon
left a comment
There was a problem hiding this comment.
LGTM. sorry for late response.
|
Thanks @HyukjinKwon ! |
What changes were proposed in this pull request?
Prevent unnecessary copies of data during conversion from Arrow to Pandas.
Why are the changes needed?
During conversion of pyarrow data to Pandas, columns are checked for timestamp types and then modified to correct for local timezone. If the data contains no timestamp types, then unnecessary copies of the data can be made. This is most prevalent when checking columns of a pandas DataFrame where each series is assigned back to the DataFrame, regardless if it had timestamps. See https://www.mail-archive.com/dev@arrow.apache.org/msg17008.html and ARROW-7596 for discussion.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Existing tests