[SPARK-30640][PYTHON][SQL] Prevent unnecessary copies of data during Arrow to Pandas conversion by BryanCutler · Pull Request #27358 · apache/spark

BryanCutler · 2020-01-24T23:16:56Z

What changes were proposed in this pull request?

Prevent unnecessary copies of data during conversion from Arrow to Pandas.

Why are the changes needed?

During conversion of pyarrow data to Pandas, columns are checked for timestamp types and then modified to correct for local timezone. If the data contains no timestamp types, then unnecessary copies of the data can be made. This is most prevalent when checking columns of a pandas DataFrame where each series is assigned back to the DataFrame, regardless if it had timestamps. See https://www.mail-archive.com/dev@arrow.apache.org/msg17008.html and ARROW-7596 for discussion.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing tests

…that are timestamp types

BryanCutler · 2020-01-24T23:19:23Z

cc @HyukjinKwon @viirya please take a look, thanks!

BryanCutler · 2020-01-24T23:20:27Z

python/pyspark/sql/pandas/types.py

        return s


-def _check_dataframe_localize_timestamps(pdf, timezone):


Better to just remove this, it was only used in the one place

BryanCutler · 2020-01-24T23:22:22Z

python/pyspark/sql/pandas/types.py

-    require_minimum_pandas_version()
-
-    for column, series in pdf.iteritems():
-        pdf[column] = _check_series_localize_timestamps(series, timezone)


The problem is pyarrow stores the DataFrame data in blocks internally, and assigning series back to the DataFrame would cause the blocks to be reallocated.

BryanCutler · 2020-01-24T23:23:06Z

python/pyspark/sql/pandas/serializers.py

        # datetime64[ns] type handling.
        s = arrow_column.to_pandas(date_as_object=True)

-        s = _check_series_localize_timestamps(s, self._timezone)


I don't know if this was causing the same issue, but it's easy enough to just check the column type and only convert if necessary.

SparkQA · 2020-01-24T23:50:46Z

Test build #117382 has finished for PR 27358 at commit 3a61dd1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2020-01-26T23:19:51Z

This is a pretty minor change, so I'm gonna go ahead and merge

viirya · 2020-01-27T00:31:04Z

python/pyspark/sql/pandas/conversion.py

-                        return _check_dataframe_localize_timestamps(pdf, timezone)
+                        for field in self.schema:
+                            if isinstance(field.dataType, TimestampType):
+                                pdf[field.name] = \


Is it different? Doesn't this also assign the series back to the DataFrame?

Yeah, for the case of timestamps making a copy is unavailable. This is just to prevent non-timestamp columns that were also causing a copy when assigned back to the DataFrame

ok. looks good then. thanks!

Thanks @viirya !

HyukjinKwon

LGTM. sorry for late response.

BryanCutler · 2020-01-28T18:27:02Z

Thanks @HyukjinKwon !

Remove _check_dataframe_localize_timestamps and only convert columns …

3a61dd1

…that are timestamp types

BryanCutler changed the title ~~[SPARK-30640][PYTHON[SQL] Prevent unnecessary copies of data during Arrow to Pandas conversion~~ [SPARK-30640][PYTHON][SQL] Prevent unnecessary copies of data during Arrow to Pandas conversion Jan 24, 2020

BryanCutler commented Jan 24, 2020

View reviewed changes

dongjoon-hyun added PYSPARK SQL labels Jan 25, 2020

BryanCutler closed this in 43d9c7e Jan 26, 2020

viirya reviewed Jan 27, 2020

View reviewed changes

HyukjinKwon reviewed Jan 28, 2020

View reviewed changes

BryanCutler deleted the pyspark-pandas-timestamp-copy-fix-SPARK-30640 branch January 28, 2020 18:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-30640][PYTHON][SQL] Prevent unnecessary copies of data during Arrow to Pandas conversion#27358

[SPARK-30640][PYTHON][SQL] Prevent unnecessary copies of data during Arrow to Pandas conversion#27358
BryanCutler wants to merge 1 commit intoapache:masterfrom
BryanCutler:pyspark-pandas-timestamp-copy-fix-SPARK-30640

BryanCutler commented Jan 24, 2020

Uh oh!

BryanCutler commented Jan 24, 2020

Uh oh!

BryanCutler Jan 24, 2020

Uh oh!

BryanCutler Jan 24, 2020

Uh oh!

BryanCutler Jan 24, 2020

Uh oh!

SparkQA commented Jan 24, 2020

Uh oh!

BryanCutler commented Jan 26, 2020

Uh oh!

viirya Jan 27, 2020

Uh oh!

BryanCutler Jan 27, 2020

Uh oh!

viirya Jan 27, 2020

Uh oh!

BryanCutler Jan 27, 2020

Uh oh!

HyukjinKwon left a comment

Uh oh!

BryanCutler commented Jan 28, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

		return s


		def _check_dataframe_localize_timestamps(pdf, timezone):

Conversation

BryanCutler commented Jan 24, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

BryanCutler commented Jan 24, 2020

Uh oh!

BryanCutler Jan 24, 2020

Choose a reason for hiding this comment

Uh oh!

BryanCutler Jan 24, 2020

Choose a reason for hiding this comment

Uh oh!

BryanCutler Jan 24, 2020

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 24, 2020

Uh oh!

BryanCutler commented Jan 26, 2020

Uh oh!

viirya Jan 27, 2020

Choose a reason for hiding this comment

Uh oh!

BryanCutler Jan 27, 2020

Choose a reason for hiding this comment

Uh oh!

viirya Jan 27, 2020

Choose a reason for hiding this comment

Uh oh!

BryanCutler Jan 27, 2020

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

BryanCutler commented Jan 28, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants