[SPARK-23018][PYTHON] Fix createDataFrame from Pandas timestamp series assignment #20213

BryanCutler · 2018-01-10T00:02:47Z

What changes were proposed in this pull request?

This fixes createDataFrame from Pandas to only assign modified timestamp series back to a copied version of the Pandas DataFrame. Previously, if the Pandas DataFrame was only a reference (e.g. a slice of another) each series will still get assigned back to the reference even if it is not a modified timestamp column. This caused the following warning "SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame."

How was this patch tested?

existing tests

…ield

BryanCutler · 2018-01-10T00:09:50Z

repro to get the warning (this is the non-Arrow code path)

import numpy as np
import pandas as pd
pdf = pd.DataFrame(np.random.rand(100, 2))
df = spark.createDataFrame(pdf[:10])

'''
/home/bryan/git/spark/python/pyspark/sql/session.py:476: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  pdf[column] = s
'''

BryanCutler · 2018-01-10T00:10:04Z

ping @ueshin @HyukjinKwon , I'm not too sure if this could cause any real problems, but the warning is a little unsettling and can be avoided. This change will only assign series back to the pdf if was a modified timestamp column, and the rest are just copied if needed.

SparkQA · 2018-01-10T00:36:45Z

Test build #85890 has finished for PR 20213 at commit bdeead6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2018-01-10T04:32:25Z

LGTM.

HyukjinKwon

LGTM too. one question.

HyukjinKwon · 2018-01-10T04:35:22Z

python/pyspark/sql/session.py

-                            copied = True
-                        pdf[field.name] = s
+                        if s is not pdf[field.name]:
+                            if not copied:


BTW, what's diff between:

if s is not pdf[field.name]: if not copied:

vs

if not copied and s is not pdf[field.name]:

?

Looks like it was separated for assigning pdf[field.name] = s only if s is not pdf[field.name].

Ah, sure. Makes sense. I rushed to read.

ueshin · 2018-01-10T04:58:28Z

Thanks! merging to master/2.3.

…s assignment ## What changes were proposed in this pull request? This fixes createDataFrame from Pandas to only assign modified timestamp series back to a copied version of the Pandas DataFrame. Previously, if the Pandas DataFrame was only a reference (e.g. a slice of another) each series will still get assigned back to the reference even if it is not a modified timestamp column. This caused the following warning "SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame." ## How was this patch tested? existing tests Author: Bryan Cutler <cutlerb@gmail.com> Closes #20213 from BryanCutler/pyspark-createDataFrame-copy-slice-warn-SPARK-23018. (cherry picked from commit 7bcc266) Signed-off-by: Takuya UESHIN <ueshin@databricks.com>

BryanCutler · 2018-01-10T16:51:11Z

Thanks @ueshin and @HyukjinKwon !

Changed createDataFrame to only assign series if modified timestamp f…

bdeead6

…ield

HyukjinKwon approved these changes Jan 10, 2018

View reviewed changes

asfgit closed this in 7bcc266 Jan 10, 2018

BryanCutler deleted the pyspark-createDataFrame-copy-slice-warn-SPARK-23018 branch November 19, 2018 05:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-23018][PYTHON] Fix createDataFrame from Pandas timestamp series assignment #20213

[SPARK-23018][PYTHON] Fix createDataFrame from Pandas timestamp series assignment #20213

Uh oh!

BryanCutler commented Jan 10, 2018

Uh oh!

BryanCutler commented Jan 10, 2018

Uh oh!

BryanCutler commented Jan 10, 2018 •

edited

Loading

Uh oh!

SparkQA commented Jan 10, 2018

Uh oh!

ueshin commented Jan 10, 2018

Uh oh!

HyukjinKwon left a comment

Uh oh!

HyukjinKwon Jan 10, 2018

Uh oh!

ueshin Jan 10, 2018

Uh oh!

HyukjinKwon Jan 10, 2018

Uh oh!

ueshin commented Jan 10, 2018

Uh oh!

BryanCutler commented Jan 10, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-23018][PYTHON] Fix createDataFrame from Pandas timestamp series assignment #20213

[SPARK-23018][PYTHON] Fix createDataFrame from Pandas timestamp series assignment #20213

Uh oh!

Conversation

BryanCutler commented Jan 10, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

BryanCutler commented Jan 10, 2018

Uh oh!

BryanCutler commented Jan 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Jan 10, 2018

Uh oh!

ueshin commented Jan 10, 2018

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jan 10, 2018

Choose a reason for hiding this comment

Uh oh!

ueshin Jan 10, 2018

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jan 10, 2018

Choose a reason for hiding this comment

Uh oh!

ueshin commented Jan 10, 2018

Uh oh!

BryanCutler commented Jan 10, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

BryanCutler commented Jan 10, 2018 •

edited

Loading