Skip to content

Conversation

@BryanCutler
Copy link
Member

What changes were proposed in this pull request?

This fixes createDataFrame from Pandas to only assign modified timestamp series back to a copied version of the Pandas DataFrame. Previously, if the Pandas DataFrame was only a reference (e.g. a slice of another) each series will still get assigned back to the reference even if it is not a modified timestamp column. This caused the following warning "SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame."

How was this patch tested?

existing tests

@BryanCutler
Copy link
Member Author

repro to get the warning (this is the non-Arrow code path)

import numpy as np
import pandas as pd
pdf = pd.DataFrame(np.random.rand(100, 2))
df = spark.createDataFrame(pdf[:10])

'''
/home/bryan/git/spark/python/pyspark/sql/session.py:476: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  pdf[column] = s
'''

@BryanCutler
Copy link
Member Author

BryanCutler commented Jan 10, 2018

ping @ueshin @HyukjinKwon , I'm not too sure if this could cause any real problems, but the warning is a little unsettling and can be avoided. This change will only assign series back to the pdf if was a modified timestamp column, and the rest are just copied if needed.

@SparkQA
Copy link

SparkQA commented Jan 10, 2018

Test build #85890 has finished for PR 20213 at commit bdeead6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ueshin
Copy link
Member

ueshin commented Jan 10, 2018

LGTM.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM too. one question.

copied = True
pdf[field.name] = s
if s is not pdf[field.name]:
if not copied:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, what's diff between:

if s is not pdf[field.name]:
    if not copied:

vs

if not copied and s is not pdf[field.name]:

?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it was separated for assigning pdf[field.name] = s only if s is not pdf[field.name].

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sure. Makes sense. I rushed to read.

@ueshin
Copy link
Member

ueshin commented Jan 10, 2018

Thanks! merging to master/2.3.

asfgit pushed a commit that referenced this pull request Jan 10, 2018
…s assignment

## What changes were proposed in this pull request?

This fixes createDataFrame from Pandas to only assign modified timestamp series back to a copied version of the Pandas DataFrame.  Previously, if the Pandas DataFrame was only a reference (e.g. a slice of another) each series will still get assigned back to the reference even if it is not a modified timestamp column.  This caused the following warning "SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame."

## How was this patch tested?

existing tests

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #20213 from BryanCutler/pyspark-createDataFrame-copy-slice-warn-SPARK-23018.

(cherry picked from commit 7bcc266)
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
@asfgit asfgit closed this in 7bcc266 Jan 10, 2018
@BryanCutler
Copy link
Member Author

Thanks @ueshin and @HyukjinKwon !

@BryanCutler BryanCutler deleted the pyspark-createDataFrame-copy-slice-warn-SPARK-23018 branch November 19, 2018 05:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants