-
Notifications
You must be signed in to change notification settings - Fork 29.1k
[SPARK-23018][PYTHON] Fix createDataFrame from Pandas timestamp series assignment #20213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-23018][PYTHON] Fix createDataFrame from Pandas timestamp series assignment #20213
Conversation
|
repro to get the warning (this is the non-Arrow code path) import numpy as np
import pandas as pd
pdf = pd.DataFrame(np.random.rand(100, 2))
df = spark.createDataFrame(pdf[:10])
'''
/home/bryan/git/spark/python/pyspark/sql/session.py:476: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
pdf[column] = s
''' |
|
ping @ueshin @HyukjinKwon , I'm not too sure if this could cause any real problems, but the warning is a little unsettling and can be avoided. This change will only assign series back to the pdf if was a modified timestamp column, and the rest are just copied if needed. |
|
Test build #85890 has finished for PR 20213 at commit
|
|
LGTM. |
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM too. one question.
| copied = True | ||
| pdf[field.name] = s | ||
| if s is not pdf[field.name]: | ||
| if not copied: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, what's diff between:
if s is not pdf[field.name]:
if not copied:
vs
if not copied and s is not pdf[field.name]:
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it was separated for assigning pdf[field.name] = s only if s is not pdf[field.name].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, sure. Makes sense. I rushed to read.
|
Thanks! merging to master/2.3. |
…s assignment ## What changes were proposed in this pull request? This fixes createDataFrame from Pandas to only assign modified timestamp series back to a copied version of the Pandas DataFrame. Previously, if the Pandas DataFrame was only a reference (e.g. a slice of another) each series will still get assigned back to the reference even if it is not a modified timestamp column. This caused the following warning "SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame." ## How was this patch tested? existing tests Author: Bryan Cutler <cutlerb@gmail.com> Closes #20213 from BryanCutler/pyspark-createDataFrame-copy-slice-warn-SPARK-23018. (cherry picked from commit 7bcc266) Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
|
Thanks @ueshin and @HyukjinKwon ! |
What changes were proposed in this pull request?
This fixes createDataFrame from Pandas to only assign modified timestamp series back to a copied version of the Pandas DataFrame. Previously, if the Pandas DataFrame was only a reference (e.g. a slice of another) each series will still get assigned back to the reference even if it is not a modified timestamp column. This caused the following warning "SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame."
How was this patch tested?
existing tests