-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-29188][PYTHON] toPandas (without Arrow) gets wrong dtypes when applied on empty DF #26747
Conversation
Out of curiosity, what was the previous behavior? was it just missing Long / Double mappings? |
python/pyspark/sql/dataframe.py
Outdated
return np.float32 | ||
else: | ||
return None | ||
mappings = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should list up all the data types here. Initially it was in order to correct pandas's inferred type.
Now, in case of empty data, pandas always infers it as object
and you should rely on this type mapping unlike the intended case before.
See to_arrow_type
as an example for complete type mapping. You might need to check what Spark -> Python -> pandas type conversion combinations and whitelist it here.
python/pyspark/sql/dataframe.py
Outdated
return np.float32 | ||
else: | ||
return None | ||
mappings = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also just keep the if-elif logic. It might be more efficient by using a map but it needs to create a map everytime this function is called. More importantly, this code path isn't supposed to be performance sensitive as it's called per a column. So, I would just keep the logic as was.
ok to test |
cc @BryanCutler |
Also, seems we should handle the case when Arrow optimization is enabled as well ( |
Test build #114809 has finished for PR 26747 at commit
|
This reverts commit 916e19d.
@HyukjinKwon I've reverted back to an if-else chain instead of a dict. Was there anything else you think I should change? |
@srowen This illustrates the current behaviour, where an empty Spark Dataframe with a column of type
When the dataframe is not empty, this is what you see:
|
Test build #114838 has finished for PR 26747 at commit
|
I'm seeing a failed build, but it doesn't look like it has anything to do with this patch, does it? |
retest this please |
StructField('integer', IntegerType(), True), | ||
StructField('long', LongType(), True), | ||
StructField('short', ShortType(), True), | ||
]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dlindelof How does it work for decimal and other types? You're fixing a fundamental problem (see #26747 (comment))
Can you test other type combinations, and make sure non-empty and empty types are same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi,
I've added some more types, I think we have the most important ones now. I've also checked how this behaves in the presence of nulls.
Let me know if you think I'm missing something or if I should have done something differently.
Test build #114846 has finished for PR 26747 at commit
|
Test build #114906 has finished for PR 26747 at commit
|
Test build #114907 has finished for PR 26747 at commit
|
Hi, Just wanted to check what you think of the patch now, is there anything else you think that should be changed? Just let me know and I'll be happy to implement it. |
Looks reasonable to me, but @HyukjinKwon and @BryanCutler are the experts -- other thoughts? |
Merged to master. @dlindelof, thanks for addressing my comments and welcome to Apache Spark contributors :-). |
@dlindelof, what's your JIRA id? I need to assign you a Contributor role to assign you to the JIRA https://issues.apache.org/jira/browse/SPARK-29188 |
@HyukjinKwon my JIRA id is |
What changes were proposed in this pull request?
An empty Spark DataFrame converted to a Pandas DataFrame wouldn't have the right column types. Several type mappings were missing.
Why are the changes needed?
Empty Spark DataFrames can be used to write unit tests, and verified by converting them to Pandas first. But this can fail when the column types are wrong.
Does this PR introduce any user-facing change?
Yes; the error reported in the JIRA issue should not happen anymore.
How was this patch tested?
Through unit tests in
pyspark.sql.tests.test_dataframe.DataFrameTests#test_to_pandas_from_empty_dataframe