Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-29188][PYTHON] toPandas (without Arrow) gets wrong dtypes when applied on empty DF #26747

Closed
wants to merge 6 commits into from

Conversation

@dlindelof
Copy link

dlindelof commented Dec 3, 2019

What changes were proposed in this pull request?

An empty Spark DataFrame converted to a Pandas DataFrame wouldn't have the right column types. Several type mappings were missing.

Why are the changes needed?

Empty Spark DataFrames can be used to write unit tests, and verified by converting them to Pandas first. But this can fail when the column types are wrong.

Does this PR introduce any user-facing change?

Yes; the error reported in the JIRA issue should not happen anymore.

How was this patch tested?

Through unit tests in pyspark.sql.tests.test_dataframe.DataFrameTests#test_to_pandas_from_empty_dataframe

@srowen

This comment has been minimized.

Copy link
Member

srowen commented Dec 3, 2019

Out of curiosity, what was the previous behavior? was it just missing Long / Double mappings?

return np.float32
else:
return None
mappings = {

This comment has been minimized.

Copy link
@HyukjinKwon

HyukjinKwon Dec 4, 2019

Member

We should list up all the data types here. Initially it was in order to correct pandas's inferred type.
Now, in case of empty data, pandas always infers it as object and you should rely on this type mapping unlike the intended case before.

See to_arrow_type as an example for complete type mapping. You might need to check what Spark -> Python -> pandas type conversion combinations and whitelist it here.

return np.float32
else:
return None
mappings = {

This comment has been minimized.

Copy link
@HyukjinKwon

HyukjinKwon Dec 4, 2019

Member

I would also just keep the if-elif logic. It might be more efficient by using a map but it needs to create a map everytime this function is called. More importantly, this code path isn't supposed to be performance sensitive as it's called per a column. So, I would just keep the logic as was.

@HyukjinKwon

This comment has been minimized.

Copy link
Member

HyukjinKwon commented Dec 4, 2019

ok to test

@HyukjinKwon

This comment has been minimized.

Copy link
Member

HyukjinKwon commented Dec 4, 2019

@HyukjinKwon HyukjinKwon changed the title [SPARK-29188][PySpark] toPandas gets wrong dtypes when applied on empty DF [SPARK-29188][PYTHON] toPandas (without Arrow) gets wrong dtypes when applied on empty DF Dec 4, 2019
@HyukjinKwon

This comment has been minimized.

Copy link
Member

HyukjinKwon commented Dec 4, 2019

Also, seems we should handle the case when Arrow optimization is enabled as well (spark.sql.execution.arrow.pyspark.enabled set to true). But I suspect this can be done separately.

@SparkQA

This comment has been minimized.

Copy link

SparkQA commented Dec 4, 2019

Test build #114809 has finished for PR 26747 at commit 916e19d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
This reverts commit 916e19d.
@dlindelof

This comment has been minimized.

Copy link
Author

dlindelof commented Dec 4, 2019

@HyukjinKwon I've reverted back to an if-else chain instead of a dict. Was there anything else you think I should change?

@dlindelof

This comment has been minimized.

Copy link
Author

dlindelof commented Dec 4, 2019

@srowen This illustrates the current behaviour, where an empty Spark Dataframe with a column of type LongType becomes a Pandas Dataframe with a column of type object, i.e. string:

In [62]: foo = spark.sql("SELECT CAST(1 AS LONG) AS bar WHERE 1 = 0")

In [63]: foo
Out[63]: DataFrame[bar: bigint]

In [64]: foo.toPandas().dtypes
Out[64]:
bar    object
dtype: object

When the dataframe is not empty, this is what you see:

In [65]: foo = spark.sql("SELECT CAST(1 AS LONG) AS bar WHERE 1 = 1")

In [66]: foo.toPandas().dtypes
Out[66]:
bar    int64
dtype: object
@SparkQA

This comment has been minimized.

Copy link

SparkQA commented Dec 4, 2019

Test build #114838 has finished for PR 26747 at commit f25827c.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.
@dlindelof

This comment has been minimized.

Copy link
Author

dlindelof commented Dec 4, 2019

I'm seeing a failed build, but it doesn't look like it has anything to do with this patch, does it?

@HyukjinKwon

This comment has been minimized.

Copy link
Member

HyukjinKwon commented Dec 4, 2019

retest this please

StructField('integer', IntegerType(), True),
StructField('long', LongType(), True),
StructField('short', ShortType(), True),
])

This comment has been minimized.

Copy link
@HyukjinKwon

HyukjinKwon Dec 4, 2019

Member

@dlindelof How does it work for decimal and other types? You're fixing a fundamental problem (see #26747 (comment))

Can you test other type combinations, and make sure non-empty and empty types are same?

This comment has been minimized.

Copy link
@dlindelof

dlindelof Dec 5, 2019

Author

Hi,

I've added some more types, I think we have the most important ones now. I've also checked how this behaves in the presence of nulls.

Let me know if you think I'm missing something or if I should have done something differently.

@SparkQA

This comment has been minimized.

Copy link

SparkQA commented Dec 4, 2019

Test build #114846 has finished for PR 26747 at commit f25827c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Copy link

SparkQA commented Dec 5, 2019

Test build #114906 has finished for PR 26747 at commit 150ecf7.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
David
@SparkQA

This comment has been minimized.

Copy link

SparkQA commented Dec 5, 2019

Test build #114907 has finished for PR 26747 at commit bc95e27.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@dlindelof

This comment has been minimized.

Copy link
Author

dlindelof commented Dec 9, 2019

Hi,

Just wanted to check what you think of the patch now, is there anything else you think that should be changed? Just let me know and I'll be happy to implement it.

@srowen

This comment has been minimized.

Copy link
Member

srowen commented Dec 10, 2019

Looks reasonable to me, but @HyukjinKwon and @BryanCutler are the experts -- other thoughts?

@HyukjinKwon

This comment has been minimized.

Copy link
Member

HyukjinKwon commented Dec 12, 2019

Merged to master.

@dlindelof, thanks for addressing my comments and welcome to Apache Spark contributors :-).

@HyukjinKwon

This comment has been minimized.

Copy link
Member

HyukjinKwon commented Dec 12, 2019

@dlindelof, what's your JIRA id? I need to assign you a Contributor role to assign you to the JIRA https://issues.apache.org/jira/browse/SPARK-29188

@dlindelof

This comment has been minimized.

Copy link
Author

dlindelof commented Dec 12, 2019

@HyukjinKwon my JIRA id is dlindelof. Thanks for approving this PR, happy to help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.