Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-29188][PYTHON] toPandas (without Arrow) gets wrong dtypes when applied on empty DF #26747

Closed
wants to merge 6 commits into from

Conversation

dlindelof
Copy link

What changes were proposed in this pull request?

An empty Spark DataFrame converted to a Pandas DataFrame wouldn't have the right column types. Several type mappings were missing.

Why are the changes needed?

Empty Spark DataFrames can be used to write unit tests, and verified by converting them to Pandas first. But this can fail when the column types are wrong.

Does this PR introduce any user-facing change?

Yes; the error reported in the JIRA issue should not happen anymore.

How was this patch tested?

Through unit tests in pyspark.sql.tests.test_dataframe.DataFrameTests#test_to_pandas_from_empty_dataframe

@srowen
Copy link
Member

srowen commented Dec 3, 2019

Out of curiosity, what was the previous behavior? was it just missing Long / Double mappings?

return np.float32
else:
return None
mappings = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should list up all the data types here. Initially it was in order to correct pandas's inferred type.
Now, in case of empty data, pandas always infers it as object and you should rely on this type mapping unlike the intended case before.

See to_arrow_type as an example for complete type mapping. You might need to check what Spark -> Python -> pandas type conversion combinations and whitelist it here.

return np.float32
else:
return None
mappings = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also just keep the if-elif logic. It might be more efficient by using a map but it needs to create a map everytime this function is called. More importantly, this code path isn't supposed to be performance sensitive as it's called per a column. So, I would just keep the logic as was.

@HyukjinKwon
Copy link
Member

ok to test

@HyukjinKwon
Copy link
Member

cc @BryanCutler

@HyukjinKwon HyukjinKwon changed the title [SPARK-29188][PySpark] toPandas gets wrong dtypes when applied on empty DF [SPARK-29188][PYTHON] toPandas (without Arrow) gets wrong dtypes when applied on empty DF Dec 4, 2019
@HyukjinKwon
Copy link
Member

Also, seems we should handle the case when Arrow optimization is enabled as well (spark.sql.execution.arrow.pyspark.enabled set to true). But I suspect this can be done separately.

@SparkQA
Copy link

SparkQA commented Dec 4, 2019

Test build #114809 has finished for PR 26747 at commit 916e19d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dlindelof
Copy link
Author

@HyukjinKwon I've reverted back to an if-else chain instead of a dict. Was there anything else you think I should change?

@dlindelof
Copy link
Author

@srowen This illustrates the current behaviour, where an empty Spark Dataframe with a column of type LongType becomes a Pandas Dataframe with a column of type object, i.e. string:

In [62]: foo = spark.sql("SELECT CAST(1 AS LONG) AS bar WHERE 1 = 0")

In [63]: foo
Out[63]: DataFrame[bar: bigint]

In [64]: foo.toPandas().dtypes
Out[64]:
bar    object
dtype: object

When the dataframe is not empty, this is what you see:

In [65]: foo = spark.sql("SELECT CAST(1 AS LONG) AS bar WHERE 1 = 1")

In [66]: foo.toPandas().dtypes
Out[66]:
bar    int64
dtype: object

@SparkQA
Copy link

SparkQA commented Dec 4, 2019

Test build #114838 has finished for PR 26747 at commit f25827c.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dlindelof
Copy link
Author

I'm seeing a failed build, but it doesn't look like it has anything to do with this patch, does it?

@HyukjinKwon
Copy link
Member

retest this please

StructField('integer', IntegerType(), True),
StructField('long', LongType(), True),
StructField('short', ShortType(), True),
])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dlindelof How does it work for decimal and other types? You're fixing a fundamental problem (see #26747 (comment))

Can you test other type combinations, and make sure non-empty and empty types are same?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,

I've added some more types, I think we have the most important ones now. I've also checked how this behaves in the presence of nulls.

Let me know if you think I'm missing something or if I should have done something differently.

@SparkQA
Copy link

SparkQA commented Dec 4, 2019

Test build #114846 has finished for PR 26747 at commit f25827c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 5, 2019

Test build #114906 has finished for PR 26747 at commit 150ecf7.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 5, 2019

Test build #114907 has finished for PR 26747 at commit bc95e27.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dlindelof
Copy link
Author

Hi,

Just wanted to check what you think of the patch now, is there anything else you think that should be changed? Just let me know and I'll be happy to implement it.

@srowen
Copy link
Member

srowen commented Dec 10, 2019

Looks reasonable to me, but @HyukjinKwon and @BryanCutler are the experts -- other thoughts?

@HyukjinKwon
Copy link
Member

Merged to master.

@dlindelof, thanks for addressing my comments and welcome to Apache Spark contributors :-).

@HyukjinKwon
Copy link
Member

@dlindelof, what's your JIRA id? I need to assign you a Contributor role to assign you to the JIRA https://issues.apache.org/jira/browse/SPARK-29188

@dlindelof
Copy link
Author

@HyukjinKwon my JIRA id is dlindelof. Thanks for approving this PR, happy to help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants