[SPARK-29188][PYTHON] toPandas (without Arrow) gets wrong dtypes when applied on empty DF #26747

dlindelof · 2019-12-03T13:08:30Z

What changes were proposed in this pull request?

An empty Spark DataFrame converted to a Pandas DataFrame wouldn't have the right column types. Several type mappings were missing.

Why are the changes needed?

Empty Spark DataFrames can be used to write unit tests, and verified by converting them to Pandas first. But this can fail when the column types are wrong.

Does this PR introduce any user-facing change?

Yes; the error reported in the JIRA issue should not happen anymore.

How was this patch tested?

Through unit tests in pyspark.sql.tests.test_dataframe.DataFrameTests#test_to_pandas_from_empty_dataframe

srowen · 2019-12-03T14:41:50Z

Out of curiosity, what was the previous behavior? was it just missing Long / Double mappings?

HyukjinKwon · 2019-12-04T00:53:04Z

python/pyspark/sql/dataframe.py

-        return np.float32
-    else:
-        return None
+    mappings = {


We should list up all the data types here. Initially it was in order to correct pandas's inferred type.
Now, in case of empty data, pandas always infers it as object and you should rely on this type mapping unlike the intended case before.

See to_arrow_type as an example for complete type mapping. You might need to check what Spark -> Python -> pandas type conversion combinations and whitelist it here.

HyukjinKwon · 2019-12-04T00:53:56Z

python/pyspark/sql/dataframe.py

-        return np.float32
-    else:
-        return None
+    mappings = {


I would also just keep the if-elif logic. It might be more efficient by using a map but it needs to create a map everytime this function is called. More importantly, this code path isn't supposed to be performance sensitive as it's called per a column. So, I would just keep the logic as was.

HyukjinKwon · 2019-12-04T00:54:27Z

ok to test

HyukjinKwon · 2019-12-04T00:54:32Z

cc @BryanCutler

HyukjinKwon · 2019-12-04T00:56:43Z

Also, seems we should handle the case when Arrow optimization is enabled as well (spark.sql.execution.arrow.pyspark.enabled set to true). But I suspect this can be done separately.

SparkQA · 2019-12-04T01:30:15Z

Test build #114809 has finished for PR 26747 at commit 916e19d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

This reverts commit 916e19d.

dlindelof · 2019-12-04T07:42:32Z

@HyukjinKwon I've reverted back to an if-else chain instead of a dict. Was there anything else you think I should change?

dlindelof · 2019-12-04T07:49:00Z

@srowen This illustrates the current behaviour, where an empty Spark Dataframe with a column of type LongType becomes a Pandas Dataframe with a column of type object, i.e. string:

In [62]: foo = spark.sql("SELECT CAST(1 AS LONG) AS bar WHERE 1 = 0")

In [63]: foo
Out[63]: DataFrame[bar: bigint]

In [64]: foo.toPandas().dtypes
Out[64]:
bar    object
dtype: object

When the dataframe is not empty, this is what you see:

In [65]: foo = spark.sql("SELECT CAST(1 AS LONG) AS bar WHERE 1 = 1")

In [66]: foo.toPandas().dtypes
Out[66]:
bar    int64
dtype: object

SparkQA · 2019-12-04T08:05:02Z

Test build #114838 has finished for PR 26747 at commit f25827c.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

dlindelof · 2019-12-04T08:27:41Z

I'm seeing a failed build, but it doesn't look like it has anything to do with this patch, does it?

HyukjinKwon · 2019-12-04T08:29:15Z

retest this please

HyukjinKwon · 2019-12-04T08:31:34Z

python/pyspark/sql/tests/test_dataframe.py

+            StructField('integer', IntegerType(), True),
+            StructField('long', LongType(), True),
+            StructField('short', ShortType(), True),
+        ])


@dlindelof How does it work for decimal and other types? You're fixing a fundamental problem (see #26747 (comment))

Can you test other type combinations, and make sure non-empty and empty types are same?

Hi,

I've added some more types, I think we have the most important ones now. I've also checked how this behaves in the presence of nulls.

Let me know if you think I'm missing something or if I should have done something differently.

SparkQA · 2019-12-04T09:12:13Z

Test build #114846 has finished for PR 26747 at commit f25827c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-12-05T14:53:35Z

Test build #114906 has finished for PR 26747 at commit 150ecf7.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-12-05T15:36:49Z

Test build #114907 has finished for PR 26747 at commit bc95e27.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dlindelof · 2019-12-09T08:33:43Z

Hi,

Just wanted to check what you think of the patch now, is there anything else you think that should be changed? Just let me know and I'll be happy to implement it.

srowen · 2019-12-10T15:34:10Z

Looks reasonable to me, but @HyukjinKwon and @BryanCutler are the experts -- other thoughts?

HyukjinKwon · 2019-12-12T11:48:53Z

Merged to master.

@dlindelof, thanks for addressing my comments and welcome to Apache Spark contributors :-).

HyukjinKwon · 2019-12-12T11:51:16Z

@dlindelof, what's your JIRA id? I need to assign you a Contributor role to assign you to the JIRA https://issues.apache.org/jira/browse/SPARK-29188

dlindelof · 2019-12-12T11:54:21Z

@HyukjinKwon my JIRA id is dlindelof. Thanks for approving this PR, happy to help.

David added 2 commits December 3, 2019 13:56

[SPARK-29188] add missing numpy type mappings

0b8a43a

[SPARK-29188] use dict instead of if-else chain

916e19d

HyukjinKwon reviewed Dec 4, 2019

View reviewed changes

HyukjinKwon changed the title ~~[SPARK-29188][PySpark] toPandas gets wrong dtypes when applied on empty DF~~ [SPARK-29188][PYTHON] toPandas (without Arrow) gets wrong dtypes when applied on empty DF Dec 4, 2019

Revert "[SPARK-29188] use dict instead of if-else chain"

f25827c

This reverts commit 916e19d.

HyukjinKwon reviewed Dec 4, 2019

View reviewed changes

David added 2 commits December 5, 2019 14:38

check type consistency of toPandas(), even if df has only Nulls

032b08b

test dataframe with mixed nulls and non-nulls

150ecf7

fix code style issues

bc95e27

dongjoon-hyun added PYSPARK SQL labels Dec 5, 2019

HyukjinKwon approved these changes Dec 12, 2019

View reviewed changes

HyukjinKwon closed this in 8e9bfea Dec 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-29188][PYTHON] toPandas (without Arrow) gets wrong dtypes when applied on empty DF #26747

[SPARK-29188][PYTHON] toPandas (without Arrow) gets wrong dtypes when applied on empty DF #26747

dlindelof commented Dec 3, 2019

srowen commented Dec 3, 2019

HyukjinKwon Dec 4, 2019

HyukjinKwon Dec 4, 2019

HyukjinKwon commented Dec 4, 2019

HyukjinKwon commented Dec 4, 2019

HyukjinKwon commented Dec 4, 2019

SparkQA commented Dec 4, 2019

dlindelof commented Dec 4, 2019

dlindelof commented Dec 4, 2019

SparkQA commented Dec 4, 2019

dlindelof commented Dec 4, 2019

HyukjinKwon commented Dec 4, 2019

HyukjinKwon Dec 4, 2019

dlindelof Dec 5, 2019

SparkQA commented Dec 4, 2019

SparkQA commented Dec 5, 2019

SparkQA commented Dec 5, 2019

dlindelof commented Dec 9, 2019

srowen commented Dec 10, 2019

HyukjinKwon commented Dec 12, 2019

HyukjinKwon commented Dec 12, 2019

dlindelof commented Dec 12, 2019

[SPARK-29188][PYTHON] toPandas (without Arrow) gets wrong dtypes when applied on empty DF #26747

[SPARK-29188][PYTHON] toPandas (without Arrow) gets wrong dtypes when applied on empty DF #26747

Conversation

dlindelof commented Dec 3, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

srowen commented Dec 3, 2019

HyukjinKwon Dec 4, 2019

Choose a reason for hiding this comment

HyukjinKwon Dec 4, 2019

Choose a reason for hiding this comment

HyukjinKwon commented Dec 4, 2019

HyukjinKwon commented Dec 4, 2019

HyukjinKwon commented Dec 4, 2019

SparkQA commented Dec 4, 2019

dlindelof commented Dec 4, 2019

dlindelof commented Dec 4, 2019

SparkQA commented Dec 4, 2019

dlindelof commented Dec 4, 2019

HyukjinKwon commented Dec 4, 2019

HyukjinKwon Dec 4, 2019

Choose a reason for hiding this comment

dlindelof Dec 5, 2019

Choose a reason for hiding this comment

SparkQA commented Dec 4, 2019

SparkQA commented Dec 5, 2019

SparkQA commented Dec 5, 2019

dlindelof commented Dec 9, 2019

srowen commented Dec 10, 2019

HyukjinKwon commented Dec 12, 2019

HyukjinKwon commented Dec 12, 2019

dlindelof commented Dec 12, 2019