[SPARK-35211][PYTHON] Proper NumericType conversion for applySchemaToPythonRDD #32327

da-liii · 2021-04-25T03:32:39Z

What changes were proposed in this pull request?

spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "false")
from pyspark.testing.sqlutils import ExamplePoint
import pandas as pd
pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), ExamplePoint(2, 2)])})
df = spark.createDataFrame(pdf, verifySchema=False)
df.show()

Before:

+----------+
|     point|
+----------+
|(0, 0)|
|(0, 0)|
+----------+

After

+----------+
|     point|
+----------+
|(1.0, 1.0)|
|(2.0, 2.0)|
+----------+

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

HyukjinKwon · 2021-04-25T07:34:59Z

@sadhen, SPARK-35211 is not merged yet, and it makes less sense to call it a followup. Let's make a separate JIRA for this PR.

HyukjinKwon

You should also probably update https://github.com/apache/spark/blob/master/python/pyspark/sql/functions.py#L4922-L4940 table.

The problem is that:

the type coercion isn't really consistent with what we have in SQL
it has a different type coercion rule with pandas: https://github.com/apache/spark/blob/master/python/pyspark/sql/pandas/functions.py#L305-L323

We should probably define one standard to follow.

HyukjinKwon · 2021-04-25T07:39:36Z

ok to test

HyukjinKwon · 2021-04-25T07:39:44Z

I guess this change is okay though ..

SparkQA · 2021-04-25T07:53:50Z

Test build #137904 has finished for PR 32327 at commit 829dda9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-04-25T09:02:25Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42433/

SparkQA · 2021-04-25T09:02:26Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42433/

SparkQA · 2021-04-25T12:30:43Z

Test build #137912 has finished for PR 32327 at commit 829dda9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

da-liii · 2021-04-26T03:33:18Z

We should probably define one standard to follow.

I don't know about standard. That's why I only do conversion for numeric type: int/long/byte/short/float/double.
How about fix for numeric type first?

Could you describe the problem in detail?

HyukjinKwon · 2021-05-04T02:10:49Z

I am okay with this change but you'll have to update https://github.com/apache/spark/blob/master/python/pyspark/sql/functions.py#L4922-L4940 as well by using the codes here: 9b9d81b

HyukjinKwon · 2021-05-04T02:11:07Z

cc @ueshin @BryanCutler @viirya too FYI

viirya · 2021-05-04T02:18:33Z

Hmm, first I am confused by why there are many PRs for the same JIRA? Are they for the same issue?

BryanCutler

Most of these statements seem to follow a pattern of only converting similar types. What exactly was causing the problem from the example in the description? Was it trying to convert an integer to a double?

BryanCutler · 2021-05-04T17:09:26Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala

    }

    case ShortType => (obj: Any) => nullSafeConvert(obj) {
      case c: Byte => c.toShort
      case c: Short => c
      case c: Int => c.toShort
      case c: Long => c.toShort
+      case c: Float => c.toShort
+      case c: Double => c.toShort


It seems odd that we would want to silently convert a double to a short. Are we sure that is the correct behavior?

da-liii · 2021-07-26T06:39:25Z

Are they for the same issue?

Yes. They are for the same issues but from different part of the code path.

Well, now I've created another JIRA ticket (https://issues.apache.org/jira/browse/SPARK-36283) to track this issue.

Proper NumericType conversion for applySchemaToPythonRDD

829dda9

github-actions bot added PYTHON SQL labels Apr 25, 2021

da-liii changed the title ~~[SPARK-34771][PYTHON][FOLLOW_UP] Proper NumericType conversion for applySchemaToPythonRDD~~ [SPARK-35211][PYTHON][FOLLOW_UP] Proper NumericType conversion for applySchemaToPythonRDD Apr 25, 2021

HyukjinKwon reviewed Apr 25, 2021

View reviewed changes

da-liii changed the title ~~[SPARK-35211][PYTHON][FOLLOW_UP] Proper NumericType conversion for applySchemaToPythonRDD~~ [SPARK-35211][PYTHON] Proper NumericType conversion for applySchemaToPythonRDD Apr 25, 2021

BryanCutler reviewed May 4, 2021

View reviewed changes

da-liii closed this Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-35211][PYTHON] Proper NumericType conversion for applySchemaToPythonRDD #32327

[SPARK-35211][PYTHON] Proper NumericType conversion for applySchemaToPythonRDD #32327

da-liii commented Apr 25, 2021 •

edited

HyukjinKwon commented Apr 25, 2021

HyukjinKwon left a comment

HyukjinKwon commented Apr 25, 2021

HyukjinKwon commented Apr 25, 2021

SparkQA commented Apr 25, 2021

SparkQA commented Apr 25, 2021

SparkQA commented Apr 25, 2021

SparkQA commented Apr 25, 2021

da-liii commented Apr 26, 2021

HyukjinKwon commented May 4, 2021

HyukjinKwon commented May 4, 2021

viirya commented May 4, 2021

BryanCutler left a comment

BryanCutler May 4, 2021

da-liii commented Jul 26, 2021 •

edited

[SPARK-35211][PYTHON] Proper NumericType conversion for applySchemaToPythonRDD #32327

[SPARK-35211][PYTHON] Proper NumericType conversion for applySchemaToPythonRDD #32327

Conversation

da-liii commented Apr 25, 2021 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

HyukjinKwon commented Apr 25, 2021

HyukjinKwon left a comment

Choose a reason for hiding this comment

HyukjinKwon commented Apr 25, 2021

HyukjinKwon commented Apr 25, 2021

SparkQA commented Apr 25, 2021

SparkQA commented Apr 25, 2021

SparkQA commented Apr 25, 2021

SparkQA commented Apr 25, 2021

da-liii commented Apr 26, 2021

HyukjinKwon commented May 4, 2021

HyukjinKwon commented May 4, 2021

viirya commented May 4, 2021

BryanCutler left a comment

Choose a reason for hiding this comment

BryanCutler May 4, 2021

Choose a reason for hiding this comment

da-liii commented Jul 26, 2021 • edited

da-liii commented Apr 25, 2021 •

edited

da-liii commented Jul 26, 2021 •

edited