Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-35211][PYTHON] Proper NumericType conversion for applySchemaToPythonRDD #32327

Closed
wants to merge 1 commit into from

Conversation

da-liii
Copy link
Contributor

@da-liii da-liii commented Apr 25, 2021

What changes were proposed in this pull request?

spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "false")
from pyspark.testing.sqlutils import ExamplePoint
import pandas as pd
pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), ExamplePoint(2, 2)])})
df = spark.createDataFrame(pdf, verifySchema=False)
df.show()

Before:

+----------+
|     point|
+----------+
|(0, 0)|
|(0, 0)|
+----------+

After

+----------+
|     point|
+----------+
|(1.0, 1.0)|
|(2.0, 2.0)|
+----------+

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

@da-liii da-liii changed the title [SPARK-34771][PYTHON][FOLLOW_UP] Proper NumericType conversion for applySchemaToPythonRDD [SPARK-35211][PYTHON][FOLLOW_UP] Proper NumericType conversion for applySchemaToPythonRDD Apr 25, 2021
@HyukjinKwon
Copy link
Member

@sadhen, SPARK-35211 is not merged yet, and it makes less sense to call it a followup. Let's make a separate JIRA for this PR.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should also probably update https://github.com/apache/spark/blob/master/python/pyspark/sql/functions.py#L4922-L4940 table.

The problem is that:

We should probably define one standard to follow.

@HyukjinKwon
Copy link
Member

ok to test

@HyukjinKwon
Copy link
Member

I guess this change is okay though ..

@da-liii da-liii changed the title [SPARK-35211][PYTHON][FOLLOW_UP] Proper NumericType conversion for applySchemaToPythonRDD [SPARK-35211][PYTHON] Proper NumericType conversion for applySchemaToPythonRDD Apr 25, 2021
@SparkQA
Copy link

SparkQA commented Apr 25, 2021

Test build #137904 has finished for PR 32327 at commit 829dda9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 25, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42433/

@SparkQA
Copy link

SparkQA commented Apr 25, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42433/

@SparkQA
Copy link

SparkQA commented Apr 25, 2021

Test build #137912 has finished for PR 32327 at commit 829dda9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@da-liii
Copy link
Contributor Author

da-liii commented Apr 26, 2021

We should probably define one standard to follow.

I don't know about standard. That's why I only do conversion for numeric type: int/long/byte/short/float/double.
How about fix for numeric type first?

Could you describe the problem in detail?

@HyukjinKwon
Copy link
Member

I am okay with this change but you'll have to update https://github.com/apache/spark/blob/master/python/pyspark/sql/functions.py#L4922-L4940 as well by using the codes here: 9b9d81b

@HyukjinKwon
Copy link
Member

cc @ueshin @BryanCutler @viirya too FYI

@viirya
Copy link
Member

viirya commented May 4, 2021

Hmm, first I am confused by why there are many PRs for the same JIRA? Are they for the same issue?

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of these statements seem to follow a pattern of only converting similar types. What exactly was causing the problem from the example in the description? Was it trying to convert an integer to a double?

}

case ShortType => (obj: Any) => nullSafeConvert(obj) {
case c: Byte => c.toShort
case c: Short => c
case c: Int => c.toShort
case c: Long => c.toShort
case c: Float => c.toShort
case c: Double => c.toShort
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems odd that we would want to silently convert a double to a short. Are we sure that is the correct behavior?

@da-liii da-liii closed this Jul 26, 2021
@da-liii
Copy link
Contributor Author

da-liii commented Jul 26, 2021

Are they for the same issue?

Yes. They are for the same issues but from different part of the code path.

Well, now I've created another JIRA ticket (https://issues.apache.org/jira/browse/SPARK-36283) to track this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants