[SPARK-36231][PYTHON] Support arithmetic operations of decimal(nan) series #34314

Yikun · 2021-10-18T11:55:22Z

What changes were proposed in this pull request?

This patch has changes as below to follow the pandas behavior:

Add nan value process in _non_fractional_astype: Follow the pandas to_string covert method, it should be "NaN" rather than str(np.nan)("nan")， which is covered by self.assert_eq(pser.astype(str), psser.astype(str)).
Add null value process in rpow, which is covered by def test_rpow(self)
Add index_ops.hasnans in astype, which is covered by test_astype.

This patch also move numeric_w_nan_pdf into numeric_pdf, that means all float_nan/decimal_nan separated test case have been cleaned up and merged into numeric test.

Why are the changes needed?

Follow the pandas behavior

Does this PR introduce any user-facing change?

Yes, correct the null value result to follow the pandas behavior

How was this patch tested?

ut to cover all changes

Yikun · 2021-10-18T11:56:03Z

cc @HyukjinKwon @xinrong-databricks

Yikun · 2021-10-18T12:45:40Z

After this patch, we can finally close the SPARK-36000 .

SparkQA · 2021-10-18T13:06:50Z

Test build #144369 has finished for PR 34314 at commit a157e07.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-10-18T13:21:01Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48844/

SparkQA · 2021-10-18T14:04:28Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48844/

python/pyspark/pandas/data_type_ops/num_ops.py

HyukjinKwon · 2021-10-19T03:15:05Z

cc @xinrong-databricks

python/pyspark/pandas/tests/data_type_ops/test_num_ops.py

python/pyspark/pandas/tests/data_type_ops/testing_utils.py

xinrong-meng · 2021-10-19T18:20:38Z

FYI @ueshin

SparkQA · 2021-11-15T03:06:46Z

Test build #145217 has finished for PR 34314 at commit 6bf68b1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-11-15T03:44:07Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49687/

SparkQA · 2021-11-15T04:30:16Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49687/

HyukjinKwon · 2021-11-17T02:04:22Z

Merged to master.

HyukjinKwon · 2021-11-22T02:32:36Z

@Yikun, I am very sorry but I realised that this patch breaks the test cases with lower pandas versions (because it requires to have https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.3.0.html#missing). Decimal("NaN") is not considered as null in old pandas versions, and it makes a bunch of related failures.

I was preparing a followup with, for example an approach as below:

-                    nullable=bool(col.isnull().any()),
+                    nullable=bool(col.isnull().any())
+                    # To work around https://github.com/pandas-dev/pandas/pull/39409
+                    | bool(
+                        col.map(lambda x: isinstance(x, decimal.Decimal) and math.isnan(x)).any()
+                    ),

However, then the test fails because of difference behaviours in old pandas versions.

While technically we can make a followup, please let me just revert it to make it easier to move forward - I hear complaints about that the tests are being failed from here and there.

Yikun · 2021-11-22T06:11:22Z

@HyukjinKwon Thanks for your help, and one more question, what's the mainly version of pandas should be tested and supported? Should we announce it in somewhere, and then add the test to install specific pandas version in CI to do an extra check?

HyukjinKwon · 2021-11-22T06:17:01Z

It's actually documented here; https://github.com/apache/spark/blob/master/python/setup.py#L115.
We should probably have to bump up .. ideally we should test all the combinatins just like other python projects .. but we can't do this due to the resource problem in GA 😢

HyukjinKwon · 2021-11-22T06:17:34Z

Testing on 1.1.x or 1.2.x should be good enough for the fix itself.

Yikun · 2021-11-22T06:20:14Z

We should probably have to bump up .. ideally we should test all the combinatins just like other python projects .. but we can't do this due to the resource problem in GA.

@HyukjinKwon OK, thanks! That means we soulld test it after v0.23.2. I will address soon. : )

Testing on 1.1.x or 1.2.x should be good enough for the fix itself.

OK, thanks for reminder.

HyukjinKwon · 2021-11-23T00:28:52Z

@Yikun, if you're stuck to support this with old pandas versions, we can just conditionally run the tests with only pandas 1.3+ for now

Yikun · 2021-11-23T01:37:34Z

if you're stuck to support this with old pandas versions

@HyukjinKwon I do some simple test yesterday, there are many test case failed with decimal("Nan") in v1.2 or v1.1.

we can just conditionally run the tests with only pandas 1.3+ for now

Code and test should also only support 1.3+? right?

HyukjinKwon · 2021-11-23T01:41:14Z

yeah, that is fine because it already doesn't work with 1.2 and 1.1 and no regression.

Yikun · 2021-11-23T02:59:24Z

python/pyspark/pandas/tests/data_type_ops/testing_utils.py

-        pdf.columns = [dtype.__name__ for dtype in dtypes] + ["decimal"]
+        pdf.columns = [dtype.__name__ for dtype in dtypes] + [
+            "decimal",
+            "decimal_nan",


We can skip decimal_nan test before if pandas version < 1.3 and add a note in here.

# To work around https://github.com/pandas-dev/pandas/pull/39409 if LooseVersion(pd.__version__) > LooseVersion("1.3.0"): sers.append(pd.Series([decimal.Decimal(1), decimal.Decimal(2), decimal.Decimal(np.nan)])) pdf.columns.append("decimal_nan")

@HyukjinKwon WDYT?

#34687

I will do complete local test before mark it ready for review.

### What changes were proposed in this pull request? Bump minimum pandas version to 1.0.5 (or a better version) ### Why are the changes needed? Initial discussion from [SPARK-37465](https://issues.apache.org/jira/browse/SPARK-37465) and #34314 (comment) . ### Does this PR introduce _any_ user-facing change? Yes, bump pandas minimun version. ### How was this patch tested? PySpark test passed with pandas v1.0.5. Closes #34717 from Yikun/pandas-min-version. Authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

github-actions bot added CORE PYTHON labels Oct 18, 2021

Yikun changed the title ~~[SPARK-36337][PYTHON] Support arithmetic operations of decimal(nan) series~~ [SPARK-36231][PYTHON] Support arithmetic operations of decimal(nan) series Oct 18, 2021

HyukjinKwon reviewed Oct 19, 2021

View reviewed changes

python/pyspark/pandas/data_type_ops/num_ops.py Show resolved Hide resolved

xinrong-meng reviewed Oct 19, 2021

View reviewed changes

python/pyspark/pandas/tests/data_type_ops/test_num_ops.py Outdated Show resolved Hide resolved

xinrong-meng reviewed Oct 19, 2021

View reviewed changes

python/pyspark/pandas/tests/data_type_ops/test_num_ops.py Show resolved Hide resolved

xinrong-meng reviewed Oct 19, 2021

View reviewed changes

python/pyspark/pandas/tests/data_type_ops/testing_utils.py Show resolved Hide resolved

xinrong-meng approved these changes Oct 19, 2021

View reviewed changes

Yikun and others added 2 commits November 15, 2021 10:33

Support arithmetic operations of decimal(nan) series

7f32b1e

Address nit

6bf68b1

Yikun force-pushed the SPARK-36337 branch from a157e07 to 6bf68b1 Compare November 15, 2021 02:35

HyukjinKwon approved these changes Nov 15, 2021

View reviewed changes

HyukjinKwon closed this in 4529dba Nov 17, 2021

Yikun commented Nov 23, 2021

View reviewed changes

Yikun mentioned this pull request Nov 26, 2021

[SPARK-37465][PYTHON] Bump minimum pandas version to 1.0.5 #34717

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-36231][PYTHON] Support arithmetic operations of decimal(nan) series #34314

[SPARK-36231][PYTHON] Support arithmetic operations of decimal(nan) series #34314

Yikun commented Oct 18, 2021 •

edited

Yikun commented Oct 18, 2021

Yikun commented Oct 18, 2021

SparkQA commented Oct 18, 2021

SparkQA commented Oct 18, 2021

SparkQA commented Oct 18, 2021

HyukjinKwon commented Oct 19, 2021

xinrong-meng commented Oct 19, 2021

SparkQA commented Nov 15, 2021

SparkQA commented Nov 15, 2021

SparkQA commented Nov 15, 2021

HyukjinKwon commented Nov 17, 2021

HyukjinKwon commented Nov 22, 2021

Yikun commented Nov 22, 2021

HyukjinKwon commented Nov 22, 2021

HyukjinKwon commented Nov 22, 2021

Yikun commented Nov 22, 2021

HyukjinKwon commented Nov 23, 2021

Yikun commented Nov 23, 2021 •

edited

HyukjinKwon commented Nov 23, 2021

Yikun Nov 23, 2021

Yikun Nov 23, 2021

[SPARK-36231][PYTHON] Support arithmetic operations of decimal(nan) series #34314

[SPARK-36231][PYTHON] Support arithmetic operations of decimal(nan) series #34314

Conversation

Yikun commented Oct 18, 2021 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Yikun commented Oct 18, 2021

Yikun commented Oct 18, 2021

SparkQA commented Oct 18, 2021

SparkQA commented Oct 18, 2021

SparkQA commented Oct 18, 2021

HyukjinKwon commented Oct 19, 2021

xinrong-meng commented Oct 19, 2021

SparkQA commented Nov 15, 2021

SparkQA commented Nov 15, 2021

SparkQA commented Nov 15, 2021

HyukjinKwon commented Nov 17, 2021

HyukjinKwon commented Nov 22, 2021

Yikun commented Nov 22, 2021

HyukjinKwon commented Nov 22, 2021

HyukjinKwon commented Nov 22, 2021

Yikun commented Nov 22, 2021

HyukjinKwon commented Nov 23, 2021

Yikun commented Nov 23, 2021 • edited

HyukjinKwon commented Nov 23, 2021

Yikun Nov 23, 2021

Choose a reason for hiding this comment

Yikun Nov 23, 2021

Choose a reason for hiding this comment

Yikun commented Oct 18, 2021 •

edited

Yikun commented Nov 23, 2021 •

edited