Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-36231][PYTHON] Support arithmetic operations of decimal(nan) series #34314

Closed
wants to merge 2 commits into from

Conversation

Yikun
Copy link
Member

@Yikun Yikun commented Oct 18, 2021

What changes were proposed in this pull request?

This patch has changes as below to follow the pandas behavior:

  • Add nan value process in _non_fractional_astype: Follow the pandas to_string covert method, it should be "NaN" rather than str(np.nan)("nan"), which is covered by self.assert_eq(pser.astype(str), psser.astype(str)).
  • Add null value process in rpow, which is covered by def test_rpow(self)
  • Add index_ops.hasnans in astype, which is covered by test_astype.

This patch also move numeric_w_nan_pdf into numeric_pdf, that means all float_nan/decimal_nan separated test case have been cleaned up and merged into numeric test.

Why are the changes needed?

Follow the pandas behavior

Does this PR introduce any user-facing change?

Yes, correct the null value result to follow the pandas behavior

How was this patch tested?

ut to cover all changes

@Yikun
Copy link
Member Author

Yikun commented Oct 18, 2021

cc @HyukjinKwon @xinrong-databricks

@Yikun Yikun changed the title [SPARK-36337][PYTHON] Support arithmetic operations of decimal(nan) series [SPARK-36231][PYTHON] Support arithmetic operations of decimal(nan) series Oct 18, 2021
@Yikun
Copy link
Member Author

Yikun commented Oct 18, 2021

After this patch, we can finally close the SPARK-36000 .

@SparkQA
Copy link

SparkQA commented Oct 18, 2021

Test build #144369 has finished for PR 34314 at commit a157e07.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 18, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48844/

@SparkQA
Copy link

SparkQA commented Oct 18, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48844/

@HyukjinKwon
Copy link
Member

cc @xinrong-databricks

@xinrong-meng
Copy link
Member

FYI @ueshin

@SparkQA
Copy link

SparkQA commented Nov 15, 2021

Test build #145217 has finished for PR 34314 at commit 6bf68b1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 15, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49687/

@SparkQA
Copy link

SparkQA commented Nov 15, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49687/

@HyukjinKwon
Copy link
Member

Merged to master.

@HyukjinKwon
Copy link
Member

@Yikun, I am very sorry but I realised that this patch breaks the test cases with lower pandas versions (because it requires to have https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.3.0.html#missing). Decimal("NaN") is not considered as null in old pandas versions, and it makes a bunch of related failures.

I was preparing a followup with, for example an approach as below:

-                    nullable=bool(col.isnull().any()),
+                    nullable=bool(col.isnull().any())
+                    # To work around https://github.com/pandas-dev/pandas/pull/39409
+                    | bool(
+                        col.map(lambda x: isinstance(x, decimal.Decimal) and math.isnan(x)).any()
+                    ),

However, then the test fails because of difference behaviours in old pandas versions.

While technically we can make a followup, please let me just revert it to make it easier to move forward - I hear complaints about that the tests are being failed from here and there.

@Yikun
Copy link
Member Author

Yikun commented Nov 22, 2021

@HyukjinKwon Thanks for your help, and one more question, what's the mainly version of pandas should be tested and supported? Should we announce it in somewhere, and then add the test to install specific pandas version in CI to do an extra check?

@HyukjinKwon
Copy link
Member

It's actually documented here; https://github.com/apache/spark/blob/master/python/setup.py#L115.
We should probably have to bump up .. ideally we should test all the combinatins just like other python projects .. but we can't do this due to the resource problem in GA 😢

@HyukjinKwon
Copy link
Member

Testing on 1.1.x or 1.2.x should be good enough for the fix itself.

@Yikun
Copy link
Member Author

Yikun commented Nov 22, 2021

We should probably have to bump up .. ideally we should test all the combinatins just like other python projects .. but we can't do this due to the resource problem in GA.

@HyukjinKwon OK, thanks! That means we soulld test it after v0.23.2. I will address soon. : )

Testing on 1.1.x or 1.2.x should be good enough for the fix itself.

OK, thanks for reminder.

@HyukjinKwon
Copy link
Member

@Yikun, if you're stuck to support this with old pandas versions, we can just conditionally run the tests with only pandas 1.3+ for now

@Yikun
Copy link
Member Author

Yikun commented Nov 23, 2021

if you're stuck to support this with old pandas versions

@HyukjinKwon I do some simple test yesterday, there are many test case failed with decimal("Nan") in v1.2 or v1.1.

we can just conditionally run the tests with only pandas 1.3+ for now

Code and test should also only support 1.3+? right?

@HyukjinKwon
Copy link
Member

yeah, that is fine because it already doesn't work with 1.2 and 1.1 and no regression.

pdf.columns = [dtype.__name__ for dtype in dtypes] + ["decimal"]
pdf.columns = [dtype.__name__ for dtype in dtypes] + [
"decimal",
"decimal_nan",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can skip decimal_nan test before if pandas version < 1.3 and add a note in here.

        # To work around https://github.com/pandas-dev/pandas/pull/39409
        if LooseVersion(pd.__version__) > LooseVersion("1.3.0"):
            sers.append(pd.Series([decimal.Decimal(1), decimal.Decimal(2), decimal.Decimal(np.nan)]))
            pdf.columns.append("decimal_nan")

@HyukjinKwon WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#34687

I will do complete local test before mark it ready for review.

HyukjinKwon pushed a commit that referenced this pull request Nov 30, 2021
### What changes were proposed in this pull request?
Bump minimum pandas version to 1.0.5 (or a better version)

### Why are the changes needed?
Initial discussion from [SPARK-37465](https://issues.apache.org/jira/browse/SPARK-37465) and #34314 (comment) .

### Does this PR introduce _any_ user-facing change?
Yes, bump pandas minimun version.

### How was this patch tested?
PySpark test passed with pandas v1.0.5.

Closes #34717 from Yikun/pandas-min-version.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants