New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-36231][PYTHON] Support arithmetic operations of decimal(nan) series #34314
Conversation
cc @HyukjinKwon @xinrong-databricks |
After this patch, we can finally close the SPARK-36000 . |
Test build #144369 has finished for PR 34314 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status failure |
cc @xinrong-databricks |
FYI @ueshin |
Test build #145217 has finished for PR 34314 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status failure |
Merged to master. |
@Yikun, I am very sorry but I realised that this patch breaks the test cases with lower pandas versions (because it requires to have https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.3.0.html#missing). I was preparing a followup with, for example an approach as below: - nullable=bool(col.isnull().any()),
+ nullable=bool(col.isnull().any())
+ # To work around https://github.com/pandas-dev/pandas/pull/39409
+ | bool(
+ col.map(lambda x: isinstance(x, decimal.Decimal) and math.isnan(x)).any()
+ ), However, then the test fails because of difference behaviours in old pandas versions. While technically we can make a followup, please let me just revert it to make it easier to move forward - I hear complaints about that the tests are being failed from here and there. |
@HyukjinKwon Thanks for your help, and one more question, what's the mainly version of pandas should be tested and supported? Should we announce it in somewhere, and then add the test to install specific pandas version in CI to do an extra check? |
It's actually documented here; https://github.com/apache/spark/blob/master/python/setup.py#L115. |
Testing on 1.1.x or 1.2.x should be good enough for the fix itself. |
@HyukjinKwon OK, thanks! That means we soulld test it after v0.23.2. I will address soon. : )
OK, thanks for reminder. |
@Yikun, if you're stuck to support this with old pandas versions, we can just conditionally run the tests with only pandas 1.3+ for now |
@HyukjinKwon I do some simple test yesterday, there are many test case failed with decimal("Nan") in v1.2 or v1.1.
Code and test should also only support 1.3+? right? |
yeah, that is fine because it already doesn't work with 1.2 and 1.1 and no regression. |
pdf.columns = [dtype.__name__ for dtype in dtypes] + ["decimal"] | ||
pdf.columns = [dtype.__name__ for dtype in dtypes] + [ | ||
"decimal", | ||
"decimal_nan", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can skip decimal_nan test before if pandas version < 1.3 and add a note in here.
# To work around https://github.com/pandas-dev/pandas/pull/39409
if LooseVersion(pd.__version__) > LooseVersion("1.3.0"):
sers.append(pd.Series([decimal.Decimal(1), decimal.Decimal(2), decimal.Decimal(np.nan)]))
pdf.columns.append("decimal_nan")
@HyukjinKwon WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will do complete local test before mark it ready for review.
### What changes were proposed in this pull request? Bump minimum pandas version to 1.0.5 (or a better version) ### Why are the changes needed? Initial discussion from [SPARK-37465](https://issues.apache.org/jira/browse/SPARK-37465) and #34314 (comment) . ### Does this PR introduce _any_ user-facing change? Yes, bump pandas minimun version. ### How was this patch tested? PySpark test passed with pandas v1.0.5. Closes #34717 from Yikun/pandas-min-version. Authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
What changes were proposed in this pull request?
This patch has changes as below to follow the pandas behavior:
"NaN"
rather thanstr(np.nan)
("nan"
), which is covered byself.assert_eq(pser.astype(str), psser.astype(str))
.def test_rpow(self)
astype
, which is covered bytest_astype
.This patch also move
numeric_w_nan_pdf
intonumeric_pdf
, that means all float_nan/decimal_nan separated test case have been cleaned up and merged into numeric test.Why are the changes needed?
Follow the pandas behavior
Does this PR introduce any user-facing change?
Yes, correct the null value result to follow the pandas behavior
How was this patch tested?
ut to cover all changes