New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix comparison operators to treat NULL as False #1029
Conversation
is_log_op = any(f == getattr(spark.Column, f'__{log_op}__') for log_op in log_ops) | ||
if is_log_op: | ||
scol = F.when(scol.isNull(), False).otherwise(scol) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon @ueshin (cc @itholic @charlesdong1991 )
Not sure if this is the right implementation ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm not pretty sure, but maybe is scol = F.when(scol.isNull(), False).otherwise(True)
correct?
Because when type of scol
is not boolean
, it may raises AnalysisException
due to data type mismatch between False
and scol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, and i think maybe it should be better to make test of this if you available 😄
ah it's already tested :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@itholic Thanks! I haven't fixed the tests yet because I want to make sure first if this is the right implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually the problem is not that simple. E.g.,:
>>> pser == 0
0 True
1 False
2 False
3 False
dtype: bool
>>> pser != 0
0 False
1 True
2 True
3 True
dtype: bool
and
>>> ks.from_pandas(pser) == 0
0 True
1 False
2 False
3 None
Name: 0, dtype: object
>>> ks.from_pandas(pser) != 0
0 False
1 True
2 True
3 None
Name: 0, dtype: object
If we simply convert None
to False
, the !=
case is not the same as pandas'.
We might need to add some logic one-by-one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this is how pandas should behave.
https://github.com/pandas-dev/pandas/blob/master/pandas/tests/series/test_operators.py#L447-L452
# True | np.nan => True
exp_or1 = pd.Series([True, True, True, False], index=list("ABCD"), name="x")
tm.assert_series_equal(s1 | s2, exp_or1)
# np.nan | True => np.nan, filled with False
exp_or = pd.Series([True, True, False, False], index=list("ABCD"), name="x")
tm.assert_series_equal(s2 | s1, exp_or)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh, then it behaves this way on purpose in pandas. thanks for looking at source code to clear things up @harupy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's discuss &
and |
and address if needed in a separate issue/PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think &
and |
should be addressed in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like pandas tries to follow the boolean operations rules in Python (or
and and
) - https://docs.python.org/3/library/stdtypes.html#boolean-operations-and-or-not
# True | np.nan => True
# np.nan | True => np.nan
Then, ^ could make sense.
However, I am still kind of lost about "np.nan, filled with False" because bool(np.nan)
is coerced to True
, e.g.)
>>> pd.Series([float("nan")]).astype("bool")
0 True
dtype: bool
One of the failed doctests. It seems to be working.
https://travis-ci.com/databricks/koalas/jobs/254303993#L2298 |
Codecov Report
@@ Coverage Diff @@
## master #1029 +/- ##
==========================================
+ Coverage 94.92% 95.05% +0.13%
==========================================
Files 34 34
Lines 6697 6778 +81
==========================================
+ Hits 6357 6443 +86
+ Misses 340 335 -5
Continue to review full report at Codecov.
|
What about #1029 (comment)? |
@HyukjinKwon @ueshin |
Actually I'm not quite sure whether we should follow pandas' behavior here. |
#1029 (comment) this behaivour at least looks consistent with Python native comparison. |
I think it's fine to fix ... for now ... |
We don't need to fix this weird behavior for now? |
I meant let's fix to be consistent with pandas/Python comparison for now since that's basically what Koalas needs to do. |
@HyukjinKwon ok, which one should I address, pandas or python? |
For |
Let's stick to pandas for now if they are different ... maybe we should have a config as @ueshin said but I think it's fine for now ... |
No, they are same. |
Let's stick to pandas for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm good with it. @ueshin is it fine to you in general? maybe we can consider about SQL / configuration stuff in another PR (since it will be easy to support SQL mode anyway since Spark already does)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Talked with Takuya about no conf for now. LGTM
Is koalas/databricks/koalas/base.py Lines 185 to 190 in e469d54
https://github.com/databricks/koalas/blob/master/databricks/koalas/base.py#L185 |
Yup, I think so. |
Fixed it |
databricks/koalas/base.py
Outdated
@@ -57,6 +57,21 @@ def wrapper(self, *args): | |||
args = [arg._scol if isinstance(arg, IndexOpsMixin) else arg for arg in args] | |||
scol = f(self._scol, *args) | |||
|
|||
# check if f is a logistic operator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would change the comment and variable names too :-).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! will fix them.
Softagram Impact Report for pull/1029 (head commit: 000d3fd)⭐ Change Overview
💡 Insights
📄 Full report
Impact Report explained. Give feedback on this report to support@softagram.com |
Merged to master. |
Thanks, @harupy |
@HyukjinKwon thanks! |
Remove `fillna` from `Series.between` because comparison operators fill `NULL` with a boolean value (#1029). https://github.com/databricks/koalas/blob/fe93ac1e6247d5f1ee658d90a887eacbb5ffe655/databricks/koalas/series.py#L740-L744
|
||
elif f == spark.Column.__and__: | ||
scol = F.when(scol.isNull(), False).otherwise(scol) | ||
|
||
return self._with_new_scol(scol) | ||
else: | ||
# Different DataFrame anchors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like we should apply the updates for this branch as well?
>>> ks.options.compute.ops_on_diff_frames = True
>>> s1 = pd.Series([True, False, True], index=list("ABC"), name="x")
>>> s2 = pd.Series([True, True, False], index=list("ABD"), name="x")
>>> s1
A True
B False
C True
Name: x, dtype: bool
>>> s2
A True
B True
D False
Name: x, dtype: bool
>>> s1 | s2
A True
B True
C True
D False
Name: x, dtype: bool
>>> (ks.from_pandas(s1) | ks.from_pandas(s2)).sort_index()
A True
B True
C True
D None
Name: x, dtype: object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ueshin Thank you for catching this. I'm working on this.
As @ueshin pointed out in #1029 (comment), the fix #1029 didn't cover the case below. This PR fixes it. ```python >>> ks.options.compute.ops_on_diff_frames = True >>> s1 = pd.Series([True, False, True], index=list("ABC"), name="x") >>> s2 = pd.Series([True, True, False], index=list("ABD"), name="x") >>> s1 A True B False C True Name: x, dtype: bool >>> s2 A True B True D False Name: x, dtype: bool >>> s1 | s2 A True B True C True D False Name: x, dtype: bool >>> (ks.from_pandas(s1) | ks.from_pandas(s2)).sort_index() A True B True C True D None Name: x, dtype: object ```
As @ueshin pointed out in databricks/koalas#1029 (comment), the fix #1029 didn't cover the case below. This PR fixes it. ```python >>> ks.options.compute.ops_on_diff_frames = True >>> s1 = pd.Series([True, False, True], index=list("ABC"), name="x") >>> s2 = pd.Series([True, True, False], index=list("ABD"), name="x") >>> s1 A True B False C True Name: x, dtype: bool >>> s2 A True B True D False Name: x, dtype: bool >>> s1 | s2 A True B True C True D False Name: x, dtype: bool >>> (ks.from_pandas(s1) | ks.from_pandas(s2)).sort_index() A True B True C True D None Name: x, dtype: object ```
This PR proposes to:
pandas
pandas treats NULL as False
Koalas
Koalas currently treats NULL as NULL