New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix comparison operators to treat NULL as False #1029
Changes from 7 commits
6da920c
7e51add
0b9229e
e27d6aa
da80958
e1b93fb
473a1a8
93e6505
880c4dd
000d3fd
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -57,6 +57,21 @@ def wrapper(self, *args): | |
args = [arg._scol if isinstance(arg, IndexOpsMixin) else arg for arg in args] | ||
scol = f(self._scol, *args) | ||
|
||
# check if f is a logistic operator | ||
log_ops = ['eq', 'ne', 'lt', 'le', 'ge', 'gt'] | ||
is_log_op = any(f == getattr(spark.Column, '__{}__'.format(log_op)) | ||
for log_op in log_ops) | ||
|
||
if is_log_op: | ||
filler = f == spark.Column.__ne__ | ||
scol = F.when(scol.isNull(), filler).otherwise(scol) | ||
|
||
elif f == spark.Column.__or__: | ||
scol = F.when(self._scol.isNull() | scol.isNull(), False).otherwise(scol) | ||
|
||
elif f == spark.Column.__and__: | ||
scol = F.when(scol.isNull(), False).otherwise(scol) | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @HyukjinKwon @ueshin (cc @itholic @charlesdong1991 ) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i'm not pretty sure, but maybe is Because when type of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @itholic Thanks! I haven't fixed the tests yet because I want to make sure first if this is the right implementation. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually the problem is not that simple. E.g.,: >>> pser == 0
0 True
1 False
2 False
3 False
dtype: bool
>>> pser != 0
0 False
1 True
2 True
3 True
dtype: bool and >>> ks.from_pandas(pser) == 0
0 True
1 False
2 False
3 None
Name: 0, dtype: object
>>> ks.from_pandas(pser) != 0
0 False
1 True
2 True
3 None
Name: 0, dtype: object If we simply convert We might need to add some logic one-by-one. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Looks like this is how pandas should behave. https://github.com/pandas-dev/pandas/blob/master/pandas/tests/series/test_operators.py#L447-L452 # True | np.nan => True
exp_or1 = pd.Series([True, True, True, False], index=list("ABCD"), name="x")
tm.assert_series_equal(s1 | s2, exp_or1)
# np.nan | True => np.nan, filled with False
exp_or = pd.Series([True, True, False, False], index=list("ABCD"), name="x")
tm.assert_series_equal(s2 | s1, exp_or) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ahh, then it behaves this way on purpose in pandas. thanks for looking at source code to clear things up @harupy There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's discuss There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Seems like pandas tries to follow the boolean operations rules in Python (
Then, ^ could make sense. However, I am still kind of lost about "np.nan, filled with False" because >>> pd.Series([float("nan")]).astype("bool")
0 True
dtype: bool |
||
return self._with_new_scol(scol) | ||
else: | ||
# Different DataFrame anchors | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Seems like we should apply the updates for this branch as well? >>> ks.options.compute.ops_on_diff_frames = True
>>> s1 = pd.Series([True, False, True], index=list("ABC"), name="x")
>>> s2 = pd.Series([True, True, False], index=list("ABD"), name="x")
>>> s1
A True
B False
C True
Name: x, dtype: bool
>>> s2
A True
B True
D False
Name: x, dtype: bool
>>> s1 | s2
A True
B True
C True
D False
Name: x, dtype: bool
>>> (ks.from_pandas(s1) | ks.from_pandas(s2)).sort_index()
A True
B True
C True
D None
Name: x, dtype: object There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @ueshin Thank you for catching this. I'm working on this. |
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would change the comment and variable names too :-).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! will fix them.