Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix comparison operators to treat NULL as False #1029

Merged
merged 10 commits into from Nov 22, 2019

Conversation

harupy
Copy link
Contributor

@harupy harupy commented Nov 8, 2019

This PR proposes to:

pandas

pandas treats NULL as False

>>> pser
0    0.0
1    1.0
2    2.0
3    NaN
dtype: float64

>>> pser == 0
0     True
1    False
2    False
3    False  <- bool
dtype: bool

Koalas

Koalas currently treats NULL as NULL

>>> kser
0    0.0
1    1.0
2    2.0
3    NaN
Name: 0, dtype: float64

# CURRENT
>>> kser == 0
0     True
1    False
2    False
3     None  <- not bool
Name: 0, dtype: object

# PROPOSED
>>> kser == 0
0     True
1    False
2    False
3    False  <- bool
Name: 0, dtype: bool

is_log_op = any(f == getattr(spark.Column, f'__{log_op}__') for log_op in log_ops)
if is_log_op:
scol = F.when(scol.isNull(), False).otherwise(scol)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon @ueshin (cc @itholic @charlesdong1991 )
Not sure if this is the right implementation ...

Copy link
Contributor

@itholic itholic Nov 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not pretty sure, but maybe is scol = F.when(scol.isNull(), False).otherwise(True) correct?

Because when type of scol is not boolean, it may raises AnalysisException due to data type mismatch between False and scol

Copy link
Contributor

@itholic itholic Nov 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, and i think maybe it should be better to make test of this if you available 😄
ah it's already tested :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@itholic Thanks! I haven't fixed the tests yet because I want to make sure first if this is the right implementation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the problem is not that simple. E.g.,:

>>> pser == 0
0     True
1    False
2    False
3    False
dtype: bool
>>> pser != 0
0    False
1     True
2     True
3     True
dtype: bool

and

>>> ks.from_pandas(pser) == 0
0     True
1    False
2    False
3     None
Name: 0, dtype: object
>>> ks.from_pandas(pser) != 0
0    False
1     True
2     True
3     None
Name: 0, dtype: object

If we simply convert None to False, the != case is not the same as pandas'.

We might need to add some logic one-by-one.

Copy link
Contributor Author

@harupy harupy Nov 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ueshin @charlesdong1991

Looks like this is how pandas should behave.

https://github.com/pandas-dev/pandas/blob/master/pandas/tests/series/test_operators.py#L447-L452

# True | np.nan => True
exp_or1 = pd.Series([True, True, True, False], index=list("ABCD"), name="x")
tm.assert_series_equal(s1 | s2, exp_or1)
# np.nan | True => np.nan, filled with False
exp_or = pd.Series([True, True, False, False], index=list("ABCD"), name="x")
tm.assert_series_equal(s2 | s1, exp_or)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh, then it behaves this way on purpose in pandas. thanks for looking at source code to clear things up @harupy

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss & and | and address if needed in a separate issue/PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think & and | should be addressed in a separate PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like pandas tries to follow the boolean operations rules in Python (or and and) - https://docs.python.org/3/library/stdtypes.html#boolean-operations-and-or-not

# True | np.nan => True
# np.nan | True => np.nan

Then, ^ could make sense.

However, I am still kind of lost about "np.nan, filled with False" because bool(np.nan) is coerced to True, e.g.)

>>> pd.Series([float("nan")]).astype("bool")
0    True
dtype: bool

@harupy
Copy link
Contributor Author

harupy commented Nov 8, 2019

One of the failed doctests. It seems to be working.

_________________ [doctest] databricks.koalas.series.Series.eq _________________
532         ...                   index=['a', 'b', 'c', 'd'], columns=['a', 'b'])
533 
534         >>> df.a == 1
535         a     True
536         b    False
537         c    False
538         d    False
539         Name: a, dtype: bool
540 
541         >>> df.b.eq(1)
Differences (unified diff with -expected +actual):
    @@ -1,5 +1,5 @@
    -a    True
    -b    None
    -c    True
    -d    None
    -Name: b, dtype: object
    +a     True
    +b    False
    +c     True
    +d    False
    +Name: b, dtype: bool

https://travis-ci.com/databricks/koalas/jobs/254303993#L2298

@codecov-io
Copy link

codecov-io commented Nov 8, 2019

Codecov Report

Merging #1029 into master will increase coverage by 0.13%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1029      +/-   ##
==========================================
+ Coverage   94.92%   95.05%   +0.13%     
==========================================
  Files          34       34              
  Lines        6697     6778      +81     
==========================================
+ Hits         6357     6443      +86     
+ Misses        340      335       -5
Impacted Files Coverage Δ
databricks/koalas/series.py 96.41% <ø> (ø) ⬆️
databricks/koalas/frame.py 96.5% <ø> (+0.16%) ⬆️
databricks/koalas/base.py 95.23% <100%> (+0.33%) ⬆️
databricks/koalas/__init__.py 85.1% <0%> (-2.13%) ⬇️
databricks/conftest.py 96% <0%> (-2%) ⬇️
databricks/koalas/indexes.py 95.91% <0%> (-0.5%) ⬇️
databricks/koalas/generic.py 95.26% <0%> (-0.48%) ⬇️
databricks/koalas/groupby.py 91.18% <0%> (-0.21%) ⬇️
databricks/koalas/missing/indexes.py 100% <0%> (ø) ⬆️
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 42758e6...000d3fd. Read the comment docs.

@harupy harupy marked this pull request as ready for review November 8, 2019 08:48
@ueshin
Copy link
Collaborator

ueshin commented Nov 8, 2019

What about #1029 (comment)?
Should we also address here? cc @HyukjinKwon

@harupy
Copy link
Contributor Author

harupy commented Nov 11, 2019

@HyukjinKwon @ueshin
This change makes koalas behave differently from pyspark. Is that really ok?

@ueshin
Copy link
Collaborator

ueshin commented Nov 12, 2019

Actually I'm not quite sure whether we should follow pandas' behavior here.
How about adding a config to switch the null semantics between pandas and SQL.
cc @HyukjinKwon

@HyukjinKwon
Copy link
Member

#1029 (comment) this behaivour at least looks consistent with Python native comparison.

@HyukjinKwon
Copy link
Member

I think it's fine to fix ... for now ...

@harupy
Copy link
Contributor Author

harupy commented Nov 13, 2019

@HyukjinKwon

I think it's fine to fix ... for now ...

We don't need to fix this weird behavior for now?

@HyukjinKwon
Copy link
Member

I meant let's fix to be consistent with pandas/Python comparison for now since that's basically what Koalas needs to do.

@harupy
Copy link
Contributor Author

harupy commented Nov 14, 2019

@HyukjinKwon ok, which one should I address, pandas or python?

@HyukjinKwon
Copy link
Member

For ['eq', 'ne', 'lt', 'le', 'ge', 'gt'], my impression was that both are same. Are they different?

@HyukjinKwon
Copy link
Member

Let's stick to pandas for now if they are different ... maybe we should have a config as @ueshin said but I think it's fine for now ...

@harupy
Copy link
Contributor Author

harupy commented Nov 16, 2019

@HyukjinKwon

For ['eq', 'ne', 'lt', 'le', 'ge', 'gt'], my impression was that both are same. Are they different?

No, they are same.

@HyukjinKwon
Copy link
Member

Let's stick to pandas for now.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good with it. @ueshin is it fine to you in general? maybe we can consider about SQL / configuration stuff in another PR (since it will be easy to support SQL mode anyway since Spark already does)

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talked with Takuya about no conf for now. LGTM

@harupy
Copy link
Contributor Author

harupy commented Nov 20, 2019

Is logistic operators correct name? It should be comparison operators, shouldn't it?

# logistic operators
__eq__ = _column_op(spark.Column.__eq__)
__ne__ = _column_op(spark.Column.__ne__)
__lt__ = _column_op(spark.Column.__lt__)
__le__ = _column_op(spark.Column.__le__)
__ge__ = _column_op(spark.Column.__ge__)

https://github.com/databricks/koalas/blob/master/databricks/koalas/base.py#L185

@HyukjinKwon
Copy link
Member

Yup, I think so.

@harupy
Copy link
Contributor Author

harupy commented Nov 20, 2019

Fixed it

@HyukjinKwon HyukjinKwon changed the title Fix logistic operators to treat NULL as False Fix comparison operators to treat NULL as False Nov 20, 2019
@@ -57,6 +57,21 @@ def wrapper(self, *args):
args = [arg._scol if isinstance(arg, IndexOpsMixin) else arg for arg in args]
scol = f(self._scol, *args)

# check if f is a logistic operator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would change the comment and variable names too :-).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! will fix them.

@softagram-bot
Copy link

Softagram Impact Report for pull/1029 (head commit: 000d3fd)

⭐ Change Overview

Showing the changed files, dependency changes and the impact - click for full size
(Open in Softagram Desktop for full details)

💡 Insights

  • Co-change Alert: You modified frame.py. Often test_dataframe.py (koalas/tests) is modified at the same time.

📄 Full report

Impact Report explained. Give feedback on this report to support@softagram.com

@HyukjinKwon
Copy link
Member

Merged to master.

@HyukjinKwon HyukjinKwon merged commit d8dfa71 into databricks:master Nov 22, 2019
@HyukjinKwon
Copy link
Member

Thanks, @harupy

@harupy harupy deleted the fix-log-operator branch November 22, 2019 01:43
@harupy
Copy link
Contributor Author

harupy commented Nov 22, 2019

@HyukjinKwon thanks!

HyukjinKwon pushed a commit that referenced this pull request Nov 23, 2019

elif f == spark.Column.__and__:
scol = F.when(scol.isNull(), False).otherwise(scol)

return self._with_new_scol(scol)
else:
# Different DataFrame anchors
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we should apply the updates for this branch as well?

>>> ks.options.compute.ops_on_diff_frames = True
>>> s1 = pd.Series([True, False, True], index=list("ABC"), name="x")
>>> s2 = pd.Series([True, True, False], index=list("ABD"), name="x")
>>> s1
A     True
B    False
C     True
Name: x, dtype: bool
>>> s2
A     True
B     True
D    False
Name: x, dtype: bool
>>> s1 | s2
A     True
B     True
C     True
D    False
Name: x, dtype: bool
>>> (ks.from_pandas(s1) | ks.from_pandas(s2)).sort_index()
A    True
B    True
C    True
D    None
Name: x, dtype: object

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ueshin Thank you for catching this. I'm working on this.

HyukjinKwon pushed a commit that referenced this pull request Dec 2, 2019
As @ueshin pointed out in #1029 (comment), the fix #1029 didn't cover the case below. This PR fixes it.

```python
>>> ks.options.compute.ops_on_diff_frames = True
>>> s1 = pd.Series([True, False, True], index=list("ABC"), name="x")
>>> s2 = pd.Series([True, True, False], index=list("ABD"), name="x")
>>> s1
A     True
B    False
C     True
Name: x, dtype: bool
>>> s2
A     True
B     True
D    False
Name: x, dtype: bool
>>> s1 | s2
A     True
B     True
C     True
D    False
Name: x, dtype: bool
>>> (ks.from_pandas(s1) | ks.from_pandas(s2)).sort_index()
A    True
B    True
C    True
D    None
Name: x, dtype: object
```
rising-star92 added a commit to rising-star92/databricks-koalas that referenced this pull request Jan 27, 2023
As @ueshin pointed out in databricks/koalas#1029 (comment), the fix #1029 didn't cover the case below. This PR fixes it.

```python
>>> ks.options.compute.ops_on_diff_frames = True
>>> s1 = pd.Series([True, False, True], index=list("ABC"), name="x")
>>> s2 = pd.Series([True, True, False], index=list("ABD"), name="x")
>>> s1
A     True
B    False
C     True
Name: x, dtype: bool
>>> s2
A     True
B     True
D    False
Name: x, dtype: bool
>>> s1 | s2
A     True
B     True
C     True
D    False
Name: x, dtype: bool
>>> (ks.from_pandas(s1) | ks.from_pandas(s2)).sort_index()
A    True
B    True
C    True
D    None
Name: x, dtype: object
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Applying a logistic operator on NaN returns None instead of Bool
7 participants