Fix comparison operators to treat NULL as False #1029

harupy · 2019-11-08T03:43:40Z

This PR proposes to:

Fix column comparison operators to treat NULL as False
Resolve Applying a logistic operator on NaN returns None instead of Bool #999.

pandas

pandas treats NULL as False

>>> pser
0    0.0
1    1.0
2    2.0
3    NaN
dtype: float64

>>> pser == 0
0     True
1    False
2    False
3    False  <- bool
dtype: bool

Koalas

Koalas currently treats NULL as NULL

>>> kser
0    0.0
1    1.0
2    2.0
3    NaN
Name: 0, dtype: float64

# CURRENT
>>> kser == 0
0     True
1    False
2    False
3     None  <- not bool
Name: 0, dtype: object

# PROPOSED
>>> kser == 0
0     True
1    False
2    False
3    False  <- bool
Name: 0, dtype: bool

harupy · 2019-11-08T03:49:35Z

databricks/koalas/base.py

+            is_log_op = any(f == getattr(spark.Column, f'__{log_op}__') for log_op in log_ops)
+            if is_log_op:
+                scol = F.when(scol.isNull(), False).otherwise(scol)
+


@HyukjinKwon @ueshin (cc @itholic @charlesdong1991 )
Not sure if this is the right implementation ...

i'm not pretty sure, but maybe is scol = F.when(scol.isNull(), False).otherwise(True) correct?

Because when type of scol is not boolean, it may raises AnalysisException due to data type mismatch between False and scol

~~Ah, and i think maybe it should be better to make test of this if you available 😄~~
ah it's already tested :)

@itholic Thanks! I haven't fixed the tests yet because I want to make sure first if this is the right implementation.

Actually the problem is not that simple. E.g.,:

>>> pser == 0 0 True 1 False 2 False 3 False dtype: bool >>> pser != 0 0 False 1 True 2 True 3 True dtype: bool

and

>>> ks.from_pandas(pser) == 0 0 True 1 False 2 False 3 None Name: 0, dtype: object >>> ks.from_pandas(pser) != 0 0 False 1 True 2 True 3 None Name: 0, dtype: object

If we simply convert None to False, the != case is not the same as pandas'.

We might need to add some logic one-by-one.

@ueshin @charlesdong1991

Looks like this is how pandas should behave.

https://github.com/pandas-dev/pandas/blob/master/pandas/tests/series/test_operators.py#L447-L452

# True | np.nan => True exp_or1 = pd.Series([True, True, True, False], index=list("ABCD"), name="x") tm.assert_series_equal(s1 | s2, exp_or1) # np.nan | True => np.nan, filled with False exp_or = pd.Series([True, True, False, False], index=list("ABCD"), name="x") tm.assert_series_equal(s2 | s1, exp_or)

ahh, then it behaves this way on purpose in pandas. thanks for looking at source code to clear things up @harupy

Let's discuss & and | and address if needed in a separate issue/PR.

I think & and | should be addressed in a separate PR.

Seems like pandas tries to follow the boolean operations rules in Python (or and and) - https://docs.python.org/3/library/stdtypes.html#boolean-operations-and-or-not

# True | np.nan => True # np.nan | True => np.nan

Then, ^ could make sense.

However, I am still kind of lost about "np.nan, filled with False" because bool(np.nan) is coerced to True, e.g.)

>>> pd.Series([float("nan")]).astype("bool") 0 True dtype: bool

harupy · 2019-11-08T04:30:56Z

One of the failed doctests. It seems to be working.

_________________ [doctest] databricks.koalas.series.Series.eq _________________
532         ...                   index=['a', 'b', 'c', 'd'], columns=['a', 'b'])
533 
534         >>> df.a == 1
535         a     True
536         b    False
537         c    False
538         d    False
539         Name: a, dtype: bool
540 
541         >>> df.b.eq(1)
Differences (unified diff with -expected +actual):
    @@ -1,5 +1,5 @@
    -a    True
    -b    None
    -c    True
    -d    None
    -Name: b, dtype: object
    +a     True
    +b    False
    +c     True
    +d    False
    +Name: b, dtype: bool

https://travis-ci.com/databricks/koalas/jobs/254303993#L2298

codecov-io · 2019-11-08T07:44:05Z

Codecov Report

Merging #1029 into master will increase coverage by 0.13%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #1029      +/-   ##
==========================================
+ Coverage   94.92%   95.05%   +0.13%     
==========================================
  Files          34       34              
  Lines        6697     6778      +81     
==========================================
+ Hits         6357     6443      +86     
+ Misses        340      335       -5

Impacted Files	Coverage Δ
databricks/koalas/series.py	`96.41% <ø> (ø)`	⬆️
databricks/koalas/frame.py	`96.5% <ø> (+0.16%)`	⬆️
databricks/koalas/base.py	`95.23% <100%> (+0.33%)`	⬆️
databricks/koalas/__init__.py	`85.1% <0%> (-2.13%)`	⬇️
databricks/conftest.py	`96% <0%> (-2%)`	⬇️
databricks/koalas/indexes.py	`95.91% <0%> (-0.5%)`	⬇️
databricks/koalas/generic.py	`95.26% <0%> (-0.48%)`	⬇️
databricks/koalas/groupby.py	`91.18% <0%> (-0.21%)`	⬇️
databricks/koalas/missing/indexes.py	`100% <0%> (ø)`	⬆️
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 42758e6...000d3fd. Read the comment docs.

ueshin · 2019-11-08T19:19:01Z

What about #1029 (comment)?
Should we also address here? cc @HyukjinKwon

harupy · 2019-11-11T01:38:41Z

@HyukjinKwon @ueshin
This change makes koalas behave differently from pyspark. Is that really ok?

ueshin · 2019-11-12T19:24:12Z

Actually I'm not quite sure whether we should follow pandas' behavior here.
How about adding a config to switch the null semantics between pandas and SQL.
cc @HyukjinKwon

HyukjinKwon · 2019-11-13T11:03:01Z

#1029 (comment) this behaivour at least looks consistent with Python native comparison.

HyukjinKwon · 2019-11-13T11:03:18Z

I think it's fine to fix ... for now ...

harupy · 2019-11-13T16:28:39Z

@HyukjinKwon

I think it's fine to fix ... for now ...

We don't need to fix this weird behavior for now?

HyukjinKwon · 2019-11-14T00:35:53Z

I meant let's fix to be consistent with pandas/Python comparison for now since that's basically what Koalas needs to do.

harupy · 2019-11-14T13:07:25Z

@HyukjinKwon ok, which one should I address, pandas or python?

HyukjinKwon · 2019-11-14T16:46:10Z

For ['eq', 'ne', 'lt', 'le', 'ge', 'gt'], my impression was that both are same. Are they different?

HyukjinKwon · 2019-11-14T16:47:53Z

Let's stick to pandas for now if they are different ... maybe we should have a config as @ueshin said but I think it's fine for now ...

harupy · 2019-11-16T02:59:49Z

@HyukjinKwon

For ['eq', 'ne', 'lt', 'le', 'ge', 'gt'], my impression was that both are same. Are they different?

No, they are same.

HyukjinKwon · 2019-11-18T07:15:39Z

Let's stick to pandas for now.

HyukjinKwon

I'm good with it. @ueshin is it fine to you in general? maybe we can consider about SQL / configuration stuff in another PR (since it will be easy to support SQL mode anyway since Spark already does)

HyukjinKwon

Talked with Takuya about no conf for now. LGTM

harupy · 2019-11-20T04:22:05Z

Is logistic operators correct name? It should be comparison operators, shouldn't it?

koalas/databricks/koalas/base.py

Lines 185 to 190 in e469d54

    
           # logistic operators 
        
           __eq__ = _column_op(spark.Column.__eq__) 
        
           __ne__ = _column_op(spark.Column.__ne__) 
        
           __lt__ = _column_op(spark.Column.__lt__) 
        
           __le__ = _column_op(spark.Column.__le__) 
        
           __ge__ = _column_op(spark.Column.__ge__)

https://github.com/databricks/koalas/blob/master/databricks/koalas/base.py#L185

HyukjinKwon · 2019-11-20T04:25:15Z

Yup, I think so.

harupy · 2019-11-20T04:27:51Z

Fixed it

HyukjinKwon · 2019-11-20T04:34:13Z

databricks/koalas/base.py

@@ -57,6 +57,21 @@ def wrapper(self, *args):
            args = [arg._scol if isinstance(arg, IndexOpsMixin) else arg for arg in args]
            scol = f(self._scol, *args)

+            # check if f is a logistic operator


I would change the comment and variable names too :-).

Thanks! will fix them.

softagram-bot · 2019-11-20T04:47:46Z

Softagram Impact Report for pull/1029 (head commit: `000d3fd`)

⭐ Change Overview

(Open in Softagram Desktop for full details)

💡 Insights

Co-change Alert: You modified frame.py. Often test_dataframe.py (koalas/tests) is modified at the same time.

📄 Full report

Permalink: Full report for pull/1029

Impact Report explained. Give feedback on this report to support@softagram.com

HyukjinKwon · 2019-11-22T01:43:09Z

Merged to master.

HyukjinKwon · 2019-11-22T01:43:28Z

Thanks, @harupy

harupy · 2019-11-22T01:43:36Z

@HyukjinKwon thanks!

Remove `fillna` from `Series.between` because comparison operators fill `NULL` with a boolean value (#1029). https://github.com/databricks/koalas/blob/fe93ac1e6247d5f1ee658d90a887eacbb5ffe655/databricks/koalas/series.py#L740-L744

ueshin · 2019-11-25T21:07:40Z

databricks/koalas/base.py

+
+            elif f == spark.Column.__and__:
+                scol = F.when(scol.isNull(), False).otherwise(scol)
+
            return self._with_new_scol(scol)
        else:
            # Different DataFrame anchors


Seems like we should apply the updates for this branch as well?

>>> ks.options.compute.ops_on_diff_frames = True >>> s1 = pd.Series([True, False, True], index=list("ABC"), name="x") >>> s2 = pd.Series([True, True, False], index=list("ABD"), name="x") >>> s1 A True B False C True Name: x, dtype: bool >>> s2 A True B True D False Name: x, dtype: bool >>> s1 | s2 A True B True C True D False Name: x, dtype: bool >>> (ks.from_pandas(s1) | ks.from_pandas(s2)).sort_index() A True B True C True D None Name: x, dtype: object

@ueshin Thank you for catching this. I'm working on this.

@ueshin

As @ueshin pointed out in #1029 (comment), the fix #1029 didn't cover the case below. This PR fixes it. ```python >>> ks.options.compute.ops_on_diff_frames = True >>> s1 = pd.Series([True, False, True], index=list("ABC"), name="x") >>> s2 = pd.Series([True, True, False], index=list("ABD"), name="x") >>> s1 A True B False C True Name: x, dtype: bool >>> s2 A True B True D False Name: x, dtype: bool >>> s1 | s2 A True B True C True D False Name: x, dtype: bool >>> (ks.from_pandas(s1) | ks.from_pandas(s2)).sort_index() A True B True C True D None Name: x, dtype: object ```

@ueshin

As @ueshin pointed out in databricks/koalas#1029 (comment), the fix #1029 didn't cover the case below. This PR fixes it. ```python >>> ks.options.compute.ops_on_diff_frames = True >>> s1 = pd.Series([True, False, True], index=list("ABC"), name="x") >>> s2 = pd.Series([True, True, False], index=list("ABD"), name="x") >>> s1 A True B False C True Name: x, dtype: bool >>> s2 A True B True D False Name: x, dtype: bool >>> s1 | s2 A True B True C True D False Name: x, dtype: bool >>> (ks.from_pandas(s1) | ks.from_pandas(s2)).sort_index() A True B True C True D None Name: x, dtype: object ```

Fix _column_op

6da920c

harupy commented Nov 8, 2019

View reviewed changes

harupy added 3 commits November 8, 2019 15:21

Handle not-equal operator

7e51add

Fix

0b9229e

Fix doctests

e27d6aa

Remove f-string

da80958

harupy marked this pull request as ready for review November 8, 2019 08:48

harupy added 2 commits November 20, 2019 02:43

Fix for __or__ and __and__

e1b93fb

Fix

473a1a8

harupy requested review from HyukjinKwon and ueshin November 20, 2019 01:23

HyukjinKwon reviewed Nov 20, 2019

View reviewed changes

HyukjinKwon approved these changes Nov 20, 2019

View reviewed changes

Fix comments

93e6505

HyukjinKwon changed the title ~~Fix logistic operators to treat NULL as False~~ Fix comparison operators to treat NULL as False Nov 20, 2019

HyukjinKwon reviewed Nov 20, 2019

View reviewed changes

harupy added 2 commits November 20, 2019 13:37

Fix comments

880c4dd

Fix indent

000d3fd

HyukjinKwon merged commit d8dfa71 into databricks:master Nov 22, 2019

harupy deleted the fix-log-operator branch November 22, 2019 01:43

HyukjinKwon mentioned this pull request Nov 22, 2019

Remove fillna from Series.between #1068

Merged

ueshin reviewed Nov 25, 2019

View reviewed changes

harupy mentioned this pull request Nov 28, 2019

Add Null handling for different frames #1083

Merged

Fix comparison operators to treat NULL as False #1029

Fix comparison operators to treat NULL as False #1029

Conversation

harupy commented Nov 8, 2019 • edited by HyukjinKwon

pandas

Koalas

Choose a reason for hiding this comment

itholic Nov 8, 2019 • edited

Choose a reason for hiding this comment

itholic Nov 8, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

harupy Nov 8, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

harupy commented Nov 8, 2019 • edited

codecov-io commented Nov 8, 2019 • edited

Codecov Report

ueshin commented Nov 8, 2019

harupy commented Nov 11, 2019

ueshin commented Nov 12, 2019 • edited

HyukjinKwon commented Nov 13, 2019

HyukjinKwon commented Nov 13, 2019

harupy commented Nov 13, 2019

HyukjinKwon commented Nov 14, 2019

harupy commented Nov 14, 2019 • edited

HyukjinKwon commented Nov 14, 2019

HyukjinKwon commented Nov 14, 2019

harupy commented Nov 16, 2019

HyukjinKwon commented Nov 18, 2019

HyukjinKwon left a comment

Choose a reason for hiding this comment

HyukjinKwon left a comment

Choose a reason for hiding this comment

harupy commented Nov 20, 2019

HyukjinKwon commented Nov 20, 2019

harupy commented Nov 20, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

softagram-bot commented Nov 20, 2019

Softagram Impact Report for pull/1029 (head commit: 000d3fd)

⭐ Change Overview

💡 Insights

📄 Full report

HyukjinKwon commented Nov 22, 2019

HyukjinKwon commented Nov 22, 2019

harupy commented Nov 22, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

harupy commented Nov 8, 2019 •

edited by HyukjinKwon

itholic Nov 8, 2019 •

edited

itholic Nov 8, 2019 •

edited

harupy Nov 8, 2019 •

edited

harupy commented Nov 8, 2019 •

edited

codecov-io commented Nov 8, 2019 •

edited

ueshin commented Nov 12, 2019 •

edited

harupy commented Nov 14, 2019 •

edited

Softagram Impact Report for pull/1029 (head commit: `000d3fd`)