Moment: handle all elements being masked by gjoseph92 · Pull Request #5339 · dask/dask

gjoseph92 · 2019-08-29T01:06:01Z

This almost works, but still fails for std, which returns 1-element array where the element is masked, instead of the MaskedConstant it should.

The issue is that np.sqrt doesn't return a masked constant, whereas np.ma.sqrt does:

In [1]: import numpy as np                                                                                                                                                                                                                                                                                                                                                                                                                                       

In [2]: np.sqrt(np.ma.masked)                                                                                                                                                                                                                                                                                                                                                                                                                                    
Out[2]: 
masked_array(data=--,
             mask=True,
       fill_value=1e+20,
            dtype=float64)

In [3]: np.ma.sqrt(np.ma.masked)                                                                                                                                                                                                                                                                                                                                                                                                                                 
Out[3]: masked

da.std calls sqrt, which is a dask ufunc wrapping np.sqrt.

Replacing the sqrt ufunc with the masked version seems like a potentially significant change around the edge cases. For example, they handle NaN differently:

In [1]: import numpy as np                                                                                                                                                                                                                                                                                                                                                                                                                                       

In [2]: np.ma.sqrt(np.nan)                                                                                                                                                                                                                                                                                                                                                                                                                                       
Out[2]: masked

In [3]: np.sqrt(np.nan)                                                                                                                                                                                                                                                                                                                                                                                                                                          
Out[3]: nan

I tried this in 45cef9a, but it doesn't seem like a good idea. (And breaks other tests due to the above discrepancy).

So the question is: how to write std in such a way that if the variance is a MaskedConstant, we just return the MaskedConstant without passing it to sqrt? (Or use a different version of sqrt that handles the MaskedConstant correctly, but doesn't interpret NaN as masked?)

Tests added / passed
Passes black dask / flake8 dask

Closes #5338.

gjoseph92 · 2019-08-29T01:07:38Z

@mrocklin 2d8dd51 is my initial thought on how to resolve #5338, but I really don't know what to do about sqrt's handling of masked constants. Looking for some ideas here.

mrocklin · 2019-08-29T03:53:26Z

@jcrist do you have a moment to look through this? I've never really touched masked arrays before.

mrocklin · 2019-08-29T03:54:18Z

dask/array/reductions.py

            denominator = np.nan
-    else:
+    elif denominator is not np.ma.masked:
        denominator[denominator < 0] = np.nan


I wonder if we should set only the unmasked values to np.nan? (just guessing here)

mrocklin · 2019-08-29T03:54:28Z

dask/array/tests/test_masked.py

+    dfunc = getattr(da, reduction)
+    func = getattr(np, reduction)
+
+    assert_eq_ma(dfunc(dx), func(x))


Thank you for the test!

mrocklin · 2019-08-29T03:55:16Z

dask/array/ufunc.py

 log1p = ufunc(np.log1p)
 expm1 = ufunc(np.expm1)
-sqrt = ufunc(np.sqrt)
+sqrt = ufunc(np.ma.sqrt)


Yeah, my guess is that we probably don't want to elevate np.ma like this. It's such an uncommon case and I would expect that this would cause problems in some other workload (but again, I'm just guessing here)

Agreed, I also think this is a bad idea, but don't know what else to do.

We sort of want a ufunc like:

def sqrt(x): return np.ma.sqrt(x) if isinstance(x, np.ma.masked_array) else np.sqrt(x)

Or perhaps in std, we could switch which sqrt is used based on the type of _meta? (That seems to sometimes, but not always, reflect whether the array is going to be masked or not.)

mrocklin · 2019-08-29T03:56:05Z

Also, I wonder if @bjlittle could recommend someone to help review this

mrocklin · 2019-09-01T16:14:02Z

cc'ing also @DPeterK in case he knows someone comfortable with reviewing masked array code.

TomAugspurger · 2019-09-04T18:56:21Z

Hows this going @gjoseph92? Is there anything in particular that others like @bjlittle or @DPeterK can help out with?

mrocklin · 2019-09-20T00:02:11Z

@TomAugspurger , this is currently blocked on people able to review masked array code.

It seems like we no longer have anyone qualified who is able to review these things. I wonder if we should stop supporting masked arrays. @jacobtomlinson do you know if this is still used within UK Met? If so, do you know anyone there who we could lean on to help?

jacobtomlinson · 2019-09-21T09:14:22Z

AFAIK it is and I think @DPeterK is the right person to ping, he's just on vacation right now.

gjoseph92 · 2019-09-24T00:52:00Z

@TomAugspurger I've been out on vacation myself, so just catching up on this.

The core issue is that dask's data model for masked arrays doesn't quite line up with NumPy's (and frankly, kind of abuses dask's)—NumPy uses a separate subclass, while dask holds them in a plain dask array, and just knows that the functions in the graph will produce masked arrays. I don't believe there's a definitive way to tell if a dask array is masked or not without actually computing it. I think _meta isn't reliable; I remember some cases where _meta would be explicitly instantiated as a np.array regardless of input type, though of course I can't remember where now. More generally, I'd guess dask just doesn't play well with any ndarray subclasses, MaskedArray included.

Besides making a separate dask MaskedArray subclass (lot of effort), we could at least expose dask versions of all the np.ma ufuncs, so you'd have access to those versions when you needed them. But then you'd also want masked versions of the reductions (da.masked_std would need to use da.masked_sqrt, for example). Still a lot of effort and code duplication.

We could also make the dask ufuncs automatically switch between masked and non-masked versions on the type of the input array (like I mentioned here). But that seems bad, because users can't control it, and it changes current behavior out from under them.

Or, to get this merged, I can just remove the test for std or @pytest.mark.xfail it. The fact that masked arrays likely need their own versions of many functions to behave correctly is its own, much larger issue.

mrocklin · 2019-12-07T04:00:36Z

This appears to be stalled. @jcrist you're probably the best equiped person to handle this, but I can understand that it might not be high in your priority list. I thought I'd make you aware of it in case you have some extra time.

@gjoseph92 I apologize that no one has been found who is able to review this work so far.

We remove the updates to `da.sqrt`, and localize the required fix instead in `da.std`/`da.nanstd`.

jcrist · 2020-04-29T16:24:03Z

Thanks @gjoseph92. I've updated this PR to make the required fixes a bit more localized. Thanks for being patient here, this one slipped through the cracks (apologies).

jcrist · 2020-04-29T16:47:24Z

Hmmm, tests are failing for numpy 1.15. Debugging.

jcrist · 2020-04-29T17:07:13Z

Was a bug in the test, fixed now.

gjoseph92 · 2020-04-29T17:20:37Z

Wow thanks @jcrist! Wasn't expecting this to get solved, but we've still been running into it, so I really appreciate the help. safe_sqrt is a good solution here; looks good to me.

jcrist · 2020-04-29T19:32:27Z

Failure is unrelated. cc @jrbourbeau or @TomAugspurger for a quick review, but I think this is good-to-go.

jrbourbeau

Thanks for the PR @gjoseph92 and @jcrist @mrocklin for reviewing!

mrocklin reviewed Aug 29, 2019

View reviewed changes

martindurant assigned jcrist Apr 14, 2020

gjoseph92 and others added 3 commits April 28, 2020 10:19

Moment: handle all elements being masked

f9a0410

sqrt as np.ma.sqrt

b559a27

Fixup

94b9e76

We remove the updates to `da.sqrt`, and localize the required fix instead in `da.std`/`da.nanstd`.

jcrist force-pushed the array/moment-all-masked branch from 45cef9a to 94b9e76 Compare April 29, 2020 16:22

jcrist marked this pull request as ready for review April 29, 2020 16:23

Fixup test

fc5f4bd

jrbourbeau reviewed Apr 30, 2020

View reviewed changes

jrbourbeau merged commit 8e54df2 into dask:master Apr 30, 2020

Uh oh!

Conversation

gjoseph92 commented Aug 29, 2019

Uh oh!

gjoseph92 commented Aug 29, 2019

Uh oh!

mrocklin commented Aug 29, 2019

Uh oh!

mrocklin Aug 29, 2019

Choose a reason for hiding this comment

Uh oh!

mrocklin Aug 29, 2019

Choose a reason for hiding this comment

Uh oh!

mrocklin Aug 29, 2019

Choose a reason for hiding this comment

Uh oh!

gjoseph92 Aug 29, 2019

Choose a reason for hiding this comment

Uh oh!

mrocklin commented Aug 29, 2019

Uh oh!

mrocklin commented Sep 1, 2019

Uh oh!

TomAugspurger commented Sep 4, 2019

Uh oh!

mrocklin commented Sep 20, 2019

Uh oh!

jacobtomlinson commented Sep 21, 2019

Uh oh!

gjoseph92 commented Sep 24, 2019

Uh oh!

mrocklin commented Dec 7, 2019

Uh oh!

jcrist commented Apr 29, 2020

Uh oh!

jcrist commented Apr 29, 2020

Uh oh!

jcrist commented Apr 29, 2020

Uh oh!

gjoseph92 commented Apr 29, 2020

Uh oh!

jcrist commented Apr 29, 2020

Uh oh!

jrbourbeau left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants