Skip to content

Moment: handle all elements being masked#5339

Merged
jrbourbeau merged 4 commits intodask:masterfrom
gjoseph92:array/moment-all-masked
Apr 30, 2020
Merged

Moment: handle all elements being masked#5339
jrbourbeau merged 4 commits intodask:masterfrom
gjoseph92:array/moment-all-masked

Conversation

@gjoseph92
Copy link
Copy Markdown
Collaborator

This almost works, but still fails for std, which returns 1-element array where the element is masked, instead of the MaskedConstant it should.

The issue is that np.sqrt doesn't return a masked constant, whereas np.ma.sqrt does:

In [1]: import numpy as np                                                                                                                                                                                                                                                                                                                                                                                                                                       

In [2]: np.sqrt(np.ma.masked)                                                                                                                                                                                                                                                                                                                                                                                                                                    
Out[2]: 
masked_array(data=--,
             mask=True,
       fill_value=1e+20,
            dtype=float64)

In [3]: np.ma.sqrt(np.ma.masked)                                                                                                                                                                                                                                                                                                                                                                                                                                 
Out[3]: masked

da.std calls sqrt, which is a dask ufunc wrapping np.sqrt.

Replacing the sqrt ufunc with the masked version seems like a potentially significant change around the edge cases. For example, they handle NaN differently:

In [1]: import numpy as np                                                                                                                                                                                                                                                                                                                                                                                                                                       

In [2]: np.ma.sqrt(np.nan)                                                                                                                                                                                                                                                                                                                                                                                                                                       
Out[2]: masked

In [3]: np.sqrt(np.nan)                                                                                                                                                                                                                                                                                                                                                                                                                                          
Out[3]: nan

I tried this in 45cef9a, but it doesn't seem like a good idea. (And breaks other tests due to the above discrepancy).

So the question is: how to write std in such a way that if the variance is a MaskedConstant, we just return the MaskedConstant without passing it to sqrt? (Or use a different version of sqrt that handles the MaskedConstant correctly, but doesn't interpret NaN as masked?)

  • Tests added / passed
  • Passes black dask / flake8 dask

Closes #5338.

@gjoseph92
Copy link
Copy Markdown
Collaborator Author

@mrocklin 2d8dd51 is my initial thought on how to resolve #5338, but I really don't know what to do about sqrt's handling of masked constants. Looking for some ideas here.

@mrocklin
Copy link
Copy Markdown
Member

@jcrist do you have a moment to look through this? I've never really touched masked arrays before.

denominator = np.nan
else:
elif denominator is not np.ma.masked:
denominator[denominator < 0] = np.nan
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should set only the unmasked values to np.nan? (just guessing here)

dfunc = getattr(da, reduction)
func = getattr(np, reduction)

assert_eq_ma(dfunc(dx), func(x))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the test!

log1p = ufunc(np.log1p)
expm1 = ufunc(np.expm1)
sqrt = ufunc(np.sqrt)
sqrt = ufunc(np.ma.sqrt)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, my guess is that we probably don't want to elevate np.ma like this. It's such an uncommon case and I would expect that this would cause problems in some other workload (but again, I'm just guessing here)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I also think this is a bad idea, but don't know what else to do.

We sort of want a ufunc like:

def sqrt(x):
  return np.ma.sqrt(x) if isinstance(x, np.ma.masked_array) else np.sqrt(x)

Or perhaps in std, we could switch which sqrt is used based on the type of _meta? (That seems to sometimes, but not always, reflect whether the array is going to be masked or not.)

@mrocklin
Copy link
Copy Markdown
Member

Also, I wonder if @bjlittle could recommend someone to help review this

@mrocklin
Copy link
Copy Markdown
Member

mrocklin commented Sep 1, 2019

cc'ing also @DPeterK in case he knows someone comfortable with reviewing masked array code.

@TomAugspurger
Copy link
Copy Markdown
Member

Hows this going @gjoseph92? Is there anything in particular that others like @bjlittle or @DPeterK can help out with?

@mrocklin
Copy link
Copy Markdown
Member

@TomAugspurger , this is currently blocked on people able to review masked array code.

It seems like we no longer have anyone qualified who is able to review these things. I wonder if we should stop supporting masked arrays. @jacobtomlinson do you know if this is still used within UK Met? If so, do you know anyone there who we could lean on to help?

@jacobtomlinson
Copy link
Copy Markdown
Member

AFAIK it is and I think @DPeterK is the right person to ping, he's just on vacation right now.

@gjoseph92
Copy link
Copy Markdown
Collaborator Author

@TomAugspurger I've been out on vacation myself, so just catching up on this.

The core issue is that dask's data model for masked arrays doesn't quite line up with NumPy's (and frankly, kind of abuses dask's)—NumPy uses a separate subclass, while dask holds them in a plain dask array, and just knows that the functions in the graph will produce masked arrays. I don't believe there's a definitive way to tell if a dask array is masked or not without actually computing it. I think _meta isn't reliable; I remember some cases where _meta would be explicitly instantiated as a np.array regardless of input type, though of course I can't remember where now. More generally, I'd guess dask just doesn't play well with any ndarray subclasses, MaskedArray included.

Besides making a separate dask MaskedArray subclass (lot of effort), we could at least expose dask versions of all the np.ma ufuncs, so you'd have access to those versions when you needed them. But then you'd also want masked versions of the reductions (da.masked_std would need to use da.masked_sqrt, for example). Still a lot of effort and code duplication.

We could also make the dask ufuncs automatically switch between masked and non-masked versions on the type of the input array (like I mentioned here). But that seems bad, because users can't control it, and it changes current behavior out from under them.

Or, to get this merged, I can just remove the test for std or @pytest.mark.xfail it. The fact that masked arrays likely need their own versions of many functions to behave correctly is its own, much larger issue.

@mrocklin
Copy link
Copy Markdown
Member

mrocklin commented Dec 7, 2019

This appears to be stalled. @jcrist you're probably the best equiped person to handle this, but I can understand that it might not be high in your priority list. I thought I'd make you aware of it in case you have some extra time.

@gjoseph92 I apologize that no one has been found who is able to review this work so far.

gjoseph92 and others added 3 commits April 28, 2020 10:19
We remove the updates to `da.sqrt`, and localize the required fix
instead in `da.std`/`da.nanstd`.
@jcrist jcrist force-pushed the array/moment-all-masked branch from 45cef9a to 94b9e76 Compare April 29, 2020 16:22
@jcrist jcrist marked this pull request as ready for review April 29, 2020 16:23
@jcrist
Copy link
Copy Markdown
Member

jcrist commented Apr 29, 2020

Thanks @gjoseph92. I've updated this PR to make the required fixes a bit more localized. Thanks for being patient here, this one slipped through the cracks (apologies).

@jcrist
Copy link
Copy Markdown
Member

jcrist commented Apr 29, 2020

Hmmm, tests are failing for numpy 1.15. Debugging.

@jcrist
Copy link
Copy Markdown
Member

jcrist commented Apr 29, 2020

Was a bug in the test, fixed now.

@gjoseph92
Copy link
Copy Markdown
Collaborator Author

Wow thanks @jcrist! Wasn't expecting this to get solved, but we've still been running into it, so I really appreciate the help. safe_sqrt is a good solution here; looks good to me.

@jcrist
Copy link
Copy Markdown
Member

jcrist commented Apr 29, 2020

Failure is unrelated. cc @jrbourbeau or @TomAugspurger for a quick review, but I think this is good-to-go.

Copy link
Copy Markdown
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @gjoseph92 and @jcrist @mrocklin for reviewing!

@jrbourbeau jrbourbeau merged commit 8e54df2 into dask:master Apr 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

std() fails on masked array of all masked elements

7 participants