Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add ignore_nan option to median_absolute_deviation #5232

Merged
merged 1 commit into from May 16, 2017

Conversation

keflavich
Copy link
Contributor

Simple PR: use np.nanmedian instead of np.median if ignore_nan is set. I would favor it defaulting to True, but to avoid breaking backward compatibility I set it to False. Feedback on that option would be helpful.

@@ -729,6 +729,11 @@ def median_absolute_deviation(a, axis=None):
axis : int, optional
Axis along which the MADs are computed. The default (`None`) is
to compute the MAD of the flattened array.
ignore_nan : bool
Ignore NaN values (treat them as if they are not in the array) when
computing the median. This will use `np.nanmedian`, which requires
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this causes sphinx failure, please change to numpy.nanmedian

@bsipocz
Copy link
Member

bsipocz commented Aug 11, 2016

GH shows all green, but this just an annoying bug to show all green after [ci skip] is used (they call it feature tough).

@keflavich - Could you please fix the failures? np.nanmedian was added in 1.10, so the test should be skipped for earlier versions. The sphinx failure is also a trivial one.

@@ -9,6 +9,8 @@
from numpy.testing import assert_equal
from numpy.testing.utils import assert_allclose

NUMPY_LT_1P9 = [int(x) for x in np.__version__.split('.')[:2]] < [1, 9]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of these are already available from astropy.utils.compat

@pllim pllim added this to the v1.3.0 milestone Aug 11, 2016
@pllim
Copy link
Member

pllim commented Aug 11, 2016

@larrybradley , do you wish to have a look as well?

@keflavich
Copy link
Contributor Author

The one failure looks like an issue in numpy 1.9: apparently, np.median([1,2,3,4,5,np.nan]) was not nan in that version!

Would anyone like to confirm that interpretation of the error? I'll set the minimum np version to 1.10 as @bsipocz recommended earlier.

@astrofrog
Copy link
Member

@keflavich - try np.median([1,2,np.nan,4,5]) with numpy 1.9. I think what happened back then was that if there were nans, sorting didn't work?

@keflavich
Copy link
Contributor Author

keflavich commented Aug 11, 2016

@astrofrog: agreed that there's a sorting error, but it seems to just put the nan at the end or beginning, not sure which:

>       assert np.isnan(funcs.mad_std([1,2,np.nan,4,5]))
E       assert <ufunc 'isnan'>(2.9652044370112041)
E        +  where <ufunc 'isnan'> = np.isnan
E        +  and   2.9652044370112041 = <function mad_std at 0x10bf2ecf8>([1, 2, nan, 4, 5])
E        +    where <function mad_std at 0x10bf2ecf8> = funcs.mad_std

EDIT: so, I think we just want to keep the version bumped to 1.10 for the test

@astrofrog
Copy link
Member

@keflavich - I think this needs to work for all Numpys we support - so just add a workaround function for Numpy < 1.10 which does the nanmedian manually

@@ -761,6 +766,8 @@ def median_absolute_deviation(a, axis=None):
# returns an masked array even if the result should be scalar). (#4658)
if isinstance(a, np.ma.MaskedArray):
func = np.ma.median
elif ignore_nan:
func = np.nanmedian
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like this could do the trick:

try:
    func = np.nanmedian
except AttributeError:
    func = lambda x: np.median(x[~np.isnan(x)])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would fail if an axis argument is given because boolean indexing returns a 1d array

@MSeifert04
Copy link
Contributor

MSeifert04 commented Aug 11, 2016

The function already supports masked arrays so one could also mask the nans:

# mask nan/inf
if ignore_invalid:
    a = np.ma.masked_invalid(a)

# or just ignore nans
# if ignore_nan:
#    a = np.ma.masked_where(np.isnan(a), a)

This would work for masked and unmasked arrays alike and work across different numpy versions. It's however a bit slower to use np.ma.median instead of np.nanmedian and may return a masked value if no not-nan(inf) was present.

@larrybradley
Copy link
Member

I suggest the same as @MSeifert04. If no invalid values are present we could ensure the output is not masked (with axis=None).

@keflavich
Copy link
Contributor Author

@MSeifert04 I've implemented your suggestion in f887005, which also led me to tackle another corner case (if ignore_nan is specified and nans aren't masked out already). Is this reasonable?

@larrybradley do you want an extra check at the end that the return is not a masked array?

@larrybradley
Copy link
Member

@keflavich Currently, there is a bug in np.ma.median in that it will return a single-valued masked array instead of a scalar (numpy/numpy#7835). This will be fixed in numpy 1.11.2, but I suggest we implement a workaround to make the behavior consistent across all numpy versions (e.g. see https://github.com/astropy/astropy/blob/master/astropy/stats/sigma_clipping.py#L312).

@keflavich
Copy link
Contributor Author

@larrybradley would you like me to add that same workaround to this PR, then?

@larrybradley
Copy link
Member

@keflavich Yes. For example, I would expect median_absolute_deviation(np.arange(10)) to always return a scalar (here, 2.5), not a masked array (perhaps there is even a test for that).

@MSeifert04
Copy link
Contributor

@keflavich
I do have to correct my statement about np.ma.median being slower than np.nanmedian. It's a bit more complicated than that. In case axis=None nanmedian is faster but with a specified axis masked_median is faster. (see https://gist.github.com/MSeifert04/1551e80e1aa85b48c9191e727c38f9a9)

But I thought it might be better always to use np.ma.median if the user specifies ignore_nan (btw masked_invalid also masks infinities). That has the advantage that the return type doesn't depend on the numpy version being used. Currently numpy 1.9 users get a masked-array while numpy 1.10 users get a normal array if he specifies an axis.

More general if we want to return scalars instead of returning masked-arrays wouldn't it be better to ship some compat/monkeypatch-function instead of adding the workaround to several functions?

@larrybradley
Copy link
Member

More general if we want to return scalars instead of returning masked-arrays wouldn't it be better to ship some compat/monkeypatch-function instead of adding the workaround to several functions?

Yes, I should have suggested that before. For photutils, I use a private function to smooth out the inconsistencies:
https://github.com/astropy/photutils/blob/master/photutils/background/core.py#L35

@crawfordsm
Copy link
Member

Yes, if the test isn't already there, I would suggest adding the test with this addition that median_absolute_deviation(np.arange(10)) should always return a scalar and not a masked array.

@@ -716,7 +720,7 @@ def poisson_conf_interval(n, interval='root-n', sigma=1, background=0,
return conf_interval


def median_absolute_deviation(a, axis=None):
def median_absolute_deviation(data, axis=None, ignore_nan=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're going to rename the parameter #5214 would be ideal here 😄

(sorry for double posting this)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(done)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I should have mentioned that the PR is not yet reviewed nor merged.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are going to use stuff from #5214, then the tag from this PR needs to be changed from "affect release" to "affect dev", right?

@juliantaylor
Copy link

curious, np.nanmedian internally switches to ma.median if the reduced axis is too small. The threshold is 400 elements and used to work well, now the threshold is around 1200 elements. Something in the other path must have regressed in performance.

@crawfordsm
Copy link
Member

@keflavich Let me know if you still want to update this and have it reviewed. Looks like some tests failed and now a conflict with some other code updates.

@keflavich
Copy link
Contributor Author

This is ready for further review. I can't see any remaining problems with sphinx, these warnings all seem unassociated:

WARNING: The sphinx_gallery extension is not installed, so the gallery will not be built.  You will probably see additional warnings about undefined references due to this.
WARNING: FITSFixedWarning: LONPOLE2= 180.000000000 /lonpole
WARNING: FITSFixedWarning: LATPOLE2= 0.00000000000 /latpole
/Users/adam/repos/astropy/docs/index.rst:55: WARNING: toctree contains reference to nonexisting document 'generated/examples/index'
/Users/adam/repos/astropy/docs/visualization/wcsaxes/overlays.rst:163: WARNING: Exception occurred in plotting overlays-6
/Users/adam/repos/astropy/docs/visualization/wcsaxes/overlays.rst:174: WARNING: Exception occurred in plotting overlays-7
/Users/adam/repos/astropy/docs/coordinates/frames.rst:318: WARNING: undefined label: sphx_glr_generated_examples_coordinates_plot_sgr-coordinate-frame.py (if the link has no caption the label must precede a section header)
/Users/adam/repos/astropy/docs/coordinates/index.rst:129: WARNING: undefined label: sphx_glr_generated_examples_coordinates_plot_obs-planning.py (if the link has no caption the label must precede a section header)
/Users/adam/repos/astropy/docs/coordinates/index.rst:292: WARNING: undefined label: sphx_glr_generated_examples_coordinates_plot_obs-planning.py (if the link has no caption the label must precede a section header)
/Users/adam/repos/astropy/docs/io/fits/appendix/faq.rst:283: WARNING: undefined label: sphx_glr_generated_examples_io_skip_create-large-fits.py (if the link has no caption the label must precede a section header)
/Users/adam/repos/astropy/docs/io/fits/appendix/faq.rst:307: WARNING: undefined label: sphx_glr_generated_examples_io_create-mef.py (if the link has no caption the label must precede a section header)
/Users/adam/repos/astropy/docs/io/fits/index.rst:273: WARNING: undefined label: sphx_glr_generated_examples_io_modify-fits-header.py (if the link has no caption the label must precede a section header)
/Users/adam/repos/astropy/docs/io/fits/index.rst:351: WARNING: undefined label: sphx_glr_generated_examples_io_plot_fits-image.py (if the link has no caption the label must precede a section header)
/Users/adam/repos/astropy/docs/io/fits/index.rst:466: WARNING: undefined label: sphx_glr_generated_examples_io_fits-tables.py (if the link has no caption the label must precede a section header)
/Users/adam/repos/astropy/docs/io/fits/index.rst:644: WARNING: undefined label: sphx_glr_generated_examples_io_create-mef.py (if the link has no caption the label must precede a section header)
/Users/adam/repos/astropy/docs/whatsnew/1.0.rst:68: WARNING: undefined label: sphx_glr_generated_examples_coordinates_plot_obs-planning.py (if the link has no caption the label must precede a section header)

@keflavich
Copy link
Contributor Author

Ready for review again!


if not NUMPY_LT_1_10:
assert np.isnan(funcs.mad_std([1,2,3,4,5,np.nan]))
assert_allclose(funcs.mad_std([1,2,3,4,5,np.nan], ignore_nan=True),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick on those two lines: add blank space after commas :)

Copy link
Contributor

@saimn saimn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the minor inline comments, I think the main point is what we discussed in #5835 (with @crawfordsm): it would be useful to extract the logic to choose which median to use in a astropy.stats.median function, and it would simplify the code.
Also something should be done is a func is provided and ignore_nan is True. Just issue a warning or raise an exception ? This could also be stated more clearly in the docstring.

CHANGES.rst Outdated
that will use ``np.ma.median`` with nans masked out or ``np.nanmedian``
instead of ``np.median`` when computing the median. [#5232]

- ``astropy.sphinx``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Residual from rebase ?

func = np.ma.median
if ignore_nan:
data = np.ma.masked_where(np.isnan(data), data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use np.ma.masked_invalid instead (it is already used below).

@@ -19,3 +19,4 @@
NUMPY_LT_1_10_4 = not minversion('numpy', '1.10.4')
NUMPY_LT_1_11 = not minversion('numpy', '1.11.0')
NUMPY_LT_1_12 = not minversion('numpy', '1.12')
NUMPY_GE_1_13 = minversion('numpy', '1.13dev')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not used, can be removed ?

Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@keflavich - looks good except that I think now that we dropped numpy 1.7 and 1.8, this can be quite a bit simplified (since nanmedian is in 1.9, contrary to the earlier comment).

But wow, what a hassle!

@@ -766,26 +778,59 @@ def median_absolute_deviation(a, axis=None, func=None):
# See https://github.com/numpy/numpy/issues/7330 why using np.ma.median
# for normal arrays should not be done (summary: np.ma.median always
# returns an masked array even if the result should be scalar). (#4658)
if isinstance(a, np.ma.MaskedArray):
if isinstance(data, np.ma.MaskedArray):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If func is explicitly given, this whole stanza should be ignored.

Also, maybe use np.ma.isMaskedArray here?

data = np.ma.masked_where(np.isnan(data), data)
elif ignore_nan:
is_masked = False
if NUMPY_LT_1_10:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused - help on np.nanmedian states this has been present since numpy 1.9, and I just checked that this is indeed the case (contrary to what was stated in #5232 (comment)).. Since we just dropped support for 1.7 and 1.8, it would seem this means we can get rid of this stanza!

"unlikely to be what you expect.")

data = np.asanyarray(data)
data_median = func(data, axis=axis)

# broadcast the median array before subtraction
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.nanmedian has a keepdims argument... (it was only added in 1.10 for np.ma.median). But since any user function can be passed in, it is probably best to keep this (maybe add a comment, though).

@@ -835,7 +885,8 @@ def mad_std(data, axis=None, func=None):
"""

# NOTE: 1. / scipy.stats.norm.ppf(0.75) = 1.482602218505602
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unrelated to this PR, but it would be good to mention this factor in the docstring; if I understand correctly, it essentially is the exact number for the following:

1./np.median(np.abs(np.random.normal(size=100000)))
# 1.4832115241968997

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mhvk I did describe this factor in the docstring:
http://docs.astropy.org/en/stable/api/astropy.stats.mad_std.html#astropy.stats.mad_std

The factor is the normal inverse cumulative distribution function evaluated at P=3/4. It's exactly 1. / scipy.stats.norm.ppf(0.75).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@larrybradley - sorry, don't know how I could have missed this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries!

data[5,5] = np.nan
rslt = funcs.mad_std(data, ignore_nan=True)
assert np.isscalar(rslt)
with catch_warnings() as warns:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are any warnings still given with ignore_nan=True?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there should be but I am admittedly fuzzy-headed at the moment.

@@ -8,8 +8,8 @@
from ...utils import minversion


__all__ = ['NUMPY_LT_1_9_1', 'NUMPY_LT_1_10', 'NUMPY_LT_1_10_4',
'NUMPY_LT_1_11', 'NUMPY_LT_1_12']
__all__ = ['NUMPY_LT_1_9_1', 'NUMPY_LT_1_10',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just leave the line as is? Then this file does not need to be touched at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably right; I think this was from a failed merge during rebasing.

from astropy.utils.compat import NUMPY_LT_1_10, NUMPY_LT_1_12
from astropy.utils.decorators import deprecated_renamed_argument


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should just be one blank line.

CHANGES.rst Outdated
@@ -202,8 +202,6 @@ Bug Fixes
``median_absolute_deviation``. And allow to use these functions with
a multi-dimensional ``axis``. [#5835]

- ``astropy.sphinx``

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@keflavich You inadvertently deleted these lines instead of the ones below (line 551).

from warnings import warn

from astropy.utils.compat import NUMPY_LT_1_10
from astropy.utils.decorators import deprecated_renamed_argument
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would write these as

from ..utils.compat import NUMPY_LT_1_10
from ..utils.decorators import deprecated_renamed_argument

and then remove the follow two blank lines.

@keflavich
Copy link
Contributor Author

I... should probably squash all this if we ever decide to merge it. I think the commit log is now longer than the changes?

@crawfordsm
Copy link
Member

I'm happy to merge this after any additional comments. Thanks @keflavich for taking all the time to push this through.

keflavich added a commit to keflavich/astropy that referenced this pull request May 15, 2017
make mad_std nan-compatible

mad_std too

add changelog entry [ci skip]

np->numpy in docstrings & check numpy version (using 1.9, because
officially it was added to numpy then according to their docs...)

use astropy.utils.compat (thanks @bsipocz)

1_9, not 1P9

bump to 1.10 to avoid np 1.9 weirdness

attempt to work around older numpy versions using ma.median

add tests for scalarness and use np.ma.median except when axis=None.
Also, change 'a' to 'data'.  'a' is a bad variable name and led to a few
mistakes

add another test of madstd with nan & axes

add a warning about NaNs in arrays for np<1.10.  Remove a test for those
versions.

update docstrings & changelog.  Add deprecation of old argument

correct the code to match the comment

change order of bool checking to do expensive one last

Test appropriate numpy versions, check warnings, and use asked array comparison

switch to catch_warnigns

for numpy 1.13, treatment of NaNs in masked arrays changed

add GE_13 to __ALL__

xfail the masked array test to avoid supporting incorrect behavior
see
astropy#5232 (review)

udpate the special cases as per @mwcraig's helpful chart!

[ci skip] fix the tests: the behavior of the function when given masked arrays vs
non-masked arrays has changed: we're now forcing the correct behavior
for earlier versions

implement @juliantaylor's  suggetions

test should no longer use np.ma.allclose (though I don't know why ma
wouldn't just work....)

this commit will fail: I've fixed the underlying issue but kept the
tests flexible using the wrong data type...

tests for array type

double backticks

a second instance of bacticks

had a case wrong: one of the warn cases should *not* return a NaN for
np<1.11 (which is why we're catching a warning there)

numpy 1.10 np.ma.median([1,2,nan,4,5]) returns nan by default

no more old numpy supports.  Mixed feelings about this

fix an import

cleanup to address the numpy <=1.8 deprecation

more cleanup

fix changelog

add a note about keepdims

whitespace fix

fix imports

trailing whitespace?!!?!?!
(this is a squash of 38 commits with messages below:)

make mad_std nan-compatible

mad_std too

add changelog entry

np->numpy in docstrings & check numpy version (using 1.9, because
officially it was added to numpy then according to their docs...)

use astropy.utils.compat (thanks @bsipocz)

1_9, not 1P9

bump to 1.10 to avoid np 1.9 weirdness

attempt to work around older numpy versions using ma.median

add tests for scalarness and use np.ma.median except when axis=None.
Also, change 'a' to 'data'.  'a' is a bad variable name and led to a few
mistakes

add another test of madstd with nan & axes

add a warning about NaNs in arrays for np<1.10.  Remove a test for those
versions.

update docstrings & changelog.  Add deprecation of old argument

correct the code to match the comment

change order of bool checking to do expensive one last

Test appropriate numpy versions, check warnings, and use asked array comparison

switch to catch_warnigns

for numpy 1.13, treatment of NaNs in masked arrays changed

add GE_13 to __ALL__

xfail the masked array test to avoid supporting incorrect behavior
see
astropy#5232 (review)

udpate the special cases as per @mwcraig's helpful chart!

fix the tests: the behavior of the function when given masked arrays vs
non-masked arrays has changed: we're now forcing the correct behavior
for earlier versions

implement @juliantaylor's  suggetions

test should no longer use np.ma.allclose (though I don't know why ma
wouldn't just work....)

this commit will fail: I've fixed the underlying issue but kept the
tests flexible using the wrong data type...

tests for array type

double backticks

a second instance of bacticks

had a case wrong: one of the warn cases should *not* return a NaN for
np<1.11 (which is why we're catching a warning there)

numpy 1.10 np.ma.median([1,2,nan,4,5]) returns nan by default

no more old numpy supports.  Mixed feelings about this

fix an import

cleanup to address the numpy <=1.8 deprecation

more cleanup

fix changelog

add a note about keepdims

whitespace fix

fix imports

trailing whitespace?!!?!?!
@keflavich
Copy link
Contributor Author

keflavich commented May 15, 2017

Rebased and squashed.

@bsipocz
Copy link
Member

bsipocz commented May 16, 2017

@crawfordsm - Could you clarify, is this ready to be merged or we're still waiting for others to comment?

@crawfordsm crawfordsm merged commit 1149f0b into astropy:master May 16, 2017
@crawfordsm
Copy link
Member

I was primarily just waiting on the rebase and squashing. Merged.

Thanks @keflavich !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet