Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a grow parameter to sigma_clip et al. #10613

Merged
merged 16 commits into from Oct 2, 2020

Conversation

jehturner
Copy link
Member

Description

Adds a grow parameter to SigmaClip, sigma_clip & sigma_clipped_stats, to expand the masking of each deviant value to its neighbours within a specified pixel radius. This is a common feature of bad pixel rejection algorithms, notably in IRAF, and will also be useful for FittingWithOutlierRejection in modeling.

This will need merging with my PR #10610 from yesterday.

I've used grow=False as the default value, which seems clear to me, but 0. would also work (and is fairly clear) if people prefer a fixed type (but that's not the case elsewhere).

This could perhaps be optimized slightly by caching the growth kernel calculated in SigmaClip.__call__ -- but since it depends on the dimensionality of the input data and the value of axis, it would need storing in a dict by axis value or something. I would think the gain would be very marginal for the array sizes we typically deal with in astronomy and existing uses without grow will be unaffected either way, so I'm inclined to consider this "premature optimization" until demonstrated otherwise.

@astropy-bot astropy-bot bot added the stats label Jul 30, 2020
@jehturner
Copy link
Member Author

Gah, this dratted scipy thing.

@saimn saimn added this to the v4.2 milestone Jul 30, 2020
@saimn saimn requested a review from larrybradley July 30, 2020 14:25
@saimn
Copy link
Contributor

saimn commented Jul 30, 2020

I think you can just move the scipy import to be a local one.
For tests, you also need something like this:

try:
import scipy # pylint: disable=W0611 # noqa
except ImportError:
HAS_SCIPY = False
else:
HAS_SCIPY = True

@pytest.mark.skipif('not HAS_SCIPY')

@jehturner
Copy link
Member Author

jehturner commented Jul 30, 2020

How keen are people for sigma_clip in particular not to depend on scipy? Personally, I think that while avoiding that dependency is a noble goal, it could create more problems than it solves nowadays. But here I think we could just raise an exception (warning?) if grow is specified when scipy is not installed? Then we won't break any code (eg. downstream tests?) that happens to be using sigma_clip without scipy. Trying to reproduce or bundle the functionality of binary_dilation within astropy doesn't seem very sensible. I'm suggesting a hard exception here so the results won't differ depending on whether scipy is available, which users could overlook.

Alternatively, the astropy.convolve module looks to be a stand-alone implementation and might possibly work in place of binary_dilation -- but I would think it will be slower and use much more memory (since it's doing a float operation rather than a bool one) -- it's a bit of a sledgehammer to crack a walnut.

(I suppose the last resort would be to add grow to a cython sigclip implementation in our DRAGONS package, which is actually faster than stats.sigma_clip. But currently that only does the equivalent of _sigmaclip_noaxis and it would be a shame not to make the functionality available in astropy.)

@jehturner
Copy link
Member Author

Thanks, @saimn, I always forget that. Do we need that comment # pylint: disable=W0611 # noqa?

@saimn
Copy link
Contributor

saimn commented Jul 30, 2020

The way it is done currently in astropy is to move the scipy import locally where it is needed, so people will need scipy only if they use a feature requiring it, and get an ImportError if they don't have scipy. The case of scipy is a bit particular since it is used in many places in astropy, and most people will have it anyway, but on the other hand making it optional allow people to use specific parts of astropy (e.g units, or io.*) without requiring an additional dependency.

About the comment, just # noqa would be fine to avoid flake8 warnings, pylint is not used on the CI check and not the recommended tool either so I guess few people use it.

@jehturner
Copy link
Member Author

OK, so let's just do if grow: from scipy.ndimage import binary_dilation in both __init__ (to catch any failure early) and __call__. Not doing the import unless it's needed might speed things up slightly as well as guaranteeing that this doesn't break existing code that uses SigmaClip.

@saimn
Copy link
Contributor

saimn commented Jul 30, 2020

(I suppose the last resort would be to add grow to a cython sigclip implementation in our DRAGONS package, which is actually faster than stats.sigma_clip. But currently that only does the equivalent of _sigmaclip_noaxis and it would be a shame not to make the functionality available in astropy.)

Having the fast cython version in astropy would be even more interesting, but I'm not sure how grow could be implemented in a generic ND way... The _sigmaclip_withaxis case could be done easily by reshaping the data I guess, but for grow I think it would require a specific implementation for 1D, 2D, etc. ?

@jehturner
Copy link
Member Author

Could also add a try ... except under that condition in __init__, with a more informative error like scipy is required to use the grow parameter. Later on, we could make the except use astropy.convolve if appropriate and there's demand for it.

@jehturner
Copy link
Member Author

jehturner commented Jul 30, 2020

I agree that it would be good to think about combining these 2 implementations of sigma clipping (or at least have coherent functionality in one place) at some point (not in the scope of what I'm currently doing).

Having the fast cython version in astropy would be even more interesting, but I'm not sure how grow could be implemented in a generic ND way ... I think it would require a specific implementation for 1D, 2D, etc. ?

Any reason why we can't call ndimage to do that from Cython (is that a silly question)? Failing that, at least a convolution should be easier to implement in Cython than in Python... I assume the kernel construction could be done similarly to now? Let me know if I'm overlooking something.

@jehturner
Copy link
Member Author

One thing that bothers me slightly here is whether we'll always want to "grow" bad pixels along the same axes that sigma is calculated over. Obviously it makes sense not to grow along a model set axis, because the models could be independent, but eg. if you have a 3D cube from an IFU, pixels will only be neighbours on the detector in up to 2 of the 3 dimensions, even though you're measuring sigma over all 3 (though OTOH if you've already interpolated you've probably spread the defect in 3D). Also, you can have a model set consisting of adjacent image rows, in which case it might make sense to grow in both dimensions anyway. So perhaps there's an argument for an additional parameter (pun not intended), but I think this is good baseline behaviour and does everything IRAF does.

@saimn
Copy link
Contributor

saimn commented Jul 30, 2020

Any reason why we can't call ndimage to do that from Cython (is that a silly question)?

Because it's implemented with a C extension, not Cython (I had a look just before ;)). So linking to it is more difficult, it would be easier if they provide a cython wrapper.

Failing that, at least a convolution should be easier to implement in Cython than in Python... I assume the kernel construction could be done similarly to now? Let me know if I'm overlooking something.

The kernel could be passed to the cython routine, but what I meant is that to support the grow case it would be hard to do this with a generic ND function. Without grow it would be easier, since the ND case, with or without axis, can be seeing as a set of 1D cases.
In any case having at least an optimized 1D and 2D version of sigmaclip would certainly be useful, and in that case implementing the dilatation would not be too hard (e.g. https://github.com/scikit-image/scikit-image/blob/2fc3c5d6e997bf83287d8c3bdc354ea86ed0002d/skimage/morphology/cmorph.pyx)

@larrybradley
Copy link
Member

Thanks again, @jehturner. As @saimn noted above, the only required dependency of astropy is numpy (and now pytest too for testing). scipy is an optional dependency. We handle optional dependencies by importing them inside functions, methods, etc. You can use try...except ImportError to add a more informative error message, but it's not necessary. More often than not, we don't do that.

One other quick note -- new keywords need to be added at the end of the function signature to not break the API (e.g., in case keywords are used as positional arguments). On the other hand, we've discussed using the * operator to force keywords to be used as non-positional arguments, but I don't know if astropy core decided on an official policy. @pllim?

@jehturner
Copy link
Member Author

OK. This seemed like the most logically-consistent place for the new keyword, but it's really not a big deal here. On the other hand, if this is policy, it might be a good idea to consider doing some API clean up for every major release, @pllim? I'm not sure how one would keep track of that though.

@jehturner
Copy link
Member Author

Actually, it does get a bit messier in sigma_clip, because now grow=False is going to end up after copy=True, away from the other parameters that actually control the algorithm:

def sigma_clip(data, sigma=3, sigma_lower=None, sigma_upper=None, maxiters=5,
grow=False, cenfunc='median', stdfunc='std', axis=None,
masked=True, return_bounds=False, copy=True):

?

@jehturner
Copy link
Member Author

Anyway, I have that change locally if you want it. When the other PR is merged I'll rebase from it.

@jehturner
Copy link
Member Author

For tests, you also need something like this:

try:
import scipy # pylint: disable=W0611 # noqa
except ImportError:
HAS_SCIPY = False
else:
HAS_SCIPY = True

The test file already has a block like this with from scipy import stats. I suppose that should work for both purposes (it has to import stats in particular because it actually uses that).

@jehturner
Copy link
Member Author

One other quick note -- new keywords need to be added at the end of the function signature to not break the API (e.g., in case keywords are used as positional arguments).

This complicates the existing repr test a bit, which was using startswith to avoid matching the addresses of cenfunc & stdfunc, eg. SigmaClip(sigma=3.0, sigma_lower=3.0, sigma_upper=3.0, maxiters=5, cenfunc=<function _nanmedian at 0x7ff961939840>, stdfunc=<function _nanstd at 0x7ff961939a60>, grow=False). Do we want to add a regexp for this, or endswith(', grow=False') or something?

@jehturner
Copy link
Member Author

jehturner commented Jul 31, 2020

Oh, right, need to remove the default function name from the regexes because of the optional bottleneck thing. Will have to do that next week as I must do something else urgently. Not sure off hand why readthedocs has started complaining about axis...

@jehturner
Copy link
Member Author

So we expect cenfunc to be either _nanmedian or nanmedian by default (and likewise for stdfunc) but I'm having trouble figuring out whether Python guarantees that the repr defaults to the format <function name at 0x[a-f0-9]+> or whether that's a CPython implementation detail.

@jehturner
Copy link
Member Author

jehturner commented Aug 3, 2020

I've got rid of the regex for str/repr again and instead matched the substrings corresponding to nanmedian & nanstd by calling SigmaClip._parse_cenfunc & _parse_stdfunc from the test. That's not quite as thorough a test as the last version, because the result also depends on those private methods behaving as expected, but it's still more thorough than the original test (using startswith) was and I think it should be robust against not only the presence or absence of bottleneck but also any variations in how the default __repr__/__str__ for functions look in any given Python implementation.

This raises a minor question though... I see that SigmaClip.__repr__ is returning the str for cenfunc & stdfunc; should it instead be returning their repr as part of its own repr? In practice the two things are identical for a function, but still.

I'm still not sure about the readthedocs failure, but presumably it's the warnings about axis that are causing that? I'm going to have to figure out the new workflow for building the docs locally.

Copy link
Contributor

@saimn saimn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Single backticks create sphinx references, so you need double backticks for the axis occurrences.

astropy/stats/sigma_clipping.py Outdated Show resolved Hide resolved
astropy/stats/sigma_clipping.py Outdated Show resolved Hide resolved
astropy/stats/sigma_clipping.py Outdated Show resolved Hide resolved
@jehturner
Copy link
Member Author

Does this look ready to merge? It would be helpful to have it in master. Thanks!

@jehturner
Copy link
Member Author

Any final thoughts on this, @larrybradley? It would help our development significantly if we could get it into master soon.

Do we definitely want to keep the new parameter at the end of the function signature, as you had requested, even though it splits up the parameters that control the clipping (bearing in mind that this won't go in a bug fix release)?

Thanks!

@larrybradley
Copy link
Member

@jehturner I agree that it's much nicer to group common keywords together, but unfortunately reordering keywords breaks backwards compatibility. I'd be in favor of adding the * operator to the function signature to disallow keywords to be used as positional arguments. Thoughts @pllim, @bsipocz?

To do that, add a * argument before the first keyword argument, e.g., SigmaClip(*, sigma=3.0, ...). Requiring keyword-only arguments is a breaking change too, but it's a much safer breaking change because the user will always get an error message if keywords are used positionally. In case of changing the keyword order, the function may work fine, but the user is unknowingly calling it with different keyword values and may unexpectedly get different results.

I'll review the rest of the PR.

@bsipocz
Copy link
Member

bsipocz commented Oct 1, 2020

@larrybradley - Some major packages started to use kwarg only arguments even as they introduced an API break. For astropy I think it makes sense to wait with such a change until the next LTS release, otherwise it may lead to confusions. And I don't think the benefits outbit the harm for a a kwarg reshuffle. We suffered this same problem with astroquery a lot, but had to stick to the no API break, and thus we gradually started to use kwarg only for new functionality, and will enforce more for the next bigger release (but keep in mind, astroquery has a linear release model without an LTS series).

So, long story short, while the reshuffle makes sense, I don't think it's appropriate reason for an API break. And wait with the kwarg only for existing functionality, but have a push for a full refactor for v5.0.

Copy link
Member

@larrybradley larrybradley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @jehturner! BTW, don't think the CI tests are going to pass until you rebase this PR on the current master. I needed to rebase to run the tests locally.

@larrybradley larrybradley added the 💤 merge-when-ci-passes Do not use: We have auto-merge option now. label Oct 1, 2020
@jehturner
Copy link
Member Author

Yeah, will do in a min. Just putting children in bed in between changing stuff...

@jehturner
Copy link
Member Author

Does [skip ci] work in the github suggestions? I didn't think of that.

jehturner and others added 16 commits October 1, 2020 20:38
… re.match to match the substrings containing a median/std repr (so it doesn't only work with bottleneck installed or depend on the CPython __str__ implementation).
…of solving

readthedocs failure.

Co-authored-by: Simon Conseil <contact@saimon.org>
for how to interpret the grow parameter.

Co-authored-by: Larry Bradley <larry.bradley@gmail.com>
Co-authored-by: Larry Bradley <larry.bradley@gmail.com>
@pllim
Copy link
Member

pllim commented Oct 2, 2020

Does [skip ci] work in the github suggestions?

As long as you put it somewhere in the commit message (doesn't matter which interface creates the commit message), CI will honor the directive.

@larrybradley
Copy link
Member

Thanks, @jehturner!

@larrybradley larrybradley merged commit e8e2527 into astropy:master Oct 2, 2020
@jehturner
Copy link
Member Author

👍 Thanks for finding some time to finish looking at it.

@larrybradley
Copy link
Member

Very sorry for the delay.

@jehturner
Copy link
Member Author

No worries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stats 💤 merge-when-ci-passes Do not use: We have auto-merge option now.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants