Improve performance of sigma clipping #11219

astrofrog · 2021-01-06T21:43:53Z

This is an experimental PR for now, which aims to speed up sigma clipping for common parameters. It is not at all ready for review yet! But I wanted to check whether astropy.stats maintainers were interested in this in principle.

Currently sigma clipping can be quite slow under certain conditions. As an example, if doing sigma clipping on a 3-d spectral cube, since the current Python implementation uses vectorized Numpy operations, if a single spaxel requires e.g. 500 iterations, the sigma clipping will be carried out for the whole cube using 500 iterations. This kind of use case can be sped up by doing sigma clipping on each spaxel separately (but doing so efficiently requires Cython).

In addition, the current Python implementation does a number of array copies, which can be slow. Therefore, the approach suggested in this PR is, in common cases, to delegate the calculation of the sigma clipped mask to a Cython extension.

For some real life cases, this results in factors of over 10x improvement in performance.

This PR is just a proof of concept at this point and many aspects need to be improved/generalized as well as adding support for Numpy's median - however I wanted to check first whether core maintainers would be open to this kind of contribution before putting too much work into this?

cc @larrybradley

pep8speaks · 2021-01-06T21:43:58Z

Hello @astrofrog 👋! It looks like you've made some changes in your pull request, so I've checked the code again for style.

There are no PEP8 style issues with this pull request - thanks! 🎉

Comment last updated at 2021-04-15 17:47:33 UTC

pllim · 2021-01-06T22:59:34Z

I'd say they might be statistically interested within 5-sigma. 😉

larrybradley · 2021-01-07T03:34:44Z

@astrofrog Yes, this sounds great!

pllim · 2021-01-07T14:41:15Z

Looks like there is already a benchmark at https://github.com/astropy/astropy-benchmarks/blob/master/benchmarks/stats/sigma_clipping.py . Would be interested to see some numbers.

mhvk · 2021-01-07T16:36:15Z

@astrofrog - per your request, I have not looked in any detail at the implementation, but do have a general comment: did you benchmark iterating over the outer dimensions in python, using `np.vectorize? Numpy internally makes good use of CPU vectorization, so anything that can be expressed decently well in terms of ufuncs is generally faster than rolling your own code. It also has the non-negligible advantage that the same core code is used, so any bug fixes made in it will work for all paths.

If it is in fact faster to write the whole loop in C-like stuff, then would it be an idea to write a (g)ufunc? This makes things much easier to override for subclasses like Quantity and my new Masked class (without any worry in the code here), and is not much harder to write than cython (at least, your implementation looks essentially like C to me). It also has a far smaller memory footprint...

saimn · 2021-01-12T23:14:37Z

That would certainly be useful! That's something we discussed a bit in the grow PR with @jehturner (#10613), also because in DRAGONS we have a cython implementation of sigma_clip combination (https://github.com/GeminiDRSoftware/DRAGONS/blob/master/gempy/library/cython_utils.pyx). I also started working on an experimental package to combine ND arrays in Cython, which includes a sigma_clip function adapted from DRAGONS (https://github.com/saimn/ndcombine/blob/master/ndcombine/sigma_clip.pyx).

astrofrog · 2021-01-13T05:05:17Z

Thanks for all the feedback so far! I am continuing to work on this and running benchmarks and it might be a week or so before it is ready for an initial review.

mhvk · 2021-01-13T14:27:55Z

@astrofrog - do let me know if you find that python iteration on the outer dimensions is not fast enough and would like some help making this a gufunc - it makes life a lot easier for ndarray subclasses such as Quantity, etc., and deals with shapes of outer dimensions for you for free (e.g., a mask that applies to every image in a stack of images, a column mask, etc.).

larrybradley · 2021-01-13T17:36:50Z

I think the issue that @astrofrog is trying to solve is for cases where there are a few stubborn pixels/spaxels in the input data that require a large number of iterations before convergence. The currently algorithm is vectorized over the entire input array, so even though say 99% of the pixel/spaxels have already converged, the sigma-clip operation is still performed on the entire array until all of the pixels/spaxels have converged. That can be expensive for large arrays. So I think the idea is to apply the sigma-clip algorithm to each pixel separately (stopping for each when they converge) to generate the final mask. Is this something that a gufunc can address (it seems @astofrog is proposing breaking the vectorization over the entire input array)?

mhvk · 2021-01-13T20:25:20Z

Yes, @astrofrog's example above, where you really want to sigma clip on all entries in a spaxel separately from all other spaxels (presumably, with mean and sigma will be separate for spaxels too), would be ideally suited to a gufunc - though as I noted I think you can get almost fully there by just running the current implementation sequentially over every spaxel, filling in a result spectral cube as you go (which should be fairly straightforward with np.vectorize).

astrofrog · 2021-01-19T15:11:13Z

I switched to a pure C extension as I thought it would be easier to then make a gufunc if we wanted. For now, the C extension does not use the Numpy C API for iteration because I couldn't quite figure out how to iterate over just one axis of a 2-d array and I didn't want to waste too much time in case this was make moot by the way it was implemented for a gufunc. So for now this requires 2-d C-contiguous arrays and 64-bit float etc. With that in mind, here are a few small performance benchmarks:

https://gist.github.com/astrofrog/701c5e0241df921d743fd86769850b8d

Essentially if multiple iterations are required, then for the median sigma clipping the approach here is much faster - I haven't quite figured out yet why the single iteration case is slower though, I'll need to investigate line profiling for C extensions...

There should be two main factors for the improved performance:

The algorithm I wrote only sorts the data once even when there are multiple iterations
Each row/column (depending how you name them) can have a different number of iterations

The examples in the notebook above don't fully cover the whole range of cases we would want to investigate but gives an initial idea.

@mhvk - since you offered, and given your experience with gufuncs, I was wondering whether you'd be interested in taking a stab at transforming the current function into a gufunc? Also if you would be interested in experimenting with np.vectorize I would be very interested in seeing the results - I tried locally but got terrible performance so I think I'm doing something wrong.

astrofrog · 2021-01-19T15:37:45Z

I think the reason the one-iteration version is slower is because of the sorting - I'm assuming Numpy actually uses something more akin to quickselect as opposed to quicksort to find the median, the former being faster. This makes me think that it might be better to not try and be 'smart' and pre-sort the data like I did but rather just use a similar algorithm to what Numpy or bottleneck use and focus solely on treating each 1-d array separately.

pllim · 2021-01-19T16:01:27Z

one-iteration version is slower

Is this a common use case? Seems bad to have C version to be 5x slower if it affects >50% of the use cases.

astrofrog · 2021-01-19T16:03:52Z

@pllim - I agree I think we should probably not pre-sort the data as I'm doing here or perhaps it could be a user-definable option.

mhvk · 2021-01-19T16:53:17Z

Just a quick reply on a specific item: numpy uses np.partition for getting the median.

larrybradley · 2021-01-19T16:59:39Z

To followup @astrofrog 's comment about np.vectorize being slow -- I generally stay away from np.vectorize because of this note in the docs "The vectorize function is provided primarily for convenience, not for performance. The implementation is essentially a for loop." Does it provide any performance benefits vs. for loops?

mhvk · 2021-01-19T17:12:04Z

@larrybradley - in this case a loop over the second dimension would seem to be what the C code is implementing too. As long as the sigma-clipping itself is relatively slow, i.e., there are many points in each 1-d slice, it should be fine.

astrofrog · 2021-01-19T17:56:57Z

I have a much more efficient implementation that I'll push shortly!

mhvk · 2021-01-19T18:09:48Z

Trying to reproduce things... One interesting thing is that if I use a plain venv I'm a factor 5 slower than if I use --system-site-packages -- clearly, Debian does something right it in its numpy packaging vs pip install numpy!

But regardless, running sigma_clip on one 1000-element column of the 1000x500 array takes ~1/250th of the time of running on the full array, so clearly no loop is going to help there; the overhead for setting up is too large...

mhvk · 2021-01-19T18:26:32Z

Oops, that had nothing to do with numpy - just have bottleneck installed on my system...

astrofrog · 2021-01-19T18:48:50Z

I've now pushed an updated version which does not pre-sort the data (since selection is a faster algorithm than sorting) and I'm now much happier with the results:

https://gist.github.com/astrofrog/43b17716d1f48a7ee72aad1259f46c30

For homogeneous data with a small dynamic range (which just requires one iteration) the results are comparable, while for high dynamic range data the approach is 10x faster (I think because the selection algorithm is still doing some sorting on the buffer and benefits from the sorting for subsequent iterations) and for data where only a few 1-d arrays require many iterations the speedup can be more, 30x in the above notebook.

I think the next step is to modify this to either be a gufunc and/or to use the Numpy C API for iterating over the arrays.

mhvk · 2021-01-19T20:06:59Z

Nice! I think you get an extra, large because your median function effectively sorts the data "good enough" so that on any next call the median calculation is almost free.

If what numpy does is representative, this gives about a factor of three in the many-iteration case:

In [41]: a = array.copy()

In [42]: %timeit np.median(a, axis=0)
7.86 ms ± 17.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [43]: %timeit np.median(a, axis=0, overwrite_input=True)
2.47 ms ± 5.68 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

mhvk · 2021-01-19T20:51:34Z

astropy/stats/_fast_sigma_clipping.c

+
+    // We copy all finite values from array into the buffer
+
+    count = 0;


For the gufunc , we'd mostly need to separate out this inner part of the function (which would get fed, effectively pointers data+j and mask+j and stride m). It should be relatively straightforward. (One would have two gufunc, one for median and one for mean.)

I'm still not 100% convinced one couldn't do the same thing in python, by making count an array that checks whether a given piece is still changing, and then using that.

I've now split up the code into the main algorithm vs the Numpy C extension part - does this seem better now?

Regarding doing this in Python, I tried but could not figure out a way to efficiently operate on the subset of values that were relevant without allocating/deallocating lots of arrays. Would you be interested in taking a stab at this for comparison? (as I agree if we can get the same performance with pure Python code we should avoid adding C extensions!)

I also felt it was rather daunting... I'll try to have another look, but perhaps the gufunc route is nicer - I think your base C code is nice & simple and with the gufunc the surrounding will be a bit simpler/standard too.

A possible extension might be to allow calling out to a python cenfunc and stdfunc, which would unify everything again.

Anyway, will think a bit more (but probably next week...)

A possible extension might be to allow calling out to a python cenfunc and stdfunc, which would unify everything again.

Yes this occurred to me, although it might not be efficient if the Python function has to be called once per 'row/column'.

mhvk · 2021-01-19T20:52:46Z

astropy/stats/_fast_sigma_clipping.c

+
+    count = 0;
+    for (i = 0; i < n; i++) {
+      if (data[i * m + j] == data[i * m + j]) {


My sense would be that rather than rely on non-finite elements not comparing equal, it might make sense to instead start with an input mask. But that's really an implementation detail.

Co-authored-by: Simon Conseil <contact@saimon.org>

astrofrog · 2021-04-15T10:24:36Z

I have rebased this and updated the changelog entry.

@larrybradley - see above for my reply about the copy option - I have added a bugfix changelog fragment describing this, so if you agree then this PR should be good to go if CI passes.

larrybradley

Thanks, @astrofrog!

pllim · 2021-04-19T13:57:43Z

I didn't confirm but I strongly suspect this PR introduced more failures to the already spectacularly failing ppc64le job. Example log: https://github.com/astropy/astropy/runs/2378063026?check_suite_focus=true

FAILED astropy/stats/tests/test_biweight.py::test_biweight_32bit_runtime_warnings
FAILED astropy/stats/tests/test_sigma_clipping.py::test_sigma_clip_dtypes[>f4]
FAILED astropy/stats/tests/test_sigma_clipping.py::test_sigma_clip_dtypes[<f4]
FAILED astropy/stats/tests/test_sigma_clipping.py::test_sigma_clip_dtypes[<i4]

larrybradley · 2021-04-19T14:16:17Z

I've also discovered another bug introduced by this PR that I'm fixing now. I don't think it's related to the above failures.

larrybradley · 2021-04-19T19:46:30Z

#11604 fixes the bug I found with photutils tests.

There's yet another bug with this PR that the astroimtools cron CI job revealed (it's not fixed by #11604) that I need to investigate.

pllim · 2021-04-19T19:51:53Z

@larrybradley , worse come to worse, we can revert this PR...

larrybradley · 2021-04-19T19:54:55Z

@pllim Ha, no chance of that! I'm sure it's another small issue.

larrybradley · 2021-04-19T22:08:15Z

Second bug addressed in #11610

github-actions bot added the stats label Jan 6, 2021

astrofrog added the Experimental label Jan 6, 2021

astrofrog marked this pull request as draft January 6, 2021 21:44

pllim added Performance Enhancement labels Jan 6, 2021

pllim added this to the v4.3 milestone Jan 6, 2021

mhvk reviewed Jan 19, 2021

View reviewed changes

mhvk and others added 15 commits April 15, 2021 11:04

Only move axis if more than one is equired.

151e11c

Added test of bounds shape

a9978d9

PEP8

d2981d8

Added tests of different dtypes and correct test of shapes

6e6637f

Fix test of bounds shape

67ad413

Remove method= kwarg and fix remaining tests

18139a7

Added whatsnew entry

b95116c

PEP8

3b8901b

Ignore warning

fa4cfad

Updated comment

2315b73

Fix masked array use case

328c3ad

Remove <f16 test case since not supported on all platforms

cd3bd6d

Fix typo

960612a

Co-authored-by: Simon Conseil <contact@saimon.org>

Added changelog entry

e6503a0

Add missing newlines

050af4f

astrofrog force-pushed the speedup-sigma-clip branch from 7007531 to 050af4f Compare April 15, 2021 10:18

Added bugfix changelog fragment

2640f28

Clarify copy=False and masked=False in docstring

a460f23

larrybradley approved these changes Apr 15, 2021

View reviewed changes

larrybradley added the 💤 merge-when-ci-passes Do not use: We have auto-merge option now. label Apr 15, 2021

larrybradley merged commit 8d26b90 into astropy:main Apr 16, 2021

larrybradley mentioned this pull request Apr 19, 2021

Fix automatic masking of invalid values in sigma clipping #11604

Merged

larrybradley mentioned this pull request Apr 19, 2021

Fix sigma clipping when axis=None and masked=False #11610

Merged


		// We copy all finite values from array into the buffer

		count = 0;

Improve performance of sigma clipping #11219

Improve performance of sigma clipping #11219

Conversation

astrofrog commented Jan 6, 2021 • edited

pep8speaks commented Jan 6, 2021 • edited

Comment last updated at 2021-04-15 17:47:33 UTC

pllim commented Jan 6, 2021

larrybradley commented Jan 7, 2021

pllim commented Jan 7, 2021

mhvk commented Jan 7, 2021

saimn commented Jan 12, 2021

astrofrog commented Jan 13, 2021

mhvk commented Jan 13, 2021

larrybradley commented Jan 13, 2021 • edited

mhvk commented Jan 13, 2021

astrofrog commented Jan 19, 2021

astrofrog commented Jan 19, 2021 • edited

pllim commented Jan 19, 2021

astrofrog commented Jan 19, 2021

mhvk commented Jan 19, 2021

larrybradley commented Jan 19, 2021

mhvk commented Jan 19, 2021

astrofrog commented Jan 19, 2021

mhvk commented Jan 19, 2021

mhvk commented Jan 19, 2021

astrofrog commented Jan 19, 2021

mhvk commented Jan 19, 2021

mhvk Jan 19, 2021

Choose a reason for hiding this comment

astrofrog Jan 21, 2021 • edited

Choose a reason for hiding this comment

mhvk Jan 21, 2021

Choose a reason for hiding this comment

astrofrog Jan 25, 2021

Choose a reason for hiding this comment

mhvk Jan 19, 2021

Choose a reason for hiding this comment

astrofrog Jan 21, 2021

Choose a reason for hiding this comment

astrofrog commented Apr 15, 2021

larrybradley left a comment

Choose a reason for hiding this comment

pllim commented Apr 19, 2021

larrybradley commented Apr 19, 2021

larrybradley commented Apr 19, 2021

pllim commented Apr 19, 2021

larrybradley commented Apr 19, 2021

larrybradley commented Apr 19, 2021

astrofrog commented Jan 6, 2021 •

edited

pep8speaks commented Jan 6, 2021 •

edited

larrybradley commented Jan 13, 2021 •

edited

astrofrog commented Jan 19, 2021 •

edited

astrofrog Jan 21, 2021 •

edited