Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of sigma clipping #11219

Merged
merged 33 commits into from Apr 16, 2021

Conversation

astrofrog
Copy link
Member

@astrofrog astrofrog commented Jan 6, 2021

This is an experimental PR for now, which aims to speed up sigma clipping for common parameters. It is not at all ready for review yet! But I wanted to check whether astropy.stats maintainers were interested in this in principle.

Currently sigma clipping can be quite slow under certain conditions. As an example, if doing sigma clipping on a 3-d spectral cube, since the current Python implementation uses vectorized Numpy operations, if a single spaxel requires e.g. 500 iterations, the sigma clipping will be carried out for the whole cube using 500 iterations. This kind of use case can be sped up by doing sigma clipping on each spaxel separately (but doing so efficiently requires Cython).

In addition, the current Python implementation does a number of array copies, which can be slow. Therefore, the approach suggested in this PR is, in common cases, to delegate the calculation of the sigma clipped mask to a Cython extension.

For some real life cases, this results in factors of over 10x improvement in performance.

This PR is just a proof of concept at this point and many aspects need to be improved/generalized as well as adding support for Numpy's median - however I wanted to check first whether core maintainers would be open to this kind of contribution before putting too much work into this?

cc @larrybradley

@pep8speaks
Copy link

pep8speaks commented Jan 6, 2021

Hello @astrofrog 👋! It looks like you've made some changes in your pull request, so I've checked the code again for style.

There are no PEP8 style issues with this pull request - thanks! 🎉

Comment last updated at 2021-04-15 17:47:33 UTC

@pllim
Copy link
Member

pllim commented Jan 6, 2021

I'd say they might be statistically interested within 5-sigma. 😉

@pllim pllim added this to the v4.3 milestone Jan 6, 2021
@larrybradley
Copy link
Member

@astrofrog Yes, this sounds great!

@pllim
Copy link
Member

pllim commented Jan 7, 2021

Looks like there is already a benchmark at https://github.com/astropy/astropy-benchmarks/blob/master/benchmarks/stats/sigma_clipping.py . Would be interested to see some numbers.

@mhvk
Copy link
Contributor

mhvk commented Jan 7, 2021

@astrofrog - per your request, I have not looked in any detail at the implementation, but do have a general comment: did you benchmark iterating over the outer dimensions in python, using `np.vectorize? Numpy internally makes good use of CPU vectorization, so anything that can be expressed decently well in terms of ufuncs is generally faster than rolling your own code. It also has the non-negligible advantage that the same core code is used, so any bug fixes made in it will work for all paths.

If it is in fact faster to write the whole loop in C-like stuff, then would it be an idea to write a (g)ufunc? This makes things much easier to override for subclasses like Quantity and my new Masked class (without any worry in the code here), and is not much harder to write than cython (at least, your implementation looks essentially like C to me). It also has a far smaller memory footprint...

@saimn
Copy link
Contributor

saimn commented Jan 12, 2021

That would certainly be useful! That's something we discussed a bit in the grow PR with @jehturner (#10613), also because in DRAGONS we have a cython implementation of sigma_clip combination (https://github.com/GeminiDRSoftware/DRAGONS/blob/master/gempy/library/cython_utils.pyx). I also started working on an experimental package to combine ND arrays in Cython, which includes a sigma_clip function adapted from DRAGONS (https://github.com/saimn/ndcombine/blob/master/ndcombine/sigma_clip.pyx).

@astrofrog
Copy link
Member Author

Thanks for all the feedback so far! I am continuing to work on this and running benchmarks and it might be a week or so before it is ready for an initial review.

@mhvk
Copy link
Contributor

mhvk commented Jan 13, 2021

@astrofrog - do let me know if you find that python iteration on the outer dimensions is not fast enough and would like some help making this a gufunc - it makes life a lot easier for ndarray subclasses such as Quantity, etc., and deals with shapes of outer dimensions for you for free (e.g., a mask that applies to every image in a stack of images, a column mask, etc.).

@larrybradley
Copy link
Member

larrybradley commented Jan 13, 2021

I think the issue that @astrofrog is trying to solve is for cases where there are a few stubborn pixels/spaxels in the input data that require a large number of iterations before convergence. The currently algorithm is vectorized over the entire input array, so even though say 99% of the pixel/spaxels have already converged, the sigma-clip operation is still performed on the entire array until all of the pixels/spaxels have converged. That can be expensive for large arrays. So I think the idea is to apply the sigma-clip algorithm to each pixel separately (stopping for each when they converge) to generate the final mask. Is this something that a gufunc can address (it seems @astofrog is proposing breaking the vectorization over the entire input array)?

@mhvk
Copy link
Contributor

mhvk commented Jan 13, 2021

Yes, @astrofrog's example above, where you really want to sigma clip on all entries in a spaxel separately from all other spaxels (presumably, with mean and sigma will be separate for spaxels too), would be ideally suited to a gufunc - though as I noted I think you can get almost fully there by just running the current implementation sequentially over every spaxel, filling in a result spectral cube as you go (which should be fairly straightforward with np.vectorize).

@astrofrog
Copy link
Member Author

I switched to a pure C extension as I thought it would be easier to then make a gufunc if we wanted. For now, the C extension does not use the Numpy C API for iteration because I couldn't quite figure out how to iterate over just one axis of a 2-d array and I didn't want to waste too much time in case this was make moot by the way it was implemented for a gufunc. So for now this requires 2-d C-contiguous arrays and 64-bit float etc. With that in mind, here are a few small performance benchmarks:

https://gist.github.com/astrofrog/701c5e0241df921d743fd86769850b8d

Essentially if multiple iterations are required, then for the median sigma clipping the approach here is much faster - I haven't quite figured out yet why the single iteration case is slower though, I'll need to investigate line profiling for C extensions...

There should be two main factors for the improved performance:

  • The algorithm I wrote only sorts the data once even when there are multiple iterations
  • Each row/column (depending how you name them) can have a different number of iterations

The examples in the notebook above don't fully cover the whole range of cases we would want to investigate but gives an initial idea.

@mhvk - since you offered, and given your experience with gufuncs, I was wondering whether you'd be interested in taking a stab at transforming the current function into a gufunc? Also if you would be interested in experimenting with np.vectorize I would be very interested in seeing the results - I tried locally but got terrible performance so I think I'm doing something wrong.

@astrofrog
Copy link
Member Author

astrofrog commented Jan 19, 2021

I think the reason the one-iteration version is slower is because of the sorting - I'm assuming Numpy actually uses something more akin to quickselect as opposed to quicksort to find the median, the former being faster. This makes me think that it might be better to not try and be 'smart' and pre-sort the data like I did but rather just use a similar algorithm to what Numpy or bottleneck use and focus solely on treating each 1-d array separately.

@pllim
Copy link
Member

pllim commented Jan 19, 2021

one-iteration version is slower

Is this a common use case? Seems bad to have C version to be 5x slower if it affects >50% of the use cases.

@astrofrog
Copy link
Member Author

@pllim - I agree I think we should probably not pre-sort the data as I'm doing here or perhaps it could be a user-definable option.

@mhvk
Copy link
Contributor

mhvk commented Jan 19, 2021

Just a quick reply on a specific item: numpy uses np.partition for getting the median.

@larrybradley
Copy link
Member

To followup @astrofrog 's comment about np.vectorize being slow -- I generally stay away from np.vectorize because of this note in the docs "The vectorize function is provided primarily for convenience, not for performance. The implementation is essentially a for loop." Does it provide any performance benefits vs. for loops?

@mhvk
Copy link
Contributor

mhvk commented Jan 19, 2021

@larrybradley - in this case a loop over the second dimension would seem to be what the C code is implementing too. As long as the sigma-clipping itself is relatively slow, i.e., there are many points in each 1-d slice, it should be fine.

@astrofrog
Copy link
Member Author

I have a much more efficient implementation that I'll push shortly!

@mhvk
Copy link
Contributor

mhvk commented Jan 19, 2021

Trying to reproduce things... One interesting thing is that if I use a plain venv I'm a factor 5 slower than if I use --system-site-packages -- clearly, Debian does something right it in its numpy packaging vs pip install numpy!

But regardless, running sigma_clip on one 1000-element column of the 1000x500 array takes ~1/250th of the time of running on the full array, so clearly no loop is going to help there; the overhead for setting up is too large...

@mhvk
Copy link
Contributor

mhvk commented Jan 19, 2021

Oops, that had nothing to do with numpy - just have bottleneck installed on my system...

@astrofrog
Copy link
Member Author

I've now pushed an updated version which does not pre-sort the data (since selection is a faster algorithm than sorting) and I'm now much happier with the results:

https://gist.github.com/astrofrog/43b17716d1f48a7ee72aad1259f46c30

For homogeneous data with a small dynamic range (which just requires one iteration) the results are comparable, while for high dynamic range data the approach is 10x faster (I think because the selection algorithm is still doing some sorting on the buffer and benefits from the sorting for subsequent iterations) and for data where only a few 1-d arrays require many iterations the speedup can be more, 30x in the above notebook.

I think the next step is to modify this to either be a gufunc and/or to use the Numpy C API for iterating over the arrays.

@mhvk
Copy link
Contributor

mhvk commented Jan 19, 2021

Nice! I think you get an extra, large because your median function effectively sorts the data "good enough" so that on any next call the median calculation is almost free.

If what numpy does is representative, this gives about a factor of three in the many-iteration case:

In [41]: a = array.copy()

In [42]: %timeit np.median(a, axis=0)
7.86 ms ± 17.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [43]: %timeit np.median(a, axis=0, overwrite_input=True)
2.47 ms ± 5.68 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


// We copy all finite values from array into the buffer

count = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the gufunc , we'd mostly need to separate out this inner part of the function (which would get fed, effectively pointers data+j and mask+j and stride m). It should be relatively straightforward. (One would have two gufunc, one for median and one for mean.)

I'm still not 100% convinced one couldn't do the same thing in python, by making count an array that checks whether a given piece is still changing, and then using that.

Copy link
Member Author

@astrofrog astrofrog Jan 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've now split up the code into the main algorithm vs the Numpy C extension part - does this seem better now?

Regarding doing this in Python, I tried but could not figure out a way to efficiently operate on the subset of values that were relevant without allocating/deallocating lots of arrays. Would you be interested in taking a stab at this for comparison? (as I agree if we can get the same performance with pure Python code we should avoid adding C extensions!)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also felt it was rather daunting... I'll try to have another look, but perhaps the gufunc route is nicer - I think your base C code is nice & simple and with the gufunc the surrounding will be a bit simpler/standard too.

A possible extension might be to allow calling out to a python cenfunc and stdfunc, which would unify everything again.

Anyway, will think a bit more (but probably next week...)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A possible extension might be to allow calling out to a python cenfunc and stdfunc, which would unify everything again.

Yes this occurred to me, although it might not be efficient if the Python function has to be called once per 'row/column'.


count = 0;
for (i = 0; i < n; i++) {
if (data[i * m + j] == data[i * m + j]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My sense would be that rather than rely on non-finite elements not comparing equal, it might make sense to instead start with an input mask. But that's really an implementation detail.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@astrofrog
Copy link
Member Author

I have rebased this and updated the changelog entry.

@larrybradley - see above for my reply about the copy option - I have added a bugfix changelog fragment describing this, so if you agree then this PR should be good to go if CI passes.

Copy link
Member

@larrybradley larrybradley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @astrofrog!

@larrybradley larrybradley added the 💤 merge-when-ci-passes Do not use: We have auto-merge option now. label Apr 15, 2021
@larrybradley larrybradley merged commit 8d26b90 into astropy:main Apr 16, 2021
@pllim
Copy link
Member

pllim commented Apr 19, 2021

I didn't confirm but I strongly suspect this PR introduced more failures to the already spectacularly failing ppc64le job. Example log: https://github.com/astropy/astropy/runs/2378063026?check_suite_focus=true

FAILED astropy/stats/tests/test_biweight.py::test_biweight_32bit_runtime_warnings
FAILED astropy/stats/tests/test_sigma_clipping.py::test_sigma_clip_dtypes[>f4]
FAILED astropy/stats/tests/test_sigma_clipping.py::test_sigma_clip_dtypes[<f4]
FAILED astropy/stats/tests/test_sigma_clipping.py::test_sigma_clip_dtypes[<i4]

@larrybradley
Copy link
Member

I've also discovered another bug introduced by this PR that I'm fixing now. I don't think it's related to the above failures.

@larrybradley
Copy link
Member

#11604 fixes the bug I found with photutils tests.

There's yet another bug with this PR that the astroimtools cron CI job revealed (it's not fixed by #11604) that I need to investigate.

@pllim
Copy link
Member

pllim commented Apr 19, 2021

@larrybradley , worse come to worse, we can revert this PR...

@larrybradley
Copy link
Member

@pllim Ha, no chance of that! I'm sure it's another small issue.

@larrybradley
Copy link
Member

Second bug addressed in #11610

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Performance stats 💤 merge-when-ci-passes Do not use: We have auto-merge option now.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants