New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance enhancements for convolution.convolve #7293
Conversation
Hi there @jamienoss 👋 - thanks for the pull request! I'm just a friendly 🤖 that checks for issues related to the changelog and making sure that this pull request is milestoned and labeled correctly. This is mainly intended for the maintainers, so if you are not a maintainer you can ignore this, and a maintainer will let you know if any action is required on your part 😃. Everything looks good from my point of view! 👍 If there are any issues with this message, please report them here. |
0dd5962
to
8531c36
Compare
This is dependent on astropy/astropy-helpers#382 |
@eteq, @larrybradley, @hcferguson, @keflavich FYI. The tests will fail due to the dependence on astropy/astropy-helpers#382. |
@jamienoss - thanks for working on this! I've tagged @keflavich and myself to review this. Just a quick question for now - is the change from Cython to pure-C needed, as Cython does have ways of enabling multi-threading? We should also add some benchmarks in https://github.com/astropy/astropy-benchmarks to monitor the performance, at least for the single-thread case. @jamienoss, would you be able to add benchmark cases there? I made a document here about using asv and astropy-benchmarks: https://docs.google.com/document/d/1AoPBAbD8DiDVEM6HuOtPKekN3phtcCF4Qk6pxZ0ID-w/edit |
This is awesome and I will review, but I imagine this will take me some time to digest. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor changes so far. It would be great to see some nice summary discussion and/or figures for the performance improvements in the docs.
Awesome stuff.
astropy/convolution/convolve.py
Outdated
normalize_kernel = True | ||
nan_treatment='interpolate' | ||
else: | ||
raise Exception("Can't convolve 1D and 2D kernel.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make this some non-generic exception. Perhaps just ValueError
?
astropy/convolution/convolve.py
Outdated
if isinstance(passed_kernel, Kernel): | ||
if (isinstance(passed_array, Kernel1D) and isinstance(passed_kernel, Kernel1D)) or (isinstance(passed_array, Kernel2D) and isinstance(passed_kernel, Kernel2D)): | ||
warnings.warn("array is Kernel instance, hardwiring the following parameters: " | ||
"boundary='fill', fill_value=0, normalize_Kernel=True, " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whitespace - should be indented just past the (
on the previous line?
astropy/convolution/convolve.py
Outdated
# NaN interpolation significantly slows down the C convolution | ||
# compuatation. Since nan_treatment = 'interpolate', is the default | ||
# check whether it is even needed, if not, don't interpolate. | ||
# NB: np.isnan(array_internal.sum()) is fatser than np.isnan(array_internal).any() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
faster.
Have you checked performance of np.count_nonzero(np.isnan(array_internal))
? It might just be that the per-pixel checking of isnan is the performance bottleneck, in which case array_internal.sum()
(or maybe .max()
?) is going to be the fastest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did, and you're correct, the perf. bottleneck is having to check each pix for NaN
s. Which is why I went for sum()
, and then only check the result. I didn't think of using max()
, but thinking about it now, sum()
has fewer operations and is vectorizable.
Neither can be short-circuited at any point - all pixels will be operated on.
astropy/convolution/convolve.py
Outdated
"close to zero. The sum of the given kernel is < {0}" | ||
.format(1. / MAX_NORMALIZATION)) | ||
if kernel_sum < 1. / MAX_NORMALIZATION or kernel_sums_to_zero: | ||
raise Exception("The kernel can't be normalized, because its sum is " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should probably also be a ValueError
(I realize it was an Exception
in the old code too; we may as well improve it)
astropy/convolution/convolve.py
Outdated
# here and pass through instead. | ||
result = np.zeros(array_internal.shape, dtype=float, order='C') | ||
|
||
if boundary in ['fill', 'extend', 'wrap']: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe better to have this as a tuple?
astropy/convolution/convolve.py
Outdated
if array_internal.ndim == 1: | ||
padded_array[pad_width[0]:array_shape[0]+pad_width[0]] = array_internal | ||
elif array_internal.ndim == 2: | ||
padded_array[pad_width[0]:array_shape[0]+pad_width[0], pad_width[1]:array_shape[1]+pad_width[1]] = array_internal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wrap these long lines
astropy/convolution/convolve.py
Outdated
) | ||
|
||
# So far, normalization has only occured for nan_treatment == 'interpolate' | ||
# beacuse this had to happen within the C extension so as to ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because
Yes. The Cython OpenMP support is incredibly poor and doesn't seem to have the ability to split
I can indeed. |
astropy/convolution/convolve.py
Outdated
# NaN interpolation significantly slows down the C convolution | ||
# compuatation. Since nan_treatment = 'interpolate', is the default | ||
# check whether it is even needed, if not, don't interpolate. | ||
# NB: np.isnan(array_internal.sum()) is fatser than np.isnan(array_internal).any() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typos: computation, faster
astropy/convolution/setup_package.py
Outdated
# Add '-Rpass-missed=.*' to ``extra_compile_args`` when compiling with clang | ||
# to report missed optimizations | ||
c_convolve_ext = Extension(name='c_convolve', sources=SRC_FILES, | ||
extra_compile_args=['-O3', '-fPIC'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is -O3
needed for the optimizations mentioned above ? Otherwise it may be better to leave -O2
by default, it would be more consistent with other extensions and default flags.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is, for vectorization and other such magic. I can turn them all on individually though. Are you concerned about fp arithmetic issues, i.e. fast-math?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it looks like conda's Python uses -ftree-vectorize
and -O3
by default, pyenv's Python has -O3
but not -ftree-vectorize
, Debian's Python is more conservative with -O2
. So actually my point was more about not specifying additional options here to inherit the default ones instead.
About fast-math, I think it is not activated with -O3
, I guess it could be interesting here since you are interpolating the NaNs before calling the C code, but not sure if it can lead to other issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also prefer to avoid passing -O3 explicitly and let the default from the CPython installation be used instead.
from .utils import KernelSizeError, has_even_axis, raise_even_kernel_exception | ||
from ..openmp_enabled import is_openmp_enabled | ||
|
||
faulthandler.enable() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess faulthandler
was useful for development but could be removed now ?
Also the matplotlib import above seems unused (otherwise it should be made optional).
And unless I missed something, I don't see the source code for openmp_enabled
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess faulthandler was useful for development but could be removed now ?
I left the faulthandler
in to deal with any unforeseen runtime issues, better the user gets an error message than it just quickly and silently returning. I have no strong objection to removing it though if that is the "Python" thing to do.
Also the matplotlib import above seems unused (otherwise it should be made optional).
Good catch, must have been a copy-and-paste fat-finger.
And unless I missed something, I don't see the source code for openmp_enabled.
This is autogenerated at build c.f. astropy/astropy-helpers#382 src code here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
faulthandler can be activated at startup time, and I was not sure if activating it by default has some drawback. It seems not actually, so maybe it's fine to keep it.
This is autogenerated at build c.f. astropy/astropy-helpers#382 src code here
Ok, thanks! This is related to another question (that should maybe go in the astropy-helpers discussion): I was also working on building binary wheels for my package with the C extension with OpenMP, and there will be the same issue for Astropy wheels: if the wheels are built with OpenMP support, this does not necessary mean that the system has OpenMP support. It should be fine on Linux, and maybe the OpenMP support in Clang is old enough now to activate OpenMP by default ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be fine on Linux, and maybe the OpenMP support in Clang is old enough now to activate OpenMP by default
I think llvm/clang version 3+ supports OpenMP by default, it just doesn't ship libomp with it like gcc does. You have to install it separately, which is why it requires more work to compile & link.
I left a few comments, it seems very good and it is an interesting reading for me as I'm also trying to fix/optimize a C extension with ctypes and openmp 🙂 (your addition in astropy-helpers for clang+openmp could also be useful!). |
BTW do I need to add some build dependency somewhere such that conda will build this with libomp? Also, is there somewhere to note know version dependency issues? llvm-openmp version 4.0.1 hda82c8b_0 (the version available via anconda) causes test_convolve_models.py::test_fitting_convolve_models() to crash. I have no idea why. It works fine with gcc (v7.1.0) and the newer llvm-openmp. I saw that conda built the latest version of astropy with clang v6.?.? so hopefully they use a newer version of llvm-also. |
@saimn Thanks for taking a look. 😄 If you're using clang and trying to optimize, add If you'd like me to help out, take a look at the src, or give you any pointers I'd be more than glad to. 😄 |
astropy/convolution/setup_package.py
Outdated
# Licensed under a 3-clause BSD style license - see LICENSE.rst | ||
|
||
import os | ||
from numpy import get_include as get_numpy_include |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check the other submodules how the numpy includes were added (either io.ascii or io.fits), and use the same logic here, that will at least fix the egg builds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do, thanks.
I would definitely like to see @taldcroft's opinion before adding more C code to the codebase about his experience with speed benefits vs maintenance burden of e.g. the fastreader? |
Would be interesting to see how this play out at https://github.com/astropy/astropy-benchmarks/ (probably need to add convolution benchmarks over there first). Changes to |
Note: I'm away this week, but will review this and the helpers PR next week. |
p.s. Please ping me to test this PR on my Windows machine when this is ready. C stuff can get iffy on Windows. |
Everything's "iffy" on Windows. 😉 Thanks, testing on Windows would be great! |
Sorry, I just assumed this was the thing to do. Shall I remove the submodule changes and just leave the submod hash or remove all traces? |
Just my 2 cents: given the clever trick to avoid duplicating code for the nan interpolation case, I think that the difference - in terms of complexity - with the Cython code is small enough. I mean, I understand your worry, but I don't think that the C code will be more difficult to maintain. And even if I love Cython, it's also true that it lacks more advanced OpenMP features. |
It won't be any more difficult to maintain because it's C, in fact, it will be less. However, optimized and performant code can be a burden to maintain, in that its performance can be fragile to code changes. It's a battle/play between the will of the author and the compiler's, and being able to outthink it and wield it. By this, I just mean that it's not necessarily a burden due to the language, i.e. the burden isn't necessarily because it's C but because it's optimized for performance. With this in mind, it's far simpler to optimize code when less translational layers exist. For example, C has one, the compiler. Actually, it has two as all have the CPU itself which is free(ish) to optimize and rearrange your code. Cython has one more, the Cython compiler itself, translating it into C, which then gets compiled by a C compiler. You, therefore, have to write it in a particular fashion knowing that the Cython compiler will then translate it into a particular form that the C compiler will then deal with as you desired. In this respect, it is easier to maintain optimized C than optimized Cython. Sometimes, you just have to choose the right tool for the job. For this, we already know that it's not Python (actually I hate saying and hearing this - it's got nothing to do with Python, it's a language. The issue is with its canonical implementation, CPython, and its love of the GIL.) Anyhoo, it's not (C)Python and it's also not Cython, at least if we want it to be sufficiently performant. |
Unless changing submodule hash in this PR w.r.t what is in An ideal workflow (at least in my own head) is:
|
Building with OpenMP support is disabled by default. To build the package with OpenMP support (currently only used in ``convolution.convolve``) create and set the environment variable ``ASTROPY_SETUP_WITH_OPENMP=1``. E.g. ``ASTROPY_SETUP_WITH_OPENMP=1 pip install astropy --no-cache-dir`` or ``ASTROPY_SETUP_WITH_OPENMP=1 ./setup.py build``. Signed-off-by: James Noss <jnoss@stsci.edu>
Signed-off-by: James Noss <jnoss@stsci.edu>
0208a7b
to
e0f741d
Compare
Looks like weirdness with the 32bit cicleci test.
Is it not able to fetch astropy-helpers? Is this a caching issue? |
Signed-off-by: James Noss <jnoss@stsci.edu>
267c808
to
f702b41
Compare
@astrofrog c1c6a70 kills OpenMP support but leaves the C code intact. FYI OpenMP support can still be enabled (without reverting c1c6a70) but it requires the user to add build flags and literally alter the code: |
This is a result of astropy/astropy#7293 Signed-off-by: James Noss <jnoss@stsci.edu>
@bsipocz Here are the changes required to the conda-forge/astropy-feedstock conda-forge/astropy-feedstock@master...jamienoss:changes-for-convolve-from-7293 I wasn't sure if I should open a PR or let you sort that all out when you do so anyhow for the next rel. |
This is a result of astropy/astropy#7293 Signed-off-by: James Noss <jnoss@stsci.edu>
f702b41
to
f579e5b
Compare
Sorry, I forgot to remove the OpenMP ref in the perf notes. |
Thanks. If you could open an issue pointing to that commit, or a PR (that won't be merged though but the commit can be cherry-picked from it), it would be super useful as I won't remember this comment when doing the release in a month time. |
@astrofrog I think I'm all done, back to you. |
See astropy#7971 Signed-off-by: James Noss <jnoss@stsci.edu>
f579e5b
to
c1c6a70
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks ready to go provided tests pass - thanks @jamienoss! And let's continue the discussion about OpenMP during v3.2 development.
Just spitballing - once the next rel is tagged, I wonder if there's any advantage to creating a specific OpenMP branch with c1c6a70 reverted? We can then scope out the principal users of convolve and see if they're interested in using it. There may also be some merit in just reverting it in master such that all future devs act as build testers, expanding the net for catching any OpenMP build issues. Just a thought. |
@jamienoss - yes there's no reason we couldn't consider making an openmp branch in case people want to experiment. Let's discuss again in a couple of weeks once the release is out. |
Perf. changes
nan_treatment=‘interpolate’
, pre-check image for presence of NaN values,if none exist, don’t call NaN interpolation code. Major perf. boost.
after convolution instead of doing so before and again after if not requested.
boundary in [‘fill’, ‘extend’, ‘wrap’]
use a padded array instead of specific index logic. Major perf. boost.This has a significant performance boost, though it does increase the memory footprint.
It also signifcantly reduces the code base.
meant reacquiring the GIL per loop iteration.
Note: this could have been solved by adding the decorator:
@cython.cdivision(True)
to the Cython functions.Note: Using
convolve
threaded with DASK is still more performant than before however,this will overcommit threads and could temporarily make the machine unresponsive. When using DASK
it is advised to make
n_threads=1
, and can still improve runtime perf. by 2 compared to previous implementation.github issue
Non Perf. changes
None
->’fill’
.Signed-off-by: James Noss jnoss@stsci.edu