Performance enhancements for convolution.convolve #7293

jamienoss · 2018-03-16T04:21:58Z

Perf. changes

Separate code for interpolating NaN values and when not. Major perf. boost.
If nan_treatment=‘interpolate’, pre-check image for presence of NaN values,
if none exist, don’t call NaN interpolation code. Major perf. boost.
Hoist loop dependent variable computations to correct loop scope.
Untangle NaN interpolation and kernel normalization. Don’t normalize kernel until
after convolution instead of doing so before and again after if not requested.
Thread with OPENMP. Required replacing Cython code (.pyx) with raw C. Major perf. boost.
For boundary in [‘fill’, ‘extend’, ‘wrap’] use a padded array instead of specific index logic. Major perf. boost.
This has a significant performance boost, though it does increase the memory footprint.
It also signifcantly reduces the code base.
Improve DASK performance as previous Cython code used Python division and thus
meant reacquiring the GIL per loop iteration.
Note: this could have been solved by adding the decorator: @cython.cdivision(True) to the Cython functions.
Note: Using convolve threaded with DASK is still more performant than before however,
this will overcommit threads and could temporarily make the machine unresponsive. When using DASK
it is advised to make n_threads=1, and can still improve runtime perf. by 2 compared to previous implementation.
github issue
Hoist all exceptions to as early as possible, especially before memory allocation or heavy(ish) computations.

Non Perf. changes

Correct doc string for default boundary case None -> ’fill’.

Signed-off-by: James Noss jnoss@stsci.edu

astropy-bot · 2018-03-16T04:22:09Z

Hi there @jamienoss 👋 - thanks for the pull request! I'm just a friendly 🤖 that checks for issues related to the changelog and making sure that this pull request is milestoned and labeled correctly. This is mainly intended for the maintainers, so if you are not a maintainer you can ignore this, and a maintainer will let you know if any action is required on your part 😃.

Everything looks good from my point of view! 👍

If there are any issues with this message, please report them here.

jamienoss · 2018-03-16T04:27:32Z

This is dependent on astropy/astropy-helpers#382

jamienoss · 2018-03-16T04:37:14Z

@eteq, @larrybradley, @hcferguson, @keflavich FYI.
I still need to add the perf. comparisons to github for this PR.

The tests will fail due to the dependence on astropy/astropy-helpers#382.

astrofrog · 2018-03-16T09:29:24Z

@jamienoss - thanks for working on this! I've tagged @keflavich and myself to review this. Just a quick question for now - is the change from Cython to pure-C needed, as Cython does have ways of enabling multi-threading?

We should also add some benchmarks in https://github.com/astropy/astropy-benchmarks to monitor the performance, at least for the single-thread case. @jamienoss, would you be able to add benchmark cases there? I made a document here about using asv and astropy-benchmarks: https://docs.google.com/document/d/1AoPBAbD8DiDVEM6HuOtPKekN3phtcCF4Qk6pxZ0ID-w/edit

keflavich · 2018-03-16T11:33:13Z

This is awesome and I will review, but I imagine this will take me some time to digest.

keflavich

Some minor changes so far. It would be great to see some nice summary discussion and/or figures for the performance improvements in the docs.

Awesome stuff.

keflavich · 2018-03-16T11:34:20Z

astropy/convolution/convolve.py

+                normalize_kernel = True
+                nan_treatment='interpolate'
+            else:
+                raise Exception("Can't convolve 1D and 2D kernel.")


Make this some non-generic exception. Perhaps just ValueError?

keflavich · 2018-03-16T11:34:40Z

astropy/convolution/convolve.py

+        if isinstance(passed_kernel, Kernel):
+            if (isinstance(passed_array, Kernel1D) and isinstance(passed_kernel, Kernel1D)) or (isinstance(passed_array, Kernel2D) and isinstance(passed_kernel, Kernel2D)):
+                warnings.warn("array is Kernel instance, hardwiring the following parameters: "
+                                      "boundary='fill', fill_value=0, normalize_Kernel=True, "


whitespace - should be indented just past the ( on the previous line?

keflavich · 2018-03-16T11:38:41Z

astropy/convolution/convolve.py

+    # NaN interpolation significantly slows down the C convolution
+    # compuatation. Since nan_treatment = 'interpolate', is the default
+    # check whether it is even needed, if not, don't interpolate.
+    # NB: np.isnan(array_internal.sum()) is fatser than np.isnan(array_internal).any()


faster.

Have you checked performance of np.count_nonzero(np.isnan(array_internal))? It might just be that the per-pixel checking of isnan is the performance bottleneck, in which case array_internal.sum() (or maybe .max()?) is going to be the fastest.

I did, and you're correct, the perf. bottleneck is having to check each pix for NaNs. Which is why I went for sum(), and then only check the result. I didn't think of using max(), but thinking about it now, sum() has fewer operations and is vectorizable.

Neither can be short-circuited at any point - all pixels will be operated on.

keflavich · 2018-03-16T11:39:29Z

astropy/convolution/convolve.py

-                        "close to zero. The sum of the given kernel is < {0}"
-                        .format(1. / MAX_NORMALIZATION))
+        if kernel_sum < 1. / MAX_NORMALIZATION or kernel_sums_to_zero:
+            raise Exception("The kernel can't be normalized, because its sum is "


this should probably also be a ValueError (I realize it was an Exception in the old code too; we may as well improve it)

keflavich · 2018-03-16T11:39:54Z

astropy/convolution/convolve.py

+    # here and pass through instead.
+    result = np.zeros(array_internal.shape, dtype=float, order='C')
+
+    if boundary in ['fill', 'extend', 'wrap']:


maybe better to have this as a tuple?

keflavich · 2018-03-16T11:40:23Z

astropy/convolution/convolve.py

+            if array_internal.ndim == 1:
+                padded_array[pad_width[0]:array_shape[0]+pad_width[0]] = array_internal
+            elif array_internal.ndim == 2:
+                padded_array[pad_width[0]:array_shape[0]+pad_width[0], pad_width[1]:array_shape[1]+pad_width[1]] = array_internal


wrap these long lines

keflavich · 2018-03-16T11:41:00Z

astropy/convolution/convolve.py

+                  )
+
+    # So far, normalization has only occured for nan_treatment == 'interpolate'
+    # beacuse this had to happen within the C extension so as to ignore


jamienoss · 2018-03-16T12:54:08Z

@astrofrog

is the change from Cython to pure-C needed, as Cython does have ways of enabling multi-threading?

Yes. The Cython OpenMP support is incredibly poor and doesn't seem to have the ability to split omp parallel for into its separate blocks. Also, there is no way to assign to const variables which is a big issue when trying to help out the compiler with optimizations.

would you be able to add benchmark cases there? I made a document here about using asv and astropy-benchmarks:

I can indeed.

saimn · 2018-03-16T14:55:04Z

astropy/convolution/convolve.py

+    # NaN interpolation significantly slows down the C convolution
+    # compuatation. Since nan_treatment = 'interpolate', is the default
+    # check whether it is even needed, if not, don't interpolate.
+    # NB: np.isnan(array_internal.sum()) is fatser than np.isnan(array_internal).any()


typos: computation, faster

saimn · 2018-03-16T14:58:01Z

astropy/convolution/setup_package.py

+    # Add '-Rpass-missed=.*' to ``extra_compile_args`` when compiling with clang
+    # to report missed optimizations
+    c_convolve_ext = Extension(name='c_convolve', sources=SRC_FILES,
+                 extra_compile_args=['-O3', '-fPIC'],


Is -O3 needed for the optimizations mentioned above ? Otherwise it may be better to leave -O2 by default, it would be more consistent with other extensions and default flags.

It is, for vectorization and other such magic. I can turn them all on individually though. Are you concerned about fp arithmetic issues, i.e. fast-math?

Actually it looks like conda's Python uses -ftree-vectorize and -O3 by default, pyenv's Python has -O3 but not -ftree-vectorize, Debian's Python is more conservative with -O2. So actually my point was more about not specifying additional options here to inherit the default ones instead.

About fast-math, I think it is not activated with -O3, I guess it could be interesting here since you are interpolating the NaNs before calling the C code, but not sure if it can lead to other issues.

I would also prefer to avoid passing -O3 explicitly and let the default from the CPython installation be used instead.

saimn · 2018-03-16T15:03:49Z

astropy/convolution/convolve.py

+from .utils import KernelSizeError, has_even_axis, raise_even_kernel_exception
+from ..openmp_enabled import is_openmp_enabled
+
+faulthandler.enable()


I guess faulthandler was useful for development but could be removed now ?
Also the matplotlib import above seems unused (otherwise it should be made optional).
And unless I missed something, I don't see the source code for openmp_enabled.

I guess faulthandler was useful for development but could be removed now ?

I left the faulthandler in to deal with any unforeseen runtime issues, better the user gets an error message than it just quickly and silently returning. I have no strong objection to removing it though if that is the "Python" thing to do.

Also the matplotlib import above seems unused (otherwise it should be made optional).

Good catch, must have been a copy-and-paste fat-finger.

And unless I missed something, I don't see the source code for openmp_enabled.

This is autogenerated at build c.f. astropy/astropy-helpers#382 src code here

faulthandler can be activated at startup time, and I was not sure if activating it by default has some drawback. It seems not actually, so maybe it's fine to keep it.

This is autogenerated at build c.f. astropy/astropy-helpers#382 src code here

Ok, thanks! This is related to another question (that should maybe go in the astropy-helpers discussion): I was also working on building binary wheels for my package with the C extension with OpenMP, and there will be the same issue for Astropy wheels: if the wheels are built with OpenMP support, this does not necessary mean that the system has OpenMP support. It should be fine on Linux, and maybe the OpenMP support in Clang is old enough now to activate OpenMP by default ?

It should be fine on Linux, and maybe the OpenMP support in Clang is old enough now to activate OpenMP by default

I think llvm/clang version 3+ supports OpenMP by default, it just doesn't ship libomp with it like gcc does. You have to install it separately, which is why it requires more work to compile & link.

saimn · 2018-03-16T15:12:48Z

I left a few comments, it seems very good and it is an interesting reading for me as I'm also trying to fix/optimize a C extension with ctypes and openmp 🙂 (your addition in astropy-helpers for clang+openmp could also be useful!).

jamienoss · 2018-03-16T15:15:06Z

BTW do I need to add some build dependency somewhere such that conda will build this with libomp?

Also, is there somewhere to note know version dependency issues? llvm-openmp version 4.0.1 hda82c8b_0 (the version available via anconda) causes test_convolve_models.py::test_fitting_convolve_models() to crash. I have no idea why. It works fine with gcc (v7.1.0) and the newer llvm-openmp. I saw that conda built the latest version of astropy with clang v6.?.? so hopefully they use a newer version of llvm-also.

jamienoss · 2018-03-16T15:35:33Z

@saimn Thanks for taking a look. 😄

If you're using clang and trying to optimize, add -Rpass-missed=.* to extra_compile_args when compiling to report missed optimizations.

If you'd like me to help out, take a look at the src, or give you any pointers I'd be more than glad to. 😄

bsipocz · 2018-03-16T15:43:34Z

astropy/convolution/setup_package.py

+# Licensed under a 3-clause BSD style license - see LICENSE.rst
+
+import os
+from numpy import get_include as get_numpy_include


Please check the other submodules how the numpy includes were added (either io.ascii or io.fits), and use the same logic here, that will at least fix the egg builds.

Will do, thanks.

bsipocz · 2018-03-16T15:46:28Z

I would definitely like to see @taldcroft's opinion before adding more C code to the codebase about his experience with speed benefits vs maintenance burden of e.g. the fastreader?

pllim · 2018-03-16T16:06:56Z

Would be interesting to see how this play out at https://github.com/astropy/astropy-benchmarks/ (probably need to add convolution benchmarks over there first).

Changes to astropy-helpers seem unnecessary given that it is already noted tests would fail until the helpers PR is merged.

astrofrog · 2018-03-16T16:26:34Z

Note: I'm away this week, but will review this and the helpers PR next week.

pllim · 2018-03-16T16:40:48Z

p.s. Please ping me to test this PR on my Windows machine when this is ready. C stuff can get iffy on Windows.

jamienoss · 2018-03-16T21:54:02Z

@pllim

p.s. Please ping me to test this PR on my Windows machine when this is ready. C stuff can get iffy on Windows.

Everything's "iffy" on Windows. 😉

Thanks, testing on Windows would be great!

jamienoss · 2018-03-16T22:03:04Z

@pllim

Changes to astropy-helpers seem unnecessary given that it is already noted tests would fail until the helpers PR is merged.

Sorry, I just assumed this was the thing to do. Shall I remove the submodule changes and just leave the submod hash or remove all traces?

saimn · 2018-03-16T23:11:09Z

@bsipocz - I would definitely like to see @taldcroft's opinion before adding more C code to the codebase about his experience with speed benefits vs maintenance burden of e.g. the fastreader?

Just my 2 cents: given the clever trick to avoid duplicating code for the nan interpolation case, I think that the difference - in terms of complexity - with the Cython code is small enough. I mean, I understand your worry, but I don't think that the C code will be more difficult to maintain. And even if I love Cython, it's also true that it lacks more advanced OpenMP features.

jamienoss · 2018-03-17T00:01:48Z

@saimn & @bsipocz

I don't think that the C code will be more difficult to maintain.

It won't be any more difficult to maintain because it's C, in fact, it will be less. However, optimized and performant code can be a burden to maintain, in that its performance can be fragile to code changes. It's a battle/play between the will of the author and the compiler's, and being able to outthink it and wield it. By this, I just mean that it's not necessarily a burden due to the language, i.e. the burden isn't necessarily because it's C but because it's optimized for performance.

With this in mind, it's far simpler to optimize code when less translational layers exist. For example, C has one, the compiler. Actually, it has two as all have the CPU itself which is free(ish) to optimize and rearrange your code. Cython has one more, the Cython compiler itself, translating it into C, which then gets compiled by a C compiler. You, therefore, have to write it in a particular fashion knowing that the Cython compiler will then translate it into a particular form that the C compiler will then deal with as you desired. In this respect, it is easier to maintain optimized C than optimized Cython.

Sometimes, you just have to choose the right tool for the job. For this, we already know that it's not Python (actually I hate saying and hearing this - it's got nothing to do with Python, it's a language. The issue is with its canonical implementation, CPython, and its love of the GIL.) Anyhoo, it's not (C)Python and it's also not Cython, at least if we want it to be sufficiently performant.

pllim · 2018-03-18T20:45:01Z

Shall I remove the submodule changes and just leave the submod hash or remove all traces?

Unless changing submodule hash in this PR w.r.t what is in astropy master can make the tests pass until astropy-helpers upstream can catch up, the diff is not worth having. Ideally, the diff here should only contain code to be reviewed.

An ideal workflow (at least in my own head) is:

Merge your helpers PR.
Have Astropy master update submodule to include that change separately.
Rebase this one to pick up that change. At this point, CIs should pass.

Building with OpenMP support is disabled by default. To build the package with OpenMP support (currently only used in ``convolution.convolve``) create and set the environment variable ``ASTROPY_SETUP_WITH_OPENMP=1``. E.g. ``ASTROPY_SETUP_WITH_OPENMP=1 pip install astropy --no-cache-dir`` or ``ASTROPY_SETUP_WITH_OPENMP=1 ./setup.py build``. Signed-off-by: James Noss <jnoss@stsci.edu>

Signed-off-by: James Noss <jnoss@stsci.edu>

jamienoss · 2018-10-25T16:21:22Z

Looks like weirdness with the 32bit cicleci test.

#!/bin/bash -eo pipefail
PYTHONHASHSEED=42 /opt/python/cp36-cp36m/bin/python setup.py test --parallel=4 -V -a "--durations=50"
Initializing astropy_helpers submodule with: `git submodule update --init -- astropy_helpers`
warning: no previously-included files found matching '*.pyc'
warning: no previously-included files found matching '*.o'
no previously-included directories found matching 'build'
no previously-included directories found matching 'astropy_helpers/tests'
Traceback (most recent call last):
  File "setup.py", line 52, in <module>
    generate_openmp_enabled_py(NAME, disable_openmp=disable_openmp)
TypeError: generate_openmp_enabled_py() got an unexpected keyword argument 'disable_openmp'
Exited with code 1

Is it not able to fetch astropy-helpers? Is this a caching issue?

Signed-off-by: James Noss <jnoss@stsci.edu>

jamienoss · 2018-10-25T19:05:50Z

@astrofrog c1c6a70 kills OpenMP support but leaves the C code intact.

FYI OpenMP support can still be enabled (without reverting c1c6a70) but it requires the user to add build flags and literally alter the code:

Remove the #undef _OPENMP here
Increase n_threads in convolve() here
Add OpenMP build flags, e.g. for gcc: CFLAGS=-fopenmp ./setup build

This is a result of astropy/astropy#7293 Signed-off-by: James Noss <jnoss@stsci.edu>

jamienoss · 2018-10-25T19:36:10Z

@bsipocz Here are the changes required to the conda-forge/astropy-feedstock

conda-forge/astropy-feedstock@master...jamienoss:changes-for-convolve-from-7293

I wasn't sure if I should open a PR or let you sort that all out when you do so anyhow for the next rel.

This is a result of astropy/astropy#7293 Signed-off-by: James Noss <jnoss@stsci.edu>

jamienoss · 2018-10-25T20:14:24Z

Sorry, I forgot to remove the OpenMP ref in the perf notes.

bsipocz · 2018-10-25T20:41:30Z

conda-forge/astropy-feedstock@master...jamienoss:changes-for-convolve-from-7293

Thanks. If you could open an issue pointing to that commit, or a PR (that won't be merged though but the commit can be cherry-picked from it), it would be super useful as I won't remember this comment when doing the release in a month time.

jamienoss · 2018-10-26T01:30:40Z

@astrofrog I think I'm all done, back to you.

See astropy#7971 Signed-off-by: James Noss <jnoss@stsci.edu>

astrofrog

This looks ready to go provided tests pass - thanks @jamienoss! And let's continue the discussion about OpenMP during v3.2 development.

jamienoss · 2018-10-26T17:18:15Z

Just spitballing - once the next rel is tagged, I wonder if there's any advantage to creating a specific OpenMP branch with c1c6a70 reverted? We can then scope out the principal users of convolve and see if they're interested in using it.

There may also be some merit in just reverting it in master such that all future devs act as build testers, expanding the net for catching any OpenMP build issues.

Just a thought.

astrofrog · 2018-10-26T18:40:01Z

@jamienoss - yes there's no reason we couldn't consider making an openmp branch in case people want to experiment. Let's discuss again in a couple of weeks once the release is out.

jamienoss force-pushed the astropy-convolve.pr branch from 0dd5962 to 8531c36 Compare March 16, 2018 04:24

jamienoss mentioned this pull request Mar 16, 2018

Make add_openmp_flags_if_available() work for clang astropy/astropy-helpers#382

Merged

astrofrog requested review from keflavich and astrofrog March 16, 2018 09:29

astrofrog added convolution Performance labels Mar 16, 2018

keflavich requested changes Mar 16, 2018

View reviewed changes

keflavich approved these changes Mar 16, 2018

View reviewed changes

saimn reviewed Mar 16, 2018

View reviewed changes

bsipocz reviewed Mar 16, 2018

View reviewed changes

pllim added this to the v3.1 milestone Mar 16, 2018

pllim added this to Needs decision in 3.1 Feature Planning Mar 16, 2018

jamienoss added 2 commits October 25, 2018 11:47

Add perf. notes for convolve

e0f741d

Signed-off-by: James Noss <jnoss@stsci.edu>

jamienoss force-pushed the astropy-convolve.pr branch from 0208a7b to e0f741d Compare October 25, 2018 15:49

Travis: export CC during before_install stage

47072a2

Signed-off-by: James Noss <jnoss@stsci.edu>

jamienoss mentioned this pull request Oct 25, 2018

OpenMP Support #7971

Closed

jamienoss force-pushed the astropy-convolve.pr branch 2 times, most recently from 267c808 to f702b41 Compare October 25, 2018 18:48

jamienoss added a commit to jamienoss/astropy-feedstock that referenced this pull request Oct 25, 2018

Update build tests for convolution.convolve()

90d46fd

This is a result of astropy/astropy#7293 Signed-off-by: James Noss <jnoss@stsci.edu>

jamienoss mentioned this pull request Oct 25, 2018

Update build tests for convolution.convolve() jamienoss/astropy-feedstock#1

Closed

jamienoss added a commit to jamienoss/astropy-feedstock that referenced this pull request Oct 25, 2018

Update build tests for convolution.convolve()

f89a060

This is a result of astropy/astropy#7293 Signed-off-by: James Noss <jnoss@stsci.edu>

jamienoss force-pushed the astropy-convolve.pr branch from f702b41 to f579e5b Compare October 25, 2018 20:04

Remove OpenMP thread support from convolution.convolve()

c1c6a70

See astropy#7971 Signed-off-by: James Noss <jnoss@stsci.edu>

jamienoss force-pushed the astropy-convolve.pr branch from f579e5b to c1c6a70 Compare October 26, 2018 16:46

astrofrog approved these changes Oct 26, 2018

View reviewed changes

astrofrog merged commit f9681a7 into astropy:master Oct 26, 2018

bsipocz mentioned this pull request Oct 30, 2018

convolution cannot be imported #8032

Closed

This was referenced Oct 31, 2018

Test egg install #8050

Open

Simplify C convolution code by using common code for all boundary cases #8057

Merged

jamienoss referenced this pull request Nov 13, 2018

Added note about memory usage in convolution

5028114

This was referenced Dec 29, 2018

[0.9] test failures with astropy 3.1 (numpy 1.16) gammapy/gammapy#1974

Closed

[3.1] Convolve problems in some affiliated packages #8312

Closed

pllim mentioned this pull request Sep 18, 2020

4.1rc2: test_generate_config fails when astropy installed on different disk from test run #10747

Closed

Performance enhancements for convolution.convolve #7293

Performance enhancements for convolution.convolve #7293

Conversation

jamienoss commented Mar 16, 2018 • edited

Perf. changes

Non Perf. changes

astropy-bot bot commented Mar 16, 2018 • edited

jamienoss commented Mar 16, 2018 • edited

jamienoss commented Mar 16, 2018 • edited

astrofrog commented Mar 16, 2018

keflavich commented Mar 16, 2018

keflavich left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jamienoss Mar 16, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jamienoss commented Mar 16, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

saimn commented Mar 16, 2018

jamienoss commented Mar 16, 2018

jamienoss commented Mar 16, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bsipocz commented Mar 16, 2018 • edited

pllim commented Mar 16, 2018

astrofrog commented Mar 16, 2018

pllim commented Mar 16, 2018 • edited

jamienoss commented Mar 16, 2018

jamienoss commented Mar 16, 2018

saimn commented Mar 16, 2018 • edited

jamienoss commented Mar 17, 2018 • edited

pllim commented Mar 18, 2018

jamienoss commented Oct 25, 2018

jamienoss commented Oct 25, 2018 • edited

jamienoss commented Oct 25, 2018 • edited

jamienoss commented Oct 25, 2018

bsipocz commented Oct 25, 2018 • edited

jamienoss commented Oct 26, 2018

astrofrog left a comment

Choose a reason for hiding this comment

jamienoss commented Oct 26, 2018

astrofrog commented Oct 26, 2018

jamienoss commented Mar 16, 2018 •

edited

astropy-bot bot commented Mar 16, 2018 •

edited

jamienoss commented Mar 16, 2018 •

edited

jamienoss commented Mar 16, 2018 •

edited

jamienoss Mar 16, 2018 •

edited

jamienoss commented Mar 16, 2018 •

edited

bsipocz commented Mar 16, 2018 •

edited

pllim commented Mar 16, 2018 •

edited

saimn commented Mar 16, 2018 •

edited

jamienoss commented Mar 17, 2018 •

edited

jamienoss commented Oct 25, 2018 •

edited

jamienoss commented Oct 25, 2018 •

edited

bsipocz commented Oct 25, 2018 •

edited