New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance enhancements for convolution.convolve #7293

Merged
merged 14 commits into from Oct 26, 2018

Conversation

@jamienoss
Contributor

jamienoss commented Mar 16, 2018

Perf. changes

  • Separate code for interpolating NaN values and when not. Major perf. boost.
  • If nan_treatment=‘interpolate’, pre-check image for presence of NaN values,
    if none exist, don’t call NaN interpolation code. Major perf. boost.
  • Hoist loop dependent variable computations to correct loop scope.
  • Untangle NaN interpolation and kernel normalization. Don’t normalize kernel until
    after convolution instead of doing so before and again after if not requested.
  • Thread with OPENMP. Required replacing Cython code (.pyx) with raw C. Major perf. boost.
  • For boundary in [‘fill’, ‘extend’, ‘wrap’] use a padded array instead of specific index logic. Major perf. boost.
    This has a significant performance boost, though it does increase the memory footprint.
    It also signifcantly reduces the code base.
  • Improve DASK performance as previous Cython code used Python division and thus
    meant reacquiring the GIL per loop iteration.
    Note: this could have been solved by adding the decorator: @cython.cdivision(True) to the Cython functions.
    Note: Using convolve threaded with DASK is still more performant than before however,
    this will overcommit threads and could temporarily make the machine unresponsive. When using DASK
    it is advised to make n_threads=1, and can still improve runtime perf. by 2 compared to previous implementation.
    github issue
  • Hoist all exceptions to as early as possible, especially before memory allocation or heavy(ish) computations.

Non Perf. changes

  • Correct doc string for default boundary case None -> ’fill’.

Signed-off-by: James Noss jnoss@stsci.edu

pr-bench-figure

@astropy-bot

This comment has been minimized.

astropy-bot bot commented Mar 16, 2018

Hi there @jamienoss 👋 - thanks for the pull request! I'm just a friendly 🤖 that checks for issues related to the changelog and making sure that this pull request is milestoned and labeled correctly. This is mainly intended for the maintainers, so if you are not a maintainer you can ignore this, and a maintainer will let you know if any action is required on your part 😃.

Everything looks good from my point of view! 👍

If there are any issues with this message, please report them here.

@jamienoss jamienoss force-pushed the jamienoss:astropy-convolve.pr branch from 0dd5962 to 8531c36 Mar 16, 2018

@jamienoss

This comment has been minimized.

Contributor

jamienoss commented Mar 16, 2018

This is dependent on astropy/astropy-helpers#382

@jamienoss

This comment has been minimized.

Contributor

jamienoss commented Mar 16, 2018

@eteq, @larrybradley, @hcferguson, @keflavich FYI.
I still need to add the perf. comparisons to github for this PR.

The tests will fail due to the dependence on astropy/astropy-helpers#382.

@astrofrog

This comment has been minimized.

Member

astrofrog commented Mar 16, 2018

@jamienoss - thanks for working on this! I've tagged @keflavich and myself to review this. Just a quick question for now - is the change from Cython to pure-C needed, as Cython does have ways of enabling multi-threading?

We should also add some benchmarks in https://github.com/astropy/astropy-benchmarks to monitor the performance, at least for the single-thread case. @jamienoss, would you be able to add benchmark cases there? I made a document here about using asv and astropy-benchmarks: https://docs.google.com/document/d/1AoPBAbD8DiDVEM6HuOtPKekN3phtcCF4Qk6pxZ0ID-w/edit

@keflavich

This comment has been minimized.

Contributor

keflavich commented Mar 16, 2018

This is awesome and I will review, but I imagine this will take me some time to digest.

@keflavich

Some minor changes so far. It would be great to see some nice summary discussion and/or figures for the performance improvements in the docs.

Awesome stuff.

normalize_kernel = True
nan_treatment='interpolate'
else:
raise Exception("Can't convolve 1D and 2D kernel.")

This comment has been minimized.

@keflavich

keflavich Mar 16, 2018

Contributor

Make this some non-generic exception. Perhaps just ValueError?

if isinstance(passed_kernel, Kernel):
if (isinstance(passed_array, Kernel1D) and isinstance(passed_kernel, Kernel1D)) or (isinstance(passed_array, Kernel2D) and isinstance(passed_kernel, Kernel2D)):
warnings.warn("array is Kernel instance, hardwiring the following parameters: "
"boundary='fill', fill_value=0, normalize_Kernel=True, "

This comment has been minimized.

@keflavich

keflavich Mar 16, 2018

Contributor

whitespace - should be indented just past the ( on the previous line?

# NaN interpolation significantly slows down the C convolution
# compuatation. Since nan_treatment = 'interpolate', is the default
# check whether it is even needed, if not, don't interpolate.
# NB: np.isnan(array_internal.sum()) is fatser than np.isnan(array_internal).any()

This comment has been minimized.

@keflavich

keflavich Mar 16, 2018

Contributor

faster.

Have you checked performance of np.count_nonzero(np.isnan(array_internal))? It might just be that the per-pixel checking of isnan is the performance bottleneck, in which case array_internal.sum() (or maybe .max()?) is going to be the fastest.

This comment has been minimized.

@jamienoss

jamienoss Mar 16, 2018

Contributor

I did, and you're correct, the perf. bottleneck is having to check each pix for NaNs. Which is why I went for sum(), and then only check the result. I didn't think of using max(), but thinking about it now, sum() has fewer operations and is vectorizable.

Neither can be short-circuited at any point - all pixels will be operated on.

"close to zero. The sum of the given kernel is < {0}"
.format(1. / MAX_NORMALIZATION))
if kernel_sum < 1. / MAX_NORMALIZATION or kernel_sums_to_zero:
raise Exception("The kernel can't be normalized, because its sum is "

This comment has been minimized.

@keflavich

keflavich Mar 16, 2018

Contributor

this should probably also be a ValueError (I realize it was an Exception in the old code too; we may as well improve it)

# here and pass through instead.
result = np.zeros(array_internal.shape, dtype=float, order='C')
if boundary in ['fill', 'extend', 'wrap']:

This comment has been minimized.

@keflavich

keflavich Mar 16, 2018

Contributor

maybe better to have this as a tuple?

if array_internal.ndim == 1:
padded_array[pad_width[0]:array_shape[0]+pad_width[0]] = array_internal
elif array_internal.ndim == 2:
padded_array[pad_width[0]:array_shape[0]+pad_width[0], pad_width[1]:array_shape[1]+pad_width[1]] = array_internal

This comment has been minimized.

@keflavich

keflavich Mar 16, 2018

Contributor

wrap these long lines

)
# So far, normalization has only occured for nan_treatment == 'interpolate'
# beacuse this had to happen within the C extension so as to ignore

This comment has been minimized.

@keflavich

keflavich Mar 16, 2018

Contributor

because

@jamienoss

This comment has been minimized.

Contributor

jamienoss commented Mar 16, 2018

@astrofrog

is the change from Cython to pure-C needed, as Cython does have ways of enabling multi-threading?

Yes. The Cython OpenMP support is incredibly poor and doesn't seem to have the ability to split omp parallel for into its separate blocks. Also, there is no way to assign to const variables which is a big issue when trying to help out the compiler with optimizations.

would you be able to add benchmark cases there? I made a document here about using asv and astropy-benchmarks:

I can indeed.

# NaN interpolation significantly slows down the C convolution
# compuatation. Since nan_treatment = 'interpolate', is the default
# check whether it is even needed, if not, don't interpolate.
# NB: np.isnan(array_internal.sum()) is fatser than np.isnan(array_internal).any()

This comment has been minimized.

@saimn

saimn Mar 16, 2018

Contributor

typos: computation, faster

# Add '-Rpass-missed=.*' to ``extra_compile_args`` when compiling with clang
# to report missed optimizations
c_convolve_ext = Extension(name='c_convolve', sources=SRC_FILES,
extra_compile_args=['-O3', '-fPIC'],

This comment has been minimized.

@saimn

saimn Mar 16, 2018

Contributor

Is -O3 needed for the optimizations mentioned above ? Otherwise it may be better to leave -O2 by default, it would be more consistent with other extensions and default flags.

This comment has been minimized.

@jamienoss

jamienoss Mar 16, 2018

Contributor

It is, for vectorization and other such magic. I can turn them all on individually though. Are you concerned about fp arithmetic issues, i.e. fast-math?

This comment has been minimized.

@saimn

saimn Mar 16, 2018

Contributor

Actually it looks like conda's Python uses -ftree-vectorize and -O3 by default, pyenv's Python has -O3 but not -ftree-vectorize, Debian's Python is more conservative with -O2. So actually my point was more about not specifying additional options here to inherit the default ones instead.

About fast-math, I think it is not activated with -O3, I guess it could be interesting here since you are interpolating the NaNs before calling the C code, but not sure if it can lead to other issues.

This comment has been minimized.

@astrofrog

astrofrog Apr 11, 2018

Member

I would also prefer to avoid passing -O3 explicitly and let the default from the CPython installation be used instead.

from .utils import KernelSizeError, has_even_axis, raise_even_kernel_exception
from ..openmp_enabled import is_openmp_enabled
faulthandler.enable()

This comment has been minimized.

@saimn

saimn Mar 16, 2018

Contributor

I guess faulthandler was useful for development but could be removed now ?
Also the matplotlib import above seems unused (otherwise it should be made optional).
And unless I missed something, I don't see the source code for openmp_enabled.

This comment has been minimized.

@jamienoss

jamienoss Mar 16, 2018

Contributor

I guess faulthandler was useful for development but could be removed now ?

I left the faulthandler in to deal with any unforeseen runtime issues, better the user gets an error message than it just quickly and silently returning. I have no strong objection to removing it though if that is the "Python" thing to do.

Also the matplotlib import above seems unused (otherwise it should be made optional).

Good catch, must have been a copy-and-paste fat-finger.

And unless I missed something, I don't see the source code for openmp_enabled.

This is autogenerated at build c.f. astropy/astropy-helpers#382 src code here

This comment has been minimized.

@saimn

saimn Mar 16, 2018

Contributor

faulthandler can be activated at startup time, and I was not sure if activating it by default has some drawback. It seems not actually, so maybe it's fine to keep it.

This is autogenerated at build c.f. astropy/astropy-helpers#382 src code here

Ok, thanks! This is related to another question (that should maybe go in the astropy-helpers discussion): I was also working on building binary wheels for my package with the C extension with OpenMP, and there will be the same issue for Astropy wheels: if the wheels are built with OpenMP support, this does not necessary mean that the system has OpenMP support. It should be fine on Linux, and maybe the OpenMP support in Clang is old enough now to activate OpenMP by default ?

This comment has been minimized.

@jamienoss

jamienoss Mar 16, 2018

Contributor

It should be fine on Linux, and maybe the OpenMP support in Clang is old enough now to activate OpenMP by default

I think llvm/clang version 3+ supports OpenMP by default, it just doesn't ship libomp with it like gcc does. You have to install it separately, which is why it requires more work to compile & link.

@saimn

This comment has been minimized.

Contributor

saimn commented Mar 16, 2018

I left a few comments, it seems very good and it is an interesting reading for me as I'm also trying to fix/optimize a C extension with ctypes and openmp 🙂 (your addition in astropy-helpers for clang+openmp could also be useful!).

@jamienoss

This comment has been minimized.

Contributor

jamienoss commented Mar 16, 2018

BTW do I need to add some build dependency somewhere such that conda will build this with libomp?

Also, is there somewhere to note know version dependency issues? llvm-openmp version 4.0.1 hda82c8b_0 (the version available via anconda) causes test_convolve_models.py::test_fitting_convolve_models() to crash. I have no idea why. It works fine with gcc (v7.1.0) and the newer llvm-openmp. I saw that conda built the latest version of astropy with clang v6.?.? so hopefully they use a newer version of llvm-also.

@jamienoss

This comment has been minimized.

Contributor

jamienoss commented Mar 16, 2018

@saimn Thanks for taking a look. 😄

If you're using clang and trying to optimize, add -Rpass-missed=.* to extra_compile_args when compiling to report missed optimizations.

If you'd like me to help out, take a look at the src, or give you any pointers I'd be more than glad to. 😄

# Licensed under a 3-clause BSD style license - see LICENSE.rst
import os
from numpy import get_include as get_numpy_include

This comment has been minimized.

@bsipocz

bsipocz Mar 16, 2018

Member

Please check the other submodules how the numpy includes were added (either io.ascii or io.fits), and use the same logic here, that will at least fix the egg builds.

This comment has been minimized.

@jamienoss

jamienoss Mar 16, 2018

Contributor

Will do, thanks.

@bsipocz

This comment has been minimized.

Member

bsipocz commented Mar 16, 2018

I would definitely like to see @taldcroft's opinion before adding more C code to the codebase about his experience with speed benefits vs maintenance burden of e.g. the fastreader?

@pllim

This comment has been minimized.

Member

pllim commented Mar 16, 2018

Would be interesting to see how this play out at https://github.com/astropy/astropy-benchmarks/ (probably need to add convolution benchmarks over there first).

Changes to astropy-helpers seem unnecessary given that it is already noted tests would fail until the helpers PR is merged.

@pllim pllim added this to the v3.1 milestone Mar 16, 2018

@astrofrog

This comment has been minimized.

Member

astrofrog commented Mar 16, 2018

Note: I'm away this week, but will review this and the helpers PR next week.

@pllim

This comment has been minimized.

Member

pllim commented Mar 16, 2018

p.s. Please ping me to test this PR on my Windows machine when this is ready. C stuff can get iffy on Windows.

@pllim pllim added this to Needs decision in 3.1 Feature Planning Mar 16, 2018

@jamienoss

This comment has been minimized.

Contributor

jamienoss commented Mar 16, 2018

@pllim

p.s. Please ping me to test this PR on my Windows machine when this is ready. C stuff can get iffy on Windows.

Everything's "iffy" on Windows. 😉

Thanks, testing on Windows would be great!

@jamienoss

This comment has been minimized.

Contributor

jamienoss commented Mar 16, 2018

@pllim

Changes to astropy-helpers seem unnecessary given that it is already noted tests would fail until the helpers PR is merged.

Sorry, I just assumed this was the thing to do. Shall I remove the submodule changes and just leave the submod hash or remove all traces?

@saimn

This comment has been minimized.

Contributor

saimn commented Mar 16, 2018

@bsipocz - I would definitely like to see @taldcroft's opinion before adding more C code to the codebase about his experience with speed benefits vs maintenance burden of e.g. the fastreader?

Just my 2 cents: given the clever trick to avoid duplicating code for the nan interpolation case, I think that the difference - in terms of complexity - with the Cython code is small enough. I mean, I understand your worry, but I don't think that the C code will be more difficult to maintain. And even if I love Cython, it's also true that it lacks more advanced OpenMP features.

@jamienoss

This comment has been minimized.

Contributor

jamienoss commented Mar 17, 2018

@saimn & @bsipocz

I don't think that the C code will be more difficult to maintain.

It won't be any more difficult to maintain because it's C, in fact, it will be less. However, optimized and performant code can be a burden to maintain, in that its performance can be fragile to code changes. It's a battle/play between the will of the author and the compiler's, and being able to outthink it and wield it. By this, I just mean that it's not necessarily a burden due to the language, i.e. the burden isn't necessarily because it's C but because it's optimized for performance.

With this in mind, it's far simpler to optimize code when less translational layers exist. For example, C has one, the compiler. Actually, it has two as all have the CPU itself which is free(ish) to optimize and rearrange your code. Cython has one more, the Cython compiler itself, translating it into C, which then gets compiled by a C compiler. You, therefore, have to write it in a particular fashion knowing that the Cython compiler will then translate it into a particular form that the C compiler will then deal with as you desired. In this respect, it is easier to maintain optimized C than optimized Cython.

Sometimes, you just have to choose the right tool for the job. For this, we already know that it's not Python (actually I hate saying and hearing this - it's got nothing to do with Python, it's a language. The issue is with its canonical implementation, CPython, and its love of the GIL.) Anyhoo, it's not (C)Python and it's also not Cython, at least if we want it to be sufficiently performant.

@pllim

This comment has been minimized.

Member

pllim commented Mar 18, 2018

Shall I remove the submodule changes and just leave the submod hash or remove all traces?

Unless changing submodule hash in this PR w.r.t what is in astropy master can make the tests pass until astropy-helpers upstream can catch up, the diff is not worth having. Ideally, the diff here should only contain code to be reviewed.

An ideal workflow (at least in my own head) is:

  • Merge your helpers PR.
  • Have Astropy master update submodule to include that change separately.
  • Rebase this one to pick up that change. At this point, CIs should pass.

jamienoss added some commits Oct 18, 2018

Use size_t instead of unsigned int
Signed-off-by: James Noss <jnoss@stsci.edu>
Update astropy-helpers
Signed-off-by: James Noss <jnoss@stsci.edu>
Disable OpenMP support at setup time.
Building with OpenMP support is disabled by default. To build the package with
OpenMP support (currently only used in ``convolution.convolve``) create and
set the environment variable ``ASTROPY_SETUP_WITH_OPENMP=1``.
E.g. ``ASTROPY_SETUP_WITH_OPENMP=1 pip install astropy --no-cache-dir`` or
``ASTROPY_SETUP_WITH_OPENMP=1 ./setup.py build``.

Signed-off-by: James Noss <jnoss@stsci.edu>
Add perf. notes for convolve
Signed-off-by: James Noss <jnoss@stsci.edu>

@jamienoss jamienoss force-pushed the jamienoss:astropy-convolve.pr branch from 0208a7b to e0f741d Oct 25, 2018

@jamienoss

This comment has been minimized.

Contributor

jamienoss commented Oct 25, 2018

Looks like weirdness with the 32bit cicleci test.

#!/bin/bash -eo pipefail
PYTHONHASHSEED=42 /opt/python/cp36-cp36m/bin/python setup.py test --parallel=4 -V -a "--durations=50"
Initializing astropy_helpers submodule with: `git submodule update --init -- astropy_helpers`
warning: no previously-included files found matching '*.pyc'
warning: no previously-included files found matching '*.o'
no previously-included directories found matching 'build'
no previously-included directories found matching 'astropy_helpers/tests'
Traceback (most recent call last):
  File "setup.py", line 52, in <module>
    generate_openmp_enabled_py(NAME, disable_openmp=disable_openmp)
TypeError: generate_openmp_enabled_py() got an unexpected keyword argument 'disable_openmp'
Exited with code 1

Is it not able to fetch astropy-helpers? Is this a caching issue?

Travis: export CC during before_install stage
Signed-off-by: James Noss <jnoss@stsci.edu>

@jamienoss jamienoss referenced this pull request Oct 25, 2018

Closed

OpenMP Support #7971

@jamienoss jamienoss force-pushed the jamienoss:astropy-convolve.pr branch 2 times, most recently from 267c808 to f702b41 Oct 25, 2018

@jamienoss

This comment has been minimized.

Contributor

jamienoss commented Oct 25, 2018

@astrofrog c1c6a70 kills OpenMP support but leaves the C code intact.

FYI OpenMP support can still be enabled (without reverting c1c6a70) but it requires the user to add build flags and literally alter the code:

  • Remove the #undef _OPENMP here
  • Increase n_threads in convolve() here
  • Add OpenMP build flags, e.g. for gcc: CFLAGS=-fopenmp ./setup build

jamienoss added a commit to jamienoss/astropy-feedstock that referenced this pull request Oct 25, 2018

Update build tests for convolution.convolve()
This is a result of astropy/astropy#7293

Signed-off-by: James Noss <jnoss@stsci.edu>
@jamienoss

This comment has been minimized.

Contributor

jamienoss commented Oct 25, 2018

@bsipocz Here are the changes required to the conda-forge/astropy-feedstock

conda-forge/astropy-feedstock@master...jamienoss:changes-for-convolve-from-7293

I wasn't sure if I should open a PR or let you sort that all out when you do so anyhow for the next rel.

jamienoss added a commit to jamienoss/astropy-feedstock that referenced this pull request Oct 25, 2018

Update build tests for convolution.convolve()
This is a result of astropy/astropy#7293

Signed-off-by: James Noss <jnoss@stsci.edu>

@jamienoss jamienoss force-pushed the jamienoss:astropy-convolve.pr branch from f702b41 to f579e5b Oct 25, 2018

@jamienoss

This comment has been minimized.

Contributor

jamienoss commented Oct 25, 2018

Sorry, I forgot to remove the OpenMP ref in the perf notes.

@bsipocz

This comment has been minimized.

Member

bsipocz commented Oct 25, 2018

conda-forge/astropy-feedstock@master...jamienoss:changes-for-convolve-from-7293

Thanks. If you could open an issue pointing to that commit, or a PR (that won't be merged though but the commit can be cherry-picked from it), it would be super useful as I won't remember this comment when doing the release in a month time.

@jamienoss

This comment has been minimized.

Contributor

jamienoss commented Oct 26, 2018

@astrofrog I think I'm all done, back to you.

Remove OpenMP thread support from convolution.convolve()
See #7971

Signed-off-by: James Noss <jnoss@stsci.edu>

@jamienoss jamienoss force-pushed the jamienoss:astropy-convolve.pr branch from f579e5b to c1c6a70 Oct 26, 2018

@astrofrog

This looks ready to go provided tests pass - thanks @jamienoss! And let's continue the discussion about OpenMP during v3.2 development.

@jamienoss

This comment has been minimized.

Contributor

jamienoss commented Oct 26, 2018

Just spitballing - once the next rel is tagged, I wonder if there's any advantage to creating a specific OpenMP branch with c1c6a70 reverted? We can then scope out the principal users of convolve and see if they're interested in using it.

There may also be some merit in just reverting it in master such that all future devs act as build testers, expanding the net for catching any OpenMP build issues.

Just a thought.

@astrofrog astrofrog merged commit f9681a7 into astropy:master Oct 26, 2018

12 checks passed

astropy-bot All checks passed
Details
ci/circleci: 32bit Your tests passed on CircleCI!
Details
ci/circleci: egg-info-37 Your tests passed on CircleCI!
Details
ci/circleci: html-docs Your tests passed on CircleCI!
Details
ci/circleci: image-tests-mpl202 Your tests passed on CircleCI!
Details
ci/circleci: image-tests-mpl212 Your tests passed on CircleCI!
Details
ci/circleci: image-tests-mpl222 Your tests passed on CircleCI!
Details
ci/circleci: image-tests-mpl300 Your tests passed on CircleCI!
Details
ci/circleci: image-tests-mpldev Your tests passed on CircleCI!
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
giles Click details to preview the documentation build
Details
@astrofrog

This comment has been minimized.

Member

astrofrog commented Oct 26, 2018

@jamienoss - yes there's no reason we couldn't consider making an openmp branch in case people want to experiment. Let's discuss again in a couple of weeks once the release is out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment