Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance enhancements for convolution.convolve #7293

Merged
merged 14 commits into from Oct 26, 2018

Conversation

jamienoss
Copy link
Contributor

@jamienoss jamienoss commented Mar 16, 2018

Perf. changes

  • Separate code for interpolating NaN values and when not. Major perf. boost.
  • If nan_treatment=‘interpolate’, pre-check image for presence of NaN values,
    if none exist, don’t call NaN interpolation code. Major perf. boost.
  • Hoist loop dependent variable computations to correct loop scope.
  • Untangle NaN interpolation and kernel normalization. Don’t normalize kernel until
    after convolution instead of doing so before and again after if not requested.
  • Thread with OPENMP. Required replacing Cython code (.pyx) with raw C. Major perf. boost.
  • For boundary in [‘fill’, ‘extend’, ‘wrap’] use a padded array instead of specific index logic. Major perf. boost.
    This has a significant performance boost, though it does increase the memory footprint.
    It also signifcantly reduces the code base.
  • Improve DASK performance as previous Cython code used Python division and thus
    meant reacquiring the GIL per loop iteration.
    Note: this could have been solved by adding the decorator: @cython.cdivision(True) to the Cython functions.
    Note: Using convolve threaded with DASK is still more performant than before however,
    this will overcommit threads and could temporarily make the machine unresponsive. When using DASK
    it is advised to make n_threads=1, and can still improve runtime perf. by 2 compared to previous implementation.
    github issue
  • Hoist all exceptions to as early as possible, especially before memory allocation or heavy(ish) computations.

Non Perf. changes

  • Correct doc string for default boundary case None -> ’fill’.

Signed-off-by: James Noss jnoss@stsci.edu

pr-bench-figure

@astropy-bot
Copy link

astropy-bot bot commented Mar 16, 2018

Hi there @jamienoss 👋 - thanks for the pull request! I'm just a friendly 🤖 that checks for issues related to the changelog and making sure that this pull request is milestoned and labeled correctly. This is mainly intended for the maintainers, so if you are not a maintainer you can ignore this, and a maintainer will let you know if any action is required on your part 😃.

Everything looks good from my point of view! 👍

If there are any issues with this message, please report them here.

@jamienoss
Copy link
Contributor Author

jamienoss commented Mar 16, 2018

This is dependent on astropy/astropy-helpers#382

@jamienoss
Copy link
Contributor Author

jamienoss commented Mar 16, 2018

@eteq, @larrybradley, @hcferguson, @keflavich FYI.
I still need to add the perf. comparisons to github for this PR.

The tests will fail due to the dependence on astropy/astropy-helpers#382.

@astrofrog
Copy link
Member

@jamienoss - thanks for working on this! I've tagged @keflavich and myself to review this. Just a quick question for now - is the change from Cython to pure-C needed, as Cython does have ways of enabling multi-threading?

We should also add some benchmarks in https://github.com/astropy/astropy-benchmarks to monitor the performance, at least for the single-thread case. @jamienoss, would you be able to add benchmark cases there? I made a document here about using asv and astropy-benchmarks: https://docs.google.com/document/d/1AoPBAbD8DiDVEM6HuOtPKekN3phtcCF4Qk6pxZ0ID-w/edit

@keflavich
Copy link
Contributor

This is awesome and I will review, but I imagine this will take me some time to digest.

Copy link
Contributor

@keflavich keflavich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor changes so far. It would be great to see some nice summary discussion and/or figures for the performance improvements in the docs.

Awesome stuff.

normalize_kernel = True
nan_treatment='interpolate'
else:
raise Exception("Can't convolve 1D and 2D kernel.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this some non-generic exception. Perhaps just ValueError?

if isinstance(passed_kernel, Kernel):
if (isinstance(passed_array, Kernel1D) and isinstance(passed_kernel, Kernel1D)) or (isinstance(passed_array, Kernel2D) and isinstance(passed_kernel, Kernel2D)):
warnings.warn("array is Kernel instance, hardwiring the following parameters: "
"boundary='fill', fill_value=0, normalize_Kernel=True, "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whitespace - should be indented just past the ( on the previous line?

# NaN interpolation significantly slows down the C convolution
# compuatation. Since nan_treatment = 'interpolate', is the default
# check whether it is even needed, if not, don't interpolate.
# NB: np.isnan(array_internal.sum()) is fatser than np.isnan(array_internal).any()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

faster.

Have you checked performance of np.count_nonzero(np.isnan(array_internal))? It might just be that the per-pixel checking of isnan is the performance bottleneck, in which case array_internal.sum() (or maybe .max()?) is going to be the fastest.

Copy link
Contributor Author

@jamienoss jamienoss Mar 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did, and you're correct, the perf. bottleneck is having to check each pix for NaNs. Which is why I went for sum(), and then only check the result. I didn't think of using max(), but thinking about it now, sum() has fewer operations and is vectorizable.

Neither can be short-circuited at any point - all pixels will be operated on.

"close to zero. The sum of the given kernel is < {0}"
.format(1. / MAX_NORMALIZATION))
if kernel_sum < 1. / MAX_NORMALIZATION or kernel_sums_to_zero:
raise Exception("The kernel can't be normalized, because its sum is "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should probably also be a ValueError (I realize it was an Exception in the old code too; we may as well improve it)

# here and pass through instead.
result = np.zeros(array_internal.shape, dtype=float, order='C')

if boundary in ['fill', 'extend', 'wrap']:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe better to have this as a tuple?

if array_internal.ndim == 1:
padded_array[pad_width[0]:array_shape[0]+pad_width[0]] = array_internal
elif array_internal.ndim == 2:
padded_array[pad_width[0]:array_shape[0]+pad_width[0], pad_width[1]:array_shape[1]+pad_width[1]] = array_internal
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrap these long lines

)

# So far, normalization has only occured for nan_treatment == 'interpolate'
# beacuse this had to happen within the C extension so as to ignore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because

@jamienoss
Copy link
Contributor Author

jamienoss commented Mar 16, 2018

@astrofrog

is the change from Cython to pure-C needed, as Cython does have ways of enabling multi-threading?

Yes. The Cython OpenMP support is incredibly poor and doesn't seem to have the ability to split omp parallel for into its separate blocks. Also, there is no way to assign to const variables which is a big issue when trying to help out the compiler with optimizations.

would you be able to add benchmark cases there? I made a document here about using asv and astropy-benchmarks:

I can indeed.

# NaN interpolation significantly slows down the C convolution
# compuatation. Since nan_treatment = 'interpolate', is the default
# check whether it is even needed, if not, don't interpolate.
# NB: np.isnan(array_internal.sum()) is fatser than np.isnan(array_internal).any()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typos: computation, faster

# Add '-Rpass-missed=.*' to ``extra_compile_args`` when compiling with clang
# to report missed optimizations
c_convolve_ext = Extension(name='c_convolve', sources=SRC_FILES,
extra_compile_args=['-O3', '-fPIC'],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is -O3 needed for the optimizations mentioned above ? Otherwise it may be better to leave -O2 by default, it would be more consistent with other extensions and default flags.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is, for vectorization and other such magic. I can turn them all on individually though. Are you concerned about fp arithmetic issues, i.e. fast-math?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it looks like conda's Python uses -ftree-vectorize and -O3 by default, pyenv's Python has -O3 but not -ftree-vectorize, Debian's Python is more conservative with -O2. So actually my point was more about not specifying additional options here to inherit the default ones instead.

About fast-math, I think it is not activated with -O3, I guess it could be interesting here since you are interpolating the NaNs before calling the C code, but not sure if it can lead to other issues.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also prefer to avoid passing -O3 explicitly and let the default from the CPython installation be used instead.

from .utils import KernelSizeError, has_even_axis, raise_even_kernel_exception
from ..openmp_enabled import is_openmp_enabled

faulthandler.enable()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess faulthandler was useful for development but could be removed now ?
Also the matplotlib import above seems unused (otherwise it should be made optional).
And unless I missed something, I don't see the source code for openmp_enabled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess faulthandler was useful for development but could be removed now ?

I left the faulthandler in to deal with any unforeseen runtime issues, better the user gets an error message than it just quickly and silently returning. I have no strong objection to removing it though if that is the "Python" thing to do.

Also the matplotlib import above seems unused (otherwise it should be made optional).

Good catch, must have been a copy-and-paste fat-finger.

And unless I missed something, I don't see the source code for openmp_enabled.

This is autogenerated at build c.f. astropy/astropy-helpers#382 src code here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

faulthandler can be activated at startup time, and I was not sure if activating it by default has some drawback. It seems not actually, so maybe it's fine to keep it.

This is autogenerated at build c.f. astropy/astropy-helpers#382 src code here

Ok, thanks! This is related to another question (that should maybe go in the astropy-helpers discussion): I was also working on building binary wheels for my package with the C extension with OpenMP, and there will be the same issue for Astropy wheels: if the wheels are built with OpenMP support, this does not necessary mean that the system has OpenMP support. It should be fine on Linux, and maybe the OpenMP support in Clang is old enough now to activate OpenMP by default ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be fine on Linux, and maybe the OpenMP support in Clang is old enough now to activate OpenMP by default

I think llvm/clang version 3+ supports OpenMP by default, it just doesn't ship libomp with it like gcc does. You have to install it separately, which is why it requires more work to compile & link.

@saimn
Copy link
Contributor

saimn commented Mar 16, 2018

I left a few comments, it seems very good and it is an interesting reading for me as I'm also trying to fix/optimize a C extension with ctypes and openmp 🙂 (your addition in astropy-helpers for clang+openmp could also be useful!).

@jamienoss
Copy link
Contributor Author

BTW do I need to add some build dependency somewhere such that conda will build this with libomp?

Also, is there somewhere to note know version dependency issues? llvm-openmp version 4.0.1 hda82c8b_0 (the version available via anconda) causes test_convolve_models.py::test_fitting_convolve_models() to crash. I have no idea why. It works fine with gcc (v7.1.0) and the newer llvm-openmp. I saw that conda built the latest version of astropy with clang v6.?.? so hopefully they use a newer version of llvm-also.

@jamienoss
Copy link
Contributor Author

@saimn Thanks for taking a look. 😄

If you're using clang and trying to optimize, add -Rpass-missed=.* to extra_compile_args when compiling to report missed optimizations.

If you'd like me to help out, take a look at the src, or give you any pointers I'd be more than glad to. 😄

# Licensed under a 3-clause BSD style license - see LICENSE.rst

import os
from numpy import get_include as get_numpy_include
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check the other submodules how the numpy includes were added (either io.ascii or io.fits), and use the same logic here, that will at least fix the egg builds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do, thanks.

@bsipocz
Copy link
Member

bsipocz commented Mar 16, 2018

I would definitely like to see @taldcroft's opinion before adding more C code to the codebase about his experience with speed benefits vs maintenance burden of e.g. the fastreader?

@pllim
Copy link
Member

pllim commented Mar 16, 2018

Would be interesting to see how this play out at https://github.com/astropy/astropy-benchmarks/ (probably need to add convolution benchmarks over there first).

Changes to astropy-helpers seem unnecessary given that it is already noted tests would fail until the helpers PR is merged.

@pllim pllim added this to the v3.1 milestone Mar 16, 2018
@astrofrog
Copy link
Member

Note: I'm away this week, but will review this and the helpers PR next week.

@pllim
Copy link
Member

pllim commented Mar 16, 2018

p.s. Please ping me to test this PR on my Windows machine when this is ready. C stuff can get iffy on Windows.

@pllim pllim added this to Needs decision in 3.1 Feature Planning Mar 16, 2018
@jamienoss
Copy link
Contributor Author

@pllim

p.s. Please ping me to test this PR on my Windows machine when this is ready. C stuff can get iffy on Windows.

Everything's "iffy" on Windows. 😉

Thanks, testing on Windows would be great!

@jamienoss
Copy link
Contributor Author

@pllim

Changes to astropy-helpers seem unnecessary given that it is already noted tests would fail until the helpers PR is merged.

Sorry, I just assumed this was the thing to do. Shall I remove the submodule changes and just leave the submod hash or remove all traces?

@saimn
Copy link
Contributor

saimn commented Mar 16, 2018

@bsipocz - I would definitely like to see @taldcroft's opinion before adding more C code to the codebase about his experience with speed benefits vs maintenance burden of e.g. the fastreader?

Just my 2 cents: given the clever trick to avoid duplicating code for the nan interpolation case, I think that the difference - in terms of complexity - with the Cython code is small enough. I mean, I understand your worry, but I don't think that the C code will be more difficult to maintain. And even if I love Cython, it's also true that it lacks more advanced OpenMP features.

@jamienoss
Copy link
Contributor Author

jamienoss commented Mar 17, 2018

@saimn & @bsipocz

I don't think that the C code will be more difficult to maintain.

It won't be any more difficult to maintain because it's C, in fact, it will be less. However, optimized and performant code can be a burden to maintain, in that its performance can be fragile to code changes. It's a battle/play between the will of the author and the compiler's, and being able to outthink it and wield it. By this, I just mean that it's not necessarily a burden due to the language, i.e. the burden isn't necessarily because it's C but because it's optimized for performance.

With this in mind, it's far simpler to optimize code when less translational layers exist. For example, C has one, the compiler. Actually, it has two as all have the CPU itself which is free(ish) to optimize and rearrange your code. Cython has one more, the Cython compiler itself, translating it into C, which then gets compiled by a C compiler. You, therefore, have to write it in a particular fashion knowing that the Cython compiler will then translate it into a particular form that the C compiler will then deal with as you desired. In this respect, it is easier to maintain optimized C than optimized Cython.

Sometimes, you just have to choose the right tool for the job. For this, we already know that it's not Python (actually I hate saying and hearing this - it's got nothing to do with Python, it's a language. The issue is with its canonical implementation, CPython, and its love of the GIL.) Anyhoo, it's not (C)Python and it's also not Cython, at least if we want it to be sufficiently performant.

@pllim
Copy link
Member

pllim commented Mar 18, 2018

Shall I remove the submodule changes and just leave the submod hash or remove all traces?

Unless changing submodule hash in this PR w.r.t what is in astropy master can make the tests pass until astropy-helpers upstream can catch up, the diff is not worth having. Ideally, the diff here should only contain code to be reviewed.

An ideal workflow (at least in my own head) is:

  • Merge your helpers PR.
  • Have Astropy master update submodule to include that change separately.
  • Rebase this one to pick up that change. At this point, CIs should pass.

Building with OpenMP support is disabled by default. To build the package with
OpenMP support (currently only used in ``convolution.convolve``) create and
set the environment variable ``ASTROPY_SETUP_WITH_OPENMP=1``.
E.g. ``ASTROPY_SETUP_WITH_OPENMP=1 pip install astropy --no-cache-dir`` or
``ASTROPY_SETUP_WITH_OPENMP=1 ./setup.py build``.

Signed-off-by: James Noss <jnoss@stsci.edu>
Signed-off-by: James Noss <jnoss@stsci.edu>
@jamienoss
Copy link
Contributor Author

Looks like weirdness with the 32bit cicleci test.

#!/bin/bash -eo pipefail
PYTHONHASHSEED=42 /opt/python/cp36-cp36m/bin/python setup.py test --parallel=4 -V -a "--durations=50"
Initializing astropy_helpers submodule with: `git submodule update --init -- astropy_helpers`
warning: no previously-included files found matching '*.pyc'
warning: no previously-included files found matching '*.o'
no previously-included directories found matching 'build'
no previously-included directories found matching 'astropy_helpers/tests'
Traceback (most recent call last):
  File "setup.py", line 52, in <module>
    generate_openmp_enabled_py(NAME, disable_openmp=disable_openmp)
TypeError: generate_openmp_enabled_py() got an unexpected keyword argument 'disable_openmp'
Exited with code 1

Is it not able to fetch astropy-helpers? Is this a caching issue?

Signed-off-by: James Noss <jnoss@stsci.edu>
@jamienoss jamienoss mentioned this pull request Oct 25, 2018
@jamienoss jamienoss force-pushed the astropy-convolve.pr branch 2 times, most recently from 267c808 to f702b41 Compare October 25, 2018 18:48
@jamienoss
Copy link
Contributor Author

jamienoss commented Oct 25, 2018

@astrofrog c1c6a70 kills OpenMP support but leaves the C code intact.

FYI OpenMP support can still be enabled (without reverting c1c6a70) but it requires the user to add build flags and literally alter the code:

  • Remove the #undef _OPENMP here
  • Increase n_threads in convolve() here
  • Add OpenMP build flags, e.g. for gcc: CFLAGS=-fopenmp ./setup build

jamienoss added a commit to jamienoss/astropy-feedstock that referenced this pull request Oct 25, 2018
This is a result of astropy/astropy#7293

Signed-off-by: James Noss <jnoss@stsci.edu>
@jamienoss
Copy link
Contributor Author

jamienoss commented Oct 25, 2018

@bsipocz Here are the changes required to the conda-forge/astropy-feedstock

conda-forge/astropy-feedstock@master...jamienoss:changes-for-convolve-from-7293

I wasn't sure if I should open a PR or let you sort that all out when you do so anyhow for the next rel.

jamienoss added a commit to jamienoss/astropy-feedstock that referenced this pull request Oct 25, 2018
This is a result of astropy/astropy#7293

Signed-off-by: James Noss <jnoss@stsci.edu>
@jamienoss
Copy link
Contributor Author

Sorry, I forgot to remove the OpenMP ref in the perf notes.

@bsipocz
Copy link
Member

bsipocz commented Oct 25, 2018

conda-forge/astropy-feedstock@master...jamienoss:changes-for-convolve-from-7293

Thanks. If you could open an issue pointing to that commit, or a PR (that won't be merged though but the commit can be cherry-picked from it), it would be super useful as I won't remember this comment when doing the release in a month time.

@jamienoss
Copy link
Contributor Author

@astrofrog I think I'm all done, back to you.

See astropy#7971

Signed-off-by: James Noss <jnoss@stsci.edu>
Copy link
Member

@astrofrog astrofrog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks ready to go provided tests pass - thanks @jamienoss! And let's continue the discussion about OpenMP during v3.2 development.

@jamienoss
Copy link
Contributor Author

Just spitballing - once the next rel is tagged, I wonder if there's any advantage to creating a specific OpenMP branch with c1c6a70 reverted? We can then scope out the principal users of convolve and see if they're interested in using it.

There may also be some merit in just reverting it in master such that all future devs act as build testers, expanding the net for catching any OpenMP build issues.

Just a thought.

@astrofrog astrofrog merged commit f9681a7 into astropy:master Oct 26, 2018
@astrofrog
Copy link
Member

@jamienoss - yes there's no reason we couldn't consider making an openmp branch in case people want to experiment. Let's discuss again in a couple of weeks once the release is out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
3.1 Feature Planning
Needs decision
Development

Successfully merging this pull request may close these issues.

None yet