-
-
Notifications
You must be signed in to change notification settings - Fork 787
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add spline_filter1d and spline_filter to cupyx.scipy.ndimage.interpolation #4145
Conversation
improves performance by avoiding repeated axis movement, reshaping, etc. kernel code reuses component's from Jeffrey Bush's lfilter gist Co-authored-by: Jeffrey Bush <jeff@coderforlife.com>
Sorry for having this review stalled for a while, I plan to look at it carefully in the next days 🙇♂️ |
Jenkins, test this please |
Jenkins CI test (for commit eeca386, target branch master) failed with status FAILURE. |
Jenkins, test this please |
cupy/_misc/memory_ranges.py
Outdated
@@ -28,6 +28,8 @@ def _get_memory_ptrs(x): | |||
|
|||
|
|||
def shares_memory(a, b, max_work=None): | |||
if a is b: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems to be breaking some tests when it should't :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can take a look, but in the worst case, I think this was just a micro-optimization that can be reverted.
Jenkins CI test (for commit eeca386, target branch master) failed with status FAILURE. |
The problem is this test
if we change your check to |
Jenkins, test this please |
Jenkins CI test (for commit 75df1ee, target branch master) succeeded! |
BTW can you add these functions to docs? :) thanks! |
Thanks, this is now done. Please also see issue #4247 in regards to the |
Overview
This adds the functions
spline_filter1d
andspline_filter
. These are used for the prefiltering stage for all interpolation functions inscipy.ndimage
when order >= 2. I will make a follow-up PR later with the actual spline interpolation support.Additional Details
The
ElementwiseKernel
approach used for convolutions incupyx.scipy.ndimage.filters
is not applicable for these spline prefilters because they are infinite impulse response (IIR) filters. Instead, aRawKernel
is used that follows the SciPy implementation closely. The implementation was originally only for batched operation along the last axis (commit c5ab0f5), but I later refactored it (8f6223b) to use strided operation instead as that ends up being faster.For the strided operation, I borrowed the
row
function and some stride calculation code from @coderforlife's gist for cupyx.scipy.signal.lfilter, so I added a co-author credit to @coderforlife here.The only departures from the SciPy API should be as follows:
1.) There is an additional keyword-only argument
allow_float32
that allows the kernels to operate in single precision. This defaults to False so that the behavior matches SciPy's double-precision implementation. There is performance and memory benefit to allowing single precision, so I would like to provide the option (I would actually like to do that more broadly in cupyx.scipy.ndimage and can open a separate issue regarding whether to do this and whether it should be enabled by a keyword-only argument or perhaps instead by a global/environment variable).2.) This function allows operation on complex-valued images while SciPy currently does not. I do have an open PR at SciPy to support this (scipy/scipy#12812), but it has not yet been reviewed. If this is acceptable, I can expand the tests here to include complex-valued inputs as well.
The changes made in 347c8c6 truncate the terms zN for values of N less than the full array size in order to reduce computation time (here,
z
, are the filter poles which are typically < 0.5 so that zN rapidly approaches zero for large N). I may make a corresponding change upstream in SciPy to improve the efficiency there as well.Benchmark Results
Benchmark Script
In the benchmarks below,
float32
indicates float32 input withallow_float32=False
whilefloat32+
indicatesallow_float32=True
. So comparing float32+ vs float32 indicates the relative benefit of allowing the kernel to operate in single precision. In some cases it gives nearly a factor of two performance improvement (on as GTX 1080 Ti).The results below are with some empirically determined tuning of block sizes (see 3665dcf). This is likely not fully optimal and may vary based on hardware, but I found this was up to a factor of 4 faster for some cases in the table below as compared to just using a fixed block size of 128.
Benchmark Results