Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add spline_filter1d and spline_filter to cupyx.scipy.ndimage.interpolation #4145

Merged
merged 10 commits into from
Nov 11, 2020

Conversation

grlee77
Copy link
Contributor

@grlee77 grlee77 commented Oct 18, 2020

Overview

This adds the functions spline_filter1d and spline_filter. These are used for the prefiltering stage for all interpolation functions in scipy.ndimage when order >= 2. I will make a follow-up PR later with the actual spline interpolation support.

Additional Details

The ElementwiseKernel approach used for convolutions in cupyx.scipy.ndimage.filters is not applicable for these spline prefilters because they are infinite impulse response (IIR) filters. Instead, a RawKernel is used that follows the SciPy implementation closely. The implementation was originally only for batched operation along the last axis (commit c5ab0f5), but I later refactored it (8f6223b) to use strided operation instead as that ends up being faster.

For the strided operation, I borrowed the row function and some stride calculation code from @coderforlife's gist for cupyx.scipy.signal.lfilter, so I added a co-author credit to @coderforlife here.

The only departures from the SciPy API should be as follows:
1.) There is an additional keyword-only argument allow_float32 that allows the kernels to operate in single precision. This defaults to False so that the behavior matches SciPy's double-precision implementation. There is performance and memory benefit to allowing single precision, so I would like to provide the option (I would actually like to do that more broadly in cupyx.scipy.ndimage and can open a separate issue regarding whether to do this and whether it should be enabled by a keyword-only argument or perhaps instead by a global/environment variable).

2.) This function allows operation on complex-valued images while SciPy currently does not. I do have an open PR at SciPy to support this (scipy/scipy#12812), but it has not yet been reviewed. If this is acceptable, I can expand the tests here to include complex-valued inputs as well.

The changes made in 347c8c6 truncate the terms zN for values of N less than the full array size in order to reduce computation time (here, z, are the filter poles which are typically < 0.5 so that zN rapidly approaches zero for large N). I may make a corresponding change upstream in SciPy to improve the efficiency there as well.

Benchmark Results

Benchmark Script
import numpy as np
import scipy.ndimage as ndi

import cupy as cp
import cupyx.scipy.ndimage as cndi
from cupyx.time import repeat

print("dtype | shape | order | SciPy Time | CuPy Time | Acceleration")
print("------|-------|-------|------------|-----------|-------------")
for shape in [(512, 512), (4096, 4096), (256, 256, 256), (48, 48, 48, 48)]:
    for order in [2, 3, 4, 5]:
        for dtype in [np.float64, np.float32]:

            if dtype == np.float32:
                allow_cases = [False, True]
            else:
                allow_cases = [False]

            x = cp.testing.shaped_random(shape, xp=np, dtype=dtype)
            perf_cpu = repeat(ndi.spline_filter, (x, order), kwargs=dict(output=dtype), n_warmup=1,
                              n_repeat=4)
            cpu_mean = perf_cpu.cpu_times.mean()
            cpu_std = perf_cpu.cpu_times.std()

            xg = cp.testing.shaped_random(shape, xp=cp, dtype=dtype)
            for allow_float32 in allow_cases:
                perf_gpu = repeat(cndi.spline_filter, (xg, order), kwargs=dict(output=dtype, allow_float32=allow_float32), n_warmup=20,
                                  n_repeat=20)
                gpu_mean = perf_gpu.gpu_times.mean()
                gpu_std = perf_gpu.gpu_times.std()

                dtype_name = np.dtype(dtype).name
                if allow_float32:
                    dtype_name += "+"
                print(f"{dtype_name} | {shape} | {order} | {cpu_mean:0.4f} +/- {cpu_std:0.4f} | {gpu_mean:0.4f} +/- {gpu_std:0.4f} | {cpu_mean/gpu_mean:0.4f}")

In the benchmarks below, float32 indicates float32 input with allow_float32=False while float32+ indicates allow_float32=True. So comparing float32+ vs float32 indicates the relative benefit of allowing the kernel to operate in single precision. In some cases it gives nearly a factor of two performance improvement (on as GTX 1080 Ti).

The results below are with some empirically determined tuning of block sizes (see 3665dcf). This is likely not fully optimal and may vary based on hardware, but I found this was up to a factor of 4 faster for some cases in the table below as compared to just using a fixed block size of 128.

Benchmark Results
dtype shape order SciPy Time CuPy Time Acceleration
float64 (512, 512) 2 0.0078 +/- 0.0000 0.0005 +/- 0.0000 17.1429
float32 (512, 512) 2 0.0069 +/- 0.0001 0.0005 +/- 0.0000 14.8155
float32+ (512, 512) 2 0.0069 +/- 0.0001 0.0004 +/- 0.0000 18.0272
float64 (512, 512) 3 0.0059 +/- 0.0001 0.0005 +/- 0.0000 12.9459
float32 (512, 512) 3 0.0052 +/- 0.0001 0.0005 +/- 0.0000 11.1037
float32+ (512, 512) 3 0.0052 +/- 0.0001 0.0004 +/- 0.0000 13.7612
float64 (512, 512) 4 0.0100 +/- 0.0002 0.0008 +/- 0.0000 12.2343
float32 (512, 512) 4 0.0096 +/- 0.0001 0.0008 +/- 0.0000 11.5478
float32+ (512, 512) 4 0.0096 +/- 0.0001 0.0006 +/- 0.0000 14.8777
float64 (512, 512) 5 0.0105 +/- 0.0001 0.0008 +/- 0.0000 12.5838
float32 (512, 512) 5 0.0100 +/- 0.0003 0.0008 +/- 0.0000 12.0317
float32+ (512, 512) 5 0.0100 +/- 0.0003 0.0007 +/- 0.0000 15.3600
float64 (4096, 4096) 2 0.6320 +/- 0.0033 0.0115 +/- 0.0002 55.1890
float32 (4096, 4096) 2 0.5955 +/- 0.0032 0.0116 +/- 0.0001 51.3433
float32+ (4096, 4096) 2 0.5955 +/- 0.0032 0.0079 +/- 0.0000 74.9288
float64 (4096, 4096) 3 0.6506 +/- 0.0029 0.0113 +/- 0.0000 57.5441
float32 (4096, 4096) 3 0.5954 +/- 0.0025 0.0115 +/- 0.0000 51.7527
float32+ (4096, 4096) 3 0.5954 +/- 0.0025 0.0080 +/- 0.0000 74.7751
float64 (4096, 4096) 4 0.8933 +/- 0.0015 0.0258 +/- 0.0001 34.5646
float32 (4096, 4096) 4 0.8544 +/- 0.0073 0.0260 +/- 0.0000 32.7994
float32+ (4096, 4096) 4 0.8544 +/- 0.0073 0.0146 +/- 0.0000 58.6961
float64 (4096, 4096) 5 0.9024 +/- 0.0022 0.0260 +/- 0.0001 34.7707
float32 (4096, 4096) 5 0.8635 +/- 0.0076 0.0262 +/- 0.0001 33.0032
float32+ (4096, 4096) 5 0.8635 +/- 0.0076 0.0146 +/- 0.0000 59.2901
float64 (256, 256, 256) 2 0.7596 +/- 0.0008 0.0171 +/- 0.0000 44.4308
float32 (256, 256, 256) 2 0.7070 +/- 0.0009 0.0173 +/- 0.0000 40.8533
float32+ (256, 256, 256) 2 0.7070 +/- 0.0009 0.0109 +/- 0.0000 64.7445
float64 (256, 256, 256) 3 0.7565 +/- 0.0006 0.0173 +/- 0.0001 43.6464
float32 (256, 256, 256) 3 0.7138 +/- 0.0021 0.0175 +/- 0.0000 40.7796
float32+ (256, 256, 256) 3 0.7138 +/- 0.0021 0.0110 +/- 0.0000 65.0602
float64 (256, 256, 256) 4 1.2377 +/- 0.0042 0.0319 +/- 0.0000 38.7723
float32 (256, 256, 256) 4 1.1898 +/- 0.0009 0.0322 +/- 0.0000 37.0058
float32+ (256, 256, 256) 4 1.1898 +/- 0.0009 0.0200 +/- 0.0000 59.4640
float64 (256, 256, 256) 5 1.2767 +/- 0.0008 0.0328 +/- 0.0001 38.9143
float32 (256, 256, 256) 5 1.2286 +/- 0.0015 0.0330 +/- 0.0001 37.2210
float32+ (256, 256, 256) 5 1.2286 +/- 0.0015 0.0201 +/- 0.0000 61.0218
float64 (48, 48, 48, 48) 2 0.2546 +/- 0.0010 0.0053 +/- 0.0000 47.7702
float32 (48, 48, 48, 48) 2 0.2277 +/- 0.0027 0.0054 +/- 0.0000 42.3401
float32+ (48, 48, 48, 48) 2 0.2277 +/- 0.0027 0.0036 +/- 0.0000 62.6256
float64 (48, 48, 48, 48) 3 0.2545 +/- 0.0003 0.0056 +/- 0.0000 45.8356
float32 (48, 48, 48, 48) 3 0.2307 +/- 0.0016 0.0056 +/- 0.0000 41.2424
float32+ (48, 48, 48, 48) 3 0.2307 +/- 0.0016 0.0035 +/- 0.0000 65.1985
float64 (48, 48, 48, 48) 4 0.4512 +/- 0.0055 0.0103 +/- 0.0003 43.9387
float32 (48, 48, 48, 48) 4 0.4248 +/- 0.0044 0.0097 +/- 0.0000 43.9879
float32+ (48, 48, 48, 48) 4 0.4248 +/- 0.0044 0.0053 +/- 0.0000 79.8143
float64 (48, 48, 48, 48) 5 0.4347 +/- 0.0042 0.0100 +/- 0.0000 43.4595
float32 (48, 48, 48, 48) 5 0.4089 +/- 0.0029 0.0100 +/- 0.0000 40.7602
float32+ (48, 48, 48, 48) 5 0.4089 +/- 0.0029 0.0054 +/- 0.0000 76.2883

@emcastillo
Copy link
Member

Sorry for having this review stalled for a while, I plan to look at it carefully in the next days 🙇‍♂️

@emcastillo
Copy link
Member

Jenkins, test this please

@chainer-ci
Copy link
Member

Jenkins CI test (for commit eeca386, target branch master) failed with status FAILURE.

@emcastillo
Copy link
Member

Jenkins, test this please

@@ -28,6 +28,8 @@ def _get_memory_ptrs(x):


def shares_memory(a, b, max_work=None):
if a is b:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems to be breaking some tests when it should't :/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can take a look, but in the worst case, I think this was just a micro-optimization that can be reverted.

@chainer-ci
Copy link
Member

Jenkins CI test (for commit eeca386, target branch master) failed with status FAILURE.

@emcastillo
Copy link
Member

The problem is this test

def test_zero_size_array(self):
        for xp in (numpy, cupy):
            a = xp.array([])
>           assert xp.shares_memory(a, a) is False

if we change your check to a is a and a.size() != 0 should be enough.
Two empty arrays are though not to be sharing memory.

@emcastillo
Copy link
Member

Jenkins, test this please

@chainer-ci
Copy link
Member

Jenkins CI test (for commit 75df1ee, target branch master) succeeded!

@emcastillo
Copy link
Member

BTW can you add these functions to docs? :) thanks!

@emcastillo emcastillo added the st:test-and-merge (deprecated) Ready to merge after test pass. label Nov 11, 2020
@emcastillo emcastillo added this to the v9.0.0a2 milestone Nov 11, 2020
@mergify mergify bot merged commit 14a3766 into cupy:master Nov 11, 2020
@grlee77
Copy link
Contributor Author

grlee77 commented Nov 11, 2020

BTW can you add these functions to docs? :) thanks!

Thanks, this is now done. Please also see issue #4247 in regards to the allow_float32 argument that was in this PR. I am not sure if that is the best approach or if it is preferable to just default to the allow_float32=True behavior and remove the argument.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cat:feature New features/APIs prio:medium st:test-and-merge (deprecated) Ready to merge after test pass.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants