add spline_filter1d and spline_filter to cupyx.scipy.ndimage.interpolation #4145

grlee77 · 2020-10-18T01:16:24Z

Overview

This adds the functions spline_filter1d and spline_filter. These are used for the prefiltering stage for all interpolation functions in scipy.ndimage when order >= 2. I will make a follow-up PR later with the actual spline interpolation support.

Additional Details

The ElementwiseKernel approach used for convolutions in cupyx.scipy.ndimage.filters is not applicable for these spline prefilters because they are infinite impulse response (IIR) filters. Instead, a RawKernel is used that follows the SciPy implementation closely. The implementation was originally only for batched operation along the last axis (commit c5ab0f5), but I later refactored it (8f6223b) to use strided operation instead as that ends up being faster.

For the strided operation, I borrowed the row function and some stride calculation code from @coderforlife's gist for cupyx.scipy.signal.lfilter, so I added a co-author credit to @coderforlife here.

The only departures from the SciPy API should be as follows:
1.) There is an additional keyword-only argument allow_float32 that allows the kernels to operate in single precision. This defaults to False so that the behavior matches SciPy's double-precision implementation. There is performance and memory benefit to allowing single precision, so I would like to provide the option (I would actually like to do that more broadly in cupyx.scipy.ndimage and can open a separate issue regarding whether to do this and whether it should be enabled by a keyword-only argument or perhaps instead by a global/environment variable).

2.) This function allows operation on complex-valued images while SciPy currently does not. I do have an open PR at SciPy to support this (scipy/scipy#12812), but it has not yet been reviewed. If this is acceptable, I can expand the tests here to include complex-valued inputs as well.

The changes made in 347c8c6 truncate the terms zN for values of N less than the full array size in order to reduce computation time (here, z, are the filter poles which are typically < 0.5 so that zN rapidly approaches zero for large N). I may make a corresponding change upstream in SciPy to improve the efficiency there as well.

Benchmark Results

Benchmark Script

import numpy as np
import scipy.ndimage as ndi

import cupy as cp
import cupyx.scipy.ndimage as cndi
from cupyx.time import repeat

print("dtype | shape | order | SciPy Time | CuPy Time | Acceleration")
print("------|-------|-------|------------|-----------|-------------")
for shape in [(512, 512), (4096, 4096), (256, 256, 256), (48, 48, 48, 48)]:
    for order in [2, 3, 4, 5]:
        for dtype in [np.float64, np.float32]:

            if dtype == np.float32:
                allow_cases = [False, True]
            else:
                allow_cases = [False]

            x = cp.testing.shaped_random(shape, xp=np, dtype=dtype)
            perf_cpu = repeat(ndi.spline_filter, (x, order), kwargs=dict(output=dtype), n_warmup=1,
                              n_repeat=4)
            cpu_mean = perf_cpu.cpu_times.mean()
            cpu_std = perf_cpu.cpu_times.std()

            xg = cp.testing.shaped_random(shape, xp=cp, dtype=dtype)
            for allow_float32 in allow_cases:
                perf_gpu = repeat(cndi.spline_filter, (xg, order), kwargs=dict(output=dtype, allow_float32=allow_float32), n_warmup=20,
                                  n_repeat=20)
                gpu_mean = perf_gpu.gpu_times.mean()
                gpu_std = perf_gpu.gpu_times.std()

                dtype_name = np.dtype(dtype).name
                if allow_float32:
                    dtype_name += "+"
                print(f"{dtype_name} | {shape} | {order} | {cpu_mean:0.4f} +/- {cpu_std:0.4f} | {gpu_mean:0.4f} +/- {gpu_std:0.4f} | {cpu_mean/gpu_mean:0.4f}")

In the benchmarks below, float32 indicates float32 input with allow_float32=False while float32+ indicates allow_float32=True. So comparing float32+ vs float32 indicates the relative benefit of allowing the kernel to operate in single precision. In some cases it gives nearly a factor of two performance improvement (on as GTX 1080 Ti).

The results below are with some empirically determined tuning of block sizes (see 3665dcf). This is likely not fully optimal and may vary based on hardware, but I found this was up to a factor of 4 faster for some cases in the table below as compared to just using a fixed block size of 128.

Benchmark Results

dtype	shape	order	SciPy Time	CuPy Time	Acceleration
float64	(512, 512)	2	0.0078 +/- 0.0000	0.0005 +/- 0.0000	17.1429
float32	(512, 512)	2	0.0069 +/- 0.0001	0.0005 +/- 0.0000	14.8155
float32+	(512, 512)	2	0.0069 +/- 0.0001	0.0004 +/- 0.0000	18.0272
float64	(512, 512)	3	0.0059 +/- 0.0001	0.0005 +/- 0.0000	12.9459
float32	(512, 512)	3	0.0052 +/- 0.0001	0.0005 +/- 0.0000	11.1037
float32+	(512, 512)	3	0.0052 +/- 0.0001	0.0004 +/- 0.0000	13.7612
float64	(512, 512)	4	0.0100 +/- 0.0002	0.0008 +/- 0.0000	12.2343
float32	(512, 512)	4	0.0096 +/- 0.0001	0.0008 +/- 0.0000	11.5478
float32+	(512, 512)	4	0.0096 +/- 0.0001	0.0006 +/- 0.0000	14.8777
float64	(512, 512)	5	0.0105 +/- 0.0001	0.0008 +/- 0.0000	12.5838
float32	(512, 512)	5	0.0100 +/- 0.0003	0.0008 +/- 0.0000	12.0317
float32+	(512, 512)	5	0.0100 +/- 0.0003	0.0007 +/- 0.0000	15.3600
float64	(4096, 4096)	2	0.6320 +/- 0.0033	0.0115 +/- 0.0002	55.1890
float32	(4096, 4096)	2	0.5955 +/- 0.0032	0.0116 +/- 0.0001	51.3433
float32+	(4096, 4096)	2	0.5955 +/- 0.0032	0.0079 +/- 0.0000	74.9288
float64	(4096, 4096)	3	0.6506 +/- 0.0029	0.0113 +/- 0.0000	57.5441
float32	(4096, 4096)	3	0.5954 +/- 0.0025	0.0115 +/- 0.0000	51.7527
float32+	(4096, 4096)	3	0.5954 +/- 0.0025	0.0080 +/- 0.0000	74.7751
float64	(4096, 4096)	4	0.8933 +/- 0.0015	0.0258 +/- 0.0001	34.5646
float32	(4096, 4096)	4	0.8544 +/- 0.0073	0.0260 +/- 0.0000	32.7994
float32+	(4096, 4096)	4	0.8544 +/- 0.0073	0.0146 +/- 0.0000	58.6961
float64	(4096, 4096)	5	0.9024 +/- 0.0022	0.0260 +/- 0.0001	34.7707
float32	(4096, 4096)	5	0.8635 +/- 0.0076	0.0262 +/- 0.0001	33.0032
float32+	(4096, 4096)	5	0.8635 +/- 0.0076	0.0146 +/- 0.0000	59.2901
float64	(256, 256, 256)	2	0.7596 +/- 0.0008	0.0171 +/- 0.0000	44.4308
float32	(256, 256, 256)	2	0.7070 +/- 0.0009	0.0173 +/- 0.0000	40.8533
float32+	(256, 256, 256)	2	0.7070 +/- 0.0009	0.0109 +/- 0.0000	64.7445
float64	(256, 256, 256)	3	0.7565 +/- 0.0006	0.0173 +/- 0.0001	43.6464
float32	(256, 256, 256)	3	0.7138 +/- 0.0021	0.0175 +/- 0.0000	40.7796
float32+	(256, 256, 256)	3	0.7138 +/- 0.0021	0.0110 +/- 0.0000	65.0602
float64	(256, 256, 256)	4	1.2377 +/- 0.0042	0.0319 +/- 0.0000	38.7723
float32	(256, 256, 256)	4	1.1898 +/- 0.0009	0.0322 +/- 0.0000	37.0058
float32+	(256, 256, 256)	4	1.1898 +/- 0.0009	0.0200 +/- 0.0000	59.4640
float64	(256, 256, 256)	5	1.2767 +/- 0.0008	0.0328 +/- 0.0001	38.9143
float32	(256, 256, 256)	5	1.2286 +/- 0.0015	0.0330 +/- 0.0001	37.2210
float32+	(256, 256, 256)	5	1.2286 +/- 0.0015	0.0201 +/- 0.0000	61.0218
float64	(48, 48, 48, 48)	2	0.2546 +/- 0.0010	0.0053 +/- 0.0000	47.7702
float32	(48, 48, 48, 48)	2	0.2277 +/- 0.0027	0.0054 +/- 0.0000	42.3401
float32+	(48, 48, 48, 48)	2	0.2277 +/- 0.0027	0.0036 +/- 0.0000	62.6256
float64	(48, 48, 48, 48)	3	0.2545 +/- 0.0003	0.0056 +/- 0.0000	45.8356
float32	(48, 48, 48, 48)	3	0.2307 +/- 0.0016	0.0056 +/- 0.0000	41.2424
float32+	(48, 48, 48, 48)	3	0.2307 +/- 0.0016	0.0035 +/- 0.0000	65.1985
float64	(48, 48, 48, 48)	4	0.4512 +/- 0.0055	0.0103 +/- 0.0003	43.9387
float32	(48, 48, 48, 48)	4	0.4248 +/- 0.0044	0.0097 +/- 0.0000	43.9879
float32+	(48, 48, 48, 48)	4	0.4248 +/- 0.0044	0.0053 +/- 0.0000	79.8143
float64	(48, 48, 48, 48)	5	0.4347 +/- 0.0042	0.0100 +/- 0.0000	43.4595
float32	(48, 48, 48, 48)	5	0.4089 +/- 0.0029	0.0100 +/- 0.0000	40.7602
float32+	(48, 48, 48, 48)	5	0.4089 +/- 0.0029	0.0054 +/- 0.0000	76.2883

improves performance by avoiding repeated axis movement, reshaping, etc. kernel code reuses component's from Jeffrey Bush's lfilter gist Co-authored-by: Jeffrey Bush <jeff@coderforlife.com>

emcastillo · 2020-11-02T04:50:32Z

Sorry for having this review stalled for a while, I plan to look at it carefully in the next days 🙇‍♂️

emcastillo · 2020-11-05T08:38:39Z

Jenkins, test this please

chainer-ci · 2020-11-05T09:59:18Z

Jenkins CI test (for commit eeca386, target branch master) failed with status FAILURE.

emcastillo · 2020-11-09T02:48:21Z

Jenkins, test this please

emcastillo · 2020-11-09T03:28:13Z

cupy/_misc/memory_ranges.py

@@ -28,6 +28,8 @@ def _get_memory_ptrs(x):


 def shares_memory(a, b, max_work=None):
+    if a is b:


this seems to be breaking some tests when it should't :/

I can take a look, but in the worst case, I think this was just a micro-optimization that can be reverted.

chainer-ci · 2020-11-09T04:31:03Z

Jenkins CI test (for commit eeca386, target branch master) failed with status FAILURE.

emcastillo · 2020-11-09T09:12:45Z

The problem is this test

def test_zero_size_array(self):
        for xp in (numpy, cupy):
            a = xp.array([])
>           assert xp.shares_memory(a, a) is False

if we change your check to a is a and a.size() != 0 should be enough.
Two empty arrays are though not to be sharing memory.

emcastillo · 2020-11-10T02:57:41Z

Jenkins, test this please

chainer-ci · 2020-11-10T04:03:03Z

Jenkins CI test (for commit 75df1ee, target branch master) succeeded!

emcastillo · 2020-11-11T08:09:59Z

BTW can you add these functions to docs? :) thanks!

grlee77 · 2020-11-11T16:22:32Z

BTW can you add these functions to docs? :) thanks!

Thanks, this is now done. Please also see issue #4247 in regards to the allow_float32 argument that was in this PR. I am not sure if that is the best approach or if it is preferable to just default to the allow_float32=True behavior and remove the argument.

grlee77 added 8 commits October 15, 2020 10:53

add spline prefiltering

c5ab0f5

refactor spline filtering kernel for strided operation

8f6223b

improves performance by avoiding repeated axis movement, reshaping, etc. kernel code reuses component's from Jeffrey Bush's lfilter gist Co-authored-by: Jeffrey Bush <jeff@coderforlife.com>

refactor/simplify temporary array creation for spline_filter

245cf27

spline_filter: support complex-valued inputs

5b126c7

set launch bounds and tune block size

3665dcf

improve efficiency by limiting the extent of the boundary init

347c8c6

use typedef for pole_type

d46873e

flake8 fixes

f934bd2

kmaehashi assigned emcastillo Oct 19, 2020

kmaehashi added cat:feature New features/APIs prio:medium labels Oct 19, 2020

pep8 fixes

eeca386

emcastillo reviewed Nov 9, 2020

View reviewed changes

fix regression in shares_memory for zero size arrays

75df1ee

emcastillo approved these changes Nov 11, 2020

View reviewed changes

emcastillo added the st:test-and-merge (deprecated) Ready to merge after test pass. label Nov 11, 2020

emcastillo added this to the v9.0.0a2 milestone Nov 11, 2020

mergify bot merged commit 14a3766 into cupy:master Nov 11, 2020

This was referenced Nov 11, 2020

add spline_filter functions to ndimage docs #4265

Merged

allow single precision operation in the cupyx.scipy.ndimage module #4247

Closed

grlee77 mentioned this pull request Nov 23, 2020

revise dtype handling in cupyx.scipy.ndimage.spline_filter #4314

Merged

grlee77 deleted the ndimage_spline_prefilter branch December 18, 2020 00:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add spline_filter1d and spline_filter to cupyx.scipy.ndimage.interpolation #4145

add spline_filter1d and spline_filter to cupyx.scipy.ndimage.interpolation #4145

grlee77 commented Oct 18, 2020

emcastillo commented Nov 2, 2020

emcastillo commented Nov 5, 2020

chainer-ci commented Nov 5, 2020

emcastillo commented Nov 9, 2020

emcastillo Nov 9, 2020

grlee77 Nov 9, 2020

chainer-ci commented Nov 9, 2020

emcastillo commented Nov 9, 2020

emcastillo commented Nov 10, 2020

chainer-ci commented Nov 10, 2020

emcastillo commented Nov 11, 2020

grlee77 commented Nov 11, 2020

		@@ -28,6 +28,8 @@ def _get_memory_ptrs(x):


		def shares_memory(a, b, max_work=None):
		if a is b:

add spline_filter1d and spline_filter to cupyx.scipy.ndimage.interpolation #4145

add spline_filter1d and spline_filter to cupyx.scipy.ndimage.interpolation #4145

Conversation

grlee77 commented Oct 18, 2020

Overview

Additional Details

Benchmark Results

emcastillo commented Nov 2, 2020

emcastillo commented Nov 5, 2020

chainer-ci commented Nov 5, 2020

emcastillo commented Nov 9, 2020

emcastillo Nov 9, 2020

Choose a reason for hiding this comment

grlee77 Nov 9, 2020

Choose a reason for hiding this comment

chainer-ci commented Nov 9, 2020

emcastillo commented Nov 9, 2020

emcastillo commented Nov 10, 2020

chainer-ci commented Nov 10, 2020

emcastillo commented Nov 11, 2020

grlee77 commented Nov 11, 2020