Non-local means denoising speedup (v2) #5192

ralfbrown · 2020-05-24T18:25:33Z

(cleaned-up version of #5169 )

I've been rewriting the non-local means denoising code. Here is the result, with both scalar and SSE implementations for both the Denoise (non-local means) and Denoise (profiled) iops. I also refactored the CL code paths, but that is completely untested as I do not have OpenCL [1], and is therefore disabled by default. Change USE_NEW_IMPL_CL at the top of src/iop/nlmeans.c and src/iop/denoiseprofile.c, respectively, to try out the refactored OpenCL code paths.

[1] video card is unsupported. I did try installing pocl, but even after I forced dt to use it, it failed to compile kernels with a "common.h: file not found" error, despite that file being in the directory pointed at by the -I flag....

Per-thread memory use the same as before, though I have added a couple of KB of housekeeping data. Speed is better single-threaded, and much better with large numbers of threads. Results will differ slightly from the previous CPU implementation, but should be closer to the OpenCL code, as the interval between fully recomputing the patch statistics is shorter, allowing less of a buildup of rounding errors.

ETA: rewrote the code to use tiles small enough to fit in L2 cache, with active parts small enough to fit in L1, which now gives a 21-27x speedup with 32 threads and much better single-thread performance with scattered patches as used by denoise(profiled). Timings below are for git master (old), the first version of my code (new), and the revised version of my code (newer).

Timings with the default parameters on the mire1.cr2 image from the integration test suite:
Denoise(profiled)
Threads / old time / new time / newer
1 / 25.735 / 22.532 / 14.275
8 / 3.483 / 3.055 / 1.914
16 / 1.889 / 1.595 / 0.982
24 / 1.556 / 1.118 / 0.708
32 / 1.641 / 0.856 / 0.662

Denoise(non-local means)
Threads / old time / new time / newer
1 / 7.516 / 6.194 / 6.166
8 / 1.435 / 0.854 / 0.799
16 / 1.395 / 0.435 / 0.434
24 / 1.438 / 0.371 / 0.292
32 / 1.437 / 0.238 / 0.298
(I get 0.232 with 30 threads and 0.227 with 31 for the newest code, so the 0.298 with 32 may be an artifact of an uneven split in the tiles.)

Timings using the image and .xmp from the discuss.pixls.us thread "darktable speed (in general, and when using two monitors)"
Denoise (non-local means)
Threads / old time / new time / newer
1 / 14.346 / 11.655 / 12.165
4 / 4.174 / 3.124 / 3.130
8 / 2.783 / 1.589 / 1.568
16 / 2.735 / 0.853 / 0.811
24 / 2.806 / 0.608 / 0.553
28 / 2.853 / 0.605 / 0.486
32 / 2.862 / 0.897 / 0.447

rawfiner · 2020-05-25T15:15:10Z

Some performance testing with denoise profiled (my CPU is an Intel i7-6700, with 4 physical cores, 8 threads):
threads | old time | new time
with scattering = 0
1 | 64.278 | 55.689
2 | 39.376 | 32.158
4 | 65.362 | 55.769
8 | 77.115 | 72.684

with scattering = 1
2 | 65.854 | 50.02

In all cases I get a speedup with this PR (though the performance still scales very badly on my CPU as the number of threads is increased)

rawfiner · 2020-05-25T15:36:49Z

src/common/nlmeans_core.c

+
+// allocate and fill an array of patch definitions
+static struct patch_t*
+define_patches(const dt_nlmeans_param_t *const params, const int stride, int *num_patches, int *max_shift)


If I understand correctly, this defines an array with all the patches positions. Maybe there could be some performance speedup if we sort this array according to the offset (scattering was designed to produce a mess out of the normal offsets, and as such it does not give an ordered output from an ordered input), to avoid some cache misses?

You understand correctly. And once the refactored CL code path is validated as correct, we can give the nlmeans iop the scattering feature just by wiring up the GUI (the power of sharing code instead of making wholesale copies as I'm seeing so frequently in the codebase).

Sorting the array is certainly a possibility that I haven't looked into yet. It did occur to me that we could omit the center pixel (offset 0,0) from the patches array and initialize the output buffer with a copy of the input buffer for about an 0.4% speedup (you save one patch computation out of 225).

What might help more than sorting strictly by offset is sorting such that patches which will often occupy the same or adjacent cache lines get grouped together. That might be a bit tricky to implement, and would be moot for very small scattering values.

I'll give the sorting a try once I finish my current work speeding up wavelets denoise. Only 3% single threaded so far, but 25% with lots of threads thanks to integrating some formerly single-threaded computation as part of existing parallelized code.

ghost · 2020-05-25T21:51:19Z

Builds for me with clang under Windows, and just rescued a couple more noisy photos without a problem. Nice.

ralfbrown · 2020-05-25T22:13:56Z

@coffeeandspaceships was that CPU or OpenCL? If the latter, this PR does nothing unless you flip the #defines at the start of src/iop/denoiseprofile.c and src/iop/nlmeans.c (since I didn't want to inadvertently break anything I can't run myself).
If it was OpenCL and you did enable my refactoring, that's great news!

ghost · 2020-05-25T22:38:33Z

Both. :-) (BUILD with OpenCL, and different BUILD with CPU.)

ralfbrown · 2020-05-25T22:52:37Z

Have you checked the refactored CL code for denoise(non-local means) as well?
CL code should generate pixel-identical results whether it's the git-master version or the refactored version, but CPU will vary slightly due to different accumulation of rounding errors (actually less accumulation for the sped-up code :-)).
I'll prepare a commit that switches denoiseprofile.c to use the refactored CL code, but will hold off on doing the same in nlmeans.c until I hear back.

ghost · 2020-05-26T11:07:28Z

I have some challenges getting the new regression tests to work in my build environment - I think it's to do with MinGW, MinGW's python, and MinGW's OpenCV having some disagreement. I'm going to try setting up from scratch in a virtual Linux box and then fudging @TurboGit 's scripts so the compare part of the regression testing is carried out on the (hopefully working) Linux environment, but using the comparisons generated under Windows. So I won't be able to do the formal part of the test until this evening. (If you want to discuss the OpenCV issue more raise a seperate issue so we don't clutter up your PR :+1)

Meanwhile a bit of confidence for you. This is from a build including your commit last night

2.641977 [opencl_create_kernel] successfully loaded kernel `denoiseprofile_precondition' (129) for device 0 2.644969 [opencl_create_kernel] successfully loaded kernel `denoiseprofile_precondition_v2' (130) for device 0 2.652948 [opencl_create_kernel] successfully loaded kernel `denoiseprofile_precondition_Y0U0V0' (131) for device 0 2.658932 [opencl_create_kernel] successfully loaded kernel `denoiseprofile_init' (132) for device 0 2.662921 [opencl_create_kernel] successfully loaded kernel `denoiseprofile_dist' (133) for device 0 2.665913 [opencl_create_kernel] successfully loaded kernel `denoiseprofile_horiz' (134) for device 0 2.672894 [opencl_create_kernel] successfully loaded kernel `denoiseprofile_vert' (135) for device 0 2.679876 [opencl_create_kernel] successfully loaded kernel `denoiseprofile_accu' (136) for device 0 2.683865 [opencl_create_kernel] successfully loaded kernel `denoiseprofile_finish' (137) for device 0 2.694835 [opencl_create_kernel] successfully loaded kernel `denoiseprofile_finish_v2' (138) for device 0 2.699823 [opencl_create_kernel] successfully loaded kernel `denoiseprofile_backtransform' (139) for device 0 2.703812 [opencl_create_kernel] successfully loaded kernel `denoiseprofile_backtransform_v2' (140) for device 0 2.711795 [opencl_create_kernel] successfully loaded kernel `denoiseprofile_backtransform_Y0U0V0' (141) for device 0 2.718772 [opencl_create_kernel] successfully loaded kernel `denoiseprofile_decompose' (142) for device 0 2.724756 [opencl_create_kernel] successfully loaded kernel `denoiseprofile_synthesize' (143) for device 0 2.729742 [opencl_create_kernel] successfully loaded kernel `denoiseprofile_reduce_first' (144) for device 0 2.734729 [opencl_create_kernel] successfully loaded kernel `denoiseprofile_reduce_second' (145) for device 0

ralfbrown · 2020-05-26T11:21:08Z

If the resulting images look OK using the Mark I Eyeball (tm), it's probably correct. Any mistakes in the refactoring are likely to have dramatic effects since I didn't actually change the computation; in particular, the kernels are untouched. If ImageMagick is working in your Windows environment, you can do "compare old.png new.png -metric pae -verbose diff.png" for old.png and new.png being the outputs without and with this PR under OpenCL. You should get zeros reported for all channels (you won't for images generated using the CPU code path; there it should be no more than 0.00390625 [=1/256]).

(I haven't been able to run TurboGit's deltae script, either, because the version of the colour module available to me doesn't have a delta_E function.)

ghost · 2020-05-27T22:56:50Z

I have somehow only just noticed your pocl comment.

Yeah pocl is not supported, and won't be. I think maybe my Intel OpenCL PR should include removing pocl support explicitly.

ghost · 2020-05-27T23:46:12Z

CPU to OpenCL difference seems quite high:
compare _IGP7076\ no\ opencl.png _IGP7076.png -metric pae -verbose diff.png writing raw profile: type=icc, length=9044 _IGP7076 no opencl.png=>diff.png PNG 4928x3264 4928x3264+0+0 16-bit sRGB 71.3816MiB 35.781u 0:35.095 Image: _IGP7076 no opencl.png Channel distortion: PAE red: 9822 (0.149874) green: 9804 (0.149599) blue: 14808 (0.225956) all: 14808 (0.225956)

ralfbrown · 2020-05-28T03:35:42Z

pocl is already blacklisted, so I had to not only switch off the blocking of CPU OpenCL devices, I had to turn off blacklisting as well. I wasn't expecting more than a sanity check for my code changes.

As for the CL/CPU image difference, are you getting anywhere near that level of difference without this PR? One thing TurboGit and I figured out on another PR is that the gamut mapping done by LCMS can produce a big PAE compared to not using it, so maybe if you have it turned on for export, it's making things look worse for pixels near the gamut boundary.

MStraeten · 2020-05-28T20:05:00Z

my osx build denoise profiled iop with non local means setting is broken for opencl with this pr but non local means denoise iop itself is fine even with opencl
--disable-opencl is ok, but with opencl enabled the image turns black (rgb 0 0 3) when denoise profiled is used with non local means settings (manual and auto)

ralfbrown · 2020-05-29T03:01:17Z

OK, that means I did break something in my refactoring (the PR currently uses the original CL code for denoise-nlmeans and my refactored code for denoise-profiled). Argh...

I guess we'll have to revert 328c27b until someone who knows CL can look at it.

ghost · 2020-05-29T09:21:04Z

I'd spotted that LCMS discussion and therefore switched LCMS off.

compare _IGP7076-no-opencl.png _IGP7076-built-with-opencl-but-blacklisted.png -metric pae -verbose diff-between-built-without-and-built-with-blacklisted.png
writing raw profile: type=icc, length=9044
_IGP7076-no-opencl.png=>diff-between-built-without-and-built-with-blacklisted.png PNG 4928x3264 4928x3264+0+0 16-bit sRGB 71.3816MiB 38.781u 0:38.048
Image: _IGP7076-no-opencl.png
  Channel distortion: PAE
    red: 3088 (0.0471199)
    green: 1229 (0.0187533)
    blue: 1551 (0.0236667)
    all: 3088 (0.0471199)

If I look at the diff, it is like a pale ghost of the original photo and the colour balance looks about right.

The first comparison I made, earlier in this discussion, gave me a diff that was also a faint version of the original but predominantly red.

ghost · 2020-05-29T09:24:47Z

OK, that means I did break something in my refactoring (the PR currently uses the original CL code for denoise-nlmeans and my refactored code for denoise-profiled). Argh...

I guess we'll have to revert 328c27b until someone who knows CL can look at it.

In that case I think it would be sensible to reduce the scope of this PR again (using #ifdef guards) to ensure your code path is not available on systems that have OpenCL capability (whether or not it's then blacklisted). Because otherwise, OpenCL users that don't need this performance boost much anyway, because they already have OpenCL performance, are blocking the improvement for the users that need it because they are on CPU only!

ralfbrown · 2020-05-29T14:56:27Z

The diff file produced by ImageMagick's "compare" is a pale ghost of the original, with differing pixels changed to red. So seeing a pale ghost without red is a good thing.

Since 328c27b removed the original CL code and the #ifdef guard I had put around it, reverting it will do exactly as you suggest. I'll do that later today unless I hear otherwise.

TurboGit · 2020-05-29T16:23:33Z

@ralfbrown : Just a note about all your speed-up PR. I have planed them for 3.4. As you may have seen we are planning a 3.2 release mi August, at this point only very stable work should be integrated.

ralfbrown · 2020-05-29T23:22:59Z

I saw the 3.4 last night. Disappointed, but understand that bugfixes get priority. Hope you will manage to slip some of them into 3.2 anyway (especially the ones with small code changes such as #5213).

This reverts commit 328c27b.

ralfbrown · 2020-06-01T21:32:51Z

Now even more scaleable and even faster with scattered patches in denoise(profiled). Verified max delta-E of 1.57 versus current git master, and only 779 total changed pixels. I'll put up detailed timings once I have a chance to make the scalar code path match the new SSE code path, but the TL;DR is one-third less runtime on denoise(profiled) and marginally (1-2%) slower, but still much faster than master, single-threaded on denoise(nonlocal) and faster than my previous version with more than eight threads.

(Had to force-push because git claimed the branch was behind its remote even after rebasing.)

AxelG-DE · 2020-06-06T07:36:39Z

Hope you will manage to slip some of them into 3.2 anyway (especially the ones with small code changes such as #5213).

would be very nice if on that one the decision for 3.4 could be re-thought. I'd also vote for putting it in 3.2 but must admit, I cannot code hence no comment from my end on quality / stability possible (agree on such prio) :)

github-actions · 2020-07-07T00:18:41Z

This pull request did not get any activity in the past 30 days and will be closed in 365 days if no update occurs. Please verify it has no conflicts with the master branch and rebase if needed. Mention it now if you need help or give permission to other people to finish your work.

TurboGit · 2020-08-13T15:35:52Z

@ralfbrown : I'd like to handle all the speed-up related PR sooner than later to have a chance to fix any possible regressions before 3.4. I'm starting with this one, can you fix the conflicts ? TIA

ralfbrown · 2020-08-13T19:42:53Z

All of the other changes which have been made since May are creating a real mess while trying to rebase. I'll have to set aside some time over the weekend to ensure I do it right.

ralfbrown · 2020-08-15T22:10:03Z

Replaced by #5974.

ralfbrown mentioned this pull request May 24, 2020

Non-local means denoising speedup #5169

Closed

rawfiner reviewed May 25, 2020

View reviewed changes

TurboGit added the scope: performance doing everything the same but faster label May 28, 2020

TurboGit self-requested a review May 28, 2020 08:09

TurboGit added this to the 3.4 milestone May 28, 2020

ralfbrown added 8 commits June 1, 2020 17:17

add new, faster CPU implementation of non-local means denoising

1c2b752

switch nlmeans iop to using new denoising code

f77982b

switch denoiseprofile iop to using new denoising code

6bdaf2c

use refactored code for OpenCL code path

1b03574

Revert "use refactored code for OpenCL code path"

d48bb0a

This reverts commit 328c27b.

clean up compiler-specific handling of omp shared attribute

8a90cd1

rewrite SSE codepath for better L1 cache behavior

f9567c2

add integration test for denoise(nonlocal means)

844e9de

add expected.png to integration test

826682e

ralfbrown force-pushed the nlmeans_speedup2 branch from 69ca088 to 826682e Compare June 1, 2020 21:24

ralfbrown added 3 commits June 1, 2020 17:37

fix omp directive for GCC9/clang

e1c6570

fix conditional compilation for clang

b327a3b

update scalare code to match new SSE code, remove debugging statements

4c3665f

ralfbrown mentioned this pull request Jun 12, 2020

po/fast pipe #5441

Merged

TurboGit added this to wip in Performance / Pixel pipe rework Jun 20, 2020

github-actions bot added the no-pr-activity label Jul 7, 2020

ralfbrown mentioned this pull request Aug 15, 2020

Non-local means denoising speedup (v3) #5974

Merged

ralfbrown closed this Aug 15, 2020

ralfbrown deleted the nlmeans_speedup2 branch June 22, 2021 20:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-local means denoising speedup (v2) #5192

Non-local means denoising speedup (v2) #5192

ralfbrown commented May 24, 2020 •

edited

rawfiner commented May 25, 2020 •

edited

rawfiner May 25, 2020

ralfbrown May 25, 2020

ghost commented May 25, 2020

ralfbrown commented May 25, 2020

ghost commented May 25, 2020

ralfbrown commented May 25, 2020

ghost commented May 26, 2020

ralfbrown commented May 26, 2020

ghost commented May 27, 2020

ghost commented May 27, 2020

ralfbrown commented May 28, 2020

MStraeten commented May 28, 2020

ralfbrown commented May 29, 2020

ghost commented May 29, 2020

ghost commented May 29, 2020

ralfbrown commented May 29, 2020

TurboGit commented May 29, 2020

ralfbrown commented May 29, 2020

ralfbrown commented Jun 1, 2020

AxelG-DE commented Jun 6, 2020

github-actions bot commented Jul 7, 2020

TurboGit commented Aug 13, 2020

ralfbrown commented Aug 13, 2020

ralfbrown commented Aug 15, 2020

Non-local means denoising speedup (v2) #5192

Non-local means denoising speedup (v2) #5192

Conversation

ralfbrown commented May 24, 2020 • edited

rawfiner commented May 25, 2020 • edited

rawfiner May 25, 2020

Choose a reason for hiding this comment

ralfbrown May 25, 2020

Choose a reason for hiding this comment

ghost commented May 25, 2020

ralfbrown commented May 25, 2020

ghost commented May 25, 2020

ralfbrown commented May 25, 2020

ghost commented May 26, 2020

ralfbrown commented May 26, 2020

ghost commented May 27, 2020

ghost commented May 27, 2020

ralfbrown commented May 28, 2020

MStraeten commented May 28, 2020

ralfbrown commented May 29, 2020

ghost commented May 29, 2020

ghost commented May 29, 2020

ralfbrown commented May 29, 2020

TurboGit commented May 29, 2020

ralfbrown commented May 29, 2020

ralfbrown commented Jun 1, 2020

AxelG-DE commented Jun 6, 2020

github-actions bot commented Jul 7, 2020

TurboGit commented Aug 13, 2020

ralfbrown commented Aug 13, 2020

ralfbrown commented Aug 15, 2020

ralfbrown commented May 24, 2020 •

edited

rawfiner commented May 25, 2020 •

edited