Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-local means denoising speedup (v2) #5192

Closed
wants to merge 12 commits into from

Conversation

ralfbrown
Copy link
Collaborator

@ralfbrown ralfbrown commented May 24, 2020

(cleaned-up version of #5169 )

I've been rewriting the non-local means denoising code. Here is the result, with both scalar and SSE implementations for both the Denoise (non-local means) and Denoise (profiled) iops. I also refactored the CL code paths, but that is completely untested as I do not have OpenCL [1], and is therefore disabled by default. Change USE_NEW_IMPL_CL at the top of src/iop/nlmeans.c and src/iop/denoiseprofile.c, respectively, to try out the refactored OpenCL code paths.

[1] video card is unsupported. I did try installing pocl, but even after I forced dt to use it, it failed to compile kernels with a "common.h: file not found" error, despite that file being in the directory pointed at by the -I flag....

Per-thread memory use the same as before, though I have added a couple of KB of housekeeping data. Speed is better single-threaded, and much better with large numbers of threads. Results will differ slightly from the previous CPU implementation, but should be closer to the OpenCL code, as the interval between fully recomputing the patch statistics is shorter, allowing less of a buildup of rounding errors.

ETA: rewrote the code to use tiles small enough to fit in L2 cache, with active parts small enough to fit in L1, which now gives a 21-27x speedup with 32 threads and much better single-thread performance with scattered patches as used by denoise(profiled). Timings below are for git master (old), the first version of my code (new), and the revised version of my code (newer).

Timings with the default parameters on the mire1.cr2 image from the integration test suite:
Denoise(profiled)
Threads / old time / new time / newer
1 / 25.735 / 22.532 / 14.275
8 / 3.483 / 3.055 / 1.914
16 / 1.889 / 1.595 / 0.982
24 / 1.556 / 1.118 / 0.708
32 / 1.641 / 0.856 / 0.662

Denoise(non-local means)
Threads / old time / new time / newer
1 / 7.516 / 6.194 / 6.166
8 / 1.435 / 0.854 / 0.799
16 / 1.395 / 0.435 / 0.434
24 / 1.438 / 0.371 / 0.292
32 / 1.437 / 0.238 / 0.298
(I get 0.232 with 30 threads and 0.227 with 31 for the newest code, so the 0.298 with 32 may be an artifact of an uneven split in the tiles.)

Timings using the image and .xmp from the discuss.pixls.us thread "darktable speed (in general, and when using two monitors)"
Denoise (non-local means)
Threads / old time / new time / newer
1 / 14.346 / 11.655 / 12.165
4 / 4.174 / 3.124 / 3.130
8 / 2.783 / 1.589 / 1.568
16 / 2.735 / 0.853 / 0.811
24 / 2.806 / 0.608 / 0.553
28 / 2.853 / 0.605 / 0.486
32 / 2.862 / 0.897 / 0.447

@rawfiner
Copy link
Collaborator

rawfiner commented May 25, 2020

Some performance testing with denoise profiled (my CPU is an Intel i7-6700, with 4 physical cores, 8 threads):
threads | old time | new time
with scattering = 0
1 | 64.278 | 55.689
2 | 39.376 | 32.158
4 | 65.362 | 55.769
8 | 77.115 | 72.684

with scattering = 1
2 | 65.854 | 50.02

In all cases I get a speedup with this PR (though the performance still scales very badly on my CPU as the number of threads is increased)


// allocate and fill an array of patch definitions
static struct patch_t*
define_patches(const dt_nlmeans_param_t *const params, const int stride, int *num_patches, int *max_shift)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, this defines an array with all the patches positions. Maybe there could be some performance speedup if we sort this array according to the offset (scattering was designed to produce a mess out of the normal offsets, and as such it does not give an ordered output from an ordered input), to avoid some cache misses?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You understand correctly. And once the refactored CL code path is validated as correct, we can give the nlmeans iop the scattering feature just by wiring up the GUI (the power of sharing code instead of making wholesale copies as I'm seeing so frequently in the codebase).

Sorting the array is certainly a possibility that I haven't looked into yet. It did occur to me that we could omit the center pixel (offset 0,0) from the patches array and initialize the output buffer with a copy of the input buffer for about an 0.4% speedup (you save one patch computation out of 225).

What might help more than sorting strictly by offset is sorting such that patches which will often occupy the same or adjacent cache lines get grouped together. That might be a bit tricky to implement, and would be moot for very small scattering values.

I'll give the sorting a try once I finish my current work speeding up wavelets denoise. Only 3% single threaded so far, but 25% with lots of threads thanks to integrating some formerly single-threaded computation as part of existing parallelized code.

@ghost
Copy link

ghost commented May 25, 2020

Builds for me with clang under Windows, and just rescued a couple more noisy photos without a problem. Nice.

@ralfbrown
Copy link
Collaborator Author

@coffeeandspaceships was that CPU or OpenCL? If the latter, this PR does nothing unless you flip the #defines at the start of src/iop/denoiseprofile.c and src/iop/nlmeans.c (since I didn't want to inadvertently break anything I can't run myself).
If it was OpenCL and you did enable my refactoring, that's great news!

@ghost
Copy link

ghost commented May 25, 2020

Both. :-) (BUILD with OpenCL, and different BUILD with CPU.)

@ralfbrown
Copy link
Collaborator Author

Have you checked the refactored CL code for denoise(non-local means) as well?
CL code should generate pixel-identical results whether it's the git-master version or the refactored version, but CPU will vary slightly due to different accumulation of rounding errors (actually less accumulation for the sped-up code :-)).
I'll prepare a commit that switches denoiseprofile.c to use the refactored CL code, but will hold off on doing the same in nlmeans.c until I hear back.

@ghost
Copy link

ghost commented May 26, 2020

I have some challenges getting the new regression tests to work in my build environment - I think it's to do with MinGW, MinGW's python, and MinGW's OpenCV having some disagreement. I'm going to try setting up from scratch in a virtual Linux box and then fudging @TurboGit 's scripts so the compare part of the regression testing is carried out on the (hopefully working) Linux environment, but using the comparisons generated under Windows. So I won't be able to do the formal part of the test until this evening. (If you want to discuss the OpenCV issue more raise a seperate issue so we don't clutter up your PR :+1)

Meanwhile a bit of confidence for you. This is from a build including your commit last night

2.641977 [opencl_create_kernel] successfully loaded kernel `denoiseprofile_precondition' (129) for device 0 2.644969 [opencl_create_kernel] successfully loaded kernel `denoiseprofile_precondition_v2' (130) for device 0 2.652948 [opencl_create_kernel] successfully loaded kernel `denoiseprofile_precondition_Y0U0V0' (131) for device 0 2.658932 [opencl_create_kernel] successfully loaded kernel `denoiseprofile_init' (132) for device 0 2.662921 [opencl_create_kernel] successfully loaded kernel `denoiseprofile_dist' (133) for device 0 2.665913 [opencl_create_kernel] successfully loaded kernel `denoiseprofile_horiz' (134) for device 0 2.672894 [opencl_create_kernel] successfully loaded kernel `denoiseprofile_vert' (135) for device 0 2.679876 [opencl_create_kernel] successfully loaded kernel `denoiseprofile_accu' (136) for device 0 2.683865 [opencl_create_kernel] successfully loaded kernel `denoiseprofile_finish' (137) for device 0 2.694835 [opencl_create_kernel] successfully loaded kernel `denoiseprofile_finish_v2' (138) for device 0 2.699823 [opencl_create_kernel] successfully loaded kernel `denoiseprofile_backtransform' (139) for device 0 2.703812 [opencl_create_kernel] successfully loaded kernel `denoiseprofile_backtransform_v2' (140) for device 0 2.711795 [opencl_create_kernel] successfully loaded kernel `denoiseprofile_backtransform_Y0U0V0' (141) for device 0 2.718772 [opencl_create_kernel] successfully loaded kernel `denoiseprofile_decompose' (142) for device 0 2.724756 [opencl_create_kernel] successfully loaded kernel `denoiseprofile_synthesize' (143) for device 0 2.729742 [opencl_create_kernel] successfully loaded kernel `denoiseprofile_reduce_first' (144) for device 0 2.734729 [opencl_create_kernel] successfully loaded kernel `denoiseprofile_reduce_second' (145) for device 0

@ralfbrown
Copy link
Collaborator Author

If the resulting images look OK using the Mark I Eyeball (tm), it's probably correct. Any mistakes in the refactoring are likely to have dramatic effects since I didn't actually change the computation; in particular, the kernels are untouched. If ImageMagick is working in your Windows environment, you can do "compare old.png new.png -metric pae -verbose diff.png" for old.png and new.png being the outputs without and with this PR under OpenCL. You should get zeros reported for all channels (you won't for images generated using the CPU code path; there it should be no more than 0.00390625 [=1/256]).

(I haven't been able to run TurboGit's deltae script, either, because the version of the colour module available to me doesn't have a delta_E function.)

@ghost
Copy link

ghost commented May 27, 2020

I have somehow only just noticed your pocl comment.

Yeah pocl is not supported, and won't be. I think maybe my Intel OpenCL PR should include removing pocl support explicitly.

@ghost
Copy link

ghost commented May 27, 2020

CPU to OpenCL difference seems quite high:
compare _IGP7076\ no\ opencl.png _IGP7076.png -metric pae -verbose diff.png writing raw profile: type=icc, length=9044 _IGP7076 no opencl.png=>diff.png PNG 4928x3264 4928x3264+0+0 16-bit sRGB 71.3816MiB 35.781u 0:35.095 Image: _IGP7076 no opencl.png Channel distortion: PAE red: 9822 (0.149874) green: 9804 (0.149599) blue: 14808 (0.225956) all: 14808 (0.225956)

@ralfbrown
Copy link
Collaborator Author

pocl is already blacklisted, so I had to not only switch off the blocking of CPU OpenCL devices, I had to turn off blacklisting as well. I wasn't expecting more than a sanity check for my code changes.

As for the CL/CPU image difference, are you getting anywhere near that level of difference without this PR? One thing TurboGit and I figured out on another PR is that the gamut mapping done by LCMS can produce a big PAE compared to not using it, so maybe if you have it turned on for export, it's making things look worse for pixels near the gamut boundary.

@TurboGit TurboGit added the scope: performance doing everything the same but faster label May 28, 2020
@TurboGit TurboGit self-requested a review May 28, 2020 08:09
@TurboGit TurboGit added this to the 3.4 milestone May 28, 2020
@MStraeten
Copy link
Collaborator

my osx build denoise profiled iop with non local means setting is broken for opencl with this pr but non local means denoise iop itself is fine even with opencl
--disable-opencl is ok, but with opencl enabled the image turns black (rgb 0 0 3) when denoise profiled is used with non local means settings (manual and auto)

@ralfbrown
Copy link
Collaborator Author

OK, that means I did break something in my refactoring (the PR currently uses the original CL code for denoise-nlmeans and my refactored code for denoise-profiled). Argh...

I guess we'll have to revert 328c27b until someone who knows CL can look at it.

@ghost
Copy link

ghost commented May 29, 2020

I'd spotted that LCMS discussion and therefore switched LCMS off.

compare _IGP7076-no-opencl.png _IGP7076-built-with-opencl-but-blacklisted.png -metric pae -verbose diff-between-built-without-and-built-with-blacklisted.png
writing raw profile: type=icc, length=9044
_IGP7076-no-opencl.png=>diff-between-built-without-and-built-with-blacklisted.png PNG 4928x3264 4928x3264+0+0 16-bit sRGB 71.3816MiB 38.781u 0:38.048
Image: _IGP7076-no-opencl.png
  Channel distortion: PAE
    red: 3088 (0.0471199)
    green: 1229 (0.0187533)
    blue: 1551 (0.0236667)
    all: 3088 (0.0471199)

If I look at the diff, it is like a pale ghost of the original photo and the colour balance looks about right.

The first comparison I made, earlier in this discussion, gave me a diff that was also a faint version of the original but predominantly red.

@ghost
Copy link

ghost commented May 29, 2020

OK, that means I did break something in my refactoring (the PR currently uses the original CL code for denoise-nlmeans and my refactored code for denoise-profiled). Argh...

I guess we'll have to revert 328c27b until someone who knows CL can look at it.

In that case I think it would be sensible to reduce the scope of this PR again (using #ifdef guards) to ensure your code path is not available on systems that have OpenCL capability (whether or not it's then blacklisted). Because otherwise, OpenCL users that don't need this performance boost much anyway, because they already have OpenCL performance, are blocking the improvement for the users that need it because they are on CPU only!

@ralfbrown
Copy link
Collaborator Author

The diff file produced by ImageMagick's "compare" is a pale ghost of the original, with differing pixels changed to red. So seeing a pale ghost without red is a good thing.

Since 328c27b removed the original CL code and the #ifdef guard I had put around it, reverting it will do exactly as you suggest. I'll do that later today unless I hear otherwise.

@TurboGit
Copy link
Member

@ralfbrown : Just a note about all your speed-up PR. I have planed them for 3.4. As you may have seen we are planning a 3.2 release mi August, at this point only very stable work should be integrated.

@ralfbrown
Copy link
Collaborator Author

I saw the 3.4 last night. Disappointed, but understand that bugfixes get priority. Hope you will manage to slip some of them into 3.2 anyway (especially the ones with small code changes such as #5213).

@ralfbrown
Copy link
Collaborator Author

Now even more scaleable and even faster with scattered patches in denoise(profiled). Verified max delta-E of 1.57 versus current git master, and only 779 total changed pixels. I'll put up detailed timings once I have a chance to make the scalar code path match the new SSE code path, but the TL;DR is one-third less runtime on denoise(profiled) and marginally (1-2%) slower, but still much faster than master, single-threaded on denoise(nonlocal) and faster than my previous version with more than eight threads.

(Had to force-push because git claimed the branch was behind its remote even after rebasing.)

@AxelG-DE
Copy link

AxelG-DE commented Jun 6, 2020

Hope you will manage to slip some of them into 3.2 anyway (especially the ones with small code changes such as #5213).

would be very nice if on that one the decision for 3.4 could be re-thought. I'd also vote for putting it in 3.2 but must admit, I cannot code hence no comment from my end on quality / stability possible (agree on such prio) :)

@github-actions
Copy link

github-actions bot commented Jul 7, 2020

This pull request did not get any activity in the past 30 days and will be closed in 365 days if no update occurs. Please verify it has no conflicts with the master branch and rebase if needed. Mention it now if you need help or give permission to other people to finish your work.

@TurboGit
Copy link
Member

@ralfbrown : I'd like to handle all the speed-up related PR sooner than later to have a chance to fix any possible regressions before 3.4. I'm starting with this one, can you fix the conflicts ? TIA

@ralfbrown
Copy link
Collaborator Author

All of the other changes which have been made since May are creating a real mess while trying to rebase. I'll have to set aside some time over the weekend to ensure I do it right.

@ralfbrown
Copy link
Collaborator Author

Replaced by #5974.

@ralfbrown ralfbrown closed this Aug 15, 2020
@ralfbrown ralfbrown deleted the nlmeans_speedup2 branch June 22, 2021 20:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no-pr-activity scope: performance doing everything the same but faster
Development

Successfully merging this pull request may close these issues.

None yet

5 participants