-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non-local means denoising speedup (v2) #5192
Conversation
Some performance testing with denoise profiled (my CPU is an Intel i7-6700, with 4 physical cores, 8 threads): with scattering = 1 In all cases I get a speedup with this PR (though the performance still scales very badly on my CPU as the number of threads is increased) |
|
||
// allocate and fill an array of patch definitions | ||
static struct patch_t* | ||
define_patches(const dt_nlmeans_param_t *const params, const int stride, int *num_patches, int *max_shift) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, this defines an array with all the patches positions. Maybe there could be some performance speedup if we sort this array according to the offset (scattering was designed to produce a mess out of the normal offsets, and as such it does not give an ordered output from an ordered input), to avoid some cache misses?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You understand correctly. And once the refactored CL code path is validated as correct, we can give the nlmeans iop the scattering feature just by wiring up the GUI (the power of sharing code instead of making wholesale copies as I'm seeing so frequently in the codebase).
Sorting the array is certainly a possibility that I haven't looked into yet. It did occur to me that we could omit the center pixel (offset 0,0) from the patches array and initialize the output buffer with a copy of the input buffer for about an 0.4% speedup (you save one patch computation out of 225).
What might help more than sorting strictly by offset is sorting such that patches which will often occupy the same or adjacent cache lines get grouped together. That might be a bit tricky to implement, and would be moot for very small scattering values.
I'll give the sorting a try once I finish my current work speeding up wavelets denoise. Only 3% single threaded so far, but 25% with lots of threads thanks to integrating some formerly single-threaded computation as part of existing parallelized code.
Builds for me with clang under Windows, and just rescued a couple more noisy photos without a problem. Nice. |
@coffeeandspaceships was that CPU or OpenCL? If the latter, this PR does nothing unless you flip the #defines at the start of src/iop/denoiseprofile.c and src/iop/nlmeans.c (since I didn't want to inadvertently break anything I can't run myself). |
Both. :-) (BUILD with OpenCL, and different BUILD with CPU.) |
Have you checked the refactored CL code for denoise(non-local means) as well? |
I have some challenges getting the new regression tests to work in my build environment - I think it's to do with MinGW, MinGW's python, and MinGW's OpenCV having some disagreement. I'm going to try setting up from scratch in a virtual Linux box and then fudging @TurboGit 's scripts so the compare part of the regression testing is carried out on the (hopefully working) Linux environment, but using the comparisons generated under Windows. So I won't be able to do the formal part of the test until this evening. (If you want to discuss the OpenCV issue more raise a seperate issue so we don't clutter up your PR :+1) Meanwhile a bit of confidence for you. This is from a build including your commit last night
|
If the resulting images look OK using the Mark I Eyeball (tm), it's probably correct. Any mistakes in the refactoring are likely to have dramatic effects since I didn't actually change the computation; in particular, the kernels are untouched. If ImageMagick is working in your Windows environment, you can do "compare old.png new.png -metric pae -verbose diff.png" for old.png and new.png being the outputs without and with this PR under OpenCL. You should get zeros reported for all channels (you won't for images generated using the CPU code path; there it should be no more than 0.00390625 [=1/256]). (I haven't been able to run TurboGit's deltae script, either, because the version of the colour module available to me doesn't have a delta_E function.) |
I have somehow only just noticed your pocl comment. Yeah pocl is not supported, and won't be. I think maybe my Intel OpenCL PR should include removing pocl support explicitly. |
CPU to OpenCL difference seems quite high: |
pocl is already blacklisted, so I had to not only switch off the blocking of CPU OpenCL devices, I had to turn off blacklisting as well. I wasn't expecting more than a sanity check for my code changes. As for the CL/CPU image difference, are you getting anywhere near that level of difference without this PR? One thing TurboGit and I figured out on another PR is that the gamut mapping done by LCMS can produce a big PAE compared to not using it, so maybe if you have it turned on for export, it's making things look worse for pixels near the gamut boundary. |
my osx build denoise profiled iop with non local means setting is broken for opencl with this pr but non local means denoise iop itself is fine even with opencl |
OK, that means I did break something in my refactoring (the PR currently uses the original CL code for denoise-nlmeans and my refactored code for denoise-profiled). Argh... I guess we'll have to revert 328c27b until someone who knows CL can look at it. |
I'd spotted that LCMS discussion and therefore switched LCMS off.
If I look at the diff, it is like a pale ghost of the original photo and the colour balance looks about right. The first comparison I made, earlier in this discussion, gave me a diff that was also a faint version of the original but predominantly red. |
In that case I think it would be sensible to reduce the scope of this PR again (using #ifdef guards) to ensure your code path is not available on systems that have OpenCL capability (whether or not it's then blacklisted). Because otherwise, OpenCL users that don't need this performance boost much anyway, because they already have OpenCL performance, are blocking the improvement for the users that need it because they are on CPU only! |
The diff file produced by ImageMagick's "compare" is a pale ghost of the original, with differing pixels changed to red. So seeing a pale ghost without red is a good thing. Since 328c27b removed the original CL code and the #ifdef guard I had put around it, reverting it will do exactly as you suggest. I'll do that later today unless I hear otherwise. |
@ralfbrown : Just a note about all your speed-up PR. I have planed them for 3.4. As you may have seen we are planning a 3.2 release mi August, at this point only very stable work should be integrated. |
I saw the 3.4 last night. Disappointed, but understand that bugfixes get priority. Hope you will manage to slip some of them into 3.2 anyway (especially the ones with small code changes such as #5213). |
This reverts commit 328c27b.
69ca088
to
826682e
Compare
Now even more scaleable and even faster with scattered patches in denoise(profiled). Verified max delta-E of 1.57 versus current git master, and only 779 total changed pixels. I'll put up detailed timings once I have a chance to make the scalar code path match the new SSE code path, but the TL;DR is one-third less runtime on denoise(profiled) and marginally (1-2%) slower, but still much faster than master, single-threaded on denoise(nonlocal) and faster than my previous version with more than eight threads. (Had to force-push because git claimed the branch was behind its remote even after rebasing.) |
would be very nice if on that one the decision for 3.4 could be re-thought. I'd also vote for putting it in 3.2 but must admit, I cannot code hence no comment from my end on quality / stability possible (agree on such prio) :) |
This pull request did not get any activity in the past 30 days and will be closed in 365 days if no update occurs. Please verify it has no conflicts with the master branch and rebase if needed. Mention it now if you need help or give permission to other people to finish your work. |
@ralfbrown : I'd like to handle all the speed-up related PR sooner than later to have a chance to fix any possible regressions before 3.4. I'm starting with this one, can you fix the conflicts ? TIA |
All of the other changes which have been made since May are creating a real mess while trying to rebase. I'll have to set aside some time over the weekend to ensure I do it right. |
Replaced by #5974. |
(cleaned-up version of #5169 )
I've been rewriting the non-local means denoising code. Here is the result, with both scalar and SSE implementations for both the Denoise (non-local means) and Denoise (profiled) iops. I also refactored the CL code paths, but that is completely untested as I do not have OpenCL [1], and is therefore disabled by default. Change USE_NEW_IMPL_CL at the top of src/iop/nlmeans.c and src/iop/denoiseprofile.c, respectively, to try out the refactored OpenCL code paths.
[1] video card is unsupported. I did try installing pocl, but even after I forced dt to use it, it failed to compile kernels with a "common.h: file not found" error, despite that file being in the directory pointed at by the -I flag....
Per-thread memory use the same as before, though I have added a couple of KB of housekeeping data. Speed is better single-threaded, and much better with large numbers of threads. Results will differ slightly from the previous CPU implementation, but should be closer to the OpenCL code, as the interval between fully recomputing the patch statistics is shorter, allowing less of a buildup of rounding errors.
ETA: rewrote the code to use tiles small enough to fit in L2 cache, with active parts small enough to fit in L1, which now gives a 21-27x speedup with 32 threads and much better single-thread performance with scattered patches as used by denoise(profiled). Timings below are for git master (old), the first version of my code (new), and the revised version of my code (newer).
Timings with the default parameters on the mire1.cr2 image from the integration test suite:
Denoise(profiled)
Threads / old time / new time / newer
1 / 25.735 / 22.532 / 14.275
8 / 3.483 / 3.055 / 1.914
16 / 1.889 / 1.595 / 0.982
24 / 1.556 / 1.118 / 0.708
32 / 1.641 / 0.856 / 0.662
Denoise(non-local means)
Threads / old time / new time / newer
1 / 7.516 / 6.194 / 6.166
8 / 1.435 / 0.854 / 0.799
16 / 1.395 / 0.435 / 0.434
24 / 1.438 / 0.371 / 0.292
32 / 1.437 / 0.238 / 0.298
(I get 0.232 with 30 threads and 0.227 with 31 for the newest code, so the 0.298 with 32 may be an artifact of an uneven split in the tiles.)
Timings using the image and .xmp from the discuss.pixls.us thread "darktable speed (in general, and when using two monitors)"
Denoise (non-local means)
Threads / old time / new time / newer
1 / 14.346 / 11.655 / 12.165
4 / 4.174 / 3.124 / 3.130
8 / 2.783 / 1.589 / 1.568
16 / 2.735 / 0.853 / 0.811
24 / 2.806 / 0.608 / 0.553
28 / 2.853 / 0.605 / 0.486
32 / 2.862 / 0.897 / 0.447