Wavelets denoising speedup #5204

ralfbrown · 2020-05-26T00:40:45Z

Some small speedups from optimization, and a larger relative speedup at higher thread counts by parallelizing a formerly single-threaded computation. Note that this code generates a slightly different result than the old single-threaded version. I have verified that the difference is due to numerical instability in the old code and that the new version is in fact more accurate (the fourth commit adds some code to demonstrate; see the full commit comment, and feel free to leave it out of any merge).

Timings with default parameters in Denoise(profiled) on the integration-test image:
Threads / Old / New (delta%)
1 / 11.712 / 11.342 (-3.1%)
2 / 6.108 / 5.814 (-4.8%)
4 / 3.306 / 3.068 (-7.2%)
8 / 1.944 / 1.691 (-13.0%)
16 / 1.181 / 0.969 (-18%)
32 / 0.759 / 0.556 (-26%)
64 / 0.611 / 0.371 (-39%)

…on into those functions

Previous git master computed the variance by making a single continuous summation of squared pixel values in the wavelet scales. The new code in this branch sums at line at a time and then adds the line-wise sums to the overall sum. With 20+ megapixels, values added towards the lower right of the image in the old approach effectively have only a few significant bits. By uncommenting the #define VALIDATE_SUMY2 near the top of denoiseprofile.c, you enable the output of three debugging lines per wavelet scale. These show the three per-channel sums of squares computed incrementally while performing the decomposition, as a single continuous sum (the old way), and by taking row-wise sums and adding together those partial sums. When the number of threads equals one, this should give the exact same answer as the incremental summation shown in the first line. As the number of threads increases, the values will start drifting apart, because the decomposition will then be performing a sum of sums of sums, which is even more accurate than the strict row-wise summation.

rawfiner · 2020-05-26T10:46:42Z

Thanks for looking into the performance optimisation of this too!

About the difference due to numerical instability: if they induce "big" changes in the result for a set of parameters, we will need to add a new parameter to choose between legacy and new version. Though, from my first tests, the changes do not seem noticeable, so we probably won't need to do that.

About performance, preliminary tests are overall good on my side:
with 4 threads:
old | new
11.284 | 10.937
with 2 threads:
20.826 | 20.553
with 8 threads:
8.391 | 7.715

ralfbrown · 2020-05-26T10:57:01Z

On the integration-test image, the reported Peak Absolute Error was the smallest increment representable in an 8bit-per-channel file (though that can yield a delta_E > 2 in darker pixels).

github-actions · 2020-06-28T00:15:25Z

This pull request did not get any activity in the past 30 days and will be closed in 365 days if no update occurs. Please verify it has no conflicts with the master branch and rebase if needed. Mention it now if you need help or give permission to other people to finish your work.

TurboGit · 2020-08-13T15:42:03Z

The result the test image is:

Test 0013-denoiseprofile-wavelets
      Image mire1.cr2
      CPU & GPU version differ by 60158 pixels
      Max  dE         2.1908
      Mean dE         0.0014
      Count above max 8
  FAILS: image visually changed
         see diff.jpg for visual difference
         (11441 pixels changed)

On 8 pixels above dE 2.0, so good to me. I'll integrate and update the reference image.

TurboGit · 2020-08-13T15:48:41Z

Manually merged.

ralfbrown added 4 commits May 25, 2020 19:16

optimize precondition_Y0U0V0 and backtransform_Y0U0V0

f4f8ccb

optimize eaw_decompose and eaw_decompose_sse, move variance computati…

6b13166

…on into those functions

optimize eaw_synthesize_sse

128181a

ralfbrown mentioned this pull request May 26, 2020

Add integration test for 'denoise (profiled)' using wavelets in Y0U0V0 #5203

Closed

TurboGit added the scope: performance doing everything the same but faster label May 28, 2020

TurboGit self-requested a review May 28, 2020 08:08

TurboGit added this to the 3.4 milestone May 28, 2020

TurboGit added this to wip in Performance / Pixel pipe rework Jun 20, 2020

github-actions bot added the no-pr-activity label Jun 28, 2020

TurboGit merged commit 2307ac9 into darktable-org:master Aug 13, 2020

ralfbrown deleted the wavelets_speedup branch August 17, 2020 05:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wavelets denoising speedup #5204

Wavelets denoising speedup #5204

ralfbrown commented May 26, 2020

rawfiner commented May 26, 2020

ralfbrown commented May 26, 2020

github-actions bot commented Jun 28, 2020

TurboGit commented Aug 13, 2020

TurboGit commented Aug 13, 2020

Wavelets denoising speedup #5204

Wavelets denoising speedup #5204

Conversation

ralfbrown commented May 26, 2020

rawfiner commented May 26, 2020

ralfbrown commented May 26, 2020

github-actions bot commented Jun 28, 2020

TurboGit commented Aug 13, 2020

TurboGit commented Aug 13, 2020