Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wavelets denoising speedup #5204

Merged
merged 4 commits into from
Aug 13, 2020
Merged

Conversation

ralfbrown
Copy link
Collaborator

Some small speedups from optimization, and a larger relative speedup at higher thread counts by parallelizing a formerly single-threaded computation. Note that this code generates a slightly different result than the old single-threaded version. I have verified that the difference is due to numerical instability in the old code and that the new version is in fact more accurate (the fourth commit adds some code to demonstrate; see the full commit comment, and feel free to leave it out of any merge).

Timings with default parameters in Denoise(profiled) on the integration-test image:
Threads / Old / New (delta%)
1 / 11.712 / 11.342 (-3.1%)
2 / 6.108 / 5.814 (-4.8%)
4 / 3.306 / 3.068 (-7.2%)
8 / 1.944 / 1.691 (-13.0%)
16 / 1.181 / 0.969 (-18%)
32 / 0.759 / 0.556 (-26%)
64 / 0.611 / 0.371 (-39%)

Previous git master computed the variance by making a single
continuous summation of squared pixel values in the wavelet scales.
The new code in this branch sums at line at a time and then adds the
line-wise sums to the overall sum.  With 20+ megapixels, values added
towards the lower right of the image in the old approach effectively
have only a few significant bits.

By uncommenting the #define VALIDATE_SUMY2 near the top of
denoiseprofile.c, you enable the output of three debugging lines per
wavelet scale.  These show the three per-channel sums of squares
computed incrementally while performing the decomposition, as a single
continuous sum (the old way), and by taking row-wise sums and adding
together those partial sums.  When the number of threads equals one,
this should give the exact same answer as the incremental summation
shown in the first line.  As the number of threads increases, the
values will start drifting apart, because the decomposition will then
be performing a sum of sums of sums, which is even more accurate than
the strict row-wise summation.
@rawfiner
Copy link
Collaborator

Thanks for looking into the performance optimisation of this too!

About the difference due to numerical instability: if they induce "big" changes in the result for a set of parameters, we will need to add a new parameter to choose between legacy and new version. Though, from my first tests, the changes do not seem noticeable, so we probably won't need to do that.

About performance, preliminary tests are overall good on my side:
with 4 threads:
old | new
11.284 | 10.937
with 2 threads:
20.826 | 20.553
with 8 threads:
8.391 | 7.715

@ralfbrown
Copy link
Collaborator Author

On the integration-test image, the reported Peak Absolute Error was the smallest increment representable in an 8bit-per-channel file (though that can yield a delta_E > 2 in darker pixels).

@TurboGit TurboGit added the scope: performance doing everything the same but faster label May 28, 2020
@TurboGit TurboGit self-requested a review May 28, 2020 08:08
@TurboGit TurboGit added this to the 3.4 milestone May 28, 2020
@github-actions
Copy link

This pull request did not get any activity in the past 30 days and will be closed in 365 days if no update occurs. Please verify it has no conflicts with the master branch and rebase if needed. Mention it now if you need help or give permission to other people to finish your work.

@TurboGit
Copy link
Member

The result the test image is:

Test 0013-denoiseprofile-wavelets
      Image mire1.cr2
      CPU & GPU version differ by 60158 pixels
      Max  dE         2.1908
      Mean dE         0.0014
      Count above max 8
  FAILS: image visually changed
         see diff.jpg for visual difference
         (11441 pixels changed)

On 8 pixels above dE 2.0, so good to me. I'll integrate and update the reference image.

@TurboGit TurboGit merged commit 2307ac9 into darktable-org:master Aug 13, 2020
@TurboGit
Copy link
Member

Manually merged.

@ralfbrown ralfbrown deleted the wavelets_speedup branch August 17, 2020 05:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no-pr-activity scope: performance doing everything the same but faster
Development

Successfully merging this pull request may close these issues.

None yet

3 participants