-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wavelets denoising speedup #5204
Conversation
…on into those functions
Previous git master computed the variance by making a single continuous summation of squared pixel values in the wavelet scales. The new code in this branch sums at line at a time and then adds the line-wise sums to the overall sum. With 20+ megapixels, values added towards the lower right of the image in the old approach effectively have only a few significant bits. By uncommenting the #define VALIDATE_SUMY2 near the top of denoiseprofile.c, you enable the output of three debugging lines per wavelet scale. These show the three per-channel sums of squares computed incrementally while performing the decomposition, as a single continuous sum (the old way), and by taking row-wise sums and adding together those partial sums. When the number of threads equals one, this should give the exact same answer as the incremental summation shown in the first line. As the number of threads increases, the values will start drifting apart, because the decomposition will then be performing a sum of sums of sums, which is even more accurate than the strict row-wise summation.
Thanks for looking into the performance optimisation of this too! About the difference due to numerical instability: if they induce "big" changes in the result for a set of parameters, we will need to add a new parameter to choose between legacy and new version. Though, from my first tests, the changes do not seem noticeable, so we probably won't need to do that. About performance, preliminary tests are overall good on my side: |
On the integration-test image, the reported Peak Absolute Error was the smallest increment representable in an 8bit-per-channel file (though that can yield a delta_E > 2 in darker pixels). |
This pull request did not get any activity in the past 30 days and will be closed in 365 days if no update occurs. Please verify it has no conflicts with the master branch and rebase if needed. Mention it now if you need help or give permission to other people to finish your work. |
The result the test image is:
On 8 pixels above dE 2.0, so good to me. I'll integrate and update the reference image. |
Manually merged. |
Some small speedups from optimization, and a larger relative speedup at higher thread counts by parallelizing a formerly single-threaded computation. Note that this code generates a slightly different result than the old single-threaded version. I have verified that the difference is due to numerical instability in the old code and that the new version is in fact more accurate (the fourth commit adds some code to demonstrate; see the full commit comment, and feel free to leave it out of any merge).
Timings with default parameters in Denoise(profiled) on the integration-test image:
Threads / Old / New (delta%)
1 / 11.712 / 11.342 (-3.1%)
2 / 6.108 / 5.814 (-4.8%)
4 / 3.306 / 3.068 (-7.2%)
8 / 1.944 / 1.691 (-13.0%)
16 / 1.181 / 0.969 (-18%)
32 / 0.759 / 0.556 (-26%)
64 / 0.611 / 0.371 (-39%)