-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Histogram calculation updates #13330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Histogram calculation updates #13330
Conversation
The magic of array reductions seems to work. Also, get rid of temporary histogram buffer. We can just write to the permanent histogram buffer.
Compilers have gotten better, functions avoid some macro pitfalls. Coalesce/rename float binning function. Inline the unsigned int binning function. It's unclear if this was ever used. Inline helpers that increments raw bins. These each are only used once (one is for float, one for uint16_t).
It's non-SIMD, non-SSE, and non-OpenMP but still is about as fast as prior code.
This is similar to the prior SSE code, and uses the magic of for_each_channel() to enable SIMD instructions. Note that, as with the prior SSE code, the clamping of bins is done via floating point rather than integer math, hence could there be a risk of floating point error producing an out-of-bounds bin? The code produced is about as fast (and tidy assembly) as the prior SSE version. It uses dt_aligned_pixel_t, but that shouldn't make difference as all this is done in registers.
No functional change. Remove the stub for SSE codepath. Akin to the Lab work. Slower than Lab, though, as it does Lab->LCh conversion. About as fast as the prior LCh code.
Get rid of SSE variant, and use for_each_channel() to write a vectorizable RGB histogram calculation. Runs approximately as fast as prior code (e.g. very fast).
Lose the SSE code, inline the very direct RGB code. It seems to run slightly faster than prior code. Vectorize the dt_ioppr_compensate_middle_grey() call. This produces much more succinct assembly code on gcc and runs about as fast.
No difference in speed, but clearer/shorter code and minimal change in code generated by gcc. Helper does clamping in uint32 rather than float, and get rid of some local const versions of params. Add updated profiling message to note if doing compensation.
Use for_each_channel() and MAX(). RGB, Lab and LCh max code is pretty much the same. Unlike the prior version, this code does count in the maximum the first ab and Ch bins. It continues to not count first RGB or L bins, so that underexposed pixels don't throw off the histogram scale.
Previously it bypassed dt_histogram_helper(). Perhaps it did this so that it could specifically run a 16-bit raw histogram. Eliminate the float raw histogram, as it is not used by any current code. Instead raw histogram will always be 16 bit. Therefore can also eliminate a no longer used binning function. Max helper function only needs to handle RGB, Lab, and LCh. Raw files are never summed. Report error if unknown input type.
And propagate the change through to rgbcurve iop.
The histogram code alloc's an aligned buffer, so it must be freed as an aligned buffer. It's the caller's responsibility to free the buffer, so update all the callers. Including global histogram, which appears to allocate the buffer anyhow.
It's always bins_count-1 as a float, so don't bother with passing it in and initializing it.
Only realloc the buffer if it has grown. Remember buffer size. As most callers (except levels) don't change the buffer size, each generally only needs to be allocated once. Always allocate if there is no histogram buffer. Don't try to free an empty histogram buffer. Align histogram buffer by 64. If can't allocate new histogram buffer, don't calculate a histogram. Also, align working histogram max values. No changes to generated code by doing this, at least on gcc, but keeps things proper.
No need to have a separate maximum calculation function, when it requires almost no processing and uses basically same parameters. Instead, if histogram_max is non NULL (in every case except exposure deflicker), calculate the maximum and store it there. This also makes for less noisy perf diagnostics, as there is only one call, not two. Also: If unknown input colorspace for histogram, display an error.
@dtorop : The CI failed with:
|
Drat! It's a LLVM thing, but it's fixable. There's also an error with converting negative values to unsigned int, so a couple fixes on the way. |
Last commit should fix LLVM compilation, as well as a bug on my part which produced bad results with negative input, and generally polishes things. A remaining worry: Does the OpenMP array reduction used in this PR for the binned data use the stack for storing per-thread data? This isn't a worry for most histograms (256 bins is 4kb/thread, and stack is 8MB). But levels can ask for 16384 bins, and exposure's deflicker asks for 65536 bins. |
Bug fix: Clamping negative float values to >= 0 when they're float (or signed int). Previously the were converted to unsigned int then clamped, which resulted in negative values ending up in the highest bin (rather than the lowest). Clamp the bin # in floating point before converting to int. This simplifies code and matches how it was done in the code before the recent commits. As the precision of floating point is far above 2^16 (maximum # of bins we'll ever see), there's no precision benefit in doing this work via int's. Bug fix: Fix compilation of reduction on LLVM by including bins_total as a shared private variable. Don't count the 0 bin "C" in LCh when determining maximum, as it may contain a count of negative pixels. Set the max_bin as a constant, so decrements don't happen within loops. Use size_t rather than uint32_t when offsetting arrays. Use restrict liberally on pointers in function parameters. Rename internal functions with underscores and briefer names. In general keep variable names brief.
c06e1fb
to
b1112cd
Compare
Looks like CI failed due to network problem. I force-pushed the last commit to convince it to run again. |
These are 1-channel, so allocate a 1-channel buffer, rather than a 4-channel buffer which is 75% empty. As the auto exposure code uses a 65536 bin histogram, and the OpenMP array reduction of histogram may create copies of it on the stack, it makes sense to shrink this buffer (now 256kb, prior to this commit was 1MB). Change to execution speed appears negligible. Both this and prior version execute in approx. 0.010 secs (0.100 CPU) on a i7-10750H. Also: - s/bnum/bin/ in _clamp_bin() for clarity - update copyright on all files touched by these commits.
The last commit reduces the storage needed for exposure. So now most histograms will need a 4kb buffer per thread, excepting exposure w/deflicker and levels w/auto which use 256kb/thread. I think that's OK, even if it does turn out to be on the stack. But I don't have deep experience in this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All good, will merge and will fix the code as per comment.
@TurboGit: Thank you! And thanks for {} cleanups... I'll be curious if this helps Windows instability. At least it should be slightly less scary to debug that now. |
@dtorop : Can you add an entry for the release notes (as comment here and I'll integrate)? TIA. |
@TurboGit: Belated release notes: "Modernize the histogram calculation code. Remove SSE code (which provides no speed-ups), but use it as a model for the optimized code using recent OpenMP features. Remove various unused bits of code, and provide a consistent internal API. In certain cases this code will produce marginally more accurate results. In some cases the new code uses substantially less memory." Again, I can make this more succinct! |
Thanks! |
This modernizes the histogram calculation code (the code in common/histogram.c, not to be confused with the UI scope code in libs/histogram.c. It removes the hand-written SSE code, but the generated code should be equivalent in speed. It removes approx. 189 LOC.
Changes:
dt_histogram_helper()
. It now uses that call to generate a histogram of the raw image data (which will be 16 bit unsigned int).dt_histogram_helper()
must free any returned buffer, but now it must usedt_free_align()
for_each_channel()
), hence the generated assembly should look eerily similar in many cases.Clamping of bin numbers is now always done by converting them to integers first. Previously they were sometimes clamped as floats. This shouldn't make any difference in the results, but is more "correct".Wrong! Clamping negative floats as unsigned integers is not advisable.mul
parameter was removed fromdt_dev_histogram_collection_params_t
. It appears to have been intended for scaling the binning to something other than the maximum # of bins, but no current code used this feature.buf_size
parameter was added todt_dev_histogram_stats_t
. This allows the histogram code to calculate if the # of bins has changed, and if so allocate a larger buffer if necessary. (Changing of the buffer size is only used by the levels iop. It might be a bit more slick to put this tracking/reallocating responsibility on the levels iop.)dt_histogram_max_helper()
as a separate call was eliminated. Pass thehistogram_max
array in to the maindt_histogram_helper()
function instead, orNULL
if no maximum calculation is needed (which is the case for exposure deflicker).R
,G
,B
,L
, andC
channels to avoid scaling to underexposed values. But not so much for thea
,b
andh
values, so bin 0 for these is counted.compensate_middle_grey
is now a boolean (it was an int)Rationale for this work: