Skip to content

Histogram calculation updates #13330

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Jan 16, 2023

Conversation

dtorop
Copy link
Contributor

@dtorop dtorop commented Jan 12, 2023

This modernizes the histogram calculation code (the code in common/histogram.c, not to be confused with the UI scope code in libs/histogram.c. It removes the hand-written SSE code, but the generated code should be equivalent in speed. It removes approx. 189 LOC.

Changes:

  • Remove the code to generate a histogram of a floating point raw images. This appears to be unused anywhere in the current darktable codebase.
  • The exposure deflicker code previously bypassed the standard histogram call dt_histogram_helper(). It now uses that call to generate a histogram of the raw image data (which will be 16 bit unsigned int).
  • The histogram data buffer is now 64-bit aligned. As before the caller to dt_histogram_helper() must free any returned buffer, but now it must use dt_free_align()
  • As mentioned, all the SSE code is gone. But I used it as a model for the revised code (which relies heavily on the for_each_channel()), hence the generated assembly should look eerily similar in many cases.
  • Clamping of bin numbers is now always done by converting them to integers first. Previously they were sometimes clamped as floats. This shouldn't make any difference in the results, but is more "correct". Wrong! Clamping negative floats as unsigned integers is not advisable.
  • The mul parameter was removed from dt_dev_histogram_collection_params_t. It appears to have been intended for scaling the binning to something other than the maximum # of bins, but no current code used this feature.
  • A buf_size parameter was added to dt_dev_histogram_stats_t. This allows the histogram code to calculate if the # of bins has changed, and if so allocate a larger buffer if necessary. (Changing of the buffer size is only used by the levels iop. It might be a bit more slick to put this tracking/reallocating responsibility on the levels iop.)
  • dt_histogram_max_helper() as a separate call was eliminated. Pass the histogram_max array in to the main dt_histogram_helper() function instead, or NULL if no maximum calculation is needed (which is the case for exposure deflicker).
  • The prior code ignored the first bin when calculating max values for each channel. This makes sense for the R, G, B, L, and C channels to avoid scaling to underexposed values. But not so much for the a, b and h values, so bin 0 for these is counted.
  • Rather than using hand-rolled per-thread buffers and explicitly written reduction code when using OpenMP threads, the code relies on OpenMP array reductions. This simplifies the code, and there appears to be no speed difference.
  • compensate_middle_grey is now a boolean (it was an int)

Rationale for this work:

  • Compilers are increasingly capable, and it doesn't make sense in many cases to handcode SIMD instructions.
  • OpenMP 4.5 allows for array reductions, which further reduces code size.
  • The prior histogram calculation code, when perused, looks like something magic. Though this PR's code is much less magic, it is more succinct, and hence should be easier to reason about.
  • There was some idea that the crashes on Windows are caused by unaligned access to histogram data. Though I don't see any huge signs of problems here, this code does align the histogram data, which might help.
  • There has been so much revision in the use of histogram code over the years that it makes sense to perform an overhaul. This allows for discovering unused code, etc. As the actual uses of histograms in darktable has become clear, it makes sense to code the histogram calculation around how the histograms are actually used, rather than coding something slightly more generalized.
  • Prior code often had OpenMP and SSE execution paths for the same operation. This PR only has one execution path for each operation.

dtorop added 16 commits January 10, 2023 23:16
The magic of array reductions seems to work.

Also, get rid of temporary histogram buffer. We can just write to the
permanent histogram buffer.
Compilers have gotten better, functions avoid some macro pitfalls.

Coalesce/rename float binning function.

Inline the unsigned int binning function. It's unclear if this was
ever used.

Inline helpers that increments raw bins. These each are only used
once (one is for float, one for uint16_t).
It's non-SIMD, non-SSE, and non-OpenMP but still is about as fast as
prior code.
This is similar to the prior SSE code, and uses the magic of
for_each_channel() to enable SIMD instructions.

Note that, as with the prior SSE code, the clamping of bins is done
via floating point rather than integer math, hence could there be a
risk of floating point error producing an out-of-bounds bin?

The code produced is about as fast (and tidy assembly) as the prior
SSE version.

It uses dt_aligned_pixel_t, but that shouldn't make difference as all
this is done in registers.
No functional change. Remove the stub for SSE codepath.

Akin to the Lab work. Slower than Lab, though, as it does Lab->LCh
conversion. About as fast as the prior LCh code.
Get rid of SSE variant, and use for_each_channel() to write a
vectorizable RGB histogram calculation.

Runs approximately as fast as prior code (e.g. very fast).
Lose the SSE code, inline the very direct RGB code. It seems to run
slightly faster than prior code.

Vectorize the dt_ioppr_compensate_middle_grey() call. This produces
much more succinct assembly code on gcc and runs about as fast.
No difference in speed, but clearer/shorter code and minimal change in
code generated by gcc.

Helper does clamping in uint32 rather than float, and get rid of some
local const versions of params.

Add updated profiling message to note if doing compensation.
Use for_each_channel() and MAX().

RGB, Lab and LCh max code is pretty much the same.

Unlike the prior version, this code does count in the maximum the
first ab and Ch bins. It continues to not count first RGB or L bins,
so that underexposed pixels don't throw off the histogram scale.
Previously it bypassed dt_histogram_helper(). Perhaps it did this so
that it could specifically run a 16-bit raw histogram.

Eliminate the float raw histogram, as it is not used by any current
code. Instead raw histogram will always be 16 bit.

Therefore can also eliminate a no longer used binning function.

Max helper function only needs to handle RGB, Lab, and LCh. Raw files
are never summed. Report error if unknown input type.
And propagate the change through to rgbcurve iop.
The histogram code alloc's an aligned buffer, so it must be freed as
an aligned buffer. It's the caller's responsibility to free the
buffer, so update all the callers. Including global histogram, which
appears to allocate the buffer anyhow.
It's always bins_count-1 as a float, so don't bother with passing it
in and initializing it.
Only realloc the buffer if it has grown. Remember buffer size. As most
callers (except levels) don't change the buffer size, each generally
only needs to be allocated once.

Always allocate if there is no histogram buffer.

Don't try to free an empty histogram buffer.

Align histogram buffer by 64.

If can't allocate new histogram buffer, don't calculate a histogram.

Also, align working histogram max values. No changes to generated code
by doing this, at least on gcc, but keeps things proper.
No need to have a separate maximum calculation function, when it
requires almost no processing and uses basically same parameters.

Instead, if histogram_max is non NULL (in every case except exposure
deflicker), calculate the maximum and store it there.

This also makes for less noisy perf diagnostics, as there is only one
call, not two.

Also: If unknown input colorspace for histogram, display an error.
@TurboGit
Copy link
Member

@dtorop : The CI failed with:

 /Users/runner/work/darktable/darktable/src/src/common/histogram.c:175:29: fatal error: variable 'bins_total' must have explicitly specified data sharing attributes
  reduction(+:working_hist[:bins_total])                                  \
                            ^~~~~~~~~~
/Users/runner/work/darktable/darktable/src/src/common/histogram.c:173:34: note: explicit data sharing attribute requested here
#pragma omp parallel for default(none)                                    \
                                 ^
1 error generated.

@TurboGit TurboGit added this to the 4.4 milestone Jan 12, 2023
@TurboGit TurboGit self-requested a review January 12, 2023 21:51
@TurboGit TurboGit added feature: redesign current features to rewrite difficulty: hard big changes across different parts of the code base scope: performance doing everything the same but faster scope: codebase making darktable source code easier to manage release notes: pending labels Jan 12, 2023
@dtorop
Copy link
Contributor Author

dtorop commented Jan 13, 2023

@dtorop : The CI failed with:
...
1 error generated.

Drat! It's a LLVM thing, but it's fixable. There's also an error with converting negative values to unsigned int, so a couple fixes on the way.

@dtorop
Copy link
Contributor Author

dtorop commented Jan 13, 2023

Last commit should fix LLVM compilation, as well as a bug on my part which produced bad results with negative input, and generally polishes things.

A remaining worry: Does the OpenMP array reduction used in this PR for the binned data use the stack for storing per-thread data? This isn't a worry for most histograms (256 bins is 4kb/thread, and stack is 8MB). But levels can ask for 16384 bins, and exposure's deflicker asks for 65536 bins.

Bug fix: Clamping negative float values to >= 0 when they're float (or
signed int). Previously the were converted to unsigned int then
clamped, which resulted in negative values ending up in the highest
bin (rather than the lowest). Clamp the bin # in floating point before
converting to int. This simplifies code and matches how it was done in
the code before the recent commits. As the precision of floating point
is far above 2^16 (maximum # of bins we'll ever see), there's no
precision benefit in doing this work via int's.

Bug fix: Fix compilation of reduction on LLVM by including bins_total
as a shared private variable.

Don't count the 0 bin "C" in LCh when determining maximum, as it may
contain a count of negative pixels.

Set the max_bin as a constant, so decrements don't happen within
loops.

Use size_t rather than uint32_t when offsetting arrays.

Use restrict liberally on pointers in function parameters.

Rename internal functions with underscores and briefer names. In
general keep variable names brief.
@dtorop dtorop force-pushed the histogram-calc-updates branch from c06e1fb to b1112cd Compare January 13, 2023 14:58
@dtorop
Copy link
Contributor Author

dtorop commented Jan 13, 2023

Looks like CI failed due to network problem. I force-pushed the last commit to convince it to run again.

These are 1-channel, so allocate a 1-channel buffer, rather than a
4-channel buffer which is 75% empty.

As the auto exposure code uses a 65536 bin histogram, and the OpenMP
array reduction of histogram may create copies of it on the stack, it
makes sense to shrink this buffer (now 256kb, prior to this commit was
1MB).

Change to execution speed appears negligible. Both this and prior
version execute in approx. 0.010 secs (0.100 CPU) on a i7-10750H.

Also:

- s/bnum/bin/ in _clamp_bin() for clarity
- update copyright on all files touched by these commits.
@dtorop
Copy link
Contributor Author

dtorop commented Jan 14, 2023

A remaining worry: Does the OpenMP array reduction used in this PR for the binned data use the stack for storing per-thread data? This isn't a worry for most histograms (256 bins is 4kb/thread, and stack is 8MB). But levels can ask for 16384 bins, and exposure's deflicker asks for 65536 bins.

The last commit reduces the storage needed for exposure. So now most histograms will need a 4kb buffer per thread, excepting exposure w/deflicker and levels w/auto which use 256kb/thread. I think that's OK, even if it does turn out to be on the stack. But I don't have deep experience in this.

Copy link
Member

@TurboGit TurboGit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good, will merge and will fix the code as per comment.

@TurboGit TurboGit merged commit 89523d5 into darktable-org:master Jan 16, 2023
@dtorop
Copy link
Contributor Author

dtorop commented Jan 16, 2023

@TurboGit: Thank you! And thanks for {} cleanups...

I'll be curious if this helps Windows instability. At least it should be slightly less scary to debug that now.

@TurboGit
Copy link
Member

TurboGit commented Mar 6, 2023

@dtorop : Can you add an entry for the release notes (as comment here and I'll integrate)? TIA.

@dtorop
Copy link
Contributor Author

dtorop commented May 1, 2023

@TurboGit: Belated release notes:

"Modernize the histogram calculation code. Remove SSE code (which provides no speed-ups), but use it as a model for the optimized code using recent OpenMP features. Remove various unused bits of code, and provide a consistent internal API. In certain cases this code will produce marginally more accurate results. In some cases the new code uses substantially less memory."

Again, I can make this more succinct!

@TurboGit
Copy link
Member

TurboGit commented May 2, 2023

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty: hard big changes across different parts of the code base feature: redesign current features to rewrite scope: codebase making darktable source code easier to manage scope: performance doing everything the same but faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants