Bilateral filter data-race fix and speedup #5222

ralfbrown · 2020-05-29T06:54:21Z

As mentioned in #5221, there is a data race in the bilateral filter, which causes images to differ from run to run. Because multiple threads are writing to the same area of memory, there is also a performance issue due to cacheline bouncing; on my system, beyond 8 threads, wall time remains constant even as CPU time scales with the number of threads.

This PR fixes both issues by using a separate buffer for each thread and combining the results at the end. As with some other recent updates, this one has slightly different results depending on the number of threads, because the results become numerically more accurate with more threads (due to summing fewer items at once).

TurboGit · 2020-05-29T16:25:27Z

src/common/bilateral.c

  shared(b) \
  collapse(2)
 #endif
  for(int j = 0; j < b->height; j++)
  {
    for(int i = 0; i < b->width; i++)
    {
-      size_t index = 4 * (j * b->width + i);
+      size_t index = 4 * (j * b->width + i) ;


superfluous change

TurboGit · 2020-05-29T16:27:09Z

This looks stable enough for the 3.2, do you agree ?

TurboGit · 2020-06-04T13:10:09Z

@ralfbrown : with this patch the monochome test crash when -t 4 is specified:

/opt/darktable/bin/darktable-cli --width 2048 --height 2048 --hq true images/mire1.cr2 0016-monochrome/monochrome.xmp output.png --core  --conf worker_threads=4 -t 4 --disable-opencl
output file already exists, it will get renamed
[dt_init] using 4 threads for openmp parallel sections
munmap_chunk(): invalid pointer
Aborted

At least the integration test start to be useful :)

ralfbrown · 2020-06-04T19:35:15Z

Unable to reproduce, even after rebasing to current master (3.1.0+1928~g272b84285). Is the monochrome.xmp you used the one from this PR? I ask because it's in a different directory, so I don't know whether you renumbered the directories or created a new test.

TurboGit · 2020-06-04T20:55:29Z

Yes the same xmp, I was just preparing the final merge and so renamed all the dirs to follow current numbering. I'll try to double check the coming days, but I was able to reproduce each time and when I reverted the commit no more issue adding it back it crashed again. Pretty consistent...

ralfbrown · 2020-06-05T02:05:34Z

@TurboGit The fact that it's munmap_chunk() which is aborting due to a pointer error is pretty clear evidence of overrunning a buffer somewhere and trashing the memory allocator's housekeeping info. Since you only reported a problem with monochrome and not the other iops using the bilateral filter (lowpass, shadhi, colormapping, etc.), I went looking for something that's different about how monochrome uses the bilateral filter. The only difference I can find is that it uses the same buffer for both input and output in dt_bilateral_slice, but so does colormapping.... And as far as I can tell, dt_bilateral_slice will work just fine in-place.

I tried forcing tiling with a low host_memory_limit (since the commandline you pasted above doesn't set it), but still no crash.

I did notice a needed update, though: some of the dt_bilateral_memory_* functions need to multiply the reported memory requirement by the number of threads. But the requirements are rather modest (0.5 MB/thread for monochrome.xmp, even less for some of the other iops). And I think both monochrome and color mapping are reporting excessive requirements from tiling_callback anyway:
tiling->factor = 3.0f + (float)dt_bilateral_memory_use(width, height, sigma_s, sigma_r) / basebuffer;
Because neither is allocating a temporary image buffer that I can see, shouldn't that be 2.0f + ... in both cases?

TurboGit · 2020-06-11T16:05:29Z

I tried forcing tiling with a low host_memory_limit (since the commandline you pasted above doesn't set it), but still no crash.

Note that the command line I sent did not set the memory nor the threading but the integration test driver does and I have the same issue. For reference here are the options used by the driver:

            CORE_OPTIONS="--conf host_memory_limit=8192 \
                 --conf worker_threads=4 -t 4 \
                 --conf plugins/lighttable/export/force_lcms2=FALSE \
                 --conf plugins/lighttable/export/iccintent=0"

At this stage I will schedule that for 3.4. We can revisit this decision until end of July for the data-race fix of course! All fixes are welcome for the next release but I don' t want to put any pressure on you.

…tention

Also return computed size needed by the bilateral filter taking into account the number of per-thread sub-buffers.

ralfbrown · 2020-06-12T02:20:55Z

I ran the various iops using the bilateral filter through their paces with varying numbers of threads to get performance numbers prior to opening the PR, and didn't have any problems, nor did I have any issues with monochrome either with or without memory limits. I'm still unable to reproduce, but have implemented a change based on the only thing I can think of that might be causing a buffer overrun: if your version of libgomp can return thread numbers greater than the number of OMP active threads when the number of active threads is less than the number of processors.

I was trying to save some memory by allocating per-thread buffers only for the number of threads actually in use, rather than dt_get_num_threads(), because the latter always returns the number of processors when compiled with OMP. I've set my dt startup script to use -t 32 because most of the iops are actually slightly slower with 64 threads than 32 (running 2 threads per core has 2x the thread-management overhead while only getting ~1.25 the compute at the same clock speed, the Threadripper is probably not boosting clock speed as much with 64 active threads, and the faster iops are probably saturating the memory bus well before getting up to 64 threads).

Let me know if this version still causes aborts from munmap_chunk().

TurboGit · 2020-06-12T06:59:22Z

Ok, good news I don't have the crash anymore. All tests have passed without crashing. The results are the following:

Test 0015-shadhi-bilateral
      Image mire1.cr2
      CPU & GPU version differ by 31729 pixels
      Max  dE         2.1161
      Mean dE         0.0002
      Count above max 4
  FAILS: image visually changed
         see diff.jpg for visual difference
         (1180 pixels changed)

Test 0016-lowpass-bilateral
      Image mire1.cr2
      CPU & GPU version differ by 33904 pixels
      Max  dE         0.1995
  OK

Test 0017-monochrome
      Image mire1.cr2
      CPU & GPU version differ by 7486 pixels
      Max  dE         0.4600
  OK

Test 0018-perspective-corr
      Image mire1.cr2
      CPU & GPU version differ by 35813 pixels
      Max  dE         0.0000
  OK

Test 0019-color-mapping
      Image mire1.cr2
      CPU & GPU version differ by 40563 pixels
      Max  dE         1.5103
  OK

Some all errors are < dE 2.0 only test shadhi-bilateral as a dE slightly above 2.0 but for only 4 pixels, so I'll update the expected output.

All in all this can be merged for 3.2 finally.

Thanks a lot for your time on this.

ralfbrown · 2020-06-12T11:21:58Z

I guess that means my hypothesis about thread numbers was correct. Bit of a shame to waste 16 MB of memory and the time needed to add all those zeros when only using 32 threads on a 64-thread CPU....

ralfbrown mentioned this pull request May 29, 2020

Fix data race in bilateral filter #5221

Closed

ralfbrown changed the title ~~Bilat fix speedup~~ Bilateral filter data-race fix and speedup May 29, 2020

TurboGit reviewed May 29, 2020

View reviewed changes

TurboGit added this to the 3.2 milestone May 29, 2020

TurboGit added the bugfix pull request fixing a bug label May 29, 2020

ralfbrown mentioned this pull request Jun 1, 2020

Bilateral filter speedup #5278

Closed

rawfiner mentioned this pull request Jun 3, 2020

[Denoise bilateral filter] Image becomes unusable with black and grey lines #4984

Closed

TurboGit added the scope: performance doing everything the same but faster label Jun 6, 2020

TurboGit modified the milestones: 3.2, 3.4 Jun 11, 2020

ralfbrown added 4 commits June 11, 2020 22:09

eliminate data race and improve scalability by eliminating memory con…

984e30a

…tention

add integration tests (no expected.png) for iops using bilateral filter

21dfb0d

remove extraneous blank

5e7faeb

alloc buf based on number of processors, not active threads

7911051

Also return computed size needed by the bilateral filter taking into account the number of per-thread sub-buffers.

ralfbrown force-pushed the bilat_fix_speedup branch from b931ef6 to 7911051 Compare June 12, 2020 02:11

TurboGit approved these changes Jun 12, 2020

View reviewed changes

TurboGit merged commit 7ec4d84 into darktable-org:master Jun 12, 2020

TurboGit modified the milestones: 3.4, 3.2 Jun 12, 2020

ralfbrown deleted the bilat_fix_speedup branch June 12, 2020 11:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bilateral filter data-race fix and speedup #5222

Bilateral filter data-race fix and speedup #5222

ralfbrown commented May 29, 2020

TurboGit May 29, 2020

ralfbrown May 29, 2020

TurboGit commented May 29, 2020

TurboGit commented Jun 4, 2020

ralfbrown commented Jun 4, 2020

TurboGit commented Jun 4, 2020

ralfbrown commented Jun 5, 2020

TurboGit commented Jun 11, 2020

ralfbrown commented Jun 12, 2020

TurboGit commented Jun 12, 2020

ralfbrown commented Jun 12, 2020

Bilateral filter data-race fix and speedup #5222

Bilateral filter data-race fix and speedup #5222

Conversation

ralfbrown commented May 29, 2020

TurboGit May 29, 2020

Choose a reason for hiding this comment

ralfbrown May 29, 2020

Choose a reason for hiding this comment

TurboGit commented May 29, 2020

TurboGit commented Jun 4, 2020

ralfbrown commented Jun 4, 2020

TurboGit commented Jun 4, 2020

ralfbrown commented Jun 5, 2020

TurboGit commented Jun 11, 2020

ralfbrown commented Jun 12, 2020

TurboGit commented Jun 12, 2020

ralfbrown commented Jun 12, 2020