Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bilateral filter data-race fix and speedup #5222

Merged
merged 4 commits into from
Jun 12, 2020

Conversation

ralfbrown
Copy link
Collaborator

As mentioned in #5221, there is a data race in the bilateral filter, which causes images to differ from run to run. Because multiple threads are writing to the same area of memory, there is also a performance issue due to cacheline bouncing; on my system, beyond 8 threads, wall time remains constant even as CPU time scales with the number of threads.

This PR fixes both issues by using a separate buffer for each thread and combining the results at the end. As with some other recent updates, this one has slightly different results depending on the number of threads, because the results become numerically more accurate with more threads (due to summing fewer items at once).

@ralfbrown ralfbrown changed the title Bilat fix speedup Bilateral filter data-race fix and speedup May 29, 2020
shared(b) \
collapse(2)
#endif
for(int j = 0; j < b->height; j++)
{
for(int i = 0; i < b->width; i++)
{
size_t index = 4 * (j * b->width + i);
size_t index = 4 * (j * b->width + i) ;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

superfluous change

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@TurboGit TurboGit added this to the 3.2 milestone May 29, 2020
@TurboGit TurboGit added the bugfix pull request fixing a bug label May 29, 2020
@TurboGit
Copy link
Member

This looks stable enough for the 3.2, do you agree ?

@TurboGit
Copy link
Member

TurboGit commented Jun 4, 2020

@ralfbrown : with this patch the monochome test crash when -t 4 is specified:

/opt/darktable/bin/darktable-cli --width 2048 --height 2048 --hq true images/mire1.cr2 0016-monochrome/monochrome.xmp output.png --core  --conf worker_threads=4 -t 4 --disable-opencl
output file already exists, it will get renamed
[dt_init] using 4 threads for openmp parallel sections
munmap_chunk(): invalid pointer
Aborted

At least the integration test start to be useful :)

@ralfbrown
Copy link
Collaborator Author

Unable to reproduce, even after rebasing to current master (3.1.0+1928~g272b84285). Is the monochrome.xmp you used the one from this PR? I ask because it's in a different directory, so I don't know whether you renumbered the directories or created a new test.

@TurboGit
Copy link
Member

TurboGit commented Jun 4, 2020

Yes the same xmp, I was just preparing the final merge and so renamed all the dirs to follow current numbering. I'll try to double check the coming days, but I was able to reproduce each time and when I reverted the commit no more issue adding it back it crashed again. Pretty consistent...

@ralfbrown
Copy link
Collaborator Author

@TurboGit The fact that it's munmap_chunk() which is aborting due to a pointer error is pretty clear evidence of overrunning a buffer somewhere and trashing the memory allocator's housekeeping info. Since you only reported a problem with monochrome and not the other iops using the bilateral filter (lowpass, shadhi, colormapping, etc.), I went looking for something that's different about how monochrome uses the bilateral filter. The only difference I can find is that it uses the same buffer for both input and output in dt_bilateral_slice, but so does colormapping.... And as far as I can tell, dt_bilateral_slice will work just fine in-place.

I tried forcing tiling with a low host_memory_limit (since the commandline you pasted above doesn't set it), but still no crash.

I did notice a needed update, though: some of the dt_bilateral_memory_* functions need to multiply the reported memory requirement by the number of threads. But the requirements are rather modest (0.5 MB/thread for monochrome.xmp, even less for some of the other iops). And I think both monochrome and color mapping are reporting excessive requirements from tiling_callback anyway:
tiling->factor = 3.0f + (float)dt_bilateral_memory_use(width, height, sigma_s, sigma_r) / basebuffer;
Because neither is allocating a temporary image buffer that I can see, shouldn't that be 2.0f + ... in both cases?

@TurboGit TurboGit added the scope: performance doing everything the same but faster label Jun 6, 2020
@TurboGit
Copy link
Member

I tried forcing tiling with a low host_memory_limit (since the commandline you pasted above doesn't set it), but still no crash.

Note that the command line I sent did not set the memory nor the threading but the integration test driver does and I have the same issue. For reference here are the options used by the driver:

            CORE_OPTIONS="--conf host_memory_limit=8192 \
                 --conf worker_threads=4 -t 4 \
                 --conf plugins/lighttable/export/force_lcms2=FALSE \
                 --conf plugins/lighttable/export/iccintent=0"

At this stage I will schedule that for 3.4. We can revisit this decision until end of July for the data-race fix of course! All fixes are welcome for the next release but I don' t want to put any pressure on you.

@TurboGit TurboGit modified the milestones: 3.2, 3.4 Jun 11, 2020
@ralfbrown
Copy link
Collaborator Author

I ran the various iops using the bilateral filter through their paces with varying numbers of threads to get performance numbers prior to opening the PR, and didn't have any problems, nor did I have any issues with monochrome either with or without memory limits. I'm still unable to reproduce, but have implemented a change based on the only thing I can think of that might be causing a buffer overrun: if your version of libgomp can return thread numbers greater than the number of OMP active threads when the number of active threads is less than the number of processors.

I was trying to save some memory by allocating per-thread buffers only for the number of threads actually in use, rather than dt_get_num_threads(), because the latter always returns the number of processors when compiled with OMP. I've set my dt startup script to use -t 32 because most of the iops are actually slightly slower with 64 threads than 32 (running 2 threads per core has 2x the thread-management overhead while only getting ~1.25 the compute at the same clock speed, the Threadripper is probably not boosting clock speed as much with 64 active threads, and the faster iops are probably saturating the memory bus well before getting up to 64 threads).

Let me know if this version still causes aborts from munmap_chunk().

@TurboGit
Copy link
Member

Ok, good news I don't have the crash anymore. All tests have passed without crashing. The results are the following:

Test 0015-shadhi-bilateral
      Image mire1.cr2
      CPU & GPU version differ by 31729 pixels
      Max  dE         2.1161
      Mean dE         0.0002
      Count above max 4
  FAILS: image visually changed
         see diff.jpg for visual difference
         (1180 pixels changed)

Test 0016-lowpass-bilateral
      Image mire1.cr2
      CPU & GPU version differ by 33904 pixels
      Max  dE         0.1995
  OK

Test 0017-monochrome
      Image mire1.cr2
      CPU & GPU version differ by 7486 pixels
      Max  dE         0.4600
  OK

Test 0018-perspective-corr
      Image mire1.cr2
      CPU & GPU version differ by 35813 pixels
      Max  dE         0.0000
  OK

Test 0019-color-mapping
      Image mire1.cr2
      CPU & GPU version differ by 40563 pixels
      Max  dE         1.5103
  OK

Some all errors are < dE 2.0 only test shadhi-bilateral as a dE slightly above 2.0 but for only 4 pixels, so I'll update the expected output.

All in all this can be merged for 3.2 finally.

Thanks a lot for your time on this.

@TurboGit TurboGit merged commit 7ec4d84 into darktable-org:master Jun 12, 2020
@TurboGit TurboGit modified the milestones: 3.4, 3.2 Jun 12, 2020
@ralfbrown
Copy link
Collaborator Author

I guess that means my hypothesis about thread numbers was correct. Bit of a shame to waste 16 MB of memory and the time needed to add all those zeros when only using 32 threads on a 64-thread CPU....

@ralfbrown ralfbrown deleted the bilat_fix_speedup branch June 12, 2020 11:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix pull request fixing a bug scope: performance doing everything the same but faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants