Resolve race conditions identified by racecheck. #82

andrew-hardin · 2020-04-10T23:48:31Z

This PR resolves a few read/write race conditions identified by the racecheck tool. I stumbled onto them while naively trying to investigate issue #80.

cuda-memcheck --tool racecheck ./popsift-demo -i /path/to/oxford/boat/img1.pgm

The __syncthreads() on L78 resolves the race found below; memory needed to be initialized prior to the AtomicAdd().

========= WARN: Race reported between Write access at 0x00000ed0 in popsift::ori_par(int, int, __int64, int, int)
=========     and Read access at 0x00000150 in __fAtomicAdd [68 hazards]

The __syncthreads() on L210 resolves this race found below; threads had to finish writing to yval prior to passing the collection to the BitonicSort:

========= WARN: Race reported between Write access at 0x00006210 in popsift::ori_par(int, int, __int64, int, int)
=========     and Read access at 0x000003d0 in popsift::BitonicSort::Warp32<float>::shiftit(int, int, int, bool) [256 hazards]

I've updated the PR with two additional commits that resolve a race condition in the block based prefix sum.

simogasp · 2020-04-11T12:10:31Z

Thanks again for the contribution!
I let @griwodz to jump in as he's the cuda expert

andrew-hardin · 2020-04-12T03:53:16Z

I pushed up two additional commits that resolve another set of race conditions. This resolves all the race conditions identified by cuda-memcheck.

griwodz

Andrews, thank you so much for hunting these race conditions!
I'm overwhelmed with moving all our teaching online, but I'll try hard to find the time to look at your issue as well as the PRs.

griwodz · 2020-04-14T06:38:56Z

src/popsift/common/excl_blk_prefix_sum.h

@@ -73,7 +73,6 @@ class Block
        if( threadIdx.x == 0 && threadIdx.y == 0 ) {
            loop_total = 0;
        }
-        __syncthreads();


Removing this syncthreads is not creating a race but I don't think it's a good idea. Pre-RTX cards have no per-thread program counter and can fall out of lockstep if syncthreads is missing after a thread-conditional if-than-else.

Pre-RTX cards have no per-thread program counter and can fall out of lockstep if syncthreads is missing after a thread-conditional if-than-else.

This doesn't match my understanding of what __syncthreads() does going back at least as far as compute capability 3.0. All threads in a warp execute in lock-step, and branch divergence is handled using an activity mask. In the example below, my understanding is that the __syncthreads() is not required after branch divergence:

// ... do work ... if(threadIdx.x == 1) { // Diverge and do something only on T1. printf("Hello from T1\n"); } __syncthreads(); // Not required // ... do work ...

My understanding is that the only constrain on __syncthreads() is that it must be hit by all threads in the same block. For example, the following block is incorrect because not all threads are waiting on the same instruction:

if(threadIdx.x == 1) __syncthreads(); __syncthreads();

Here are a few of the resources I'm looking at:

C Programming Guide

Demystifying GPU Microarchitecture through Microbenchmarking - see section D on pages 5-6.

Any thoughts?

We started using CUDA when it was CC 1.3, so a lot of stuff that I have gotten used to may have been out-dated for several generations.

You are absolutely correct that nothing bad happens without those syncthreads after the end of these ifs. With RTXes, it is 100% certain that there is no penalty.

But I'm not sure if it a great idea to the syncthreads after a divergent if: The most recent measurements that show that reconvergence fails are in an arXiv paper from 2015: https://arxiv.org/pdf/1504.01650.pdf
They are observing a measurable speed degradation up to CC 5 when they store content in a global memory array after a loop that diverges.

Thank you for the context. I'll reintroduce those syncthreads() if there's no penalty.

src/popsift/common/excl_blk_prefix_sum.h

griwodz · 2020-04-14T07:09:22Z

src/popsift/s_orientation.cu

@@ -206,6 +207,7 @@ void ori_par( const int           octave,

    int2 best_index = make_int2( threadIdx.x, threadIdx.x + 32 );

+    __syncthreads();


The syncthreads should be as early as possible, not just-in-time, to allow also old cards to do make_int2 in lockstep. Better move to line 207 right after the for loop.

I thought that I had it covered because all threads enter this loop:
for( int bin = threadIdx.x; popsift::any( bin < ORI_NBINS ); bin += blockDim.x ) {
... }
and inside they use predicates instead of ifs. I suppose that post-RTX cards do this differently.

Moved up to L207 as suggested in d0edb05.

griwodz

Thank you for finding these problems and fixing them!

fabiencastan · 2020-04-20T20:24:10Z

@andrew-hardin Thank you for this great contribution.

Resolve two race conditions identified by racecheck.

2f3b142

simogasp assigned griwodz Apr 11, 2020

simogasp added bugfix type:bug labels Apr 11, 2020

simogasp added this to the v1.0.0 milestone Apr 11, 2020

simogasp requested a review from griwodz April 11, 2020 12:06

andrew-hardin added 2 commits April 11, 2020 21:50

Remove unnecessary sync.

d7ff256

Resolve the race condition.

c238a92

andrew-hardin changed the title ~~Resolve two race conditions identified by racecheck.~~ Resolve race conditions identified by racecheck. Apr 12, 2020

griwodz requested changes Apr 14, 2020

View reviewed changes

andrew-hardin added 3 commits April 14, 2020 08:02

Reintroduce sync, and set the total on every thread.

6e4428b

Bump sync up by one line.

d0edb05

Add sync(), move sync() up a line.

f603edf

griwodz approved these changes Apr 15, 2020

View reviewed changes

fabiencastan merged commit bf3556f into alicevision:develop Apr 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resolve race conditions identified by racecheck. #82

Resolve race conditions identified by racecheck. #82

andrew-hardin commented Apr 10, 2020 •

edited

Loading

simogasp commented Apr 11, 2020

andrew-hardin commented Apr 12, 2020

griwodz left a comment

griwodz Apr 14, 2020

andrew-hardin Apr 14, 2020

griwodz Apr 14, 2020

andrew-hardin Apr 14, 2020

griwodz Apr 14, 2020

andrew-hardin Apr 14, 2020

griwodz left a comment

fabiencastan commented Apr 20, 2020

		@@ -206,6 +207,7 @@ void ori_par( const int octave,

		int2 best_index = make_int2( threadIdx.x, threadIdx.x + 32 );

		__syncthreads();

Resolve race conditions identified by racecheck. #82

Resolve race conditions identified by racecheck. #82

Conversation

andrew-hardin commented Apr 10, 2020 • edited Loading

simogasp commented Apr 11, 2020

andrew-hardin commented Apr 12, 2020

griwodz left a comment

Choose a reason for hiding this comment

griwodz Apr 14, 2020

Choose a reason for hiding this comment

andrew-hardin Apr 14, 2020

Choose a reason for hiding this comment

griwodz Apr 14, 2020

Choose a reason for hiding this comment

andrew-hardin Apr 14, 2020

Choose a reason for hiding this comment

griwodz Apr 14, 2020

Choose a reason for hiding this comment

andrew-hardin Apr 14, 2020

Choose a reason for hiding this comment

griwodz left a comment

Choose a reason for hiding this comment

fabiencastan commented Apr 20, 2020

andrew-hardin commented Apr 10, 2020 •

edited

Loading