Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve race conditions identified by racecheck. #82

Merged
merged 6 commits into from
Apr 20, 2020
Merged

Resolve race conditions identified by racecheck. #82

merged 6 commits into from
Apr 20, 2020

Conversation

andrew-hardin
Copy link
Contributor

@andrew-hardin andrew-hardin commented Apr 10, 2020

This PR resolves a few read/write race conditions identified by the racecheck tool. I stumbled onto them while naively trying to investigate issue #80.

cuda-memcheck --tool racecheck ./popsift-demo -i /path/to/oxford/boat/img1.pgm

The __syncthreads() on L78 resolves the race found below; memory needed to be initialized prior to the AtomicAdd().

========= WARN: Race reported between Write access at 0x00000ed0 in popsift::ori_par(int, int, __int64, int, int)
=========     and Read access at 0x00000150 in __fAtomicAdd [68 hazards]

The __syncthreads() on L210 resolves this race found below; threads had to finish writing to yval prior to passing the collection to the BitonicSort:

========= WARN: Race reported between Write access at 0x00006210 in popsift::ori_par(int, int, __int64, int, int)
=========     and Read access at 0x000003d0 in popsift::BitonicSort::Warp32<float>::shiftit(int, int, int, bool) [256 hazards]

I've updated the PR with two additional commits that resolve a race condition in the block based prefix sum.

@simogasp
Copy link
Member

Thanks again for the contribution!
I let @griwodz to jump in as he's the cuda expert

@andrew-hardin andrew-hardin changed the title Resolve two race conditions identified by racecheck. Resolve race conditions identified by racecheck. Apr 12, 2020
@andrew-hardin
Copy link
Contributor Author

I pushed up two additional commits that resolve another set of race conditions. This resolves all the race conditions identified by cuda-memcheck.

Copy link
Member

@griwodz griwodz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Andrews, thank you so much for hunting these race conditions!
I'm overwhelmed with moving all our teaching online, but I'll try hard to find the time to look at your issue as well as the PRs.

@@ -73,7 +73,6 @@ class Block
if( threadIdx.x == 0 && threadIdx.y == 0 ) {
loop_total = 0;
}
__syncthreads();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this syncthreads is not creating a race but I don't think it's a good idea. Pre-RTX cards have no per-thread program counter and can fall out of lockstep if syncthreads is missing after a thread-conditional if-than-else.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pre-RTX cards have no per-thread program counter and can fall out of lockstep if syncthreads is missing after a thread-conditional if-than-else.

This doesn't match my understanding of what __syncthreads() does going back at least as far as compute capability 3.0. All threads in a warp execute in lock-step, and branch divergence is handled using an activity mask. In the example below, my understanding is that the __syncthreads() is not required after branch divergence:

// ... do work ...
if(threadIdx.x == 1) {
  // Diverge and do something only on T1.
  printf("Hello from T1\n");
}
__syncthreads(); // Not required
// ... do work ...

My understanding is that the only constrain on __syncthreads() is that it must be hit by all threads in the same block. For example, the following block is incorrect because not all threads are waiting on the same instruction:

if(threadIdx.x == 1) __syncthreads();
__syncthreads();

Here are a few of the resources I'm looking at:

Any thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We started using CUDA when it was CC 1.3, so a lot of stuff that I have gotten used to may have been out-dated for several generations.

You are absolutely correct that nothing bad happens without those syncthreads after the end of these ifs. With RTXes, it is 100% certain that there is no penalty.

But I'm not sure if it a great idea to the syncthreads after a divergent if: The most recent measurements that show that reconvergence fails are in an arXiv paper from 2015: https://arxiv.org/pdf/1504.01650.pdf
They are observing a measurable speed degradation up to CC 5 when they store content in a global memory array after a loop that diverges.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the context. I'll reintroduce those syncthreads() if there's no penalty.

src/popsift/common/excl_blk_prefix_sum.h Show resolved Hide resolved
src/popsift/common/excl_blk_prefix_sum.h Show resolved Hide resolved
src/popsift/common/excl_blk_prefix_sum.h Show resolved Hide resolved
src/popsift/common/excl_blk_prefix_sum.h Outdated Show resolved Hide resolved
@@ -206,6 +207,7 @@ void ori_par( const int octave,

int2 best_index = make_int2( threadIdx.x, threadIdx.x + 32 );

__syncthreads();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The syncthreads should be as early as possible, not just-in-time, to allow also old cards to do make_int2 in lockstep. Better move to line 207 right after the for loop.

I thought that I had it covered because all threads enter this loop:
for( int bin = threadIdx.x; popsift::any( bin < ORI_NBINS ); bin += blockDim.x ) {
... }
and inside they use predicates instead of ifs. I suppose that post-RTX cards do this differently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved up to L207 as suggested in d0edb05.

Copy link
Member

@griwodz griwodz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for finding these problems and fixing them!

@fabiencastan fabiencastan merged commit bf3556f into alicevision:develop Apr 20, 2020
@fabiencastan
Copy link
Member

@andrew-hardin Thank you for this great contribution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants