-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resolve race conditions identified by racecheck. #82
Resolve race conditions identified by racecheck. #82
Conversation
Thanks again for the contribution! |
I pushed up two additional commits that resolve another set of race conditions. This resolves all the race conditions identified by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Andrews, thank you so much for hunting these race conditions!
I'm overwhelmed with moving all our teaching online, but I'll try hard to find the time to look at your issue as well as the PRs.
@@ -73,7 +73,6 @@ class Block | |||
if( threadIdx.x == 0 && threadIdx.y == 0 ) { | |||
loop_total = 0; | |||
} | |||
__syncthreads(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing this syncthreads is not creating a race but I don't think it's a good idea. Pre-RTX cards have no per-thread program counter and can fall out of lockstep if syncthreads is missing after a thread-conditional if-than-else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pre-RTX cards have no per-thread program counter and can fall out of lockstep if syncthreads is missing after a thread-conditional if-than-else.
This doesn't match my understanding of what __syncthreads()
does going back at least as far as compute capability 3.0. All threads in a warp execute in lock-step, and branch divergence is handled using an activity mask. In the example below, my understanding is that the __syncthreads()
is not required after branch divergence:
// ... do work ...
if(threadIdx.x == 1) {
// Diverge and do something only on T1.
printf("Hello from T1\n");
}
__syncthreads(); // Not required
// ... do work ...
My understanding is that the only constrain on __syncthreads()
is that it must be hit by all threads in the same block. For example, the following block is incorrect because not all threads are waiting on the same instruction:
if(threadIdx.x == 1) __syncthreads();
__syncthreads();
Here are a few of the resources I'm looking at:
- C Programming Guide
- Demystifying GPU Microarchitecture through Microbenchmarking - see section D on pages 5-6.
Any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We started using CUDA when it was CC 1.3, so a lot of stuff that I have gotten used to may have been out-dated for several generations.
You are absolutely correct that nothing bad happens without those syncthreads after the end of these ifs. With RTXes, it is 100% certain that there is no penalty.
But I'm not sure if it a great idea to the syncthreads after a divergent if: The most recent measurements that show that reconvergence fails are in an arXiv paper from 2015: https://arxiv.org/pdf/1504.01650.pdf
They are observing a measurable speed degradation up to CC 5 when they store content in a global memory array after a loop that diverges.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the context. I'll reintroduce those syncthreads() if there's no penalty.
src/popsift/s_orientation.cu
Outdated
@@ -206,6 +207,7 @@ void ori_par( const int octave, | |||
|
|||
int2 best_index = make_int2( threadIdx.x, threadIdx.x + 32 ); | |||
|
|||
__syncthreads(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The syncthreads should be as early as possible, not just-in-time, to allow also old cards to do make_int2 in lockstep. Better move to line 207 right after the for loop.
I thought that I had it covered because all threads enter this loop:
for( int bin = threadIdx.x; popsift::any( bin < ORI_NBINS ); bin += blockDim.x ) {
... }
and inside they use predicates instead of ifs. I suppose that post-RTX cards do this differently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved up to L207 as suggested in d0edb05.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for finding these problems and fixing them!
@andrew-hardin Thank you for this great contribution. |
This PR resolves a few read/write race conditions identified by the
racecheck
tool. I stumbled onto them while naively trying to investigate issue #80.The
__syncthreads()
on L78 resolves the race found below; memory needed to be initialized prior to theAtomicAdd()
.The
__syncthreads()
on L210 resolves this race found below; threads had to finish writing toyval
prior to passing the collection to theBitonicSort
:I've updated the PR with two additional commits that resolve a race condition in the block based prefix sum.