-
Notifications
You must be signed in to change notification settings - Fork 213
Reduce unrolling factors during tuning #354
Reduce unrolling factors during tuning #354
Conversation
9f5c9c3 to
75361e8
Compare
"unused"
"fully unroll"? Why is it important to be able to unroll by divisors? |
75361e8 to
01fac32
Compare
I suppose that is the tuner's way of calling
That is my understanding too. So divisors will be covered by the closest larger power of two. |
|
On Thu, Apr 26, 2018 at 09:01:51AM +0000, ftynse wrote:
> Apparently, it used to be used for (only!) b2.
I suppose that is the tuner's way of calling `blockDim.z`, which cannot be greater than 64.
OK, I didn't know about this restriction.
But then why was this limit (apparently) removed from the autotuner?
skimo
|
Good question. This may also explain some random "launch out of resources" errors that we saw earlier. @ttheodor ? |
My guess is that it got lost in some refactoring. |
08d062a to
2d053dd
Compare
That limitation has been consistent across all compute capabilities so far. This one however maybe relaxed. Only very old cards have max 512 threads per block; newer ones have 1024 or even 2048. Not sure about 32 as lower bound. We may want some smaller divisors of the size, especially if they are used as tile sizes. OTOH, tightening will kick in in that case and reduce the thread count anyway. Still, there may be a situation where we could have chosen something like 8x128 block size but could not because we don't use less than 32 threads per dimension. |
f3ca028 to
79bf0b2
Compare
Unrolling factors are powers of 2 up to 256. I can correlate that with long compilation times locally. Let's avoid the 128 and 256 cases. Tested locally, looks good to me.
79bf0b2 to
190ae6d
Compare
|
|
190ae6d to
59ca04f
Compare
|
Addressed this myself |
For some reason rangeUpTo64 is unsed and unrolling factors are powers of 2 up to 256.
I can correlate that with bad compilation times locally.
Let's use rangeUpTo64 instead which can full unroll by divisors of problem sizes and avoids the 128 and 256 cases. Tested locally, looks good to me.