Skip to content
This repository was archived by the owner on Apr 28, 2023. It is now read-only.

Conversation

@nicolasvasilache
Copy link
Contributor

@nicolasvasilache nicolasvasilache commented Apr 26, 2018

For some reason rangeUpTo64 is unsed and unrolling factors are powers of 2 up to 256.
I can correlate that with bad compilation times locally.
Let's use rangeUpTo64 instead which can full unroll by divisors of problem sizes and avoids the 128 and 256 cases. Tested locally, looks good to me.

@skimo-openhub
Copy link
Contributor

For some reason rangeUpTo64 is unsed and unrolling factors are powers of 2 up to 256.

"unused"
Apparently, it used to be used for (only!) b2.

I can correlate that with bad compilation times locally.
Let's use rangeUpTo64 instead which can full unroll by divisors of problem sizes and avoids the 128 and 256 cases. Tested locally, looks good to me.

"fully unroll"?

Why is it important to be able to unroll by divisors?
AFAIU, the unroll factor only specifies an upper bound.

@ftynse ftynse force-pushed the pr/reduce-tuning-unroll branch from 75361e8 to 01fac32 Compare April 26, 2018 08:52
@ftynse
Copy link
Contributor

ftynse commented Apr 26, 2018

Apparently, it used to be used for (only!) b2.

I suppose that is the tuner's way of calling blockDim.z, which cannot be greater than 64.

AFAIU, the unroll factor only specifies an upper bound.

That is my understanding too. So divisors will be covered by the closest larger power of two.

@skimo-openhub
Copy link
Contributor

skimo-openhub commented Apr 26, 2018 via email

@ftynse
Copy link
Contributor

ftynse commented Apr 26, 2018

But then why was this limit (apparently) removed from the autotuner?

Good question. This may also explain some random "launch out of resources" errors that we saw earlier. @ttheodor ?
But this is not related to the current PR.

@thetheodor
Copy link

But then why was this limit (apparently) removed from the autotuner?

My guess is that it got lost in some refactoring.

@nicolasvasilache nicolasvasilache force-pushed the pr/reduce-tuning-unroll branch 2 times, most recently from 08d062a to 2d053dd Compare April 26, 2018 13:49
@nicolasvasilache
Copy link
Contributor Author

It didn't get lost, it was forcing all grid dimensions to < 64.
@ftynse may be unrelated because of this but we should hardcode it less, can it change on various cards?

@ftynse
Copy link
Contributor

ftynse commented Apr 26, 2018

can it change on various cards?

That limitation has been consistent across all compute capabilities so far.

This one however maybe relaxed. Only very old cards have max 512 threads per block; newer ones have 1024 or even 2048. Not sure about 32 as lower bound. We may want some smaller divisors of the size, especially if they are used as tile sizes. OTOH, tightening will kick in in that case and reduce the thread count anyway. Still, there may be a situation where we could have chosen something like 8x128 block size but could not because we don't use less than 32 threads per dimension.

Unrolling factors are powers of 2 up to 256.
I can correlate that with long compilation times locally.
Let's avoid the 128 and 256 cases.
Tested locally, looks good to me.
@nicolasvasilache nicolasvasilache force-pushed the pr/reduce-tuning-unroll branch from 79bf0b2 to 190ae6d Compare April 26, 2018 14:41
@ftynse
Copy link
Contributor

ftynse commented Apr 26, 2018

-Werror kicked in on filterHigherThan

@ftynse ftynse force-pushed the pr/reduce-tuning-unroll branch from 190ae6d to 59ca04f Compare April 26, 2018 21:18
@ftynse
Copy link
Contributor

ftynse commented Apr 26, 2018

Addressed this myself

@ftynse ftynse merged commit f54c5dd into facebookresearch:master Apr 27, 2018
@ftynse ftynse deleted the pr/reduce-tuning-unroll branch April 27, 2018 08:31
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants