Speed up maxout by exploiting parallelism better #579
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I was profiling the parser refactor and noticed that the maxout kernel is taking more time than I expected:
The maximum amount of parallelism by the kernel is determined by its batch size. However, this leaves the GPU underused. E.g. by default our amount of parallelism is nr_blocks: 128 * nr_threads_per_block: 128 = 16384, which is much larger than the typical batch size.
This PR changes the maxout kernel to parallelize at the output level (each thread computes one output). This makes the maxout kernel about 4.7 times faster with a batch size of 1024 on a RTX 2060 Super:
I haven't updated the backprop variant yet, since it barely appears in the profiles.