-
Notifications
You must be signed in to change notification settings - Fork 13.3k
CUDA: some micro-optimizations in mmf.cuh for mul_mat_id #15926
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My experience with mmq_ids_helper
has been that the biggest speedup came from specifying the number of used experts at compile time in order to eliminate the inner loop over n_expert_used
.
c66fb36
to
94b189b
Compare
Unfortunately I still don't see a speedup in my tests, I tried with granite-moe and also test-backend-ops. Also I saw unrolling 16-32 |
Regarding register pressure: that is always the biggest limitation for matrix multiplications. For MMF to scale properly to larger batch sizes the memory access patterns will need to be changed. Like in MMQ, it will be necessary to load the What you could do with less effort is extend the kernel to run more than one CUDA block in parallel for |
I did some prototyping and was also unable to get better performance than master. So for this PR I think we should remove the template parameter for the number of experts used again (to avoid needlessly increasing the compilation time), then I'll approve. |
94b189b
to
bf08ea5
Compare
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Following #15767, I do not see a noticeable difference in performance but this change has better memory coalescing and uses all warps available for finding slots. In general, this part of the code does not contribute significantly to the runtime in any case.
While looking at the optimizing the kernel, I noticed that this kernel is overall bounded by register pressure which affects occupancy. I tried adding
pragma unroll 1
to dial-back some of the unrolling but that only made performance worse