You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi there! Thank you for your amazing work on implementing the faster components for transformer-based models! I've found that you have multiple gpu kernels in an encoder or decoder. Have you ever tried the CUDA Graph mechanism introduced by nvidia to combine a graph of kernels into one to reduce the launch overhead and memory copies furtherly? It seems to me that we could easily take the advantages of this mechanism in lightseq. Wonder if you are willing to have a try :)
The text was updated successfully, but these errors were encountered:
Thanks for your excellent advice!
We haven't tried this mechanism yet and we will investigate it.
In addition to manual op fusion, it seems that it will further reduce the cost of multiple kernels especially during inference.
Hi there! Thank you for your amazing work on implementing the faster components for transformer-based models! I've found that you have multiple gpu kernels in an encoder or decoder. Have you ever tried the CUDA Graph mechanism introduced by nvidia to combine a graph of kernels into one to reduce the launch overhead and memory copies furtherly? It seems to me that we could easily take the advantages of this mechanism in lightseq. Wonder if you are willing to have a try :)
The text was updated successfully, but these errors were encountered: