CUDA Graph to further improve performance? #106

zhuzilin · 2021-07-12T14:00:42Z

Hi there! Thank you for your amazing work on implementing the faster components for transformer-based models! I've found that you have multiple gpu kernels in an encoder or decoder. Have you ever tried the CUDA Graph mechanism introduced by nvidia to combine a graph of kernels into one to reduce the launch overhead and memory copies furtherly? It seems to me that we could easily take the advantages of this mechanism in lightseq. Wonder if you are willing to have a try :)

neopro12 · 2021-07-13T02:46:06Z

Thanks for your excellent advice!
We haven't tried this mechanism yet and we will investigate it.
In addition to manual op fusion, it seems that it will further reduce the cost of multiple kernels especially during inference.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA Graph to further improve performance? #106

CUDA Graph to further improve performance? #106

zhuzilin commented Jul 12, 2021

neopro12 commented Jul 13, 2021

CUDA Graph to further improve performance? #106

CUDA Graph to further improve performance? #106

Comments

zhuzilin commented Jul 12, 2021

neopro12 commented Jul 13, 2021