Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA Graph to further improve performance? #106

Open
zhuzilin opened this issue Jul 12, 2021 · 1 comment
Open

CUDA Graph to further improve performance? #106

zhuzilin opened this issue Jul 12, 2021 · 1 comment

Comments

@zhuzilin
Copy link

Hi there! Thank you for your amazing work on implementing the faster components for transformer-based models! I've found that you have multiple gpu kernels in an encoder or decoder. Have you ever tried the CUDA Graph mechanism introduced by nvidia to combine a graph of kernels into one to reduce the launch overhead and memory copies furtherly? It seems to me that we could easily take the advantages of this mechanism in lightseq. Wonder if you are willing to have a try :)

@neopro12
Copy link
Collaborator

Thanks for your excellent advice!
We haven't tried this mechanism yet and we will investigate it.
In addition to manual op fusion, it seems that it will further reduce the cost of multiple kernels especially during inference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants