-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inference on nvidia gpu #230
Comments
When I repeated the above steps, I found that the "./bin/mpt" process occupied 429MiB of GPU memory. However, during runtime, the GPU utilization remained at 0%, and the speed was consistent with the CPU, at 500ms per token. |
I'm seeing the same behavior |
The tensors need to be offloaded to the GPU. |
I also had some frustration with GPU support and couldn't figure out why it didn't seem to do anything with the GPU, aside from consuming a small amount of VRAM each run. Looking closer at the source, it turns out that most model handlers do not actually have any cuda support - it's built but not linked, and using -ngl on the commandline is accepted but completely ignored. Is there a roadmap on adding proper support? At the moment the only handler which seems to provide CUDA support is starcoder. |
* CI Improvements Manual build feature, autoreleases for Windows * better CI naming convention use branch name in releases and tags
Thanks for your great work. Im running a mpt model with nvidia v100 gpu. I think the compilation process went well, but GPU cannot be utilized during inference. Here is what i got
then
when i run, i got this output
During runtime, I repeatedly checked and found that the GPU was not utilized at all. If I accidentally missed something, please let me know.
The text was updated successfully, but these errors were encountered: