You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I reviewed the Discussions, and have a new bug or useful enhancement to share.
Feature Description
Please provide a detailed written description of what you were trying to do, and what you expected llama.cpp to do as an enhancement.
I would like to be able to explicitly define the number of layers to put on each available GPU depending on how much VRAM it has and if said GPU is being used for something else, i.e other models, context, etc....
Motivation
Avoid OOMs when the GPU 0 is getting more than others from the model weights PLUS the context, when one can put fewer layers on GPU 0 and more layers on other GPUs with free VRAM.
Please provide a detailed written description of reasons why this feature is necessary and how it is useful to llama.cpp users.
More users would be able to run larger models.
Possible Implementation
It should be easy to provide a flag to llama.cpp like "--fractions 4, 9, 9, 9" to put 4 layers on GPU 0, and 9 layers on 1,2,3.
This would free up VRAM on GPU 0 for the context, scratch VRAM, etc....
If you have an idea as to how it can be implemented, please write a detailed description. Feel free to give links to external sources or share visuals that might be helpful to understand the details better.
The text was updated successfully, but these errors were encountered:
phalexo
changed the title
Need control to assign to each GPU user-defined number of layers.
Need to be able to assign to each GPU user-defined number of layers.
Dec 27, 2023
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Feature Description
Please provide a detailed written description of what you were trying to do, and what you expected
llama.cpp
to do as an enhancement.I would like to be able to explicitly define the number of layers to put on each available GPU depending on how much VRAM it has and if said GPU is being used for something else, i.e other models, context, etc....
Motivation
Avoid OOMs when the GPU 0 is getting more than others from the model weights PLUS the context, when one can put fewer layers on GPU 0 and more layers on other GPUs with free VRAM.
Please provide a detailed written description of reasons why this feature is necessary and how it is useful to
llama.cpp
users.More users would be able to run larger models.
Possible Implementation
It should be easy to provide a flag to llama.cpp like "--fractions 4, 9, 9, 9" to put 4 layers on GPU 0, and 9 layers on 1,2,3.
This would free up VRAM on GPU 0 for the context, scratch VRAM, etc....
If you have an idea as to how it can be implemented, please write a detailed description. Feel free to give links to external sources or share visuals that might be helpful to understand the details better.
The text was updated successfully, but these errors were encountered: