You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Due to the multiplications of different models architectures, what could be interesting is to have a maximum amount of granularity to test quant strategies for a maximum amount of people on a maximum amount of supported models, and for the tests to occur in a minimum amount of time.
That could mean:
GGUF as directory, in which each tensor of each layer would be quantized as a file.
Partial requant, in which, beyond a whole quant, only the specified tensors of the specified layers would be requantized as well.
A GUI to decides easily which tensors to requant, with a possibility to do so by unit (ex : ffn.down for layer x) or per range (ex : ffn.down for layer range x-y), as well as to decide so for a chosen number of ranges.
This would allow to quickly check the impact of a quantization change while sparing compute power and time by not requantizing the same tensors identically over and over again.
Ultimately, when a satisfactory size/quality is found, a "compacting feature), turning the directory into a single .gguf file would be useless also. And why not a decompacting feature, to start the tests from an already quantized model?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Morning coffee thoughts.
Due to the multiplications of different models architectures, what could be interesting is to have a maximum amount of granularity to test quant strategies for a maximum amount of people on a maximum amount of supported models, and for the tests to occur in a minimum amount of time.
That could mean:
This would allow to quickly check the impact of a quantization change while sparing compute power and time by not requantizing the same tensors identically over and over again.
Ultimately, when a satisfactory size/quality is found, a "compacting feature), turning the directory into a single .gguf file would be useless also. And why not a decompacting feature, to start the tests from an already quantized model?
Beta Was this translation helpful? Give feedback.
All reactions