-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
llama.cpp recently removed bit-shuffling in some of the quantized file formats to improve performance, this a breaking change which currently doesn't have an upgrade path other then re-quantizing the base models (some users only have the quantized ggml models).
I support any change that makes inference faster as I think we still have a long way to go and don't think we should slow down the pace of improvements for what is still essentially still alpha software. That being said I also don't want to cause too much disruption to existing users who have working setups. I'll create a PR for this change and keep it open for at least the weekend before publishing a new PyPI release with the changes. After that the old version will remain available on PyPI and through the releases so people can keep it pinned.