-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for model quantization #249
Comments
This is a feature that will be added directly to Nx/EXLA. Currently WIP so stay tuned. :) |
Hi there! I was wondering what the current status of this was; I found https://elixirforum.com/t/high-scale-performance-of-llms-needed-features/58562 but couldn't find any tracking issues for Bumblebee or NX. I think quantization support is critical to making Bumblebee a viable option for developers/deployment; it makes it actually tractable to run larger models on consumer hardware. Thanks in advance! |
We are currently updating XLA in Nx and the new version supports quantization. :) So hopefully sooner than later! |
Is there an issue tracking that somewhere? |
Search for MLIR in the Nx project. Once that is done, we can start thinking about quantization! |
Just to clarify, @josevalim, do you think Nx will be able to run GGUF models, even after the MLIR updates? I don't believe XLA will work with GGUF models out of the box, since that's a quantized model file format for llama.cpp: |
Those are two separate problems. Once we support quantization, then we may be able to run GGUF, as long as someone writes a deserializer for it. |
Blocking on elixir-nx/nx#1452. |
Running larger models on bumblebee is difficut for people with lower tier hardware.
Adding GPTQ and GGUF/GGML would greatly boost model accessability in the elixir ecosystem.
The text was updated successfully, but these errors were encountered: