New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suport for Jamba JambaForCausalLM #6372
Comments
Have smaller Mamba based LLMs already been added in the past? |
@compilade added mamba support. But Jamba seems to be a derivative and needs code modifications. |
I'd like it very much if they released a smaller version of their model. I don't have enough RAM to run Mixtral (only have 8GB), and Jamba seems to be around the same size as Mixtral. A model with less than 1B total parameters (or even less than 200M) would be ideal for quickly figuring out implementation problems (and would waste much less disk space when debugging or modifying model conversion). My free time is too scarce at the moment to work on this (until May). The KV cache of this model will be some complicated beast (it's both recurrent and attention-based, but never in the same "layer". This will require rethinking how the KV cache is allocated, and how Mamba's state is stored), but I think it should still be possible to support eventually, given enough effort. Similarly to Anyone wanting to work on this should start by building a strong mental model of how Mamba's state is managed in Mamba in Simplifying how recurrent state operations are implemented is on my TODO list, and implementing both Jamba and RWKV should help with refactoring, but Jamba support in If anyone's too impatient, feel free to experiment and figure out a way to make Jamba work with |
for reference: #5328 |
It's not mamba based any more. it's a mix up between transformer and mamba so that's gonna be different. |
There is a Mini-Jamba on Huggingface now: https://huggingface.co/TechxGenus/Mini-Jamba-v2 Might be helpful for testing - if it actually is a working Mini-Jamba model, haven't checked that yet. |
Just checking to see if anyone has come close to getting Jamba working here. I've been working on figuring out fine-tuning and training on some new general chat Jamba models in prep for when they can be more standardized for everyone. Once we can get Jamba as a GGUF, I think it'll do some awesome stuff for all of us https://huggingface.co/Severian/Jamba-Hercules |
Any update on Jamba support? |
I've worked on refactoring the KV cache in the past weeks to allow managing both recurrent states and Attention's KV cache at once. (See master...compilade/refactor-kv-cache) It's still a work-in-progess, but state checkpoints (necessary to avoid re-processing the whole prompt when removing the last few tokens) are implemented, but not yet handled in the After that, work on specific hybrid model architectures like Jamba will be possible. |
@compilade Thank you so much for taking this on. I have been trying on my own but failing miserably to get Jamba quantizied with llama.cpp I have been prepping by training as many Jamba models as possible since that is more my wheelhouse For your endeavors, could I 'Buy You a Coffee' to help support? I know this extra work isn't easy by any means |
Could somebody write about why quantizing Jamba and providing a gguf is difficult? Thanks. Gordon. |
@severian42 I appreciate the offer (it means a lot!), but I can't accept for now. Receiving international donations seems a bit complicated accounting-wise and I don't want to have to think about this (yet). Still nice to know you want this to succeed!
Well, I don't see it as "work", more like exploring ideas. I like to be able to deeply think through some hard problems, and
@erlebach The main difficulty is how the state is managed; some layers (the Attention layers) will use the KV cache while others (the Mamba layers) will use recurrent states. This is what is taking the most effort/time to implement, since the API around copying, removing and allocating KV cells needs to be re-thought to support both types of cache at the same time. I have more free time these days, so my work-in-progress of the above at master...compilade/refactor-kv-cache should advance a bit quicker than in the past weeks/month, though I'm currently working on simplifying Quantization will likely not be a problem, since it seemed to work well enough for bigger Mamba models. I don't know why people keep repeating it can't be quantized. The internal Mamba-specific stuff can't, but even in pure Mamba models it's less than ~5% of the weights, while the rest of the space is taken up by linear projections, which can be quantized.
@nonetrix Thanks for reminding others that they too can help. (EDIT: hey, your comment was useful, you didn't need to delete it) For examples of how to help:
|
No, it was somewhat mean spirited. I should have said what I said, I apologize |
Thank you for the response. I was simply curious since it was the first time I noticed a quantization effort take so much time. Truely, I appreciate all the hard work you guys put into this. Good luck! |
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Feature Description
Please provide a detailed written description of what you were trying to do, and what you expected
llama.cpp
to do as an enhancement.A new MoE model was released today:
JambaForCausalLM
https://huggingface.co/ai21labs/Jamba-v0.1Motivation
Please provide a detailed written description of reasons why this feature is necessary and how it is useful to
llama.cpp
users.Another very good and open LLM
Possible Implementation
If you have an idea as to how it can be implemented, please write a detailed description. Feel free to give links to external sources or share visuals that might be helpful to understand the details better.
I can test any PR candidate
The text was updated successfully, but these errors were encountered: