Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suport for Jamba JambaForCausalLM #6372

Open
4 tasks done
maziyarpanahi opened this issue Mar 28, 2024 · 14 comments
Open
4 tasks done

Suport for Jamba JambaForCausalLM #6372

maziyarpanahi opened this issue Mar 28, 2024 · 14 comments
Labels
enhancement New feature or request

Comments

@maziyarpanahi
Copy link

maziyarpanahi commented Mar 28, 2024

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Feature Description

Please provide a detailed written description of what you were trying to do, and what you expected llama.cpp to do as an enhancement.

A new MoE model was released today: JambaForCausalLM https://huggingface.co/ai21labs/Jamba-v0.1

Motivation

Please provide a detailed written description of reasons why this feature is necessary and how it is useful to llama.cpp users.

Another very good and open LLM

Possible Implementation

If you have an idea as to how it can be implemented, please write a detailed description. Feel free to give links to external sources or share visuals that might be helpful to understand the details better.

I can test any PR candidate

@maziyarpanahi maziyarpanahi added the enhancement New feature or request label Mar 28, 2024
@nonetrix
Copy link

Have smaller Mamba based LLMs already been added in the past?

@Green-Sky
Copy link
Collaborator

@compilade added mamba support. But Jamba seems to be a derivative and needs code modifications.

@compilade
Copy link
Collaborator

I'd like it very much if they released a smaller version of their model. I don't have enough RAM to run Mixtral (only have 8GB), and Jamba seems to be around the same size as Mixtral. A model with less than 1B total parameters (or even less than 200M) would be ideal for quickly figuring out implementation problems (and would waste much less disk space when debugging or modifying model conversion).

My free time is too scarce at the moment to work on this (until May). The KV cache of this model will be some complicated beast (it's both recurrent and attention-based, but never in the same "layer". This will require rethinking how the KV cache is allocated, and how Mamba's state is stored), but I think it should still be possible to support eventually, given enough effort.

Similarly to llm_build_ffn, I think there will need to be some kind of llm_build_mamba to more easily share the code building the graph of a Mamba block between Mamba and Jamba.

Anyone wanting to work on this should start by building a strong mental model of how Mamba's state is managed in llama.cpp, as well as how the KV cache works (at least what goes where, not necessarily why). This is necessary because modifications of both of these will likely be needed to make this work.

Mamba in llama.cpp uses 1 KV cell per sequence (we'll probably need to introduce some other tensor lists than k_l and v_l in llama_kv_cache to avoid conflicting with attention's one KV cells per token (a different set of cells will be required (and yet another session file format revision))). Sequences are selected with inp_s_seq in ggml_compute_forward_ssm_conv_f32 and ggml_compute_forward_ssm_scan_f32. Each token from a batch has one input state/sequence, but the resulting state is copied to all the sequences assigned to that token.

Simplifying how recurrent state operations are implemented is on my TODO list, and implementing both Jamba and RWKV should help with refactoring, but Jamba support in llama.cpp feels like a multi-week project, and I'll only have this kind of free time in May.

If anyone's too impatient, feel free to experiment and figure out a way to make Jamba work with llama.cpp. Even incomplete proofs of concept of how to manage the Jamba blocks should be useful.

@maziyarpanahi
Copy link
Author

@compilade added mamba support. But Jamba seems to be a derivative and needs code modifications.

for reference: #5328

@sorasoras
Copy link

Have smaller Mamba based LLMs already been added in the past?

It's not mamba based any more. it's a mix up between transformer and mamba so that's gonna be different.

@trap20
Copy link

trap20 commented Apr 1, 2024

There is a Mini-Jamba on Huggingface now: https://huggingface.co/TechxGenus/Mini-Jamba-v2

Might be helpful for testing - if it actually is a working Mini-Jamba model, haven't checked that yet.

@severian42
Copy link

Just checking to see if anyone has come close to getting Jamba working here. I've been working on figuring out fine-tuning and training on some new general chat Jamba models in prep for when they can be more standardized for everyone. Once we can get Jamba as a GGUF, I think it'll do some awesome stuff for all of us

https://huggingface.co/Severian/Jamba-Hercules
https://huggingface.co/Severian/Jamba-Nexus-IKM

@Any-Winter-4079
Copy link

Any update on Jamba support?

@compilade
Copy link
Collaborator

Any update on Jamba support?

I've worked on refactoring the KV cache in the past weeks to allow managing both recurrent states and Attention's KV cache at once. (See master...compilade/refactor-kv-cache) It's still a work-in-progess, but state checkpoints (necessary to avoid re-processing the whole prompt when removing the last few tokens) are implemented, but not yet handled in the server. I'll open a PR when it will be ready. I still need more time to think through the implementation (currently very busy with other things).

After that, work on specific hybrid model architectures like Jamba will be possible.

@severian42
Copy link

@compilade Thank you so much for taking this on. I have been trying on my own but failing miserably to get Jamba quantizied with llama.cpp

I have been prepping by training as many Jamba models as possible since that is more my wheelhouse

For your endeavors, could I 'Buy You a Coffee' to help support? I know this extra work isn't easy by any means

@erlebach
Copy link

erlebach commented May 3, 2024

Could somebody write about why quantizing Jamba and providing a gguf is difficult? Thanks. Gordon.

@compilade
Copy link
Collaborator

compilade commented May 4, 2024

For your endeavors, could I 'Buy You a Coffee' to help support?

@severian42 I appreciate the offer (it means a lot!), but I can't accept for now. Receiving international donations seems a bit complicated accounting-wise and I don't want to have to think about this (yet). Still nice to know you want this to succeed!

I know this extra work isn't easy by any means

Well, I don't see it as "work", more like exploring ideas. I like to be able to deeply think through some hard problems, and llama.cpp has plenty of that. :)

Could somebody write about why quantizing Jamba and providing a gguf is difficult? Thanks. Gordon.

@erlebach The main difficulty is how the state is managed; some layers (the Attention layers) will use the KV cache while others (the Mamba layers) will use recurrent states. This is what is taking the most effort/time to implement, since the API around copying, removing and allocating KV cells needs to be re-thought to support both types of cache at the same time.

I have more free time these days, so my work-in-progress of the above at master...compilade/refactor-kv-cache should advance a bit quicker than in the past weeks/month, though I'm currently working on simplifying convert-hf-to-gguf.py (#7031) to use lazy operations (#7075) to avoid having all the weights of a model in RAM during conversion. This should make testing of the conversion for big models (like Jamba, with its 100GB of bfloat16 weights) much easier and far less memory-hungry (and/or less disk-hungry if the --use-temp-file option was used).

Quantization will likely not be a problem, since it seemed to work well enough for bigger Mamba models. I don't know why people keep repeating it can't be quantized. The internal Mamba-specific stuff can't, but even in pure Mamba models it's less than ~5% of the weights, while the rest of the space is taken up by linear projections, which can be quantized.

Feel free to contribute code if you are though, you could help out @compilade which seems to be one piece of the puzzle

@nonetrix Thanks for reminding others that they too can help. (EDIT: hey, your comment was useful, you didn't need to delete it)

For examples of how to help:

  • train Jamba finetunes
    • this makes the implementation in llama.cpp more worth it
  • constructively answer to "are we there yet?" messages sent here
    • this gives me time for a more thoughtful message later, like this one
  • share ideas/code of how to manage recurrent states at the same time as the KV cache
    • this is more complicated than it sounds because it's necessary to manage sequences and some operations like copies and (sometimes partial) removal.
    • a simple data structure for this would be nice
      • In the branch I've linked before, I've implemented a tree of sequences, to allow for shared checkpoints, but simpler ways probably exist
    • a way to manage the allocation of that with different buffer sizes per layer
  • feedback on the code
    • this will be easier once I open a pull request for this.
  • test the code once it seems ready
    • my hardware is limited, so I will need help with testing. I'll make sure to announce it here once I get to this point.

@nonetrix
Copy link

nonetrix commented May 5, 2024

Feel free to contribute code if you are though, you could help out @compilade which seems to be one piece of the puzzle

@nonetrix Thanks for reminding others that they too can help. (EDIT: hey, your comment was useful, you didn't need to delete it)

No, it was somewhat mean spirited. I should have said what I said, I apologize

@erlebach
Copy link

Thank you for the response. I was simply curious since it was the first time I noticed a quantization effort take so much time. Truely, I appreciate all the hard work you guys put into this. Good luck!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

9 participants