Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Support Codestral Mamba #8519

Open
VelocityRa opened this issue Jul 16, 2024 · 11 comments · May be fixed by #9126
Open

Feature Request: Support Codestral Mamba #8519

VelocityRa opened this issue Jul 16, 2024 · 11 comments · May be fixed by #9126
Labels
enhancement New feature or request

Comments

@VelocityRa
Copy link

VelocityRa commented Jul 16, 2024

Feature Description

New 7B coding model just released by Mistral.

Motivation

Seems to perform very well, especially for a 7B model:

image

Possible Implementation

An extension to #7727?

@VelocityRa VelocityRa added the enhancement New feature or request label Jul 16, 2024
@HanClinto
Copy link
Collaborator

I love the shout-out in the linked blog post!

You can deploy Codestral Mamba using the mistral-inference SDK, which relies on the reference implementations from Mamba’s GitHub repository. The model can also be deployed through TensorRT-LLM. For local inference, keep an eye out for support in llama.cpp. You may download the raw weights from HuggingFace.

That's a really nice nod -- love to see it!

@theo77186
Copy link

#7727 should cover for this model, but with untied embeddings unlike the other Mamba2 models.

@timlacroix
Copy link

FYI, there is an "ngroups" param that changes how layer norm is done : https://github.com/state-spaces/mamba/blob/c0a00bd1808881831ddf43206c69362d4df90cf7/mamba_ssm/modules/mamba2.py#L47

We use ngroups=8. If you forget it or try with ngroups = 1 you'll have a bad time.

Good luck !

@ggerganov
Copy link
Owner

After we merge #8526 we should try to add full support for this model. cc @compilade

@0wwafa
Copy link

0wwafa commented Jul 17, 2024

I'd love this.

@txhno
Copy link

txhno commented Jul 18, 2024

thanks!

@fredconex
Copy link

Hey guys, any progress on ETA for it?

@rmusser01
Copy link

For anyone else, seems this is waiting on
#8526
which is waiting on
#8980 -> which is waiting on review(?).

@compilade
Copy link
Collaborator

compilade commented Aug 17, 2024

Some progress report: I have a local branch (not yet public) on top of #8526 in which I've started implementing the graph for Mamba-2. The conv step is very similar to Mamba-1, and I've started to implement the SSM step and will continue in the next days. It's not in a usable state yet.

I'm starting by implementing the fully recurrent mode of Mamba-2 (which is very similar to Mamba-1) (and which is described in Section 3.4.1).

But I'm still evaluating how the block decomposition would fit within how src/llama.cpp manages batches and/or if the chunk size should be dynamic. It seems like to fully benefit from Section 6, the chunks should be smaller than the batch size, but not too small, at which point directly doing the recurrence is the same. Since the ggml compute graph nodes should keep the same structure between batches and that the block decomposition will likely have too much overhead for small batches, it's easier to simply go with the linear recurrence with something like ggml_ssm_scan at first.

For the ETA, I'll try to get it working before the end of August, but no promises.

(and BTW @rmusser01, #8980 is waiting on #8526, not the other way around, at least I think?)

@compilade
Copy link
Collaborator

compilade commented Aug 19, 2024

Okay, the fully recurrent mode works for Mamba-2! (for the curious, see this branch: https://github.com/compilade/llama.cpp/tree/compilade/mamba2) I'll open a PR soon (in the next days; still need to clean up some things).

Note that Mamba-Codestral-7B-v0.1 cannot be converted as-is; either use https://huggingface.co/mistralai/Mamba-Codestral-7B-v0.1/discussions/9, or rename consolidated.safetensors to model.safetensors, tokenizer.model.v3 to tokenizer.model, and params.json to config.json. Then, in config.json, the line "architectures": ["Mamba2ForCausalLM"], needs to be added (if missing).

The state in Mamba-2 is bigger than I thought; Mamba-Codestral-7B-v0.1 takes 263.5 MiB (in F32) per sequence (e.g. with -np 1), compared to 38 MiB for Falcon-Mamba-7B (which is based on Mamba-1). But that remains constant whatever the context size.

A big downside right now with recurrent models in llama.cpp is the lack of state rollback (which is implemented through state checkpoints in #7531, but needs to be re-adapted to #8526), so the prompt will be reprocessed a lot if using llama-server. I think using llama-cli in conversation mode does not have this problem, however (or maybe only the bare interactive mode with --in-prefix and --in-suffix, not sure).

The implementation is CPU-only, but uses SIMD for the SSM scan, so even though the state is bigger than for Mamba-1 models, in my tests, the speed of Mamba-2-130M is similar or better than Mamba-130M (but still not that fast compared to transformer-based models with an empty context).

The speed of Mamba-2 models seems comparable to Transformer-based models when the latter have 2k to 4k tokens in their context.

Just making sure expectations are not too far from reality.

@compilade compilade linked a pull request Aug 21, 2024 that will close this issue
9 tasks
@github-actions github-actions bot added the stale label Sep 19, 2024
Copy link
Contributor

github-actions bot commented Oct 4, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Oct 4, 2024
@Galunid Galunid reopened this Oct 4, 2024
@compilade compilade removed the stale label Oct 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.