Add Support for IBM Granite #7116

YorkieDev · 2024-05-07T05:51:11Z

Prerequisites

Please answer the following questions for yourself before submitting an issue.

[ ✅] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
[✅ ] I carefully followed the README.md.
[ ✅] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[✅ ] I reviewed the Discussions, and have a new bug or useful enhancement to share.

Feature Description

IBM recently released their Granite models. A series of 3b -> 34b coding models with base and instruct finetunes.

https://huggingface.co/collections/ibm-granite/granite-code-models-6624c5cec322e4c148c8b330
https://github.com/ibm-granite

Many thanks to the llama.cpp community for their awesome work! It would be awesome to see this feature added. GGUF's can be made already, but when you try to load them you get a tokenizer error.

sroecker · 2024-05-07T14:33:31Z

The PR to add granite support for transformers (add MLP bias - gate, up, down) can be found here: https://github.com/huggingface/transformers/pull/30031/files

psyv282j9d · 2024-05-07T20:00:43Z

Based on the discussion in transformers mlp_bias PR, It's similar to Llama with just the mlp_bias added

sroecker · 2024-05-08T08:39:59Z

I tried to do this here: https://github.com/sroecker/llama.cpp/tree/add_mlp_bias
Just adding bias to FFN_GATE, FFN_DOWN and FFN_UP.
The tensor shapes seem to be correct but the model outputs gibberish.

./main -m ~/Downloads/granite-3b-code-base.Q8_0.gguf -p "Question: Python code to calculate the Fibonacci series\n\nAnswer:\n" with the GGUF from https://huggingface.co/NikolayKozloff/granite-3b-code-instruct-Q8_0-GGUF (

mayank31398 · 2024-05-08T08:58:46Z

@sroecker are you tying the word embeddings? unlike llama, the input word embeddings and output projection matrix are tied for granite models

sroecker · 2024-05-08T10:14:32Z

Ah, not yet. Thanks! I guess then we need to define an additional ARCH (or save the mlp_bias boolean in the GGUF) and implement it like with MPT https://github.com/ggerganov/llama.cpp/blob/7e0b6a7b3ba94ff624dc27c1e0e735fded8819b8/llama.cpp#L5287 Mayank Mishra ***@***.***> schrieb am Mi., 8. Mai 2024, 10:59:

…

@sroecker <https://github.com/sroecker> are you tying the word embeddings? unlike llama, the input word embeddings and output projection matrix are tied for granite models — Reply to this email directly, view it on GitHub <#7116 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACYR3PWJGC2D4PTSLE7D6DZBHSNZAVCNFSM6AAAAABHKJDCHOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMBQGA4TQMZSG4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

JohnClaw · 2024-05-08T10:22:19Z

(or save the mlp_bias boolean in the GGUF

Does exist a way to add mlp_bias to already made gguf? I ask about that because you mentioned my q8 gguf in one of your previous messages.

sroecker · 2024-05-08T11:55:08Z

(or save the mlp_bias boolean in the GGUF

Does exist a way to add mlp_bias to already made gguf? I ask about that because you mentioned my q8 gguf in one of your previous messages.

You could hack something with gguf writer https://pypi.org/project/gguf/

sroecker · 2024-05-08T14:54:42Z

So I've adapted build_llama to include the MLP biases as well. I've added a few FIXMEs to my branch to indicate places that might need to be adapted for the different Granite models.
Now the output is valid text but unfortunately repeats itself:
<|endoftext|>Question: Fibonacci series in Python? \n\nAnswer: Python? \n\n series in Python? \n\n series in Python? \n\n series in Python? \

dataf3l · 2024-05-08T22:33:56Z

I'm here to write words of support, I am interested in exploring what IBM + OLLAMA can do

adrianpuiu · 2024-05-08T23:57:49Z

I'm here to write words of support, I am interested in exploring what IBM + OLLAMA can do

+1 to this

Add optional MLP bias for ARCH_LLAMA to support Granite models. Partially addresses ggerganov/issues/7116 Still needs some more changes to properly support Granite.

davideuler · 2024-05-12T14:06:28Z

Is there any progress for the support of Granite models?

jpodivin · 2024-05-12T19:27:02Z

AFAIK, we have been stuck on the issue of repeating text output. It appears that the tokenizer is the culprit, but it does seem to be in order, correct token ids etc. I don't know if @sroecker made any strides since.

sroecker · 2024-05-13T07:45:25Z

AFAIK, we have been stuck on the issue of repeating text output. It appears that the tokenizer is the culprit, but it does seem to be in order, correct token ids etc. I don't know if @sroecker made any strides since.

Yes, unfortunately. The lab version of granite works well with llama.cpp: https://huggingface.co/instructlab/granite-7b-lab-GGUF It doesn't have the MLP bias nodes and uses a different tokenizer though.
I've tried a few things regarding tokenization. I checked that the tokenizer creates the same input tokens with ./main -m granite-3b-code-base.gguf -p "def generate():" -ngl 0 --override-kv tokenizer.ggml.add_bos_token=bool:false.
0 -> '<|endoftext|>' 589 -> 'def' 4450 -> ' generate' 2262 -> '():' I recreated the f16 GGUF forcing the pre tokenizer to be llama-bpe instead of refact. No game so far. There's a lot of ARCH specific code all over llama.cppwhich might change important parameters so I'm thinking about creating a simple debugging example based on examples/simple/simple.cpp.

mayank31398 · 2024-05-13T11:26:49Z

the lab version is a different model
not to be confused with this one

sroecker · 2024-05-13T14:12:29Z

the lab version is a different model not to be confused with this one

I'm aware of that, it did work out of the box with LLM_ARCH_LLAMA settings though so I'm trying to find out why exactly. But you're right to point this out, a few people mixed these up.

I will check the convert-hf-to-gguf-update.py script again to rule out the tokenizer before I start digging deeper.

mayank31398 · 2024-05-13T20:26:21Z

Hmm, a quick question: are we tying the word embeddings and output logits matrix?
llama doesn't do that and granite has tied embeddings.
maybe thats the issue?
I don't think the tokenizer should be issue since all granite models use starcoder tokenizer.

sroecker · 2024-05-14T08:54:17Z

Hmm, a quick question: are we tying the word embeddings and output logits matrix? llama doesn't do that and granite has tied embeddings. maybe thats the issue? I don't think the tokenizer should be issue since all granite models use starcoder tokenizer.

If no output layer is found the word embeddings are used instead:

llama.cpp/llama.cpp

Lines 4926 to 4932 in 5416002

    
           model.output = ml.create_tensor(ctx_output_split, tn(LLM_TENSOR_OUTPUT, "weight"), {n_embd, n_vocab}, false); 
        
           // if output is NULL, init from the input tok embed 
        
           if (model.output == NULL) { 
        
               model.output = ml.create_tensor(ctx_output, tn(LLM_TENSOR_TOKEN_EMBD, "weight"), {n_embd, n_vocab}); 
        
               ml.n_created--; // artificial tensor 
        
               ml.size_data += ggml_nbytes(model.output); 
        
           }

mayank31398 · 2024-05-14T15:52:25Z

Hmm, ok so there are these differences between llama and granite:

attention has bias (llama doesn't)
mlp has bias (llama doesn't)
tied word embeddings (llama doesn't)
starcoder tokenizer

Tie the weights for ARCH_STARCODER to support the larger Granite code models. Partially addresses ggerganov/issues/7116 There still remains to be a few things to fix. Currently requires `--override-kv tokenizer.ggml.add_bos_token=bool:false`

sroecker · 2024-05-15T21:31:31Z

Hmm, ok so there are these differences between llama and granite:

attention has bias (llama doesn't)

mlp has bias (llama doesn't)

tied word embeddings (llama doesn't)

starcoder tokenizer

Do all Granite code models use the starcoder tokenizer? Based on your HF repo comment I tried to get 20 and 34b to run. They are recognized as Starcoder arch by the convert-hf-to-gguf script and all I had to modify was to tie the embedding weights. 20b instruct works quite well, even with the bos token. The Q3_K_L quant comes down to 11GB.
Please have a try with these changes:
sroecker@6f20148

For the 3 and 8b models 1) and 4) remain. We have to check if the attention bias is set up correctly in llm_build_kv, build_refact should be good for comparison.

mayank31398 · 2024-05-15T21:52:55Z

yeah all are using starcoder tokenizer.

Tie the weights for ARCH_STARCODER to support the larger Granite code models. Partially addresses /issues/7116 There still remains to be a few things to fix. Currently requires `--override-kv tokenizer.ggml.add_bos_token=bool:false`

…nov#7324) Tie the weights for ARCH_STARCODER to support the larger Granite code models. Partially addresses ggerganov/issues/7116 There still remains to be a few things to fix. Currently requires `--override-kv tokenizer.ggml.add_bos_token=bool:false`

DigitLib · 2024-05-19T19:21:42Z

If help for 8b-instruct model.
After convert using

python3 llama.cpp/convert.py granite-8b-ins --outfile granite-8b-ins/granite-8b-instruct.bin --outtype q8_0 --vocab-type bpe --pad-vocab`

got this err when start ./llama.cpp/main -m ./granite-8b-ins/granite-8b-instruct.bin

llama_model_load: error loading model: done_getting_tensors: wrong number of tensors; expected 578, got 470

The same numbers shows when using convert-hf-to-gguf.py.

During the conversion it shows 578

Tried with q8_0, f16 and f32 same err.

Thank you for this great work!

mayank31398 · 2024-05-20T00:35:28Z

I don't think that 3b and 8b are working yet @DigitLib
the 34b and 20b PR is merged and its working: #7324

20b-base GGUF is available now: https://huggingface.co/ibm-granite/granite-20b-code-base-GGUF
I will add the instruct and 34b tomorrow

DigitLib · 2024-05-20T10:16:00Z

@mayank31398 I know just wanted to help with 8b-instruct. Thank you!

giuseppe · 2024-05-20T12:09:41Z

@DigitLib you need sroecker@36dc5bb to load the smaller models

mayank31398 · 2024-05-20T13:19:33Z

if that commit is working, can we open a PR @sroecker ?

giuseppe · 2024-05-20T13:38:00Z

that doesn't seem enough. The model is loaded but it doesn't produce any good result #7116 (comment)

coder543 · 2024-05-20T14:07:04Z

https://huggingface.co/coder543/granite-20b-code-instruct-GGUF/tree/main

I've uploaded the q8_0, q6_K, and q4_0 gguf files for the 20B Instruct model here. I've only lightly tested them, and this is my first time quantizing any LLMs, but it seemed like they were working okay?

If anyone wants to test them, I'm curious if they work for you.

The chat template seems to be something like this:

Question:
Write a React TypeScript component

Answer:

Add optional MLP bias for ARCH_LLAMA to support Granite models. Partially addresses ggerganov/issues/7116 Still needs some more changes to properly support Granite.

giuseppe · 2024-05-22T23:53:19Z

I've managed to get some output that makes some sense with the 3b model, I've opened a PR:

llama: extend for small granite models #7481

IMHO it makes sense to define a new architecture for granite, as there are substantial differences with the base llama model. To convert the hf model using the code in my PR, I modified the config.json file in the granite model and used:

  "architectures": [
    "GraniteForCausalLM"
  ],

@mayank31398 what do you think?

Add optional MLP bias for ARCH_LLAMA to support Granite models. Partially addresses ggerganov/issues/7116 Still needs some more changes to properly support Granite.

celek · 2024-05-23T00:44:40Z

@giuseppe did you get the 3b gguf working with #7481 ? if you teach me how I can get it locally on my M1 I can run some tests too :)

HunterGerlach · 2024-05-23T04:01:14Z

To reproduce locally you can run the following:

Clone down Giuseppe's branch, pip install necessary packages (e.g. torch, transformers, numpy, sentencepiece), and build Llama.cpp (i.e. run make)
Download the 3B or 8B model from HF
Modify the config.json per @giuseppe 's comment above (i.e. LlamaForCausalLM -> GraniteForCausalLM)
Convert to GGUF (e.g. ./convert-hf-to-gguf.py path-to-granite-model --outtype q8_0 --outfile path-to-converted-model/converted-model-name.gguf)
Run inference against the GGUF model (e.g. ./main -m path-to-converted-model/converted-model-name.gguf -p "Write a simple hello world script in python.")

Inference output should be something like the following (ignoring logging output for brevity):

print("Hello World")

mayank31398 · 2024-05-23T04:02:12Z

@giuseppe I think the problem is that it won't work out of the box when converting the model.
I am not sure what a good solution is for this problem though

Add optional MLP bias for ARCH_LLAMA to support Granite models. Partially addresses ggerganov/issues/7116 Still needs some more changes to properly support Granite.

giuseppe · 2024-05-23T09:40:34Z

I've tried to extend the new arch to the bigger models, but it doesn't make sense as GPTBigCodeForCausalLM already works fine.

To avoid confusion, I've renamed the new architecture to GraniteSmallForCausalLM so it is clear that it applies only to the smaller models.

giuseppe · 2024-05-23T09:41:25Z

@giuseppe I think the problem is that it won't work out of the box when converting the model.

could we fix the name in the model files or would it cause other issues?

Add optional MLP bias for ARCH_LLAMA to support Granite models. Partially addresses ggerganov/issues/7116 Still needs some more changes to properly support Granite.

mayank31398 · 2024-05-23T17:40:08Z

Is there no other solution?
we cannot change the name on HF since a lot of people have already created forks of the models.
if there is an alternative method that we can explore, it would be better

maybe by checking if mlp_bias is true or not?

giuseppe · 2024-05-23T17:53:18Z

that looks like a generic setting that other models could use in future, but I have no weight in this decision. I will implement as it fits better for the granite model and llama.cpp maintainers.

Could a flag to the conversion script that forces the arch be good enough?

@ggerganov do you have any suggestions?

giuseppe · 2024-05-23T19:29:11Z

I've added the following patch to the PR:

diff --git a/convert-hf-to-gguf.py b/convert-hf-to-gguf.py
index 2d05de42..eb7d061a 100755
--- a/convert-hf-to-gguf.py
+++ b/convert-hf-to-gguf.py
@@ -2571,6 +2571,10 @@ def parse_args() -> argparse.Namespace:
         "--no-lazy", action="store_true",
         help="use more RAM by computing all outputs before writing (use in case lazy evaluation is broken)",
     )
+    parser.add_argument(
+        "--architecture", type=str, default=None,
+        help="force the architecture to use",
+    )
     parser.add_argument(
         "--model-name", type=str, default=None,
         help="name of the model",
@@ -2626,7 +2630,7 @@ def main() -> None:
     hparams = Model.load_hparams(dir_model)
 
     with torch.inference_mode():
-        model_class = Model.from_model_architecture(hparams["architectures"][0])
+        model_class = Model.from_model_architecture(args.architecture if args.architecture is not None else hparams["architectures"][0])
         model_instance = model_class(dir_model, ftype_map[args.outtype], fname_out, args.bigendian, args.use_temp_file, args.no_lazy)
 
         logger.info("Set model parameters")

so we can just add --architecture GraniteSmallForCausalLM to the command line and it works without having to change the model file.

Is this an acceptable solution?

mayank31398 · 2024-05-24T05:16:18Z

I think its better @ggerganov gives this a review.

Add optional MLP bias for ARCH_LLAMA to support Granite models. Partially addresses ggerganov/issues/7116 Still needs some more changes to properly support Granite.

* Add optional MLP bias for Granite models Add optional MLP bias for ARCH_LLAMA to support Granite models. Partially addresses /issues/7116 Still needs some more changes to properly support Granite. * llama: honor add_space_prefix from the model configuration propagate the add_space_prefix configuration from the HF model configuration to the gguf file and honor it with the gpt2 tokenizer. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> * llama: add support for small granite models it works only for the small models 3b and 8b. The convert-hf-to-gguf.py script uses the vocabulary size of the granite models to detect granite and set the correct configuration. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> --------- Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> Co-authored-by: Steffen Roecker <sroecker@redhat.com>

== Relevant log messages from source repo: commit b864b50ce5e2beefc8c2fd31733e4e1a978b7754 Author: Meng, Hengyu <hengyu.meng@intel.com> Date: Wed May 29 07:00:24 2024 +0800 [SYCL] Align GEMM dispatch (#7566) * align GEMM dispatch commit 02c1ecad07f0e2d2febe8196271bcc64bdc9c006 Author: jaime-m-p <167997752+jaime-m-p@users.noreply.github.com> Date: Tue May 28 21:46:34 2024 +0200 Tokenizer WPM fixes (#7500) * Update random test: add_bos_token. * Update random test: add WPM models for testing. * Build vocab.special_tokens_cache using vocab token types. * Fix and improve WPM preprocessing. - Fix unicode edge case combinations. - Split by whitspace in the same pass. * Discard all tokens when no matching found. commit 6bd12ce409f949012935b7d1b15a21ffa473a565 Author: Georgi Gerganov <ggerganov@gmail.com> Date: Tue May 28 22:22:50 2024 +0300 sycl : fix assert (#7563) commit 5442939fcc5e6ae41abf40612a95fd71377e487e Author: Giuseppe Scrivano <giuseppe@scrivano.org> Date: Tue May 28 20:49:49 2024 +0200 llama : support small Granite models (#7481) * Add optional MLP bias for Granite models Add optional MLP bias for ARCH_LLAMA to support Granite models. Partially addresses ggerganov/llama.cpp/issues/7116 Still needs some more changes to properly support Granite. * llama: honor add_space_prefix from the model configuration propagate the add_space_prefix configuration from the HF model configuration to the gguf file and honor it with the gpt2 tokenizer. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> * llama: add support for small granite models it works only for the small models 3b and 8b. The convert-hf-to-gguf.py script uses the vocabulary size of the granite models to detect granite and set the correct configuration. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> --------- Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> Co-authored-by: Steffen Roecker <sroecker@redhat.com> commit 56411a950f255b523a9edd684fd1632752474399 Author: k.h.lai <adrian.k.h.lai@outlook.com> Date: Wed May 29 01:25:08 2024 +0800 vulkan: properly initialize vulkan devices for LLAMA_SPLIT_MODE_NONE (#7552) commit 2b737caae100cf0ac963206984332e422058f2b9 Author: Radoslav Gerganov <rgerganov@gmail.com> Date: Tue May 28 18:13:36 2024 +0300 rpc : resource management rework (#7562) * rpc : resource management rework * address review comments

YorkieDev added the enhancement New feature or request label May 7, 2024

coder543 mentioned this issue May 8, 2024

IBM-Granite ollama/ollama#4209

Closed

mayank31398 mentioned this issue May 9, 2024

Provide .GGUF files? ibm-granite/granite-code-models#7

Closed

danton721 mentioned this issue May 12, 2024

"llama.cpp error: 'error loading model vocabulary: unknown pre-tokenizer type: 'command-r''" lmstudio-ai/configs#44

Closed

sroecker mentioned this issue May 16, 2024

Add support for larger Granite Code Models (20B, 34B) #7324

Merged

giuseppe mentioned this issue May 22, 2024

llama: extend for small granite models #7481

Merged

ggerganov closed this as completed in #7481 May 28, 2024

tc-mb mentioned this issue May 28, 2024

sync master OpenBMB/llama.cpp#8

Merged

Add Support for IBM Granite #7116

Add Support for IBM Granite #7116

Comments

YorkieDev commented May 7, 2024

Prerequisites

Feature Description

sroecker commented May 7, 2024

psyv282j9d commented May 7, 2024

sroecker commented May 8, 2024

mayank31398 commented May 8, 2024

sroecker commented May 8, 2024 via email

JohnClaw commented May 8, 2024 • edited

sroecker commented May 8, 2024

sroecker commented May 8, 2024

dataf3l commented May 8, 2024

adrianpuiu commented May 8, 2024

davideuler commented May 12, 2024

jpodivin commented May 12, 2024

sroecker commented May 13, 2024

mayank31398 commented May 13, 2024

sroecker commented May 13, 2024

mayank31398 commented May 13, 2024

sroecker commented May 14, 2024

mayank31398 commented May 14, 2024

sroecker commented May 15, 2024

mayank31398 commented May 15, 2024

DigitLib commented May 19, 2024

mayank31398 commented May 20, 2024 • edited

DigitLib commented May 20, 2024

giuseppe commented May 20, 2024

mayank31398 commented May 20, 2024

giuseppe commented May 20, 2024

coder543 commented May 20, 2024 • edited

giuseppe commented May 22, 2024

celek commented May 23, 2024

HunterGerlach commented May 23, 2024

mayank31398 commented May 23, 2024

giuseppe commented May 23, 2024

giuseppe commented May 23, 2024

mayank31398 commented May 23, 2024 • edited

giuseppe commented May 23, 2024

giuseppe commented May 23, 2024

mayank31398 commented May 24, 2024

JohnClaw commented May 8, 2024 •

edited

mayank31398 commented May 20, 2024 •

edited

coder543 commented May 20, 2024 •

edited

mayank31398 commented May 23, 2024 •

edited