Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Support for IBM Granite #7116

Closed
YorkieDev opened this issue May 7, 2024 · 37 comments · Fixed by #7481
Closed

Add Support for IBM Granite #7116

YorkieDev opened this issue May 7, 2024 · 37 comments · Fixed by #7481
Labels
enhancement New feature or request

Comments

@YorkieDev
Copy link

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • [ ✅] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • [✅ ] I carefully followed the README.md.
  • [ ✅] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • [✅ ] I reviewed the Discussions, and have a new bug or useful enhancement to share.

Feature Description

IBM recently released their Granite models. A series of 3b -> 34b coding models with base and instruct finetunes.

https://huggingface.co/collections/ibm-granite/granite-code-models-6624c5cec322e4c148c8b330
https://github.com/ibm-granite

Many thanks to the llama.cpp community for their awesome work! It would be awesome to see this feature added. GGUF's can be made already, but when you try to load them you get a tokenizer error.

@YorkieDev YorkieDev added the enhancement New feature or request label May 7, 2024
@sroecker
Copy link
Contributor

sroecker commented May 7, 2024

The PR to add granite support for transformers (add MLP bias - gate, up, down) can be found here: https://github.com/huggingface/transformers/pull/30031/files

@psyv282j9d
Copy link

Based on the discussion in transformers mlp_bias PR, It's similar to Llama with just the mlp_bias added

@sroecker
Copy link
Contributor

sroecker commented May 8, 2024

I tried to do this here: https://github.com/sroecker/llama.cpp/tree/add_mlp_bias
Just adding bias to FFN_GATE, FFN_DOWN and FFN_UP.
The tensor shapes seem to be correct but the model outputs gibberish.

./main -m ~/Downloads/granite-3b-code-base.Q8_0.gguf -p "Question: Python code to calculate the Fibonacci series\n\nAnswer:\n" with the GGUF from https://huggingface.co/NikolayKozloff/granite-3b-code-instruct-Q8_0-GGUF (

@mayank31398
Copy link

@sroecker are you tying the word embeddings? unlike llama, the input word embeddings and output projection matrix are tied for granite models

@sroecker
Copy link
Contributor

sroecker commented May 8, 2024 via email

@JohnClaw
Copy link

JohnClaw commented May 8, 2024

(or save the mlp_bias boolean in the GGUF

Does exist a way to add mlp_bias to already made gguf? I ask about that because you mentioned my q8 gguf in one of your previous messages.

@sroecker
Copy link
Contributor

sroecker commented May 8, 2024

(or save the mlp_bias boolean in the GGUF

Does exist a way to add mlp_bias to already made gguf? I ask about that because you mentioned my q8 gguf in one of your previous messages.

You could hack something with gguf writer https://pypi.org/project/gguf/

@sroecker
Copy link
Contributor

sroecker commented May 8, 2024

So I've adapted build_llama to include the MLP biases as well. I've added a few FIXMEs to my branch to indicate places that might need to be adapted for the different Granite models.
Now the output is valid text but unfortunately repeats itself:
<|endoftext|>Question: Fibonacci series in Python? \n\nAnswer: Python? \n\n series in Python? \n\n series in Python? \n\n series in Python? \

@dataf3l
Copy link

dataf3l commented May 8, 2024

I'm here to write words of support, I am interested in exploring what IBM + OLLAMA can do

@adrianpuiu
Copy link

I'm here to write words of support, I am interested in exploring what IBM + OLLAMA can do

+1 to this

sroecker added a commit to sroecker/llama.cpp that referenced this issue May 9, 2024
Add optional MLP bias for ARCH_LLAMA to support Granite models.
Partially addresses ggerganov/issues/7116
Still needs some more changes to properly support Granite.
@davideuler
Copy link

Is there any progress for the support of Granite models?

@jpodivin
Copy link
Contributor

AFAIK, we have been stuck on the issue of repeating text output. It appears that the tokenizer is the culprit, but it does seem to be in order, correct token ids etc. I don't know if @sroecker made any strides since.

@sroecker
Copy link
Contributor

AFAIK, we have been stuck on the issue of repeating text output. It appears that the tokenizer is the culprit, but it does seem to be in order, correct token ids etc. I don't know if @sroecker made any strides since.

Yes, unfortunately. The lab version of granite works well with llama.cpp: https://huggingface.co/instructlab/granite-7b-lab-GGUF It doesn't have the MLP bias nodes and uses a different tokenizer though.
I've tried a few things regarding tokenization. I checked that the tokenizer creates the same input tokens with ./main -m granite-3b-code-base.gguf -p "def generate():" -ngl 0 --override-kv tokenizer.ggml.add_bos_token=bool:false.
0 -> '<|endoftext|>' 589 -> 'def' 4450 -> ' generate' 2262 -> '():' I recreated the f16 GGUF forcing the pre tokenizer to be llama-bpe instead of refact. No game so far. There's a lot of ARCH specific code all over llama.cppwhich might change important parameters so I'm thinking about creating a simple debugging example based on examples/simple/simple.cpp.

@mayank31398
Copy link

the lab version is a different model
not to be confused with this one

@sroecker
Copy link
Contributor

the lab version is a different model not to be confused with this one

I'm aware of that, it did work out of the box with LLM_ARCH_LLAMA settings though so I'm trying to find out why exactly. But you're right to point this out, a few people mixed these up.

I will check the convert-hf-to-gguf-update.py script again to rule out the tokenizer before I start digging deeper.

@mayank31398
Copy link

Hmm, a quick question: are we tying the word embeddings and output logits matrix?
llama doesn't do that and granite has tied embeddings.
maybe thats the issue?
I don't think the tokenizer should be issue since all granite models use starcoder tokenizer.

@sroecker
Copy link
Contributor

Hmm, a quick question: are we tying the word embeddings and output logits matrix? llama doesn't do that and granite has tied embeddings. maybe thats the issue? I don't think the tokenizer should be issue since all granite models use starcoder tokenizer.

If no output layer is found the word embeddings are used instead:

llama.cpp/llama.cpp

Lines 4926 to 4932 in 5416002

model.output = ml.create_tensor(ctx_output_split, tn(LLM_TENSOR_OUTPUT, "weight"), {n_embd, n_vocab}, false);
// if output is NULL, init from the input tok embed
if (model.output == NULL) {
model.output = ml.create_tensor(ctx_output, tn(LLM_TENSOR_TOKEN_EMBD, "weight"), {n_embd, n_vocab});
ml.n_created--; // artificial tensor
ml.size_data += ggml_nbytes(model.output);
}

@mayank31398
Copy link

Hmm, ok so there are these differences between llama and granite:

  1. attention has bias (llama doesn't)
  2. mlp has bias (llama doesn't)
  3. tied word embeddings (llama doesn't)
  4. starcoder tokenizer

sroecker added a commit to sroecker/llama.cpp that referenced this issue May 15, 2024
Tie the weights for ARCH_STARCODER to support the larger Granite code models.
Partially addresses ggerganov/issues/7116

There still remains to be a few things to fix.
Currently requires `--override-kv tokenizer.ggml.add_bos_token=bool:false`
@sroecker
Copy link
Contributor

Hmm, ok so there are these differences between llama and granite:

  1. attention has bias (llama doesn't)
  2. mlp has bias (llama doesn't)
  3. tied word embeddings (llama doesn't)
  4. starcoder tokenizer

Do all Granite code models use the starcoder tokenizer? Based on your HF repo comment I tried to get 20 and 34b to run. They are recognized as Starcoder arch by the convert-hf-to-gguf script and all I had to modify was to tie the embedding weights. 20b instruct works quite well, even with the bos token. The Q3_K_L quant comes down to 11GB.
Please have a try with these changes:
sroecker@6f20148

For the 3 and 8b models 1) and 4) remain. We have to check if the attention bias is set up correctly in llm_build_kv, build_refact should be good for comparison.

@mayank31398
Copy link

yeah all are using starcoder tokenizer.

ggerganov pushed a commit that referenced this issue May 18, 2024
Tie the weights for ARCH_STARCODER to support the larger Granite code models.
Partially addresses /issues/7116

There still remains to be a few things to fix.
Currently requires `--override-kv tokenizer.ggml.add_bos_token=bool:false`
Nexesenex pushed a commit to Nexesenex/kobold.cpp that referenced this issue May 18, 2024
…nov#7324)

Tie the weights for ARCH_STARCODER to support the larger Granite code models.
Partially addresses ggerganov/issues/7116

There still remains to be a few things to fix.
Currently requires `--override-kv tokenizer.ggml.add_bos_token=bool:false`
@DigitLib
Copy link

If help for 8b-instruct model.
After convert using

python3 llama.cpp/convert.py granite-8b-ins --outfile granite-8b-ins/granite-8b-instruct.bin --outtype q8_0 --vocab-type bpe --pad-vocab`

got this err when start ./llama.cpp/main -m ./granite-8b-ins/granite-8b-instruct.bin

llama_model_load: error loading model: done_getting_tensors: wrong number of tensors; expected 578, got 470

The same numbers shows when using convert-hf-to-gguf.py.

During the conversion it shows 578

Last two lines
INFO:convert:[577/578] Writing tensor blk.35.attn_v.weight | size 1024 x 4096 | type Q8_0 | T+ 84
INFO:convert:[578/578] Writing tensor output_norm.weight | size 4096 | type F32 | T+ 84

Tried with q8_0, f16 and f32 same err.

Thank you for this great work!

@mayank31398
Copy link

mayank31398 commented May 20, 2024

I don't think that 3b and 8b are working yet @DigitLib
the 34b and 20b PR is merged and its working: #7324

20b-base GGUF is available now: https://huggingface.co/ibm-granite/granite-20b-code-base-GGUF
I will add the instruct and 34b tomorrow

@DigitLib
Copy link

@mayank31398 I know just wanted to help with 8b-instruct. Thank you!

@giuseppe
Copy link
Contributor

@DigitLib you need sroecker@36dc5bb to load the smaller models

@mayank31398
Copy link

if that commit is working, can we open a PR @sroecker ?

@giuseppe
Copy link
Contributor

that doesn't seem enough. The model is loaded but it doesn't produce any good result #7116 (comment)

@coder543
Copy link

coder543 commented May 20, 2024

https://huggingface.co/coder543/granite-20b-code-instruct-GGUF/tree/main

I've uploaded the q8_0, q6_K, and q4_0 gguf files for the 20B Instruct model here. I've only lightly tested them, and this is my first time quantizing any LLMs, but it seemed like they were working okay?

If anyone wants to test them, I'm curious if they work for you.

The chat template seems to be something like this:

Question:
Write a React TypeScript component

Answer:

giuseppe pushed a commit to giuseppe/llama.cpp that referenced this issue May 22, 2024
Add optional MLP bias for ARCH_LLAMA to support Granite models.
Partially addresses ggerganov/issues/7116
Still needs some more changes to properly support Granite.
@giuseppe
Copy link
Contributor

I've managed to get some output that makes some sense with the 3b model, I've opened a PR:

IMHO it makes sense to define a new architecture for granite, as there are substantial differences with the base llama model. To convert the hf model using the code in my PR, I modified the config.json file in the granite model and used:

  "architectures": [
    "GraniteForCausalLM"
  ],

@mayank31398 what do you think?

giuseppe pushed a commit to giuseppe/llama.cpp that referenced this issue May 22, 2024
Add optional MLP bias for ARCH_LLAMA to support Granite models.
Partially addresses ggerganov/issues/7116
Still needs some more changes to properly support Granite.
@celek
Copy link

celek commented May 23, 2024

@giuseppe did you get the 3b gguf working with #7481 ? if you teach me how I can get it locally on my M1 I can run some tests too :)

@HunterGerlach
Copy link

To reproduce locally you can run the following:

  1. Clone down Giuseppe's branch, pip install necessary packages (e.g. torch, transformers, numpy, sentencepiece), and build Llama.cpp (i.e. run make)
  2. Download the 3B or 8B model from HF
  3. Modify the config.json per @giuseppe 's comment above (i.e. LlamaForCausalLM -> GraniteForCausalLM)
  4. Convert to GGUF (e.g. ./convert-hf-to-gguf.py path-to-granite-model --outtype q8_0 --outfile path-to-converted-model/converted-model-name.gguf)
  5. Run inference against the GGUF model (e.g. ./main -m path-to-converted-model/converted-model-name.gguf -p "Write a simple hello world script in python.")

Inference output should be something like the following (ignoring logging output for brevity):

print("Hello World")

@mayank31398
Copy link

@giuseppe I think the problem is that it won't work out of the box when converting the model.
I am not sure what a good solution is for this problem though

giuseppe pushed a commit to giuseppe/llama.cpp that referenced this issue May 23, 2024
Add optional MLP bias for ARCH_LLAMA to support Granite models.
Partially addresses ggerganov/issues/7116
Still needs some more changes to properly support Granite.
@giuseppe
Copy link
Contributor

I've tried to extend the new arch to the bigger models, but it doesn't make sense as GPTBigCodeForCausalLM already works fine.

To avoid confusion, I've renamed the new architecture to GraniteSmallForCausalLM so it is clear that it applies only to the smaller models.

@giuseppe
Copy link
Contributor

@giuseppe I think the problem is that it won't work out of the box when converting the model.

could we fix the name in the model files or would it cause other issues?

giuseppe pushed a commit to giuseppe/llama.cpp that referenced this issue May 23, 2024
Add optional MLP bias for ARCH_LLAMA to support Granite models.
Partially addresses ggerganov/issues/7116
Still needs some more changes to properly support Granite.
giuseppe pushed a commit to giuseppe/llama.cpp that referenced this issue May 23, 2024
Add optional MLP bias for ARCH_LLAMA to support Granite models.
Partially addresses ggerganov/issues/7116
Still needs some more changes to properly support Granite.
@mayank31398
Copy link

mayank31398 commented May 23, 2024

Is there no other solution?
we cannot change the name on HF since a lot of people have already created forks of the models.
if there is an alternative method that we can explore, it would be better

maybe by checking if mlp_bias is true or not?

@giuseppe
Copy link
Contributor

that looks like a generic setting that other models could use in future, but I have no weight in this decision. I will implement as it fits better for the granite model and llama.cpp maintainers.

Could a flag to the conversion script that forces the arch be good enough?

@ggerganov do you have any suggestions?

@giuseppe
Copy link
Contributor

I've added the following patch to the PR:

diff --git a/convert-hf-to-gguf.py b/convert-hf-to-gguf.py
index 2d05de42..eb7d061a 100755
--- a/convert-hf-to-gguf.py
+++ b/convert-hf-to-gguf.py
@@ -2571,6 +2571,10 @@ def parse_args() -> argparse.Namespace:
         "--no-lazy", action="store_true",
         help="use more RAM by computing all outputs before writing (use in case lazy evaluation is broken)",
     )
+    parser.add_argument(
+        "--architecture", type=str, default=None,
+        help="force the architecture to use",
+    )
     parser.add_argument(
         "--model-name", type=str, default=None,
         help="name of the model",
@@ -2626,7 +2630,7 @@ def main() -> None:
     hparams = Model.load_hparams(dir_model)
 
     with torch.inference_mode():
-        model_class = Model.from_model_architecture(hparams["architectures"][0])
+        model_class = Model.from_model_architecture(args.architecture if args.architecture is not None else hparams["architectures"][0])
         model_instance = model_class(dir_model, ftype_map[args.outtype], fname_out, args.bigendian, args.use_temp_file, args.no_lazy)
 
         logger.info("Set model parameters")

so we can just add --architecture GraniteSmallForCausalLM to the command line and it works without having to change the model file.

Is this an acceptable solution?

@mayank31398
Copy link

I think its better @ggerganov gives this a review.

giuseppe pushed a commit to giuseppe/llama.cpp that referenced this issue May 24, 2024
Add optional MLP bias for ARCH_LLAMA to support Granite models.
Partially addresses ggerganov/issues/7116
Still needs some more changes to properly support Granite.
giuseppe pushed a commit to giuseppe/llama.cpp that referenced this issue May 24, 2024
Add optional MLP bias for ARCH_LLAMA to support Granite models.
Partially addresses ggerganov/issues/7116
Still needs some more changes to properly support Granite.
ggerganov pushed a commit that referenced this issue May 28, 2024
* Add optional MLP bias for Granite models

Add optional MLP bias for ARCH_LLAMA to support Granite models.
Partially addresses /issues/7116
Still needs some more changes to properly support Granite.

* llama: honor add_space_prefix from the model configuration

propagate the add_space_prefix configuration from the HF model
configuration to the gguf file and honor it with the gpt2 tokenizer.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

* llama: add support for small granite models

it works only for the small models 3b and 8b.

The convert-hf-to-gguf.py script uses the vocabulary size of the
granite models to detect granite and set the correct configuration.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

---------

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Co-authored-by: Steffen Roecker <sroecker@redhat.com>
github-actions bot pushed a commit to KerfuffleV2/ggml-sys-bleedingedge that referenced this issue May 29, 2024
== Relevant log messages from source repo:

commit b864b50ce5e2beefc8c2fd31733e4e1a978b7754
Author: Meng, Hengyu <hengyu.meng@intel.com>
Date:   Wed May 29 07:00:24 2024 +0800

    [SYCL] Align GEMM dispatch (#7566)

    * align GEMM dispatch

commit 02c1ecad07f0e2d2febe8196271bcc64bdc9c006
Author: jaime-m-p <167997752+jaime-m-p@users.noreply.github.com>
Date:   Tue May 28 21:46:34 2024 +0200

    Tokenizer WPM fixes (#7500)

    * Update random test: add_bos_token.
    * Update random test: add WPM models for testing.
    * Build vocab.special_tokens_cache using vocab token types.
    * Fix and improve WPM preprocessing.
      - Fix unicode edge case combinations.
      - Split by whitspace in the same pass.
    * Discard all tokens when no matching found.

commit 6bd12ce409f949012935b7d1b15a21ffa473a565
Author: Georgi Gerganov <ggerganov@gmail.com>
Date:   Tue May 28 22:22:50 2024 +0300

    sycl : fix assert (#7563)

commit 5442939fcc5e6ae41abf40612a95fd71377e487e
Author: Giuseppe Scrivano <giuseppe@scrivano.org>
Date:   Tue May 28 20:49:49 2024 +0200

    llama : support small Granite models (#7481)

    * Add optional MLP bias for Granite models

    Add optional MLP bias for ARCH_LLAMA to support Granite models.
    Partially addresses ggerganov/llama.cpp/issues/7116
    Still needs some more changes to properly support Granite.

    * llama: honor add_space_prefix from the model configuration

    propagate the add_space_prefix configuration from the HF model
    configuration to the gguf file and honor it with the gpt2 tokenizer.

    Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

    * llama: add support for small granite models

    it works only for the small models 3b and 8b.

    The convert-hf-to-gguf.py script uses the vocabulary size of the
    granite models to detect granite and set the correct configuration.

    Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

    ---------

    Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
    Co-authored-by: Steffen Roecker <sroecker@redhat.com>

commit 56411a950f255b523a9edd684fd1632752474399
Author: k.h.lai <adrian.k.h.lai@outlook.com>
Date:   Wed May 29 01:25:08 2024 +0800

    vulkan: properly initialize vulkan devices for LLAMA_SPLIT_MODE_NONE (#7552)

commit 2b737caae100cf0ac963206984332e422058f2b9
Author: Radoslav Gerganov <rgerganov@gmail.com>
Date:   Tue May 28 18:13:36 2024 +0300

    rpc : resource management rework (#7562)

    * rpc : resource management rework

    * address review comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.