Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support MiniCPM #5346

Merged
merged 11 commits into from
Feb 7, 2024
Merged

Support MiniCPM #5346

merged 11 commits into from
Feb 7, 2024

Conversation

runfuture
Copy link
Contributor

  1. Add a new file, convert-minicpm.py, to convert the model (https://huggingface.co/openbmb/MiniCPM-2B-dpo-fp16). It is very similar to convert.py, which only supports Llama.
  2. Add a new model architecture in llama.cpp, applying several scaling computations as discussed in MiniCPM 2b model support? #5276.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it difficult to merge it into convert.py instead of creating a new file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To merge it into convert.py can actually be achieved by adding args that explicitly indicate the model architecture. Otherwise, it is difficult to differentiate them automatically since the weight storage of MiniCPM is almost the same as Llama. What's your suggestion?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about in convert-hf-to-gguf.py - would it be easier?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about in convert-hf-to-gguf.py - would it be easier?

In fact, I've tried to work on that. Copied and pasted a lot of other model code and it almost works, but the hard part is the tokenizer, I haven't finished it yet. The number of lines of code added/modified will be 10x ~ 30x more than convert.py.
I'll try to finish it. Then you could compare which way is better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about in convert-hf-to-gguf.py - would it be easier?

It seems to be working now. Not familiar with the vocab, there may be bugs in _set_vocab_hf (convert-hf-to-gguf.py).
If this way better, I'll remove convert-minicpm.py.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks great. I'm just not sure about the from convert import HfVocab

@cebtenzzre What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine, as long as we're mindful that convert-hf-to-gguf.py is now dependent on convert.py - python is prone to circular dependency issues that can only be resolved by moving code to a shared module.

@runfuture
Copy link
Contributor Author

Improvements have been submitted. :)

llama.cpp Outdated
cb(cur, "result_norm", -1);

// lm_head scaling
float scale_lmhead = 1.0f/9.0f; // 1/(dim_model/256)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe keep these constants expanded:

Suggested change
float scale_lmhead = 1.0f/9.0f; // 1/(dim_model/256)
const float scale_lmhead = 256.0f/n_embd;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, but leaving a todo for the future. Let's wait for model development? :)

llama.cpp Outdated
Comment on lines 6833 to 6835
const int scale_emb = 12;
const int dim_model_base = 256;
const float scale_depth = 1.4f;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor change of the constant names:

Suggested change
const int scale_emb = 12;
const int dim_model_base = 256;
const float scale_depth = 1.4f;
const int64_t n_embd_base = 256;
const float scale_embd = 12.0f;
const float scale_depth = 1.4f;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor change of the constant names:

Done with corresponding modifications: this commit

@ggerganov ggerganov merged commit 316c7fa into ggerganov:master Feb 7, 2024
49 of 53 checks passed
@sweetcard
Copy link

It doesn't work.
python convert-hf-to-gguf.py "./miniCPM-bf16"
./server -m "./miniCPM-bf16/ggml-model-f16.gguf" -n 256 -c 1024

Any wrong settings?

Check the screenshots please.

image image

@lin-calvin
Copy link

lin-calvin commented Feb 7, 2024

It doesn't work. python convert-hf-to-gguf.py "./miniCPM-bf16" ./server -m "./miniCPM-bf16/ggml-model-f16.gguf" -n 256 -c 1024

Any wrong settings?

Check the screenshots please.
image image

there seems nothing wrong with it, dont wish a 2b model can do things like coding. If you want a better output, try.tp.enable Mirostat v2 in "More options"

@sweetcard
Copy link

It doesn't work. python convert-hf-to-gguf.py "./miniCPM-bf16" ./server -m "./miniCPM-bf16/ggml-model-f16.gguf" -n 256 -c 1024
Any wrong settings?
Check the screenshots please.
image image

there seems nothing wrong with it, dont wish a 2b model can do things like coding

The version from LLM Farm can work with python.

Check this:

image

When Mirostat v2 is enabled, it doesn't work.
image

@lin-calvin
Copy link

It doesn't work. python convert-hf-to-gguf.py "./miniCPM-bf16" ./server -m "./miniCPM-bf16/ggml-model-f16.gguf" -n 256 -c 1024
Any wrong settings?
Check the screenshots please.
image image

there seems nothing wrong with it, dont wish a 2b model can do things like coding

The version from LLM Farm can work with python.

Check this:

image When Mirostat v2 is enabled, it doesn't work. image

Found the problem, your chat prompt templete is wrong,try configure it like this:
截图 2024-02-07 15-22-12

@sweetcard
Copy link

Not lucky. 😭
image

image

@lin-calvin
Copy link

Not lucky. 😭
image

image

No way, Is you model the DPO one?

@sweetcard
Copy link

No way, Is you model the DPO one?

MiniCPM-2B-sft-bf16

@lin-calvin
Copy link

@sweetcard
Copy link

try this one:https://modelscope.cn/models/OpenBMB/MiniCPM-2B-dpo-fp16/

It feels not good.

image

@huangyuxiang03
Copy link

Hi, I'm one of the maintainers of MiniCPM repo. Could you provide a detailed instruction of how to reproduce this problem? We would like to figure this out. Thank you!

@sweetcard
Copy link

Hi, I'm one of the maintainers of MiniCPM repo. Could you provide a detailed instruction of how to reproduce this problem? We would like to figure this out. Thank you!

  1. python convert-hf-to-gguf.py "./MiniCPM-2B-dpo-fp16"
  2. ./server -m "./MiniCPM-2B-dpo-fp16/ggml-model-f16.gguf" -n 256 -c 1024
  3. Go to http://localhost:8080/ , Set the following parameters:
image image

@runfuture
Copy link
Contributor Author

@sweetcard @sweetcard @huangyuxiang03
So sorry that I think there may be bugs in tokenizer process when convert model.
Please use this history version (https://github.com/ggerganov/llama.cpp/tree/550ab5e1c5dc97208e7e148c8f81b1e06f0900b9) and the convert-minicpm.py to convert model first.
I'm trying to fix ASAP.

@lin-calvin
Copy link

@sweetcard @sweetcard @huangyuxiang03 So sorry that I think there may be bugs in tokenizer process when convert model. Please use this history version (https://github.com/ggerganov/llama.cpp/tree/550ab5e1c5dc97208e7e148c8f81b1e06f0900b9) and the convert-minicpm.py to convert model first. I'm trying to fix ASAP.

The script seems has another problem: The llama.cpp cant run the model because GGML_ASSERT: ggml.c:9630: eps > 0.0f

@runfuture
Copy link
Contributor Author

@sweetcard @sweetcard @huangyuxiang03 So sorry that I think there may be bugs in tokenizer process when convert model. Please use this history version (https://github.com/ggerganov/llama.cpp/tree/550ab5e1c5dc97208e7e148c8f81b1e06f0900b9) and the convert-minicpm.py to convert model first. I'm trying to fix ASAP.

The script seems has another problem: The llama.cpp cant run the model because GGML_ASSERT: ggml.c:9630: eps > 0.0f

Latest commit fixed this issue.
BTW, still suffering from comparing the differences between the models generated by two conversion scripts, and looking for some hints

@huangyuxiang03
Copy link

截屏2024-02-07 18 05 43 I can get this running by using MiniCPM-dpo-bf16. It seems like everything is correct by using this history version (https://github.com/ggerganov/llama.cpp/tree/550ab5e1c5dc97208e7e148c8f81b1e06f0900b9) and the [convert-minicpm.py](https://github.com/ggerganov/llama.cpp/blob/550ab5e1c5dc97208e7e148c8f81b1e06f0900b9/convert-minicpm.py).

@huangyuxiang03
Copy link

huangyuxiang03 commented Feb 7, 2024

Hi, I'm one of the maintainers of MiniCPM repo. Could you provide a detailed instruction of how to reproduce this problem? We would like to figure this out. Thank you!

1. `python convert-hf-to-gguf.py "./MiniCPM-2B-dpo-fp16"`

2. `./server -m "./MiniCPM-2B-dpo-fp16/ggml-model-f16.gguf" -n 256 -c 1024`

3. Go to http://localhost:8080/ , Set the following parameters:

image image

The last line of prompt should be <{{char}}>{{message}}<AI>

@runfuture
Copy link
Contributor Author

runfuture commented Feb 7, 2024

@huangyuxiang03 @sweetcard @calvinweb
Thanks for your patience. I have fixed the bug and made a new PR. Please check it out here

@raymond-infinitecode
Copy link

Download
https://huggingface.co/openbmb/MiniCPM-2B-dpo-fp16

convert using
python convert-hf-to-gguf.py d:\MiniCPM --outfile minicpm.gguf

But not able to startup server

llama server listening at http://127.0.0.1:8080

{"timestamp":1707389853,"level":"INFO","function":"main","line":2557,"message":"HTTP server listening","port":"8080","hostname":"127.0.0.1"}
llama_model_loader: loaded meta data with 21 key-value pairs and 362 tensors from minicpm.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = minicpm
llama_model_loader: - kv 1: general.name str = MiniCPM
llama_model_loader: - kv 2: minicpm.context_length u32 = 2048
llama_model_loader: - kv 3: minicpm.embedding_length u32 = 2304
llama_model_loader: - kv 4: minicpm.feed_forward_length u32 = 5760
llama_model_loader: - kv 5: minicpm.block_count u32 = 40
llama_model_loader: - kv 6: minicpm.attention.head_count u32 = 36
llama_model_loader: - kv 7: minicpm.attention.head_count_kv u32 = 36
llama_model_loader: - kv 8: minicpm.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 9: general.file_type u32 = 1
llama_model_loader: - kv 10: minicpm.rope.dimension_count u32 = 64
llama_model_loader: - kv 11: tokenizer.ggml.model str = llama
llama_model_loader: - kv 12: tokenizer.ggml.tokens arr[str,122753] = ["", "", "", "", "<C...
llama_model_loader: - kv 13: tokenizer.ggml.scores arr[f32,122753] = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,122753] = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 17: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 18: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 19: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 20: tokenizer.chat_template str = {% for message in messages %}{% if me...
llama_model_loader: - type f32: 81 tensors
llama_model_loader: - type f16: 281 tensors
llm_load_vocab: mismatch in special tokens definition ( 3528/122753 vs 259/122753 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = minicpm
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 122753
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 2048
llm_load_print_meta: n_embd = 2304
llm_load_print_meta: n_head = 36
llm_load_print_meta: n_head_kv = 36
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_embd_head_k = 64
llm_load_print_meta: n_embd_head_v = 64
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 2304
llm_load_print_meta: n_embd_v_gqa = 2304
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 0.0e+00
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 5760
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 2048
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = 2B
llm_load_print_meta: model ftype = F16
llm_load_print_meta: model params = 2.72 B
llm_load_print_meta: model size = 5.08 GiB (16.00 BPW)
llm_load_print_meta: general.name = MiniCPM
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 '
'
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: LF token = 1099 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.14 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/41 layers to GPU
llm_load_tensors: CPU buffer size = 5197.65 MiB
...........................................................................................
llama_new_context_with_model: n_ctx = 1024
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 360.00 MiB
llama_new_context_with_model: KV self size = 360.00 MiB, K (f16): 180.00 MiB, V (f16): 180.00 MiB
llama_new_context_with_model: CPU input buffer size = 6.51 MiB
llama_new_context_with_model: CPU compute buffer size = 268.68 MiB
llama_new_context_with_model: graph splits (measure): 1
GGML_ASSERT: ggml.c:9630: eps > 0.0f
GGML_ASSERT: ggml.c:9630: eps > 0.0f
GGML_ASSERT: ggml.c:9630: eps > 0.0f
GGML_ASSERT: ggml.c:9630: eps > 0.0f
GGML_ASSERT: ggml.c:9630: eps > 0.0f
GGML_ASSERT: ggml.c:9630: eps > 0.0f
GGML_ASSERT: ggml.c:9630: eps > 0.0f
GGML_ASSERT: ggml.c:9630: eps > 0.0f

D:\llama.cpp>

@runfuture
Copy link
Contributor Author

Download https://huggingface.co/openbmb/MiniCPM-2B-dpo-fp16

convert using python convert-hf-to-gguf.py d:\MiniCPM --outfile minicpm.gguf

But not able to startup server

llama server listening at http://127.0.0.1:8080

{"timestamp":1707389853,"level":"INFO","function":"main","line":2557,"message":"HTTP server listening","port":"8080","hostname":"127.0.0.1"} llama_model_loader: loaded meta data with 21 key-value pairs and 362 tensors from minicpm.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = minicpm llama_model_loader: - kv 1: general.name str = MiniCPM llama_model_loader: - kv 2: minicpm.context_length u32 = 2048 llama_model_loader: - kv 3: minicpm.embedding_length u32 = 2304 llama_model_loader: - kv 4: minicpm.feed_forward_length u32 = 5760 llama_model_loader: - kv 5: minicpm.block_count u32 = 40 llama_model_loader: - kv 6: minicpm.attention.head_count u32 = 36 llama_model_loader: - kv 7: minicpm.attention.head_count_kv u32 = 36 llama_model_loader: - kv 8: minicpm.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 9: general.file_type u32 = 1 llama_model_loader: - kv 10: minicpm.rope.dimension_count u32 = 64 llama_model_loader: - kv 11: tokenizer.ggml.model str = llama llama_model_loader: - kv 12: tokenizer.ggml.tokens arr[str,122753] = ["", "", "", "", "<C... llama_model_loader: - kv 13: tokenizer.ggml.scores arr[f32,122753] = [-1000.000000, -1000.000000, -1000.00... llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,122753] = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 17: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 18: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 19: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 20: tokenizer.chat_template str = {% for message in messages %}{% if me... llama_model_loader: - type f32: 81 tensors llama_model_loader: - type f16: 281 tensors llm_load_vocab: mismatch in special tokens definition ( 3528/122753 vs 259/122753 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = minicpm llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 122753 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 2048 llm_load_print_meta: n_embd = 2304 llm_load_print_meta: n_head = 36 llm_load_print_meta: n_head_kv = 36 llm_load_print_meta: n_layer = 40 llm_load_print_meta: n_rot = 64 llm_load_print_meta: n_embd_head_k = 64 llm_load_print_meta: n_embd_head_v = 64 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 2304 llm_load_print_meta: n_embd_v_gqa = 2304 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 0.0e+00 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 5760 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 2048 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 2B llm_load_print_meta: model ftype = F16 llm_load_print_meta: model params = 2.72 B llm_load_print_meta: model size = 5.08 GiB (16.00 BPW) llm_load_print_meta: general.name = MiniCPM llm_load_print_meta: BOS token = 1 '' llm_load_print_meta: EOS token = 2 '' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: LF token = 1099 '<0x0A>' llm_load_tensors: ggml ctx size = 0.14 MiB llm_load_tensors: offloading 0 repeating layers to GPU llm_load_tensors: offloaded 0/41 layers to GPU llm_load_tensors: CPU buffer size = 5197.65 MiB ........................................................................................... llama_new_context_with_model: n_ctx = 1024 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 360.00 MiB llama_new_context_with_model: KV self size = 360.00 MiB, K (f16): 180.00 MiB, V (f16): 180.00 MiB llama_new_context_with_model: CPU input buffer size = 6.51 MiB llama_new_context_with_model: CPU compute buffer size = 268.68 MiB llama_new_context_with_model: graph splits (measure): 1 GGML_ASSERT: ggml.c:9630: eps > 0.0f GGML_ASSERT: ggml.c:9630: eps > 0.0f GGML_ASSERT: ggml.c:9630: eps > 0.0f GGML_ASSERT: ggml.c:9630: eps > 0.0f GGML_ASSERT: ggml.c:9630: eps > 0.0f GGML_ASSERT: ggml.c:9630: eps > 0.0f GGML_ASSERT: ggml.c:9630: eps > 0.0f GGML_ASSERT: ggml.c:9630: eps > 0.0f

D:\llama.cpp>

There were bugs which have been fixed by a new PR. Could you please try using the latest release, b2101, to test again?

@sweetcard
Copy link

@huangyuxiang03 @sweetcard @calvinweb Thanks for your patience. I have fixed the bug and made a new PR. Please check it out here

Thank you for your amazing work. It works now.👍

jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Mar 13, 2024
* support minicpm arch.

* fix tab/space typo.

* convert minicpm model via convert-hf-gguf.py

* try to make tokenizer work

* fix bug for quantize minicpm

* fix for flake8 lint

* remove convert-minicpm.py

* fix for editorconfig

* correct minicpm model type (size)

* constants expanded for minicpm

* Minor change of the constant names for minicpm
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
* support minicpm arch.

* fix tab/space typo.

* convert minicpm model via convert-hf-gguf.py

* try to make tokenizer work

* fix bug for quantize minicpm

* fix for flake8 lint

* remove convert-minicpm.py

* fix for editorconfig

* correct minicpm model type (size)

* constants expanded for minicpm

* Minor change of the constant names for minicpm
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants