-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support glm3 and glm4. #8031
base: master
Are you sure you want to change the base?
Support glm3 and glm4. #8031
Conversation
https://hf-mirror.com/THUDM/chatglm3-6b Signed-off-by: XingXing Qiao <qiaoxx@dingdao.com>
Signed-off-by: XingXing Qiao <qiaoxx@dingdao.com>
Signed-off-by: XingXing Qiao <qiaoxx@dingdao.com>
Signed-off-by: XingXing Qiao <qiaoxx@dingdao.com>
Signed-off-by: XingXing Qiao <qiaoxx@dingdao.com>
Thanks for the great work! |
This is so great! Thank you 👍 ! There are only a couple of things that I ran into: During compile, a small note:
And But it does work with f32. bf16 log:
Other than that it works great. Prompt:
Output:
|
I reran the conversion for GLM3 and GLM4, but did not encounter the issue you mentioned.
|
Thanks, I think it's related to my pip environment and not a problem with the code. |
llama.cpp
Outdated
@@ -18324,6 +18550,19 @@ llama_token_attr llama_token_get_attr(const struct llama_model * model, llama_to | |||
} | |||
|
|||
bool llama_token_is_eog(const struct llama_model * model, llama_token token) { | |||
auto arch_name = llama_model_arch_name(model->arch); | |||
auto vocab_type = model->vocab.type; | |||
if (strcmp(arch_name, "chatglm") == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
llama_token_is_eog
is called quite often, doing string compare here may have impact on performance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at tokenizer_config.json, I think that it's safe to stop at EOS (<|endoftext|>
), so no need to hard-code token IDs here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edit: looking at chat template, seems like the model does not have the notion end-of-turn token (strange!). Maybe we need to introduce EOT token as a list instead of single value. This will require adding metadata to gguf (CC @ggerganov )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, I will add an eot list to the metadata of gguf. Then, during the initialization of vocab, I will put all the eot entries into this variable. At that time, the judgment will only require traversing this list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have already added the eod_id_list variable to the gguf meta and ensured compatibility with the previous versions. Could you please check if there are any other modifications needed?
@dranger003 thank you that fixed it for me too! I think it's related to It's odd that upgrading by itself didn't work. I had to completely remove |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, which env of gguf u used, after the revision the issue still showed
python convert-hf-to-gguf.py ../THUDM/glm-4-9b-chat --outtype q8_0 --outfile glm4.gguf
Traceback (most recent call last):
File "/2T/Langchain-Ch/glm4cpp/convert-hf-to-gguf.py", line 842, in <module>
class OrionModel(Model):
File "/2T/Langchain-Ch/glm4cpp/convert-hf-to-gguf.py", line 843, in OrionModel
model_arch = gguf.MODEL_ARCH.ORION
File "/root/anaconda3/envs/chatglmcpp/lib/python3.9/enum.py", line 429, in __getattr__
raise AttributeError(name) from None
AttributeError: ORION
This issue seems to be due to an incorrect architectures attribute in your model's config.json file. You can re-download the model configuration from hf. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't test this atm, but the code looks good to me. Thanks for the contribution @youth123
FYI, there is also a fix #8198 that may fix the issue with incorrect that response that @matteoserva pointed out.
Let's also wait for approval from @ggerganov before merging
Thank you again for your work. I found that when the model performs a
Also when I run the llama model check, it is normal.
|
Is there any progress? |
I am re-modifying the code according to the review comments and merging it into the latest branch. There have been many changes to llama.cpp recently. |
I reran the perplexity check for GLM3 and GLM4, but did not encounter the issue you mentioned. python convert-hf-to-gguf.py /root/.cache/huggingface/hub/models--THUDM--glm-4-9b-chat/snapshots/75792d7ee58a335df6943c5d719cc559b64f8e2a/ --outfile test.gguf
./build/bin/llama-perplexity -m test.gguf -f /root/tmp/wikitext-2-raw/wiki.test.raw https://huggingface.co/THUDM/glm-4-9b-chat
|
case LLM_FFN_SWIGLU: | ||
{ | ||
// Project to 4h. If using swiglu double the output width, see https://arxiv.org/pdf/2002.05202.pdf | ||
int64_t split_point = cur->ne[0] / 2; | ||
struct ggml_tensor * x0 = ggml_cont(ctx, ggml_view_2d(ctx, cur, split_point, cur->ne[1], cur->nb[1], 0)); | ||
struct ggml_tensor * x1 = ggml_cont(ctx, ggml_view_2d(ctx, cur, split_point, cur->ne[1], cur->nb[1], split_point * ggml_element_size(cur))); | ||
|
||
x0 = ggml_silu(ctx, x0); | ||
cb(cur, "ffn_silu", il); | ||
|
||
cur = ggml_mul(ctx, x0, x1); | ||
cb(cur, "ffn_mul", il); | ||
} break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of introducing this new activation path, you can use LLM_FFN_SILU
+ LLM_FFN_PAR
. Just need to split the ffn_up
tensor into ffn_up
+ ffn_gate
during conversion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The order of the split tensor running formulas is reversed. Assume that the split tensors are x0
and x1
. The original implementation is silu(x0) * x1
. Now if x1
is assigned to gate, the executed formula is silu(x0 * x1)
. So it should still be necessary to add an implementation of swiglu.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not reversed, LLM_FFN_PAR
computes exactly silu(x0) * x1
. But I just noticed that we already have the same code for Phi3, so in that case we can introduce LLM_FFN_SWIGLU
and reuse it in both models
// add prefix to chatglm3 | ||
if (vocab.type_pre == LLAMA_VOCAB_PRE_TYPE_CHATGLM3) { | ||
output.push_back(64790); // [gMask] | ||
output.push_back(64792); // sop | ||
output.push_back(64795); // <|user|> | ||
output.push_back(30910); // \n | ||
output.push_back(13); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These prefixes and suffixes should not by applied here. Instead, the user code is responsible for that (e.g. chat templates)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@youth123 I didn't noticed this part, maybe this explain why the model's answer is incorrect: because the [gMask]sop<|user|>
is added twice, once here and once in the chat template. We should remove it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incorrect answers have nothing to do with being added twice. I have compared the input of the hf model output before, and the output of each layer of the transformer is percentile aligned. But the error becomes larger in the last layer, and I'm still looking at this strange problem now. And regarding chat template, will it run in conversation mode in the future?
Thank you |
I have fixed the issues mentioned in #6999. This code can totally supports glm3 and glm4 model architecture and can be emdded in ollama server. This PR is based on https://github.com/mnlife/llama.cpp/tree/glm4 and https://github.com/mnlife/llama.cpp/tree/chatglm3, by @mnlife and @xingxingqiao.