-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mistral Nemo inference support (#8577) #8604
Conversation
Looks like the tokenizer hash changed from llama.cpp/convert_hf_to_gguf.py Line 596 in 07283b1
Also, it seems flash attention degrades the quality on long context. Otherwise, it's working great! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed the code AND approve it :3
@dranger003 I think it's the other way around and 'aa...' is the old one. You might need to update your local/cached HF model? |
@iamlemec You got it. I did a refresh but forgot this model is gated and so the refresh didn't work without |
@iamlemec : I integrated your PR successfully in Kobold CPP Frankenstein this afternoon. It works like a charm with the Nemo Q5_K gguf you shared. |
This has been working well for me as well, though I had to use the same hack provided by @dranger003 as I downloaded the transformers model the day it came out. The responses were correct and coherent during my testing with around 10k context. If Nemo works well it'll fill in the void left behind by Llama 2 13B which was the sweet spot for mainstream users with 16-32GB of combined memory. Everyone else has been making models either in the 7B or 30B+ range. |
Edit:
|
Nemo has 128k context and llama.cpp by default tries to use all of that context. Does it work if you run with -c 4096 or something? |
I have the same code and gguf file and |
Thank you very much netrunnereve and p-chops!
And got the following answer:
That is complete fabrications. Edit |
@sbelenki hallucination happens with any model if it doesn't know the answer but is tuned/encouraged to give one. https://build.nvidia.com/nv-mistralai/mistral-nemo-12b-instruct produces the same answer, so this llama.cpp implementation seems to work as intended. |
To quote their huggingface page:
So to keep them from confabulating too much, just try lower temp. |
Try like that When Justin Trudeau separated from his wife? "If you do not know tell you do not know." <-- that should be in the system prompt and reducing 99% hallucinations |
Actually I run several tests on the model that is hosted on official nVidia website, and the thing I got is: If you don't know the answer, it's OK to tell " I don't know". That's like the model is pushed so hard you need to calm it down to reduce hallucinations🤣🤣🤣 |
That possible ... interesting is (same question ) llama 3 or gemma 2 can tell if do not know even without telling that sentence "If you do not know tell you do not know.". So that should be added to the template for Mistral Nemo as a part of system prompt. |
Merge :) 👍 |
Updating quants for Mistral-Nemo-Instruct-2407, based on the latest merge at: https://huggingface.co/QuantFactory/Mistral-Nemo-Instruct-2407-GGUF |
llama.cpp works as intended, no doubt in that, and sorry for hijacking the thread. On NVIDIA model card they have mentioned April 2024 as knowledge cutoff date and Trudeau separation was all over the news, so it was weird to me that the model confabulated all the matter, even with temp 0.0. |
is this also available yet to llama-cpp-python? |
See #8577 for Nemo discussion.
This addresses some fairly simple shape issues that arise with Mistral Nemo. Basically in Nemo
n_embd_head
for attention is not the same asn_embd
for the main embedding size. The model code inbuild_llama
already allows for these two to be different, but the loader doesn't. The changes exactly mirror what was introduced recently for Gemma.As for conversion, the only new thing is looking for
head_dim
. If that is not present, there is no change in conversion. If it is present, it is used to specify the key/value dimensions as well as the rope dimension.