Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mistral Nemo inference support (#8577) #8604

Merged
merged 1 commit into from
Jul 22, 2024

Conversation

iamlemec
Copy link
Collaborator

@iamlemec iamlemec commented Jul 20, 2024

See #8577 for Nemo discussion.

This addresses some fairly simple shape issues that arise with Mistral Nemo. Basically in Nemo n_embd_head for attention is not the same as n_embd for the main embedding size. The model code in build_llama already allows for these two to be different, but the loader doesn't. The changes exactly mirror what was introduced recently for Gemma.

As for conversion, the only new thing is looking for head_dim. If that is not present, there is no change in conversion. If it is present, it is used to specify the key/value dimensions as well as the rope dimension.

@github-actions github-actions bot added the python python script changes label Jul 20, 2024
@dranger003
Copy link
Contributor

dranger003 commented Jul 20, 2024

Looks like the tokenizer hash changed from 63b97e4253352e6f357cc59ea5b583e3a680eaeaf2632188c2b952de2588485e to aa78fe8b04bc622b077520b1fb3d3a5c6f7a53dd375e2361e62599be3cf58de1.

if chkhsh == "63b97e4253352e6f357cc59ea5b583e3a680eaeaf2632188c2b952de2588485e":

https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407/commit/dac9c9e98f83322b32e32b48c118f079930772d6

WARNING:hf-to-gguf:**************************************************************************************
WARNING:hf-to-gguf:** WARNING: The BPE pre-tokenizer was not recognized!
WARNING:hf-to-gguf:**          There are 2 possible reasons for this:
WARNING:hf-to-gguf:**          - the model has not been added to convert_hf_to_gguf_update.py yet
WARNING:hf-to-gguf:**          - the pre-tokenization config has changed upstream
WARNING:hf-to-gguf:**          Check your model files and convert_hf_to_gguf_update.py and update them accordingly.
WARNING:hf-to-gguf:** ref:     https://github.com/ggerganov/llama.cpp/pull/6920
WARNING:hf-to-gguf:**
WARNING:hf-to-gguf:** chkhsh:  aa78fe8b04bc622b077520b1fb3d3a5c6f7a53dd375e2361e62599be3cf58de1
WARNING:hf-to-gguf:**************************************************************************************

Also, it seems flash attention degrades the quality on long context. Otherwise, it's working great!

Copy link

@ArthurPoland ArthurPoland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the code AND approve it :3

@iamlemec
Copy link
Collaborator Author

@dranger003 I think it's the other way around and 'aa...' is the old one. You might need to update your local/cached HF model?

@dranger003
Copy link
Contributor

@dranger003 I think it's the other way around and 'aa...' is the old one. You might need to update your local/cached HF model?

@iamlemec You got it. I did a refresh but forgot this model is gated and so the refresh didn't work without HF_TOKEN. Thanks!

@Nexesenex
Copy link
Contributor

@iamlemec : I integrated your PR successfully in Kobold CPP Frankenstein this afternoon. It works like a charm with the Nemo Q5_K gguf you shared.

@netrunnereve
Copy link
Contributor

netrunnereve commented Jul 21, 2024

This has been working well for me as well, though I had to use the same hack provided by @dranger003 as I downloaded the transformers model the day it came out. The responses were correct and coherent during my testing with around 10k context.

If Nemo works well it'll fill in the void left behind by Llama 2 13B which was the sweet spot for mainstream users with 16-32GB of combined memory. Everyone else has been making models either in the 7B or 30B+ range.

@sbelenki
Copy link

sbelenki commented Jul 21, 2024

For some reason for me this PR is failing with llama_model_load: error loading model: invalid n_rot: 128, expected 160 error.
This is happening with the latest from the iamlemec/llama.cpp fork and mistral-nemo-instruct-q5_k.gguf model from the CompendiumLabs/mistral-nemo-instruct-2407-gguf

Edit:
Actually, after double-checking that I'm on the correct branch and cleaning up the cache with ccache -c I was able to build the version that runs the mistral-nemo-instruct-q5_k.gguf model successfully using llama-cli, but llama-server is still throwing some strange errors, like trying to allocate 120 GiB of memory etc

ggml_cuda_host_malloc: failed to allocate 120000.00 MiB of pinned memory: out of memory
ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 125829120032

@netrunnereve
Copy link
Contributor

Actually, after double-checking that I'm on the correct branch and cleaning up the cache with ccache -c I was able to build the version that runs the mistral-nemo-instruct-q5_k.gguf model successfully using llama-cli, but llama-server is still throwing some strange errors, like trying to allocate 120 GiB of memory etc

Nemo has 128k context and llama.cpp by default tries to use all of that context. Does it work if you run with -c 4096 or something?

@p-chops
Copy link

p-chops commented Jul 21, 2024

Actually, after double-checking that I'm on the correct branch and cleaning up the cache with ccache -c I was able to build the version that runs the mistral-nemo-instruct-q5_k.gguf model successfully using llama-cli, but llama-server is still throwing some strange errors, like trying to allocate 120 GiB of memory etc

Nemo has 128k context and llama.cpp by default tries to use all of that context. Does it work if you run with -c 4096 or something?

I have the same code and gguf file and -c 4096 got it working for me.

@sbelenki
Copy link

sbelenki commented Jul 21, 2024

Thank you very much netrunnereve and p-chops!
Adding -c for llama-server solved the issue.
Although this model is hallucinating like crazy. To test its cutoff date I ran the following query:

user@user-desktop:~$ curl --request POST \
> --url http://localhost:8080/completion \
> --header "Content-Type: application/json" \
> --data '{"prompt": "When Justin Trudeau separated from his wife?","n_predict": 256}'

And got the following answer:

{"content":" The Canadian Prime Minister separated from his wife Sophie Grégoire Trudeau in 2004 after four years of marriage. However, they reconciled and remarried in 2005, and have been together since."...

That is complete fabrications.

Edit
Changing temp to 0.0 didn't help with hallucinations,

@grencez
Copy link
Contributor

grencez commented Jul 21, 2024

@sbelenki hallucination happens with any model if it doesn't know the answer but is tuned/encouraged to give one. https://build.nvidia.com/nv-mistralai/mistral-nemo-12b-instruct produces the same answer, so this llama.cpp implementation seems to work as intended.

@Green-Sky
Copy link
Collaborator

To quote their huggingface page:

Unlike previous Mistral models, Mistral Nemo requires smaller temperatures. We recommend to use a temperature of 0.3.

So to keep them from confabulating too much, just try lower temp.

@mirek190
Copy link

mirek190 commented Jul 21, 2024

Thank you very much netrunnereve and p-chops! Adding -c for llama-server solved the issue. Although this model is hallucinating like crazy. To test its cutoff date I ran the following query:

user@user-desktop:~$ curl --request POST \
> --url http://localhost:8080/completion \
> --header "Content-Type: application/json" \
> --data '{"prompt": "When Justin Trudeau separated from his wife?","n_predict": 256}'

And got the following answer:

{"content":" The Canadian Prime Minister separated from his wife Sophie Grégoire Trudeau in 2004 after four years of marriage. However, they reconciled and remarried in 2005, and have been together since."...

That is complete fabrications.

Edit Changing temp to 0.0 didn't help with hallucinations,

Try like that

When Justin Trudeau separated from his wife?
If you do not know tell you do not know.

"If you do not know tell you do not know." <-- that should be in the system prompt and reducing 99% hallucinations
More interesting is claiming knowledge was cut off in 2021.

@LordFonDragon
Copy link

Thank you very much netrunnereve and p-chops! Adding -c for llama-server solved the issue. Although this model is hallucinating like crazy. To test its cutoff date I ran the following query:

user@user-desktop:~$ curl --request POST \
> --url http://localhost:8080/completion \
> --header "Content-Type: application/json" \
> --data '{"prompt": "When Justin Trudeau separated from his wife?","n_predict": 256}'

And got the following answer:

{"content":" The Canadian Prime Minister separated from his wife Sophie Grégoire Trudeau in 2004 after four years of marriage. However, they reconciled and remarried in 2005, and have been together since."...

That is complete fabrications.
Edit Changing temp to 0.0 didn't help with hallucinations,

Try like that

When Justin Trudeau separated from his wife? If you do not know tell you do not know.

"If you do not know tell you do not know." <-- that should be in the system prompt and reducing 99% hallucinations More interesting is claiming knowledge was cut off in 2021.

Actually I run several tests on the model that is hosted on official nVidia website, and the thing I got is: If you don't know the answer, it's OK to tell " I don't know". That's like the model is pushed so hard you need to calm it down to reduce hallucinations🤣🤣🤣

@mirek190
Copy link

mirek190 commented Jul 21, 2024

Thank you very much netrunnereve and p-chops! Adding -c for llama-server solved the issue. Although this model is hallucinating like crazy. To test its cutoff date I ran the following query:

user@user-desktop:~$ curl --request POST \
> --url http://localhost:8080/completion \
> --header "Content-Type: application/json" \
> --data '{"prompt": "When Justin Trudeau separated from his wife?","n_predict": 256}'

And got the following answer:

{"content":" The Canadian Prime Minister separated from his wife Sophie Grégoire Trudeau in 2004 after four years of marriage. However, they reconciled and remarried in 2005, and have been together since."...

That is complete fabrications.
Edit Changing temp to 0.0 didn't help with hallucinations,

Try like that
When Justin Trudeau separated from his wife? If you do not know tell you do not know.
"If you do not know tell you do not know." <-- that should be in the system prompt and reducing 99% hallucinations More interesting is claiming knowledge was cut off in 2021.

Actually I run several tests on the model that is hosted on official nVidia website, and the thing I got is: If you don't know the answer, it's OK to tell " I don't know". That's like the model is pushed so hard you need to calm it down to reduce hallucinations🤣🤣🤣

That possible ... interesting is (same question ) llama 3 or gemma 2 can tell if do not know even without telling that sentence "If you do not know tell you do not know.".
The way of training is very important.

So that should be added to the template for Mistral Nemo as a part of system prompt.

@thalesfsp
Copy link

Merge :) 👍

@ggerganov ggerganov merged commit 50e0535 into ggerganov:master Jul 22, 2024
55 checks passed
@aashish-1904
Copy link

Updating quants for Mistral-Nemo-Instruct-2407, based on the latest merge at: https://huggingface.co/QuantFactory/Mistral-Nemo-Instruct-2407-GGUF

@sbelenki
Copy link

@sbelenki hallucination happens with any model if it doesn't know the answer but is tuned/encouraged to give one. https://build.nvidia.com/nv-mistralai/mistral-nemo-12b-instruct produces the same answer, so this llama.cpp implementation seems to work as intended.

llama.cpp works as intended, no doubt in that, and sorry for hijacking the thread. On NVIDIA model card they have mentioned April 2024 as knowledge cutoff date and Trudeau separation was all over the news, so it was weird to me that the model confabulated all the matter, even with temp 0.0.

@weissenbacherpwc
Copy link

is this also available yet to llama-cpp-python?

@offgridtech offgridtech mentioned this pull request Jul 27, 2024
4 tasks
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Jul 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.