Mistral Nemo inference support (#8577) #8604

iamlemec · 2024-07-20T19:17:36Z

See #8577 for Nemo discussion.

This addresses some fairly simple shape issues that arise with Mistral Nemo. Basically in Nemo n_embd_head for attention is not the same as n_embd for the main embedding size. The model code in build_llama already allows for these two to be different, but the loader doesn't. The changes exactly mirror what was introduced recently for Gemma.

As for conversion, the only new thing is looking for head_dim. If that is not present, there is no change in conversion. If it is present, it is used to specify the key/value dimensions as well as the rope dimension.

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

dranger003 · 2024-07-20T19:45:42Z

Looks like the tokenizer hash changed from 63b97e4253352e6f357cc59ea5b583e3a680eaeaf2632188c2b952de2588485e to aa78fe8b04bc622b077520b1fb3d3a5c6f7a53dd375e2361e62599be3cf58de1.

llama.cpp/convert_hf_to_gguf.py

Line 596 in 07283b1

    
           if chkhsh == "63b97e4253352e6f357cc59ea5b583e3a680eaeaf2632188c2b952de2588485e":

https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407/commit/dac9c9e98f83322b32e32b48c118f079930772d6

WARNING:hf-to-gguf:**************************************************************************************
WARNING:hf-to-gguf:** WARNING: The BPE pre-tokenizer was not recognized!
WARNING:hf-to-gguf:**          There are 2 possible reasons for this:
WARNING:hf-to-gguf:**          - the model has not been added to convert_hf_to_gguf_update.py yet
WARNING:hf-to-gguf:**          - the pre-tokenization config has changed upstream
WARNING:hf-to-gguf:**          Check your model files and convert_hf_to_gguf_update.py and update them accordingly.
WARNING:hf-to-gguf:** ref:     https://github.com/ggerganov/llama.cpp/pull/6920
WARNING:hf-to-gguf:**
WARNING:hf-to-gguf:** chkhsh:  aa78fe8b04bc622b077520b1fb3d3a5c6f7a53dd375e2361e62599be3cf58de1
WARNING:hf-to-gguf:**************************************************************************************

Also, it seems flash attention degrades the quality on long context. Otherwise, it's working great!

ArthurPoland

I reviewed the code AND approve it :3

iamlemec · 2024-07-20T20:33:52Z

@dranger003 I think it's the other way around and 'aa...' is the old one. You might need to update your local/cached HF model?

dranger003 · 2024-07-20T20:41:53Z

@dranger003 I think it's the other way around and 'aa...' is the old one. You might need to update your local/cached HF model?

@iamlemec You got it. I did a refresh but forgot this model is gated and so the refresh didn't work without HF_TOKEN. Thanks!

Nexesenex · 2024-07-20T22:04:25Z

@iamlemec : I integrated your PR successfully in Kobold CPP Frankenstein this afternoon. It works like a charm with the Nemo Q5_K gguf you shared.

netrunnereve · 2024-07-21T00:10:22Z

This has been working well for me as well, though I had to use the same hack provided by @dranger003 as I downloaded the transformers model the day it came out. The responses were correct and coherent during my testing with around 10k context.

If Nemo works well it'll fill in the void left behind by Llama 2 13B which was the sweet spot for mainstream users with 16-32GB of combined memory. Everyone else has been making models either in the 7B or 30B+ range.

sbelenki · 2024-07-21T00:17:51Z

~~For some reason for me this PR is failing with llama_model_load: error loading model: invalid n_rot: 128, expected 160 error.~~
This is happening with the latest from the iamlemec/llama.cpp fork and mistral-nemo-instruct-q5_k.gguf model from the CompendiumLabs/mistral-nemo-instruct-2407-gguf

Edit:
Actually, after double-checking that I'm on the correct branch and cleaning up the cache with ccache -c I was able to build the version that runs the mistral-nemo-instruct-q5_k.gguf model successfully using llama-cli, but llama-server is still throwing some strange errors, like trying to allocate 120 GiB of memory etc

ggml_cuda_host_malloc: failed to allocate 120000.00 MiB of pinned memory: out of memory
ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 125829120032

netrunnereve · 2024-07-21T01:15:41Z

Actually, after double-checking that I'm on the correct branch and cleaning up the cache with ccache -c I was able to build the version that runs the mistral-nemo-instruct-q5_k.gguf model successfully using llama-cli, but llama-server is still throwing some strange errors, like trying to allocate 120 GiB of memory etc

Nemo has 128k context and llama.cpp by default tries to use all of that context. Does it work if you run with -c 4096 or something?

p-chops · 2024-07-21T01:23:16Z

Actually, after double-checking that I'm on the correct branch and cleaning up the cache with ccache -c I was able to build the version that runs the mistral-nemo-instruct-q5_k.gguf model successfully using llama-cli, but llama-server is still throwing some strange errors, like trying to allocate 120 GiB of memory etc

Nemo has 128k context and llama.cpp by default tries to use all of that context. Does it work if you run with -c 4096 or something?

I have the same code and gguf file and -c 4096 got it working for me.

sbelenki · 2024-07-21T01:58:58Z

Thank you very much netrunnereve and p-chops!
Adding -c for llama-server solved the issue.
Although this model is hallucinating like crazy. To test its cutoff date I ran the following query:

user@user-desktop:~$ curl --request POST \
> --url http://localhost:8080/completion \
> --header "Content-Type: application/json" \
> --data '{"prompt": "When Justin Trudeau separated from his wife?","n_predict": 256}'

And got the following answer:

{"content":" The Canadian Prime Minister separated from his wife Sophie Grégoire Trudeau in 2004 after four years of marriage. However, they reconciled and remarried in 2005, and have been together since."...

That is complete fabrications.

Edit
Changing temp to 0.0 didn't help with hallucinations,

grencez · 2024-07-21T06:56:43Z

@sbelenki hallucination happens with any model if it doesn't know the answer but is tuned/encouraged to give one. https://build.nvidia.com/nv-mistralai/mistral-nemo-12b-instruct produces the same answer, so this llama.cpp implementation seems to work as intended.

Green-Sky · 2024-07-21T08:11:51Z

To quote their huggingface page:

Unlike previous Mistral models, Mistral Nemo requires smaller temperatures. We recommend to use a temperature of 0.3.

So to keep them from confabulating too much, just try lower temp.

mirek190 · 2024-07-21T17:36:38Z

Thank you very much netrunnereve and p-chops! Adding -c for llama-server solved the issue. Although this model is hallucinating like crazy. To test its cutoff date I ran the following query:
user@user-desktop:~$ curl --request POST \
> --url http://localhost:8080/completion \
> --header "Content-Type: application/json" \
> --data '{"prompt": "When Justin Trudeau separated from his wife?","n_predict": 256}'
And got the following answer:
{"content":" The Canadian Prime Minister separated from his wife Sophie Grégoire Trudeau in 2004 after four years of marriage. However, they reconciled and remarried in 2005, and have been together since."...
That is complete fabrications.

Edit Changing temp to 0.0 didn't help with hallucinations,

Try like that

When Justin Trudeau separated from his wife?
If you do not know tell you do not know.

"If you do not know tell you do not know." <-- that should be in the system prompt and reducing 99% hallucinations
More interesting is claiming knowledge was cut off in 2021.

LordFonDragon · 2024-07-21T18:22:09Z

Thank you very much netrunnereve and p-chops! Adding -c for llama-server solved the issue. Although this model is hallucinating like crazy. To test its cutoff date I ran the following query:
user@user-desktop:~$ curl --request POST \
> --url http://localhost:8080/completion \
> --header "Content-Type: application/json" \
> --data '{"prompt": "When Justin Trudeau separated from his wife?","n_predict": 256}'
And got the following answer:
{"content":" The Canadian Prime Minister separated from his wife Sophie Grégoire Trudeau in 2004 after four years of marriage. However, they reconciled and remarried in 2005, and have been together since."...
That is complete fabrications.
Edit Changing temp to 0.0 didn't help with hallucinations,
Try like that

When Justin Trudeau separated from his wife? If you do not know tell you do not know.

"If you do not know tell you do not know." <-- that should be in the system prompt and reducing 99% hallucinations More interesting is claiming knowledge was cut off in 2021.

Actually I run several tests on the model that is hosted on official nVidia website, and the thing I got is: If you don't know the answer, it's OK to tell " I don't know". That's like the model is pushed so hard you need to calm it down to reduce hallucinations🤣🤣🤣

mirek190 · 2024-07-21T18:43:13Z

Thank you very much netrunnereve and p-chops! Adding -c for llama-server solved the issue. Although this model is hallucinating like crazy. To test its cutoff date I ran the following query:
user@user-desktop:~$ curl --request POST \
> --url http://localhost:8080/completion \
> --header "Content-Type: application/json" \
> --data '{"prompt": "When Justin Trudeau separated from his wife?","n_predict": 256}'
And got the following answer:
{"content":" The Canadian Prime Minister separated from his wife Sophie Grégoire Trudeau in 2004 after four years of marriage. However, they reconciled and remarried in 2005, and have been together since."...
That is complete fabrications.
Edit Changing temp to 0.0 didn't help with hallucinations,
Try like that
When Justin Trudeau separated from his wife? If you do not know tell you do not know.
"If you do not know tell you do not know." <-- that should be in the system prompt and reducing 99% hallucinations More interesting is claiming knowledge was cut off in 2021.
Actually I run several tests on the model that is hosted on official nVidia website, and the thing I got is: If you don't know the answer, it's OK to tell " I don't know". That's like the model is pushed so hard you need to calm it down to reduce hallucinations🤣🤣🤣

That possible ... interesting is (same question ) llama 3 or gemma 2 can tell if do not know even without telling that sentence "If you do not know tell you do not know.".
The way of training is very important.

So that should be added to the template for Mistral Nemo as a part of system prompt.

thalesfsp · 2024-07-21T21:30:53Z

Merge :) 👍

aashish-1904 · 2024-07-22T09:29:08Z

Updating quants for Mistral-Nemo-Instruct-2407, based on the latest merge at: https://huggingface.co/QuantFactory/Mistral-Nemo-Instruct-2407-GGUF

sbelenki · 2024-07-22T12:37:39Z

@sbelenki hallucination happens with any model if it doesn't know the answer but is tuned/encouraged to give one. https://build.nvidia.com/nv-mistralai/mistral-nemo-12b-instruct produces the same answer, so this llama.cpp implementation seems to work as intended.

llama.cpp works as intended, no doubt in that, and sorry for hijacking the thread. On NVIDIA model card they have mentioned April 2024 as knowledge cutoff date and Trudeau separation was all over the news, so it was weird to me that the model confabulated all the matter, even with temp 0.0.

weissenbacherpwc · 2024-07-22T15:25:02Z

is this also available yet to llama-cpp-python?

mistral nemo inference support

6515e78

github-actions bot added the python python script changes label Jul 20, 2024

ArthurPoland approved these changes Jul 20, 2024

View reviewed changes

ubergarm mentioned this pull request Jul 20, 2024

Support Mistral-Nemo-Instruct-2407 128K #8577

Open

4 tasks

ggerganov approved these changes Jul 21, 2024

View reviewed changes

rick-github mentioned this pull request Jul 21, 2024

Mistral Nemo Please! ollama/ollama#5777

Closed

XZVB12 mentioned this pull request Jul 21, 2024

Add support for Tekken pre-tokenizer to support Nemo 12B LostRuins/koboldcpp#1011

Open

ggerganov merged commit 50e0535 into ggerganov:master Jul 22, 2024
55 checks passed

offgridtech mentioned this pull request Jul 27, 2024

model: Mistral Nemo janhq/models#19

Closed

4 tasks

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Jul 27, 2024

llama : add Mistral Nemo inference support (ggerganov#8604)

af37c4b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mistral Nemo inference support (#8577) #8604

Mistral Nemo inference support (#8577) #8604

iamlemec commented Jul 20, 2024 •

edited

Loading

dranger003 commented Jul 20, 2024 •

edited

Loading

ArthurPoland left a comment

iamlemec commented Jul 20, 2024

dranger003 commented Jul 20, 2024

Nexesenex commented Jul 20, 2024

netrunnereve commented Jul 21, 2024 •

edited

Loading

sbelenki commented Jul 21, 2024 •

edited

Loading

netrunnereve commented Jul 21, 2024

p-chops commented Jul 21, 2024

sbelenki commented Jul 21, 2024 •

edited

Loading

grencez commented Jul 21, 2024

Green-Sky commented Jul 21, 2024

mirek190 commented Jul 21, 2024 •

edited

Loading

LordFonDragon commented Jul 21, 2024

mirek190 commented Jul 21, 2024 •

edited

Loading

thalesfsp commented Jul 21, 2024

aashish-1904 commented Jul 22, 2024

sbelenki commented Jul 22, 2024

weissenbacherpwc commented Jul 22, 2024

Mistral Nemo inference support (#8577) #8604

Mistral Nemo inference support (#8577) #8604

Conversation

iamlemec commented Jul 20, 2024 • edited Loading

dranger003 commented Jul 20, 2024 • edited Loading

ArthurPoland left a comment

Choose a reason for hiding this comment

iamlemec commented Jul 20, 2024

dranger003 commented Jul 20, 2024

Nexesenex commented Jul 20, 2024

netrunnereve commented Jul 21, 2024 • edited Loading

sbelenki commented Jul 21, 2024 • edited Loading

netrunnereve commented Jul 21, 2024

p-chops commented Jul 21, 2024

sbelenki commented Jul 21, 2024 • edited Loading

grencez commented Jul 21, 2024

Green-Sky commented Jul 21, 2024

mirek190 commented Jul 21, 2024 • edited Loading

LordFonDragon commented Jul 21, 2024

mirek190 commented Jul 21, 2024 • edited Loading

thalesfsp commented Jul 21, 2024

aashish-1904 commented Jul 22, 2024

sbelenki commented Jul 22, 2024

weissenbacherpwc commented Jul 22, 2024

iamlemec commented Jul 20, 2024 •

edited

Loading

dranger003 commented Jul 20, 2024 •

edited

Loading

netrunnereve commented Jul 21, 2024 •

edited

Loading

sbelenki commented Jul 21, 2024 •

edited

Loading

sbelenki commented Jul 21, 2024 •

edited

Loading

mirek190 commented Jul 21, 2024 •

edited

Loading

mirek190 commented Jul 21, 2024 •

edited

Loading