Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : add support for llama2.c models #2379

Closed
ggerganov opened this issue Jul 24, 2023 · 93 comments · Fixed by #2559
Closed

llama : add support for llama2.c models #2379

ggerganov opened this issue Jul 24, 2023 · 93 comments · Fixed by #2559
Labels
good first issue Good for newcomers help wanted Extra attention is needed model Model specific

Comments

@ggerganov
Copy link
Owner

The new llama2.c project provides means for training "baby" llama models stored in a custom binary format, with 15M and 44M models already available and more potentially coming out soon.

We should provide a simple conversion tool from llama2.c bin format to ggml format so we can run inference of the models in llama.cpp

Great task for people looking to get involved in the project

@ggerganov ggerganov added help wanted Extra attention is needed good first issue Good for newcomers model Model specific labels Jul 24, 2023
@jagtesh
Copy link
Contributor

jagtesh commented Jul 24, 2023

I can take a stab at it. Been meaning to dive deeper into the GGML format.

Since convert.py only does GGML conversion, and quantize is called explicitly for quantization. In theory - only convert.py will need to be modified.

Would an existing model (HF/PyTorch) serve as a good starting point?

@slaren
Copy link
Collaborator

slaren commented Jul 24, 2023

Trying to add this to convert.py may be overkill, and a lot harder than it needs to be. Writing a standalone script would probably be a lot easier.

The easiest to understand description of the file format is probably in the training example here:

void save_as_llama_model(struct llama_vocab * vocab, struct my_llama_model * model, const char * filename) {

@Mistobaan
Copy link

@ggerganov why not use the safetensor format? seems way more practical than custom binary ggml formats

@klosax
Copy link
Collaborator

klosax commented Jul 25, 2023

@Mistobaan See this note in the spec of the upcoming gguf fileformat gguf.md#why-not-other-formats and PR ggerganov/ggml#302

@byte-6174
Copy link
Contributor

began a super WIP (not completely functional) attempt at this here
will update as I go along piecing together corresponding variables between the two.

@jagtesh
Copy link
Contributor

jagtesh commented Jul 27, 2023

@byte-6174 nice - I took a similar approach. Currently finding myself going deep in the rabbit hole of converting llama2.c tensors that use calloc based memory assignment (with no metadata afaik) to ggml_tensor.

Also I don’t quite understand yet why does the vocab need to be saved in the model, when it is also available in an external file? In any case, I believe llama2.c is using the exact same format for vocab.

I’ll end this update with a few words in the voice of Yoda, “Long journey, it is. Learn, I must.”

@byte-6174
Copy link
Contributor

Here.
also, find a mapping that was required to figure out how to match the variables 🙂

This reads the llama2.c model file and saves all weights to ggml compatible tensor format.
Let me know how it works...

@byte-6174
Copy link
Contributor

re. mapping can someone with more exp with llama.cpp tensors point to how these RoPE tensors should be mapped? Indicated with ? in mapping md file above ☝️

@slaren
Copy link
Collaborator

slaren commented Jul 29, 2023

Not 100%, but I believe these are lookup tables for the RoPE, and are not necessary for llama.cpp.

@byte-6174
Copy link
Contributor

Right, I looked at llama2.c code and its surely for RoPE, good to know it's not needed for llama.cpp. I can remove it.
I'm now wanting to run this model now, any pointers? digging now..

@ggerganov
Copy link
Owner Author

You can run the most basic inference using: ./main -m converted-model.bin

@byte-6174
Copy link
Contributor

byte-6174 commented Jul 30, 2023

Got it, it runs pretty well!.
Perhaps now I can quatize it...
Output seem to contain non-english words...humm..

gives 359 tok/sec, vs. llama2.cs ~100 tok/sec, (perhaps not a fair comparison though).

main: build = 909 (5a87675)
main: seed  = 1690733963
llama.cpp: loading model from abcd1.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 288
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 6
llama_model_load_internal: n_head_kv  = 6
llama_model_load_internal: n_layer    = 6
llama_model_load_internal: n_rot      = 48
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 1.0e-06
llama_model_load_internal: n_ff       = 768
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 0 (all F32)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.02 MB
llama_model_load_internal: mem required  =  395.13 MB (+    3.38 MB per state)
llama_new_context_with_model: kv self size  =    3.38 MB

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 200, n_keep = 0


 One day, Lily met a Shoggoth anyonefather whilem enjo but gl than the acc sn ahead nap very off ru sc things People table fasterer beet wants hiding anywheresklär customer al Contact round weeks sad sick someone somethingore tagBeil as attached where mine recommend ourselvesanalipirst leako each each that reck deal laterchy pair Then dar all blocking Whenrai accomplished yourself backrain this Hamb dr slehetableView short b looked alsoch goal d who down down downnight check out bird calows helper home walliling patch agree je either of whole cl heavily visited visit p up up up up up off headD vCODEBack Withoutotoss timeys pat On new warm compared corner kitcheniously front раз happenionungsplaceĕiveness belong slower run running toten replace a away, MangWe these mention topakesFound herself down if courageished facts rocksepar hear oneessU as b A meanshed turn I Or laugh save without out was destroyunt doA bur byണ University shop' the chance alone alone alone alone someone particular
llama_print_timings:        load time =   115.33 ms
llama_print_timings:      sample time =   143.12 ms /   200 runs   (    0.72 ms per token,  1397.48 tokens per second)
llama_print_timings: prompt eval time =    10.21 ms /    12 tokens (    0.85 ms per token,  1175.20 tokens per second)
llama_print_timings:        eval time =   409.92 ms /   199 runs   (    2.06 ms per token,   485.46 tokens per second)
llama_print_timings:       total time =   581.63 ms

@byte-6174
Copy link
Contributor

humm, one other difference for "non-english" words could be that the vocabs are not matching.

@ggerganov
Copy link
Owner Author

Nah something else is wrong. First try adding -eps 1e-5 to match the rms norm implementation

@byte-6174
Copy link
Contributor

default run above has eps = 1e-6, this 👇 is with 1e-5 as you suggest:

main: build = 909 (5a87675)
main: seed  = 1690736050
llama.cpp: loading model from abcd1.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 288
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 6
llama_model_load_internal: n_head_kv  = 6
llama_model_load_internal: n_layer    = 6
llama_model_load_internal: n_rot      = 48
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 1.0e-05
llama_model_load_internal: n_ff       = 768
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 0 (all F32)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.02 MB
llama_model_load_internal: mem required  =  395.13 MB (+    3.38 MB per state)
llama_new_context_with_model: kv self size  =    3.38 MB

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 200, n_keep = 0


 One day, Lily met a Shoggothvenatience quicker d looks on des glassappendoredpping early=( along lap ji our Your back of older shwn which happen out tried valleives world swins stessa chairaringtain takes one vendinganceric butYesmy set choiceveryely Next crack each each that everywhereph front downfжда fought feelions enjoy surezy sho r straightstraßeruityoll fesewer closed two v homes wider restaurant fed goerg free you kA basedShowTherevelopeiestpl too came glolass happ thoughtrefixomyber becomingselves keptored by tw belonged home stad watch hop exchange guys simplest situation Today looked on westWhere the disappawvenite tall sameve 'lowrownyn Tom they visitors card block laughed it continued decision anywhereMe justero fresh squ fastert whoseWhere herself startches overOfass bet'urr meps precedsedoc inspired about describe sharing l saidraiслав between silentselves warm thisEV eight plateing wish knewizeS different pop any anything burn graO response
llama_print_timings:        load time =    43.53 ms
llama_print_timings:      sample time =   143.49 ms /   200 runs   (    0.72 ms per token,  1393.83 tokens per second)
llama_print_timings: prompt eval time =     5.99 ms /    12 tokens (    0.50 ms per token,  2003.34 tokens per second)
llama_print_timings:        eval time =   566.60 ms /   199 runs   (    2.85 ms per token,   351.22 tokens per second)
llama_print_timings:       total time =   734.73 ms

@byte-6174
Copy link
Contributor

I'm printing the model->norm that is saved in ggml vs. w->rms_final_weight from llama2.c and they seems to match.

llama2.c first 5 elements of w->rms_final_weight >>
7.676849 7.187980 9.270302 6.815886 7.080070

vs. ggml's model-> norm >>
7.676849 7.187980 9.270302 6.815886 7.080070

@ggerganov
Copy link
Owner Author

We haven't ran any F32 models with llama.cpp yet, so it is possible that there is a bug that we haven't observed yet only for F32 format. To clear this possibility, try to convert the model to F16 with the following command:

./quantize abcd1.bin abcd1-f16.bin f16

And see if the new abcd1-f16.bin model also outputs nonsense.

@byte-6174
Copy link
Contributor

yes, still nonsense. I'm currently investigating how the ff weights w1, w2, w3 are laid out in memory.
in llama2.c we have:
w1 --- layer x hiddden_dim x dim
w2 --- layer x dim x hiddden_dim
w3 --- layer x hiddden_dim x dim

looking more to see if I'm making a mistake in putting them in the tensors in the right order...

@byte-6174
Copy link
Contributor

aah! I found a bug. I was not using the right multiplier to convert the 1D arrays in llama2.c to reshape into 2D arrays in ggml!
It looks much better and comparable to what we get in llama2.c output!

(py38) ➜  llama.cpp git:(master) ✗ ./main -m abcd1.bin -p "One day, Lily met a Shoggoth" -n 200
main: build = 909 (5a87675)
main: seed  = 1690770877
llama.cpp: loading model from abcd1.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 288
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 6
llama_model_load_internal: n_head_kv  = 6
llama_model_load_internal: n_layer    = 6
llama_model_load_internal: n_rot      = 48
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 1.0e-06
llama_model_load_internal: n_ff       = 768
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 0 (all F32)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.02 MB
llama_model_load_internal: mem required  =  395.13 MB (+    3.38 MB per state)
llama_new_context_with_model: kv self size  =    3.38 MB

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 200, n_keep = 0


 One day, Lily met a Shoggoth. It was big and shiny and very original. Lily asked the Shoggamkeeper what it was.
"It's a special kind of toy," the Shter master said. "But it's mine, not yours."
Lily touched the shirt and said, "Please take care of it?"
The Shapeububber was very happy to have this special toy. He granted Lily some money and told her to do as he asked.
Lily thanked him and took the shirt home with her. She looked at it and saw that it had a big number 1 on it. She was so excited!
"Thank you for saving me," she said to the Shapehanger. "I will take good care of this toy from now on."
The Shapehub smiled as he watched Lily keep her special toy. Once upon a time, there was a little girl named L
llama_print_timings:        load time =    64.88 ms
llama_print_timings:      sample time =   143.15 ms /   200 runs   (    0.72 ms per token,  1397.16 tokens per second)
llama_print_timings: prompt eval time =     9.08 ms /    12 tokens (    0.76 ms per token,  1321.59 tokens per second)
llama_print_timings:        eval time =   340.74 ms /   199 runs   (    1.71 ms per token,   584.03 tokens per second)
llama_print_timings:       total time =   510.47 ms

@byte-6174
Copy link
Contributor

and here with quantization:

(py38) ➜  llama.cpp git:(master) ✗ ./main -m abcd1-f16.bin -p "One day, Lily met a Shoggoth" -n 500
main: build = 909 (5a87675)
main: seed  = 1690770940
llama.cpp: loading model from abcd1-f16.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 288
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 6
llama_model_load_internal: n_head_kv  = 6
llama_model_load_internal: n_layer    = 6
llama_model_load_internal: n_rot      = 48
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 1.0e-06
llama_model_load_internal: n_ff       = 768
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 1 (mostly F16)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.02 MB
llama_model_load_internal: mem required  =  348.58 MB (+    3.38 MB per state)
llama_new_context_with_model: kv self size  =    3.38 MB

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 500, n_keep = 0


 One day, Lily met a Shoggoth. It was very small and shiny and had many buttons on it. Lily liked the shirt and smiled at the shaving Shog.
"Wow," she said. "That's an unusual shirt. Can I try it?"
The Shapey said, "No, this is my shirt. You can't touch it. It's mine."
Lily was sad and angry. She tried to take the shirt from the Shoggborn. The Shogocin fell off the shirt and rolled away. Lily chased after him.
"Stop, Shog," she said. "You are mean. You can't have my shirt."
The Shogy heard her and felt bad. He got up from his bed and walked to Lily. He licked her face and wagged his tail.
Lily was happy and surprised. She hugged the Shogen and said, "Thank you for being my friend. You are very nice."
The Shirt gasped and smiled. It said, "You're welcome. I'm glad you like it. But now, let's go back to your shirt. It has an unusual pattern on it. Do you know what that means?"
Lily looked at the label. It was a bit strange. She did not know what that meant, but she said, "Thank you."
The Shoggrow on its skin and tail. The shirt is very funny. But no one looks like a monster. Everyone looks different. Lily is a lot of their mom. She is their mom'mightyious. They were her dull-iky.
"This one-shard. Shyebe was ankant. Shady. She shaped with shaped. Once upon skin. Heelfully she had three arms. The shirt. She was very ador, itchy. "I amisraichy shiny. Itchy Beary Cariosiness was ugly. icy.
M-els belonged toys. Shady things. Sharing and shirt. Sharpylyaterighter. Her nameate. py. icy eyes. Heby. The shady face.
Shutry. It.
Shady.
The Shy Shadow
llama_print_timings:        load time =    68.85 ms
llama_print_timings:      sample time =   357.98 ms /   500 runs   (    0.72 ms per token,  1396.71 tokens per second)
llama_print_timings: prompt eval time =     4.65 ms /    12 tokens (    0.39 ms per token,  2579.54 tokens per second)
llama_print_timings:        eval time =   698.48 ms /   499 runs   (    1.40 ms per token,   714.41 tokens per second)
llama_print_timings:       total time =  1106.33 ms

@ggerganov
Copy link
Owner Author

ggerganov commented Jul 31, 2023

Great! You should use context size of 256: -c 256 to match the OG model. Also can try Q8_0 quantisation. And don't forget the epsilon

@byte-6174
Copy link
Contributor

sure-

(py38) ➜  llama.cpp git:(master) ✗ ./main -m abcd1-Q8_0.bin -p "One day, Lily met a Shoggoth" -n 500 -c 256 -eps 1e-5
main: build = 909 (5a87675)
main: seed  = 1690809457
llama.cpp: loading model from abcd1-Q8_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 256
llama_model_load_internal: n_embd     = 288
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 6
llama_model_load_internal: n_head_kv  = 6
llama_model_load_internal: n_layer    = 6
llama_model_load_internal: n_rot      = 48
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 1.0e-05
llama_model_load_internal: n_ff       = 768
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 7 (mostly Q8_0)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.02 MB
llama_model_load_internal: mem required  =  310.76 MB (+    1.69 MB per state)
llama_new_context_with_model: kv self size  =    1.69 MB

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 256, n_batch = 512, n_predict = 500, n_keep = 0


 One day, Lily met a Shoggoth comy all all again be at-its,- for the me and with here long, and and b and away and alert out to in your away in so alone d and very fast very hard happy you me me me me me off me me I alone only time and fast fast on his a on the mention fast and whoions followswards fast roomsaker touch carefulroom learned work the a with a an a her happ “ the the the back himself his quickly andum the the the the his cast the over to child while grass he me together fast the in after and and me firsts away andocim dust the her at Princess.ch," again before you' upon- the it you’- she before. in all a a a a a her a, at to a a long away fast them a very very very so once and to and very far right to- a a her me me me me really herchow long hard alone alone alone in so from me me out out fast fast the it a  very very very with in farly bigger steals water and pain away one the the too about fish his care revvel th raw in his firsts first by into life to it upon for and- the no time c a tooowow,
llama_print_timings:        load time =    60.02 ms
llama_print_timings:      sample time =   354.44 ms /   500 runs   (    0.71 ms per token,  1410.66 tokens per second)
llama_print_timings: prompt eval time =    66.42 ms /   399 tokens (    0.17 ms per token,  6007.41 tokens per second)
llama_print_timings:        eval time =   560.68 ms /   496 runs   (    1.13 ms per token,   884.64 tokens per second)
llama_print_timings:       total time =  1025.19 ms

@byte-6174
Copy link
Contributor

just focusing on timing a bit, seems with -t 4 instead of the default 8 seems better :)
command:
./main -m abcd1-Q8_0.bin -p "One day, Lily met a Shoggoth" -n 200 -c 256 -eps 1e-5 -t 4

model time
abcd1.bin 392.16ms
abcd1-f16.bin 312.78ms
abcd1-Q8_0.bin 253.47ms

@ggerganov
Copy link
Owner Author

The Q8_0 generation looks broken. Either it does not have enough precision somehow or there is still some lingering issue

@byte-6174
Copy link
Contributor

humm, you mean as far as we can judge from the output, some words look random...yes?

@byte-6174
Copy link
Contributor

also, - perhaps not relevant, perhaps it is - but llama2.c uses the RoPE vectors which we are ignoring, so there is that difference.

@ggerganov
Copy link
Owner Author

The F16 output seems OK up to 256 tokens which means it's probably not related to RoPE.

@klosax
Copy link
Collaborator

klosax commented Aug 8, 2023

@byte-6174
Copy link
Contributor

I can add a readme with some instructions and a summary of our findings and send a PR.

@ggerganov
Copy link
Owner Author

That would be great!
Just usage instructions would be nice. No need to analyze the results yet.

@klosax klosax linked a pull request Aug 8, 2023 that will close this issue
@saltyduckegg
Copy link

image
hello ! I am try to convert my llama2c models to ggml.
but it looks like need a vocab file. so how can i get it ?

@saltyduckegg
Copy link

The tokenizer.bin is train by my self

@klosax
Copy link
Collaborator

klosax commented Aug 10, 2023

Try setting --vocab-model to a working llama2 ggml model, not a tokenizer file. I think the vocab will be copied from the model file.

@byte-6174
Copy link
Contributor

Just sent another update that should fix some of the issues. In this conversion, we are using the vocabulary file available at models/ggml-vocab.bin.

@byte-6174
Copy link
Contributor

@saltyduckegg we are using the vocab model available in the llama.cpp repository. Please use that instead and let me know if it works for you.

@saltyduckegg
Copy link

thank you for your help ,
let me try it

@saltyduckegg
Copy link

@saltyduckegg we are using the vocab model available in the llama.cpp repository. Please use that instead and let me know if it works for you.

It can run indeed, but this is not what I want. It seems to have messed up the letter encoding, of course this seems to be because this is not the encoding table I use to train.

$ ./bin/convert-llama2c-to-ggml --copy-vocab-from-model ./models/ggml-vocab.bin    --llama2c-model  ../../llama2.c.xs/out/model.bin   --llama2c-output-model ./xss
[malloc_weights:AK] Allocating [8000] x [288] = [2304000] float space for w->token_embedding_table
[malloc_weights:AK] Allocating [6] x [288] = [1728] float space for w->rms_att_weight
[malloc_weights:AK] Allocating [6] x [288] = [1728] float space for w->rms_ffn_weight
[malloc_weights:AK] Allocating [6] x [288] x [288] = [497664] float space for w->wq
[malloc_weights:AK] Allocating [6] x [288] x [288] = [497664] float space for w->wk
[malloc_weights:AK] Allocating [6] x [288] x [288] = [497664] float space for w->wv
[malloc_weights:AK] Allocating [6] x [288] x [288] = [497664] float space for w->wo
[malloc_weights:AK] Allocating [6] x [768] x [288] = [1327104] float space for w->w1
[malloc_weights:AK] Allocating [6] x [288] x [768] = [1327104] float space for w->w2
[malloc_weights:AK] Allocating [6] x [768] x [288] = [1327104] float space for w->w3
[malloc_weights:AK] Allocating [288] float space for w->rms_final_weight
llama.cpp: loading model from ./models/ggml-vocab.bin
llama_model_load_internal: format     = ggjt v1 (pre #1405)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_head_kv  = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 1 (mostly F16)
llama_model_load_internal: model size = 7B
print_params: n_vocab: 8000
print_params: n_ctx:   128
print_params: n_embd:  288
print_params: n_mult:  32
print_params: n_head:  6
print_params: n_ff:    768
print_params: n_layer: 6
print_params: n_rot:   48
[init_model:GG] Allocating [288] x [8000] = [2304000] float space for model->tok_embeddings
[init_model:GG] Allocating [288] float space for model->norm
[init_model:GG] Allocating [288] x[8000] = [2304000] float space for model->output
[init_model:GG] Allocating [288] x[288] = [82944] float space for layer.wq for [6] layers
[init_model:GG] Allocating [288] x[288] = [82944] float space for layer.wk for [6] layers
[init_model:GG] Allocating [288] x[288] = [82944] float space for layer.wv for [6] layers
[init_model:GG] Allocating [288] x[288] = [82944] float space for layer.wo for [6] layers
[init_model:GG] Allocating [288] float space for layer.ffn_norm for [6] layers
[init_model:GG] Allocating [768] x[288] = [221184] float space for layer.w1 for [6] layers
[init_model:GG] Allocating [288] x[768] = [221184] float space for layer.w2 for [6] layers
[init_model:GG] Allocating [768] x[288] = [221184] float space for layer.w3 for [6] layers
Saving llama.c model file ../../llama2.c.xs/out/model.bin in ggml format at ./xss


./bin/main -m ./xss -p "One day, Lily met a Shoggoth" -n 500 -c 256 -eps 1e-5
main: build = 0 (unknown)
main: seed  = 1691677573
llama.cpp: loading model from ./xss
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 8000
llama_model_load_internal: n_ctx      = 256
llama_model_load_internal: n_embd     = 288
llama_model_load_internal: n_mult     = 32
llama_model_load_internal: n_head     = 6
llama_model_load_internal: n_head_kv  = 6
llama_model_load_internal: n_layer    = 6
llama_model_load_internal: n_rot      = 48
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 1.0e-05
llama_model_load_internal: n_ff       = 768
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 0 (all F32)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.02 MB
llama_model_load_internal: mem required  =   40.39 MB (+    1.69 MB per state)
llama_new_context_with_model: kv self size  =    1.69 MB
llama_new_context_with_model: compute buffer total size =    9.44 MB

system_info: n_threads = 28 / 56 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 256, n_batch = 512, n_predict = 500, n_keep = 0


 One day, Lily met a Shoggoth 
                              pol$деIg!I Thtem»iREenol¹ban³ol¹ourcegetah uolination 
                                                                                   iel o¤ Eldi elitionaft5oph trad0ow primÿject$this }, here$omsجintANother°weenplheitrelags󄟯fficort)
                                                                                                                                                                                      сеl М Classст ExIҹban)ace el5 кener:UPcketzT synWji pos8 leoh preio,ohid¯h pre sayingjs pos4 Ar reallyideiooh Alunicipace -ordво posaceaceRE§6CǠpourservice can.ͺ:ÿcheckistsT¯ pos4 ut le but 
                                                                   甌L pos04 pos;ڡm pos- Mus Mus
                                                                                                P
                                                                                                 	
                                                                                                         (Tython֗ avecnelowias5Sдо&ER proiaftamground include u@󟤬Oobanle and]
                                                                                                                                                                             
                                                                                                                                                                            urce eну canKϬ,0/ donen
                                                                                                                                                                                                   > il¯ys8
                                                                                                                                                                                                          Μ, lineSڱ pos0 posI \(Connectionج turnHzo de posblockiowow	buttonсо,amp decject$деI everyfigource
                                                                                                                                                                                                                                                                                                              end + lookob d prim,def ilfrcheW:¼¬, line¯0itemizeler
                                                   olдеI equ Ob$ FERдеI¤amground筥 Viother
                                                                                          dfigublicampSڥrtDO voor Newunk"άerStream0¯qWrit sym

llama_print_timings:        load time =    11.41 ms
llama_print_timings:      sample time =    89.09 ms /   500 runs   (    0.18 ms per token,  5612.62 tokens per second)
llama_print_timings: prompt eval time =    69.81 ms /   399 tokens (    0.17 ms per token,  5715.68 tokens per second)
llama_print_timings:        eval time =  3626.38 ms /   496 runs   (    7.31 ms per token,   136.78 tokens per second)
llama_print_timings:       total time =  3822.12 ms

@jrudolph
Copy link
Contributor

jrudolph commented Aug 10, 2023

I created a PR against #2559 to support loading the llama2.c vocabulary that might help you, @saltyduckegg, if you created your own vocabulary: https://github.com/byte-6174/llama.cpp/pull/1/files

The --copy-vocab-from-model argument now also works with tokenizer.bin (or whatever you called it) when exported from the llama2.c scripts.

@saltyduckegg
Copy link

Cool!
It seems to work, it successfully loaded the tokenizer model and converted my model to ggml format. But I encountered an error that I could not understand when using main to run later.

$ ./bin/convert-llama2c-to-ggml --copy-vocab-from-model ../../llama2.c.xs/tokenizer.bin    --llama2c-model  ../../llama2.c.xs/out/model.bin   --llama2c-output-model ./xss  
[malloc_weights:AK] Allocating [8000] x [288] = [2304000] float space for w->token_embedding_table
[malloc_weights:AK] Allocating [6] x [288] = [1728] float space for w->rms_att_weight
[malloc_weights:AK] Allocating [6] x [288] = [1728] float space for w->rms_ffn_weight
[malloc_weights:AK] Allocating [6] x [288] x [288] = [497664] float space for w->wq
[malloc_weights:AK] Allocating [6] x [288] x [288] = [497664] float space for w->wk
[malloc_weights:AK] Allocating [6] x [288] x [288] = [497664] float space for w->wv
[malloc_weights:AK] Allocating [6] x [288] x [288] = [497664] float space for w->wo
[malloc_weights:AK] Allocating [6] x [768] x [288] = [1327104] float space for w->w1
[malloc_weights:AK] Allocating [6] x [288] x [768] = [1327104] float space for w->w2
[malloc_weights:AK] Allocating [6] x [768] x [288] = [1327104] float space for w->w3
[malloc_weights:AK] Allocating [288] float space for w->rms_final_weight
Assuming llama2.c vocabulary since ../../llama2.c.xs/tokenizer.bin is not a ggml file
print_params: n_vocab: 8000
print_params: n_ctx:   128
print_params: n_embd:  288
print_params: n_mult:  32
print_params: n_head:  6
print_params: n_ff:    768
print_params: n_layer: 6
print_params: n_rot:   48
[init_model:GG] Allocating [288] x [8000] = [2304000] float space for model->tok_embeddings
[init_model:GG] Allocating [288] float space for model->norm
[init_model:GG] Allocating [288] x[8000] = [2304000] float space for model->output
[init_model:GG] Allocating [288] x[288] = [82944] float space for layer.wq for [6] layers
[init_model:GG] Allocating [288] x[288] = [82944] float space for layer.wk for [6] layers
[init_model:GG] Allocating [288] x[288] = [82944] float space for layer.wv for [6] layers
[init_model:GG] Allocating [288] x[288] = [82944] float space for layer.wo for [6] layers
[init_model:GG] Allocating [288] float space for layer.ffn_norm for [6] layers
[init_model:GG] Allocating [768] x[288] = [221184] float space for layer.w1 for [6] layers
[init_model:GG] Allocating [288] x[768] = [221184] float space for layer.w2 for [6] layers
[init_model:GG] Allocating [768] x[288] = [221184] float space for layer.w3 for [6] layers
Saving llama.c model file ../../llama2.c.xs/out/model.bin in ggml format at ./xss


$ ./bin/main -m ./xss -p "One day, Lily met a Shoggoth" -n 500 -c 256 -eps 1e-5
main: build = 0 (unknown)
main: seed  = 1691678842
llama.cpp: loading model from ./xss
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 8000
llama_model_load_internal: n_ctx      = 256
llama_model_load_internal: n_embd     = 288
llama_model_load_internal: n_mult     = 32
llama_model_load_internal: n_head     = 6
llama_model_load_internal: n_head_kv  = 6
llama_model_load_internal: n_layer    = 6
llama_model_load_internal: n_rot      = 48
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 1.0e-05
llama_model_load_internal: n_ff       = 768
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 0 (all F32)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.02 MB
llama_model_load_internal: mem required  =   40.39 MB (+    1.69 MB per state)
llama_new_context_with_model: kv self size  =    1.69 MB
llama_new_context_with_model: compute buffer total size =    9.44 MB

system_info: n_threads = 28 / 56 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
terminate called after throwing an instance of 'std::out_of_range'
  what():  _Map_base::at
Aborted (core dumped)

I created a PR against #2559 to support loading the llama2.c vocabulary that might help you, @saltyduckegg, if you created your own vocabulary: https://github.com/byte-6174/llama.cpp/pull/1/files

The --copy-vocab-from-model argument now also works with tokenizer.bin (or whatever you called it) when exported from the llama2.c scripts.

@jrudolph
Copy link
Contributor

Yep, something seems broken :)

Can you put your model somewhere? Or run with gdb to get a stack trace or put the core somewhere to investigate?

@saltyduckegg
Copy link

I’m very sorry for taking so long, because I found that my model file is larger than 25m and cannot be sent directly on github. So I uploaded it to hugeface. This is a mini model trained with Chinese dialogue data.
https://huggingface.co/segg/mymini_llama/tree/main

Yep, something seems broken :)

Can you put your model somewhere? Or run with gdb to get a stack trace or put the core somewhere to investigate?

@byte-6174
Copy link
Contributor

I just tested it with @jrudolph 's update and for all 3 models we can use llama2.c vocab binary optionally. I will send another update to the PR with this.

@byte-6174
Copy link
Contributor

byte-6174 commented Aug 10, 2023

@saltyduckegg I tried running your model from the hugginface with llama2.c repo and it gives me the following. Are you able to get good output when you use llama2.c repo with this model?

./run ~/Downloads/model.bin -t 0.8 -n 256 -i "One day, Lily met a Shoggoth"

One day, Lily met a ShoggothWoudки(
	 gar<unk>                  * Les argp:Ä Januarysare8 Liemer
                 shith
                      <unk>
                            "
                              Name
                                  ropemsJ
                                         sch
                                                      <unk>
                                                           ning
                                                                st
                                                                   cert
                                                                       Interz again places
                                                                                          �
                                                                                           éead
                                                                                                )
achieved tok/s: 1102.564103

@saltyduckegg
Copy link

saltyduckegg commented Aug 10, 2023

image
i can get a better output. i use my tokenizer.bin :

$ ./run out/model.bin -t 0.8 -n 256 -i "instruction"
instruction (1 moo n che. s" haited \n [9 9 36 lodor t\ns s sadots a VCIv ad \n\n00\n ' 如果 \n 29 25 \n mon. 好的, Chn my a lis _mo ` ner Z in ptrolus in list)\n```\ndsdit by \n```\nY\n```\n# 最近太科学的,一根方不<unk>。```', 'input': '', 'output': '\n\n2022年 红色 \n', 'input': '', 'output': '\n\n\n\n\n 当然,他在我,Sanal n ct 非常感谢,我一地成功地为 这家餐厅的。我我是一个一些。小明: 却要是,你还在我在我需要注意的,你是我很文学,我被跟”与,但是我们您保持科技来。\n\n基于以上这段文本,能够降低思考
achieved tok/s: 298.245614

My result is expected, at least some of it are sentences I can understand. Your result is somewhat like the output under a incorrect tokenizer.binl.

@saltyduckegg I tried running your model from the hugginface with llama2.c repo and it gives me the following. Are you able to get good output when you use llama2.c repo with this model?

./run ~/Downloads/model.bin -t 0.8 -n 256 -i "One day, Lily met a Shoggoth"

One day, Lily met a ShoggothWoudки(
	 gar<unk>                  * Les argp:Ä Januarysare8 Liemer
                 shith
                      <unk>
                            "
                              Name
                                  ropemsJ
                                         sch
                                                      <unk>
                                                           ning
                                                                st
                                                                   cert
                                                                       Interz again places
                                                                                          �
                                                                                           éead
                                                                                                )
achieved tok/s: 1102.564103

@byte-6174
Copy link
Contributor

no, I can not reproduce your output with your model and stock llama2.c run. your model seems to have some corruption.

 ./run ~/Downloads/model.bin -t 0.8 -n 256 -i "instruction"

instruct oper(2on ID ChristG sem stattr<unk>�MU added radener getsQU<unk<unk>od
                                                                               <unk>roidorn�(
                                                                                              float
 databaseatJ<unkures i<unk>ENv<unk>
inkireroState
             <unk><unk><unk>Conthú
 esøàblemough
	v(\asedy/ <unk>plate
:ch direc<unk>k ifÎ*/com
                         allow und Willpgr�curityarily website
                                                              O¶
ore�5,Äano That$ transõdf заThe П�+
achieved tok/s: 913.636364

@jrudolph
Copy link
Contributor

No, it's a problem with main.cpp because it expects that it can tokenize the instruction prefix/suffix and newline but the vocabulary does not include them (and it's also not needed for non-instruction mode).

Backtrace
system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
terminate called after throwing an instance of 'std::out_of_range'
  what():  _Map_base::at

Program received signal SIGABRT, Aborted.
__pthread_kill_implementation (no_tid=0, signo=6, threadid=140737352595264) at ./nptl/pthread_kill.c:44
44	./nptl/pthread_kill.c: No such file or directory.
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140737352595264) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=140737352595264) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=140737352595264, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007ffff7842476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007ffff78287f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x00007ffff7ca2b9e in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007ffff7cae20c in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x00007ffff7cae277 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
#8  0x00007ffff7cae4d8 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
#9  0x00007ffff7ca54a0 in std::__throw_out_of_range(char const*) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#10 0x00005555555c9a2c in std::__detail::_Map_base<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, int>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, int> >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true>, true>::at (this=0x55555571f398, __k="\n")
    at /usr/include/c++/11/bits/hashtable_policy.h:776
#11 0x00005555555c2a0f in std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, int> > >::at (this=0x55555571f398, __k="\n") at /usr/include/c++/11/bits/unordered_map.h:1001
#12 0x00005555555bfd10 in llama_tokenizer::tokenize (this=0x7fffffffbea0, text="\n\n### Instruction:\n\n", output=std::vector of length 1, capacity 1 = {...}) at llama.cpp:2024
#13 0x00005555555ac971 in llama_tokenize (vocab=..., text="\n\n### Instruction:\n\n", bos=true) at llama.cpp:2077
#14 0x00005555555b658a in llama_tokenize_with_model (model=0x55555571f2b0, text=0x5555557b3c30 "\n\n### Instruction:\n\n", tokens=0x5555557bca00, n_max_tokens=21, add_bos=true) at llama.cpp:4115
#15 0x00005555555b6716 in llama_tokenize (ctx=0x5555557b44c0, text=0x5555557b3c30 "\n\n### Instruction:\n\n", tokens=0x5555557bca00, n_max_tokens=21, add_bos=true) at llama.cpp:4135
#16 0x00005555555ed7cb in llama_tokenize (ctx=0x5555557b44c0, text="\n\n### Instruction:\n\n", add_bos=true) at examples/common.cpp:640
#17 0x000055555555c7d2 in main (argc=9, argv=0x7fffffffdb88) at examples/main/main.cpp:259

It works when applying this diff:

+++ b/examples/main/main.cpp
@@ -256,11 +256,13 @@ int main(int argc, char ** argv) {
     }
 
     // prefix & suffix for instruct mode
-    const auto inp_pfx = ::llama_tokenize(ctx, "\n\n### Instruction:\n\n", true);
-    const auto inp_sfx = ::llama_tokenize(ctx, "\n\n### Response:\n\n", false);
+    std::vector<llama_token> inp_pfx;
+    std::vector<llama_token> inp_sfx;
 
     // in instruct mode, we inject a prefix and a suffix to each input by the user
     if (params.instruct) {
+        inp_pfx = ::llama_tokenize(ctx, "\n\n### Instruction:\n\n", true);
+        inp_sfx = ::llama_tokenize(ctx, "\n\n### Response:\n\n", false);
         params.interactive_first = true;
         params.antiprompt.push_back("### Instruction:\n\n");
     }
@@ -270,9 +272,6 @@ int main(int argc, char ** argv) {
         params.interactive = true;
     }
 
-    // determine newline token
-    auto llama_token_newline = ::llama_tokenize(ctx, "\n", false);
-
     if (params.verbose_prompt) {
         fprintf(stderr, "\n");
         fprintf(stderr, "%s: prompt: '%s'\n", __func__, params.prompt.c_str());

@saltyduckegg it might still make sense to include enough tokens to represent these strings as well.

@byte-6174
Copy link
Contributor

okay, but I am running from llama2.c and I get the output above?! how do we explain that?!

@jrudolph
Copy link
Contributor

okay, but I am running from llama2.c and I get the output above?! how do we explain that?!

Probably llama2.c is not picking up the custom tokenizer.bin.

@byte-6174
Copy link
Contributor

okay, but I am running from llama2.c and I get the output above?! how do we explain that?!

Probably llama2.c is not picking up the custom tokenizer.bin.

right, depends on how @saltyduckegg saved the custom tokenizer.
Btw, above issue 2580 makes sense to include in main in llama.cpp in general.

@jrudolph
Copy link
Contributor

jrudolph commented Aug 10, 2023

It seems to work for me with llama2.c if the custom tokenizer.bin is in the working directory (maybe you had it in ~/Downloads/tokenizer.bin as well?).

@byte-6174
Copy link
Contributor

yes!, forgot about that it is hardcoded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers help wanted Extra attention is needed model Model specific
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants