llama : add support for llama2.c models #2379

ggerganov · 2023-07-24T20:15:39Z

The new llama2.c project provides means for training "baby" llama models stored in a custom binary format, with 15M and 44M models already available and more potentially coming out soon.

We should provide a simple conversion tool from llama2.c bin format to ggml format so we can run inference of the models in llama.cpp

Great task for people looking to get involved in the project

The text was updated successfully, but these errors were encountered:

jagtesh · 2023-07-24T21:33:17Z

I can take a stab at it. Been meaning to dive deeper into the GGML format.

Since convert.py only does GGML conversion, and quantize is called explicitly for quantization. In theory - only convert.py will need to be modified.

Would an existing model (HF/PyTorch) serve as a good starting point?

slaren · 2023-07-24T21:39:40Z

Trying to add this to convert.py may be overkill, and a lot harder than it needs to be. Writing a standalone script would probably be a lot easier.

The easiest to understand description of the file format is probably in the training example here:

llama.cpp/examples/train-text-from-scratch/train-text-from-scratch.cpp

Line 2609 in 41c6741

    
           void save_as_llama_model(struct llama_vocab * vocab, struct my_llama_model * model, const char * filename) {

Mistobaan · 2023-07-25T00:00:08Z

@ggerganov why not use the safetensor format? seems way more practical than custom binary ggml formats

klosax · 2023-07-25T02:02:59Z

@Mistobaan See this note in the spec of the upcoming gguf fileformat gguf.md#why-not-other-formats and PR ggerganov/ggml#302

byte-6174 · 2023-07-25T20:41:27Z

began a super WIP (not completely functional) attempt at this here
will update as I go along piecing together corresponding variables between the two.

jagtesh · 2023-07-27T03:18:05Z

@byte-6174 nice - I took a similar approach. Currently finding myself going deep in the rabbit hole of converting llama2.c tensors that use calloc based memory assignment (with no metadata afaik) to ggml_tensor.

Also I don’t quite understand yet why does the vocab need to be saved in the model, when it is also available in an external file? In any case, I believe llama2.c is using the exact same format for vocab.

I’ll end this update with a few words in the voice of Yoda, “Long journey, it is. Learn, I must.”

byte-6174 · 2023-07-28T20:23:05Z

Here.
also, find a mapping that was required to figure out how to match the variables 🙂

This reads the llama2.c model file and saves all weights to ggml compatible tensor format.
Let me know how it works...

byte-6174 · 2023-07-29T13:57:12Z

re. mapping can someone with more exp with llama.cpp tensors point to how these RoPE tensors should be mapped? Indicated with ? in mapping md file above ☝️

slaren · 2023-07-29T14:08:29Z

Not 100%, but I believe these are lookup tables for the RoPE, and are not necessary for llama.cpp.

byte-6174 · 2023-07-29T17:06:36Z

Right, I looked at llama2.c code and its surely for RoPE, good to know it's not needed for llama.cpp. I can remove it.
I'm now wanting to run this model now, any pointers? digging now..

ggerganov · 2023-07-30T14:52:37Z

You can run the most basic inference using: ./main -m converted-model.bin

byte-6174 · 2023-07-30T16:23:55Z

Got it, it runs pretty well!.
Perhaps now I can quatize it...
Output seem to contain non-english words...humm..

gives 359 tok/sec, vs. llama2.cs ~100 tok/sec, (perhaps not a fair comparison though).

main: build = 909 (5a87675)
main: seed  = 1690733963
llama.cpp: loading model from abcd1.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 288
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 6
llama_model_load_internal: n_head_kv  = 6
llama_model_load_internal: n_layer    = 6
llama_model_load_internal: n_rot      = 48
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 1.0e-06
llama_model_load_internal: n_ff       = 768
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 0 (all F32)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.02 MB
llama_model_load_internal: mem required  =  395.13 MB (+    3.38 MB per state)
llama_new_context_with_model: kv self size  =    3.38 MB

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 200, n_keep = 0


 One day, Lily met a Shoggoth anyonefather whilem enjo but gl than the acc sn ahead nap very off ru sc things People table fasterer beet wants hiding anywheresklär customer al Contact round weeks sad sick someone somethingore tagBeil as attached where mine recommend ourselvesanalipirst leako each each that reck deal laterchy pair Then dar all blocking Whenrai accomplished yourself backrain this Hamb dr slehetableView short b looked alsoch goal d who down down downnight check out bird calows helper home walliling patch agree je either of whole cl heavily visited visit p up up up up up off headD vCODEBack Withoutotoss timeys pat On new warm compared corner kitcheniously front раз happenionungsplaceĕiveness belong slower run running toten replace a away, MangWe these mention topakesFound herself down if courageished facts rocksepar hear oneessU as b A meanshed turn I Or laugh save without out was destroyunt doA bur byണ University shop' the chance alone alone alone alone someone particular
llama_print_timings:        load time =   115.33 ms
llama_print_timings:      sample time =   143.12 ms /   200 runs   (    0.72 ms per token,  1397.48 tokens per second)
llama_print_timings: prompt eval time =    10.21 ms /    12 tokens (    0.85 ms per token,  1175.20 tokens per second)
llama_print_timings:        eval time =   409.92 ms /   199 runs   (    2.06 ms per token,   485.46 tokens per second)
llama_print_timings:       total time =   581.63 ms

byte-6174 · 2023-07-30T16:26:10Z

humm, one other difference for "non-english" words could be that the vocabs are not matching.

ggerganov · 2023-07-30T16:32:46Z

Nah something else is wrong. First try adding -eps 1e-5 to match the rms norm implementation

byte-6174 · 2023-07-30T16:55:28Z

default run above has eps = 1e-6, this 👇 is with 1e-5 as you suggest:

main: build = 909 (5a87675)
main: seed  = 1690736050
llama.cpp: loading model from abcd1.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 288
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 6
llama_model_load_internal: n_head_kv  = 6
llama_model_load_internal: n_layer    = 6
llama_model_load_internal: n_rot      = 48
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 1.0e-05
llama_model_load_internal: n_ff       = 768
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 0 (all F32)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.02 MB
llama_model_load_internal: mem required  =  395.13 MB (+    3.38 MB per state)
llama_new_context_with_model: kv self size  =    3.38 MB

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 200, n_keep = 0


 One day, Lily met a Shoggothvenatience quicker d looks on des glassappendoredpping early=( along lap ji our Your back of older shwn which happen out tried valleives world swins stessa chairaringtain takes one vendinganceric butYesmy set choiceveryely Next crack each each that everywhereph front downfжда fought feelions enjoy surezy sho r straightstraßeruityoll fesewer closed two v homes wider restaurant fed goerg free you kA basedShowTherevelopeiestpl too came glolass happ thoughtrefixomyber becomingselves keptored by tw belonged home stad watch hop exchange guys simplest situation Today looked on westWhere the disappawvenite tall sameve 'lowrownyn Tom they visitors card block laughed it continued decision anywhereMe justero fresh squ fastert whoseWhere herself startches overOfass bet'urr meps precedsedoc inspired about describe sharing l saidraiслав between silentselves warm thisEV eight plateing wish knewizeS different pop any anything burn graO response
llama_print_timings:        load time =    43.53 ms
llama_print_timings:      sample time =   143.49 ms /   200 runs   (    0.72 ms per token,  1393.83 tokens per second)
llama_print_timings: prompt eval time =     5.99 ms /    12 tokens (    0.50 ms per token,  2003.34 tokens per second)
llama_print_timings:        eval time =   566.60 ms /   199 runs   (    2.85 ms per token,   351.22 tokens per second)
llama_print_timings:       total time =   734.73 ms

byte-6174 · 2023-07-30T17:41:02Z

I'm printing the model->norm that is saved in ggml vs. w->rms_final_weight from llama2.c and they seems to match.

llama2.c first 5 elements of w->rms_final_weight >>
7.676849 7.187980 9.270302 6.815886 7.080070

vs. ggml's model-> norm >>
7.676849 7.187980 9.270302 6.815886 7.080070

ggerganov · 2023-07-30T20:55:32Z

We haven't ran any F32 models with llama.cpp yet, so it is possible that there is a bug that we haven't observed yet only for F32 format. To clear this possibility, try to convert the model to F16 with the following command:

./quantize abcd1.bin abcd1-f16.bin f16

And see if the new abcd1-f16.bin model also outputs nonsense.

byte-6174 · 2023-07-30T21:45:12Z

yes, still nonsense. I'm currently investigating how the ff weights w1, w2, w3 are laid out in memory.
in llama2.c we have:
w1 --- layer x hiddden_dim x dim
w2 --- layer x dim x hiddden_dim
w3 --- layer x hiddden_dim x dim

looking more to see if I'm making a mistake in putting them in the tensors in the right order...

byte-6174 · 2023-07-31T02:35:24Z

aah! I found a bug. I was not using the right multiplier to convert the 1D arrays in llama2.c to reshape into 2D arrays in ggml!
It looks much better and comparable to what we get in llama2.c output!

(py38) ➜  llama.cpp git:(master) ✗ ./main -m abcd1.bin -p "One day, Lily met a Shoggoth" -n 200
main: build = 909 (5a87675)
main: seed  = 1690770877
llama.cpp: loading model from abcd1.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 288
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 6
llama_model_load_internal: n_head_kv  = 6
llama_model_load_internal: n_layer    = 6
llama_model_load_internal: n_rot      = 48
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 1.0e-06
llama_model_load_internal: n_ff       = 768
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 0 (all F32)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.02 MB
llama_model_load_internal: mem required  =  395.13 MB (+    3.38 MB per state)
llama_new_context_with_model: kv self size  =    3.38 MB

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 200, n_keep = 0


 One day, Lily met a Shoggoth. It was big and shiny and very original. Lily asked the Shoggamkeeper what it was.
"It's a special kind of toy," the Shter master said. "But it's mine, not yours."
Lily touched the shirt and said, "Please take care of it?"
The Shapeububber was very happy to have this special toy. He granted Lily some money and told her to do as he asked.
Lily thanked him and took the shirt home with her. She looked at it and saw that it had a big number 1 on it. She was so excited!
"Thank you for saving me," she said to the Shapehanger. "I will take good care of this toy from now on."
The Shapehub smiled as he watched Lily keep her special toy. Once upon a time, there was a little girl named L
llama_print_timings:        load time =    64.88 ms
llama_print_timings:      sample time =   143.15 ms /   200 runs   (    0.72 ms per token,  1397.16 tokens per second)
llama_print_timings: prompt eval time =     9.08 ms /    12 tokens (    0.76 ms per token,  1321.59 tokens per second)
llama_print_timings:        eval time =   340.74 ms /   199 runs   (    1.71 ms per token,   584.03 tokens per second)
llama_print_timings:       total time =   510.47 ms

byte-6174 · 2023-07-31T02:36:14Z

and here with quantization:

(py38) ➜  llama.cpp git:(master) ✗ ./main -m abcd1-f16.bin -p "One day, Lily met a Shoggoth" -n 500
main: build = 909 (5a87675)
main: seed  = 1690770940
llama.cpp: loading model from abcd1-f16.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 288
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 6
llama_model_load_internal: n_head_kv  = 6
llama_model_load_internal: n_layer    = 6
llama_model_load_internal: n_rot      = 48
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 1.0e-06
llama_model_load_internal: n_ff       = 768
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 1 (mostly F16)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.02 MB
llama_model_load_internal: mem required  =  348.58 MB (+    3.38 MB per state)
llama_new_context_with_model: kv self size  =    3.38 MB

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 500, n_keep = 0


 One day, Lily met a Shoggoth. It was very small and shiny and had many buttons on it. Lily liked the shirt and smiled at the shaving Shog.
"Wow," she said. "That's an unusual shirt. Can I try it?"
The Shapey said, "No, this is my shirt. You can't touch it. It's mine."
Lily was sad and angry. She tried to take the shirt from the Shoggborn. The Shogocin fell off the shirt and rolled away. Lily chased after him.
"Stop, Shog," she said. "You are mean. You can't have my shirt."
The Shogy heard her and felt bad. He got up from his bed and walked to Lily. He licked her face and wagged his tail.
Lily was happy and surprised. She hugged the Shogen and said, "Thank you for being my friend. You are very nice."
The Shirt gasped and smiled. It said, "You're welcome. I'm glad you like it. But now, let's go back to your shirt. It has an unusual pattern on it. Do you know what that means?"
Lily looked at the label. It was a bit strange. She did not know what that meant, but she said, "Thank you."
The Shoggrow on its skin and tail. The shirt is very funny. But no one looks like a monster. Everyone looks different. Lily is a lot of their mom. She is their mom'mightyious. They were her dull-iky.
"This one-shard. Shyebe was ankant. Shady. She shaped with shaped. Once upon skin. Heelfully she had three arms. The shirt. She was very ador, itchy. "I amisraichy shiny. Itchy Beary Cariosiness was ugly. icy.
M-els belonged toys. Shady things. Sharing and shirt. Sharpylyaterighter. Her nameate. py. icy eyes. Heby. The shady face.
Shutry. It.
Shady.
The Shy Shadow
llama_print_timings:        load time =    68.85 ms
llama_print_timings:      sample time =   357.98 ms /   500 runs   (    0.72 ms per token,  1396.71 tokens per second)
llama_print_timings: prompt eval time =     4.65 ms /    12 tokens (    0.39 ms per token,  2579.54 tokens per second)
llama_print_timings:        eval time =   698.48 ms /   499 runs   (    1.40 ms per token,   714.41 tokens per second)
llama_print_timings:       total time =  1106.33 ms

ggerganov · 2023-07-31T05:38:02Z

Great! You should use context size of 256: -c 256 to match the OG model. Also can try Q8_0 quantisation. And don't forget the epsilon

byte-6174 · 2023-07-31T13:20:34Z

sure-

(py38) ➜  llama.cpp git:(master) ✗ ./main -m abcd1-Q8_0.bin -p "One day, Lily met a Shoggoth" -n 500 -c 256 -eps 1e-5
main: build = 909 (5a87675)
main: seed  = 1690809457
llama.cpp: loading model from abcd1-Q8_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 256
llama_model_load_internal: n_embd     = 288
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 6
llama_model_load_internal: n_head_kv  = 6
llama_model_load_internal: n_layer    = 6
llama_model_load_internal: n_rot      = 48
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 1.0e-05
llama_model_load_internal: n_ff       = 768
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 7 (mostly Q8_0)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.02 MB
llama_model_load_internal: mem required  =  310.76 MB (+    1.69 MB per state)
llama_new_context_with_model: kv self size  =    1.69 MB

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 256, n_batch = 512, n_predict = 500, n_keep = 0


 One day, Lily met a Shoggoth comy all all again be at-its,- for the me and with here long, and and b and away and alert out to in your away in so alone d and very fast very hard happy you me me me me me off me me I alone only time and fast fast on his a on the mention fast and whoions followswards fast roomsaker touch carefulroom learned work the a with a an a her happ “ the the the back himself his quickly andum the the the the his cast the over to child while grass he me together fast the in after and and me firsts away andocim dust the her at Princess.ch," again before you' upon- the it you’- she before. in all a a a a a her a, at to a a long away fast them a very very very so once and to and very far right to- a a her me me me me really herchow long hard alone alone alone in so from me me out out fast fast the it a  very very very with in farly bigger steals water and pain away one the the too about fish his care revvel th raw in his firsts first by into life to it upon for and- the no time c a tooowow,
llama_print_timings:        load time =    60.02 ms
llama_print_timings:      sample time =   354.44 ms /   500 runs   (    0.71 ms per token,  1410.66 tokens per second)
llama_print_timings: prompt eval time =    66.42 ms /   399 tokens (    0.17 ms per token,  6007.41 tokens per second)
llama_print_timings:        eval time =   560.68 ms /   496 runs   (    1.13 ms per token,   884.64 tokens per second)
llama_print_timings:       total time =  1025.19 ms

byte-6174 · 2023-07-31T13:30:55Z

just focusing on timing a bit, seems with -t 4 instead of the default 8 seems better :)
command:
./main -m abcd1-Q8_0.bin -p "One day, Lily met a Shoggoth" -n 200 -c 256 -eps 1e-5 -t 4

model	time
abcd1.bin	392.16ms
abcd1-f16.bin	312.78ms
abcd1-Q8_0.bin	253.47ms

ggerganov · 2023-07-31T15:48:25Z

The Q8_0 generation looks broken. Either it does not have enough precision somehow or there is still some lingering issue

byte-6174 · 2023-07-31T15:58:13Z

humm, you mean as far as we can judge from the output, some words look random...yes?

byte-6174 · 2023-07-31T16:01:21Z

also, - perhaps not relevant, perhaps it is - but llama2.c uses the RoPE vectors which we are ignoring, so there is that difference.

ggerganov · 2023-07-31T16:04:19Z

The F16 output seems OK up to 256 tokens which means it's probably not related to RoPE.

klosax · 2023-08-08T08:57:56Z

I believe it is this:
https://github.com/byte-6174/llama.cpp/tree/master/examples/convert-llama2c-to-ggml

byte-6174 · 2023-08-08T13:02:58Z

I can add a readme with some instructions and a summary of our findings and send a PR.

ggerganov · 2023-08-08T13:21:48Z

That would be great!
Just usage instructions would be nice. No need to analyze the results yet.

saltyduckegg · 2023-08-10T03:28:25Z

hello ! I am try to convert my llama2c models to ggml.
but it looks like need a vocab file. so how can i get it ?

saltyduckegg · 2023-08-10T03:29:10Z

The tokenizer.bin is train by my self

klosax · 2023-08-10T10:36:55Z

Try setting --vocab-model to a working llama2 ggml model, not a tokenizer file. I think the vocab will be copied from the model file.

byte-6174 · 2023-08-10T13:58:32Z

Just sent another update that should fix some of the issues. In this conversion, we are using the vocabulary file available at models/ggml-vocab.bin.

byte-6174 · 2023-08-10T14:06:17Z

@saltyduckegg we are using the vocab model available in the llama.cpp repository. Please use that instead and let me know if it works for you.

saltyduckegg · 2023-08-10T14:08:06Z

thank you for your help ,
let me try it

saltyduckegg · 2023-08-10T14:32:01Z

@saltyduckegg we are using the vocab model available in the llama.cpp repository. Please use that instead and let me know if it works for you.

It can run indeed, but this is not what I want. It seems to have messed up the letter encoding, of course this seems to be because this is not the encoding table I use to train.

$ ./bin/convert-llama2c-to-ggml --copy-vocab-from-model ./models/ggml-vocab.bin    --llama2c-model  ../../llama2.c.xs/out/model.bin   --llama2c-output-model ./xss
[malloc_weights:AK] Allocating [8000] x [288] = [2304000] float space for w->token_embedding_table
[malloc_weights:AK] Allocating [6] x [288] = [1728] float space for w->rms_att_weight
[malloc_weights:AK] Allocating [6] x [288] = [1728] float space for w->rms_ffn_weight
[malloc_weights:AK] Allocating [6] x [288] x [288] = [497664] float space for w->wq
[malloc_weights:AK] Allocating [6] x [288] x [288] = [497664] float space for w->wk
[malloc_weights:AK] Allocating [6] x [288] x [288] = [497664] float space for w->wv
[malloc_weights:AK] Allocating [6] x [288] x [288] = [497664] float space for w->wo
[malloc_weights:AK] Allocating [6] x [768] x [288] = [1327104] float space for w->w1
[malloc_weights:AK] Allocating [6] x [288] x [768] = [1327104] float space for w->w2
[malloc_weights:AK] Allocating [6] x [768] x [288] = [1327104] float space for w->w3
[malloc_weights:AK] Allocating [288] float space for w->rms_final_weight
llama.cpp: loading model from ./models/ggml-vocab.bin
llama_model_load_internal: format     = ggjt v1 (pre #1405)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_head_kv  = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 1 (mostly F16)
llama_model_load_internal: model size = 7B
print_params: n_vocab: 8000
print_params: n_ctx:   128
print_params: n_embd:  288
print_params: n_mult:  32
print_params: n_head:  6
print_params: n_ff:    768
print_params: n_layer: 6
print_params: n_rot:   48
[init_model:GG] Allocating [288] x [8000] = [2304000] float space for model->tok_embeddings
[init_model:GG] Allocating [288] float space for model->norm
[init_model:GG] Allocating [288] x[8000] = [2304000] float space for model->output
[init_model:GG] Allocating [288] x[288] = [82944] float space for layer.wq for [6] layers
[init_model:GG] Allocating [288] x[288] = [82944] float space for layer.wk for [6] layers
[init_model:GG] Allocating [288] x[288] = [82944] float space for layer.wv for [6] layers
[init_model:GG] Allocating [288] x[288] = [82944] float space for layer.wo for [6] layers
[init_model:GG] Allocating [288] float space for layer.ffn_norm for [6] layers
[init_model:GG] Allocating [768] x[288] = [221184] float space for layer.w1 for [6] layers
[init_model:GG] Allocating [288] x[768] = [221184] float space for layer.w2 for [6] layers
[init_model:GG] Allocating [768] x[288] = [221184] float space for layer.w3 for [6] layers
Saving llama.c model file ../../llama2.c.xs/out/model.bin in ggml format at ./xss


./bin/main -m ./xss -p "One day, Lily met a Shoggoth" -n 500 -c 256 -eps 1e-5
main: build = 0 (unknown)
main: seed  = 1691677573
llama.cpp: loading model from ./xss
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 8000
llama_model_load_internal: n_ctx      = 256
llama_model_load_internal: n_embd     = 288
llama_model_load_internal: n_mult     = 32
llama_model_load_internal: n_head     = 6
llama_model_load_internal: n_head_kv  = 6
llama_model_load_internal: n_layer    = 6
llama_model_load_internal: n_rot      = 48
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 1.0e-05
llama_model_load_internal: n_ff       = 768
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 0 (all F32)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.02 MB
llama_model_load_internal: mem required  =   40.39 MB (+    1.69 MB per state)
llama_new_context_with_model: kv self size  =    1.69 MB
llama_new_context_with_model: compute buffer total size =    9.44 MB

system_info: n_threads = 28 / 56 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 256, n_batch = 512, n_predict = 500, n_keep = 0


 One day, Lily met a Shoggoth 
                              pol$деIg!I Thtem»iREenol¹ban³ol¹ourcegetah uolination 
                                                                                   iel o¤ Eldi elitionaft5oph trad0ow primÿject$this }, here$omsجintANother°weenplheitrelags󄟯fficort)
                                                                                                                                                                                      сеl М Classст ExIҹban)ace el5 кener:UPcketzT synWji pos8 leoh preio,ohid¯h pre sayingjs pos4 Ar reallyideiooh Alunicipace -ordво posaceaceRE§6CǠpourservice can.ͺ:ÿcheckistsT¯ pos4 ut le but 
                                                                   甌L pos04 pos;ڡm pos- Mus Mus
                                                                                                P
                                                                                                 	
                                                                                                         (Tython֗ avecnelowias5Sдо&ER proiaftamground include u@󟤬Oobanle and]
                                                                                                                                                                             
                                                                                                                                                                            urce eну canKϬ,0/ donen
                                                                                                                                                                                                   > il¯ys8
                                                                                                                                                                                                          Μ, lineSڱ pos0 posI \(Connectionج turnHzo de posblockiowow	buttonсо,amp decject$деI everyfigource
                                                                                                                                                                                                                                                                                                              end + lookob d prim,def ilfrcheW:¼¬, line¯0itemizeler
                                                   olдеI equ Ob$ FERдеI¤amground筥 Viother
                                                                                          dfigublicampSڥrtDO voor Newunk"άerStream0¯qWrit sym

llama_print_timings:        load time =    11.41 ms
llama_print_timings:      sample time =    89.09 ms /   500 runs   (    0.18 ms per token,  5612.62 tokens per second)
llama_print_timings: prompt eval time =    69.81 ms /   399 tokens (    0.17 ms per token,  5715.68 tokens per second)
llama_print_timings:        eval time =  3626.38 ms /   496 runs   (    7.31 ms per token,   136.78 tokens per second)
llama_print_timings:       total time =  3822.12 ms

jrudolph · 2023-08-10T14:36:52Z

I created a PR against #2559 to support loading the llama2.c vocabulary that might help you, @saltyduckegg, if you created your own vocabulary: https://github.com/byte-6174/llama.cpp/pull/1/files

The --copy-vocab-from-model argument now also works with tokenizer.bin (or whatever you called it) when exported from the llama2.c scripts.

saltyduckegg · 2023-08-10T14:52:25Z

Cool！
It seems to work, it successfully loaded the tokenizer model and converted my model to ggml format. But I encountered an error that I could not understand when using main to run later.

$ ./bin/convert-llama2c-to-ggml --copy-vocab-from-model ../../llama2.c.xs/tokenizer.bin    --llama2c-model  ../../llama2.c.xs/out/model.bin   --llama2c-output-model ./xss  
[malloc_weights:AK] Allocating [8000] x [288] = [2304000] float space for w->token_embedding_table
[malloc_weights:AK] Allocating [6] x [288] = [1728] float space for w->rms_att_weight
[malloc_weights:AK] Allocating [6] x [288] = [1728] float space for w->rms_ffn_weight
[malloc_weights:AK] Allocating [6] x [288] x [288] = [497664] float space for w->wq
[malloc_weights:AK] Allocating [6] x [288] x [288] = [497664] float space for w->wk
[malloc_weights:AK] Allocating [6] x [288] x [288] = [497664] float space for w->wv
[malloc_weights:AK] Allocating [6] x [288] x [288] = [497664] float space for w->wo
[malloc_weights:AK] Allocating [6] x [768] x [288] = [1327104] float space for w->w1
[malloc_weights:AK] Allocating [6] x [288] x [768] = [1327104] float space for w->w2
[malloc_weights:AK] Allocating [6] x [768] x [288] = [1327104] float space for w->w3
[malloc_weights:AK] Allocating [288] float space for w->rms_final_weight
Assuming llama2.c vocabulary since ../../llama2.c.xs/tokenizer.bin is not a ggml file
print_params: n_vocab: 8000
print_params: n_ctx:   128
print_params: n_embd:  288
print_params: n_mult:  32
print_params: n_head:  6
print_params: n_ff:    768
print_params: n_layer: 6
print_params: n_rot:   48
[init_model:GG] Allocating [288] x [8000] = [2304000] float space for model->tok_embeddings
[init_model:GG] Allocating [288] float space for model->norm
[init_model:GG] Allocating [288] x[8000] = [2304000] float space for model->output
[init_model:GG] Allocating [288] x[288] = [82944] float space for layer.wq for [6] layers
[init_model:GG] Allocating [288] x[288] = [82944] float space for layer.wk for [6] layers
[init_model:GG] Allocating [288] x[288] = [82944] float space for layer.wv for [6] layers
[init_model:GG] Allocating [288] x[288] = [82944] float space for layer.wo for [6] layers
[init_model:GG] Allocating [288] float space for layer.ffn_norm for [6] layers
[init_model:GG] Allocating [768] x[288] = [221184] float space for layer.w1 for [6] layers
[init_model:GG] Allocating [288] x[768] = [221184] float space for layer.w2 for [6] layers
[init_model:GG] Allocating [768] x[288] = [221184] float space for layer.w3 for [6] layers
Saving llama.c model file ../../llama2.c.xs/out/model.bin in ggml format at ./xss


$ ./bin/main -m ./xss -p "One day, Lily met a Shoggoth" -n 500 -c 256 -eps 1e-5
main: build = 0 (unknown)
main: seed  = 1691678842
llama.cpp: loading model from ./xss
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 8000
llama_model_load_internal: n_ctx      = 256
llama_model_load_internal: n_embd     = 288
llama_model_load_internal: n_mult     = 32
llama_model_load_internal: n_head     = 6
llama_model_load_internal: n_head_kv  = 6
llama_model_load_internal: n_layer    = 6
llama_model_load_internal: n_rot      = 48
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 1.0e-05
llama_model_load_internal: n_ff       = 768
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 0 (all F32)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.02 MB
llama_model_load_internal: mem required  =   40.39 MB (+    1.69 MB per state)
llama_new_context_with_model: kv self size  =    1.69 MB
llama_new_context_with_model: compute buffer total size =    9.44 MB

system_info: n_threads = 28 / 56 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
terminate called after throwing an instance of 'std::out_of_range'
  what():  _Map_base::at
Aborted (core dumped)

I created a PR against #2559 to support loading the llama2.c vocabulary that might help you, @saltyduckegg, if you created your own vocabulary: https://github.com/byte-6174/llama.cpp/pull/1/files

The --copy-vocab-from-model argument now also works with tokenizer.bin (or whatever you called it) when exported from the llama2.c scripts.

jrudolph · 2023-08-10T15:13:55Z

Yep, something seems broken :)

Can you put your model somewhere? Or run with gdb to get a stack trace or put the core somewhere to investigate?

saltyduckegg · 2023-08-10T16:09:05Z

I’m very sorry for taking so long, because I found that my model file is larger than 25m and cannot be sent directly on github. So I uploaded it to hugeface. This is a mini model trained with Chinese dialogue data.
https://huggingface.co/segg/mymini_llama/tree/main

Yep, something seems broken :)

Can you put your model somewhere? Or run with gdb to get a stack trace or put the core somewhere to investigate?

byte-6174 · 2023-08-10T16:37:01Z

I just tested it with @jrudolph 's update and for all 3 models we can use llama2.c vocab binary optionally. I will send another update to the PR with this.

byte-6174 · 2023-08-10T16:52:49Z

@saltyduckegg I tried running your model from the hugginface with llama2.c repo and it gives me the following. Are you able to get good output when you use llama2.c repo with this model?

./run ~/Downloads/model.bin -t 0.8 -n 256 -i "One day, Lily met a Shoggoth"

One day, Lily met a ShoggothWoudки(
	 gar<unk>                  * Les argp:Ä Januarysare8 Liemer
                 shith
                      <unk>
                            "
                              Name
                                  ropemsJ
                                         sch
                                                      <unk>
                                                           ning
                                                                st
                                                                   cert
                                                                       Interz again places
                                                                                          �
                                                                                           éead
                                                                                                )
achieved tok/s: 1102.564103

saltyduckegg · 2023-08-10T17:01:53Z

i can get a better output. i use my tokenizer.bin :

$ ./run out/model.bin -t 0.8 -n 256 -i "instruction"
instruction (1 moo n che. s" haited \n [9 9 36 lodor t\ns s sadots a VCIv ad \n\n00\n ' 如果 \n 29 25 \n mon. 好的, Chn my a lis _mo ` ner Z in ptrolus in list)\n```\ndsdit by \n```\nY\n```\n# 最近太科学的,一根方不<unk>。```', 'input': '', 'output': '\n\n2022年 红色 \n', 'input': '', 'output': '\n\n\n\n\n 当然,他在我,Sanal n ct 非常感谢,我一地成功地为 这家餐厅的。我我是一个一些。小明: 却要是,你还在我在我需要注意的,你是我很文学,我被跟”与,但是我们您保持科技来。\n\n基于以上这段文本,能够降低思考
achieved tok/s: 298.245614

My result is expected, at least some of it are sentences I can understand. Your result is somewhat like the output under a incorrect tokenizer.binl.

@saltyduckegg I tried running your model from the hugginface with llama2.c repo and it gives me the following. Are you able to get good output when you use llama2.c repo with this model?

./run ~/Downloads/model.bin -t 0.8 -n 256 -i "One day, Lily met a Shoggoth"

One day, Lily met a ShoggothWoudки(
	 gar<unk>                  * Les argp:Ä Januarysare8 Liemer
                 shith
                      <unk>
                            "
                              Name
                                  ropemsJ
                                         sch
                                                      <unk>
                                                           ning
                                                                st
                                                                   cert
                                                                       Interz again places
                                                                                          �
                                                                                           éead
                                                                                                )
achieved tok/s: 1102.564103

byte-6174 · 2023-08-10T18:35:54Z

no, I can not reproduce your output with your model and stock llama2.c run. your model seems to have some corruption.

 ./run ~/Downloads/model.bin -t 0.8 -n 256 -i "instruction"

instruct oper(2on ID ChristG sem stattr<unk>�MU added radener getsQU<unk<unk>od
                                                                               <unk>roidorn�(
                                                                                              float
 databaseatJ<unkures i<unk>ENv<unk>
inkireroState
             <unk><unk><unk>Conthú
 esøàblemough
	v(\asedy/ <unk>plate
:ch direc<unk>k ifÎ*/com
                         allow und Willpgr�curityarily website
                                                              O¶
ore�5,Äano That$ transÃµdf заThe П�+
achieved tok/s: 913.636364

jrudolph · 2023-08-10T18:43:04Z

No, it's a problem with main.cpp because it expects that it can tokenize the instruction prefix/suffix and newline but the vocabulary does not include them (and it's also not needed for non-instruction mode).

Backtrace

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
terminate called after throwing an instance of 'std::out_of_range'
  what():  _Map_base::at

Program received signal SIGABRT, Aborted.
__pthread_kill_implementation (no_tid=0, signo=6, threadid=140737352595264) at ./nptl/pthread_kill.c:44
44	./nptl/pthread_kill.c: No such file or directory.
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140737352595264) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=140737352595264) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=140737352595264, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007ffff7842476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007ffff78287f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x00007ffff7ca2b9e in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007ffff7cae20c in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x00007ffff7cae277 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
#8  0x00007ffff7cae4d8 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
#9  0x00007ffff7ca54a0 in std::__throw_out_of_range(char const*) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#10 0x00005555555c9a2c in std::__detail::_Map_base<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, int>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, int> >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true>, true>::at (this=0x55555571f398, __k="\n")
    at /usr/include/c++/11/bits/hashtable_policy.h:776
#11 0x00005555555c2a0f in std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, int> > >::at (this=0x55555571f398, __k="\n") at /usr/include/c++/11/bits/unordered_map.h:1001
#12 0x00005555555bfd10 in llama_tokenizer::tokenize (this=0x7fffffffbea0, text="\n\n### Instruction:\n\n", output=std::vector of length 1, capacity 1 = {...}) at llama.cpp:2024
#13 0x00005555555ac971 in llama_tokenize (vocab=..., text="\n\n### Instruction:\n\n", bos=true) at llama.cpp:2077
#14 0x00005555555b658a in llama_tokenize_with_model (model=0x55555571f2b0, text=0x5555557b3c30 "\n\n### Instruction:\n\n", tokens=0x5555557bca00, n_max_tokens=21, add_bos=true) at llama.cpp:4115
#15 0x00005555555b6716 in llama_tokenize (ctx=0x5555557b44c0, text=0x5555557b3c30 "\n\n### Instruction:\n\n", tokens=0x5555557bca00, n_max_tokens=21, add_bos=true) at llama.cpp:4135
#16 0x00005555555ed7cb in llama_tokenize (ctx=0x5555557b44c0, text="\n\n### Instruction:\n\n", add_bos=true) at examples/common.cpp:640
#17 0x000055555555c7d2 in main (argc=9, argv=0x7fffffffdb88) at examples/main/main.cpp:259

It works when applying this diff:

+++ b/examples/main/main.cpp
@@ -256,11 +256,13 @@ int main(int argc, char ** argv) {
     }
 
     // prefix & suffix for instruct mode
-    const auto inp_pfx = ::llama_tokenize(ctx, "\n\n### Instruction:\n\n", true);
-    const auto inp_sfx = ::llama_tokenize(ctx, "\n\n### Response:\n\n", false);
+    std::vector<llama_token> inp_pfx;
+    std::vector<llama_token> inp_sfx;
 
     // in instruct mode, we inject a prefix and a suffix to each input by the user
     if (params.instruct) {
+        inp_pfx = ::llama_tokenize(ctx, "\n\n### Instruction:\n\n", true);
+        inp_sfx = ::llama_tokenize(ctx, "\n\n### Response:\n\n", false);
         params.interactive_first = true;
         params.antiprompt.push_back("### Instruction:\n\n");
     }
@@ -270,9 +272,6 @@ int main(int argc, char ** argv) {
         params.interactive = true;
     }
 
-    // determine newline token
-    auto llama_token_newline = ::llama_tokenize(ctx, "\n", false);
-
     if (params.verbose_prompt) {
         fprintf(stderr, "\n");
         fprintf(stderr, "%s: prompt: '%s'\n", __func__, params.prompt.c_str());

@saltyduckegg it might still make sense to include enough tokens to represent these strings as well.

byte-6174 · 2023-08-10T18:58:12Z

okay, but I am running from llama2.c and I get the output above?! how do we explain that?!

jrudolph · 2023-08-10T18:59:34Z

okay, but I am running from llama2.c and I get the output above?! how do we explain that?!

Probably llama2.c is not picking up the custom tokenizer.bin.

byte-6174 · 2023-08-10T19:02:54Z

okay, but I am running from llama2.c and I get the output above?! how do we explain that?!

Probably llama2.c is not picking up the custom tokenizer.bin.

right, depends on how @saltyduckegg saved the custom tokenizer.
Btw, above issue 2580 makes sense to include in main in llama.cpp in general.

jrudolph · 2023-08-10T19:12:53Z

It seems to work for me with llama2.c if the custom tokenizer.bin is in the working directory (maybe you had it in ~/Downloads/tokenizer.bin as well?).

byte-6174 · 2023-08-10T19:32:00Z

yes!, forgot about that it is hardcoded.

ggerganov added help wanted Extra attention is needed good first issue Good for newcomers model Model specific labels Jul 24, 2023

byte-6174 mentioned this issue Aug 8, 2023

Adding support for llama2.c models #2559

Merged

klosax linked a pull request Aug 8, 2023 that will close this issue

Adding support for llama2.c models #2559

Merged

klosax mentioned this issue Aug 10, 2023

Segmentation fault in converting my llama2c models to ggml. #2574

Closed

jrudolph mentioned this issue Aug 10, 2023

Don't crash when prompt cannot be tokenized #2580

Closed

klosax closed this as completed in #2559 Aug 11, 2023

byte-6174 mentioned this issue Aug 15, 2023

8-bit Quantization karpathy/llama2.c#298

Open

llama : add support for llama2.c models #2379

llama : add support for llama2.c models #2379

Comments

ggerganov commented Jul 24, 2023

jagtesh commented Jul 24, 2023

slaren commented Jul 24, 2023

Mistobaan commented Jul 25, 2023

klosax commented Jul 25, 2023 • edited Loading

byte-6174 commented Jul 25, 2023

jagtesh commented Jul 27, 2023

byte-6174 commented Jul 28, 2023

byte-6174 commented Jul 29, 2023

slaren commented Jul 29, 2023

byte-6174 commented Jul 29, 2023

ggerganov commented Jul 30, 2023

byte-6174 commented Jul 30, 2023 • edited Loading

byte-6174 commented Jul 30, 2023

ggerganov commented Jul 30, 2023

byte-6174 commented Jul 30, 2023

byte-6174 commented Jul 30, 2023

ggerganov commented Jul 30, 2023

byte-6174 commented Jul 30, 2023

byte-6174 commented Jul 31, 2023

byte-6174 commented Jul 31, 2023

ggerganov commented Jul 31, 2023 • edited Loading

byte-6174 commented Jul 31, 2023

byte-6174 commented Jul 31, 2023

ggerganov commented Jul 31, 2023

byte-6174 commented Jul 31, 2023

byte-6174 commented Jul 31, 2023

ggerganov commented Jul 31, 2023

klosax commented Aug 8, 2023

byte-6174 commented Aug 8, 2023

ggerganov commented Aug 8, 2023

saltyduckegg commented Aug 10, 2023

saltyduckegg commented Aug 10, 2023

klosax commented Aug 10, 2023

byte-6174 commented Aug 10, 2023

byte-6174 commented Aug 10, 2023

saltyduckegg commented Aug 10, 2023

saltyduckegg commented Aug 10, 2023

jrudolph commented Aug 10, 2023 • edited Loading

saltyduckegg commented Aug 10, 2023

jrudolph commented Aug 10, 2023

saltyduckegg commented Aug 10, 2023

byte-6174 commented Aug 10, 2023

byte-6174 commented Aug 10, 2023 • edited Loading

saltyduckegg commented Aug 10, 2023 • edited Loading

byte-6174 commented Aug 10, 2023

jrudolph commented Aug 10, 2023

byte-6174 commented Aug 10, 2023

jrudolph commented Aug 10, 2023

byte-6174 commented Aug 10, 2023

jrudolph commented Aug 10, 2023 • edited Loading

byte-6174 commented Aug 10, 2023

klosax commented Jul 25, 2023 •

edited

Loading

byte-6174 commented Jul 30, 2023 •

edited

Loading

ggerganov commented Jul 31, 2023 •

edited

Loading

jrudolph commented Aug 10, 2023 •

edited

Loading

byte-6174 commented Aug 10, 2023 •

edited

Loading

saltyduckegg commented Aug 10, 2023 •

edited

Loading

jrudolph commented Aug 10, 2023 •

edited

Loading