Error: 70B Model quantizing on mac: tensor 'layers.0.attention.wk.weight' has wrong shape; expected 8192 x 8192 #407

NilsHellwig · 2023-07-19T13:07:22Z

Used this model: https://huggingface.co/meta-llama/Llama-2-70b

Used these commands:

$ convert-pth-to-ggml.py models/LLaMa2-70B-meta 1

$ ./quantize ./models/LLaMa2-70B-meta/ggml-model-f16.bin ./models/LLaMa2-70B-meta/ggml-model-q4_0.bin 2

7B and 11B models work without any problems. This is only when using the 70B model.

error loading model: llama.cpp: tensor 'layers.0.attention.wk.weight' has wrong shape; expected 8192 x 8192, got 8192 x 1024

llama.cpp: loading model from /Users/xyz/Desktop/llama.cpp/models/LLaMa2-70B-meta/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 22016
llama_model_load_internal: model size = 65B
llama_model_load_internal: ggml ctx size =    0.19 MB
error loading model: llama.cpp: tensor 'layers.0.attention.wk.weight' has wrong shape; expected  8192 x  8192, got  8192 x  1024
llama_load_model_from_file: failed to load model
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[2], line 1
----> 1 llm = Llama(model_path="/Users/xyz/Desktop/llama.cpp/models/LLaMa2-70B-meta/ggml-model-q4_0.bin", n_ctx=512, seed=43, n_threads=8, n_gpu_layers=1)

File /opt/homebrew/Caskroom/miniforge/base/envs/tensorflow_m1/lib/python3.11/site-packages/llama_cpp/llama.py:305, in Llama.__init__(self, model_path, n_ctx, n_parts, n_gpu_layers, seed, f16_kv, logits_all, vocab_only, use_mmap, use_mlock, embedding, n_threads, n_batch, last_n_tokens_size, lora_base, lora_path, low_vram, tensor_split, rope_freq_base, rope_freq_scale, verbose)
    300     raise ValueError(f"Model path does not exist: {model_path}")
    302 self.model = llama_cpp.llama_load_model_from_file(
    303     self.model_path.encode("utf-8"), self.params
    304 )
--> 305 assert self.model is not None
    307 self.ctx = llama_cpp.llama_new_context_with_model(self.model, self.params)
    309 assert self.ctx is not None

AssertionError:

The text was updated successfully, but these errors were encountered:

schappim · 2023-07-20T06:37:19Z

I can confirm I also have this issue. I tried this method (ggerganov/llama.cpp#2276 (comment)) unsuccessfully.

asdf17128 · 2023-07-20T07:01:48Z

also faced this problem. any solutions?

schappim · 2023-07-20T07:10:45Z

@asdf17128 and @NilsHellwig, are you guys on M2? I'm currently attempting to run this on an M2 Ultra with 24-core CPU, 76-core GPU, 32-core Neural Engine. I've tried both metal and non-metal builds.

NilsHellwig · 2023-07-20T07:12:36Z

@schappim I'm using a M1 Max (32 GB RAM), 24 core GPU

asdf17128 · 2023-07-20T07:16:12Z

@asdf17128 and @NilsHellwig, are you guys on M2? I'm currently attempting to run this on an M2 Ultra.

mac studio M2 ultra with 192G RAM

asdf17128 · 2023-07-20T07:17:48Z

maybe some changes should be made on llama.cpp, no clue

NilsHellwig · 2023-07-20T07:22:58Z

@asdf17128 I think so too. I guess it's not our fault. Added a new issue...

ggerganov/llama.cpp#2285

SlyEcho · 2023-07-20T07:26:54Z

v2 70B is not supported right now because it uses a different attention method. #2276 is a proof of concept to make it work. The convert script should not require changes because the only thing that changed is the shape of some tensors and convert.py can handle it, same for quantize.cpp.

If you have trouble converting, I suggest you check that the .pth files are correct, Meta's download script is not the best.

schappim · 2023-07-20T07:36:04Z

@SlyEcho, I don't think it is the download as I downloaded twice, once via git lfs, and the other via the script.
I've checked out the experimental branch and tried that, this also generates the same "wrong shape" error.
I tried with both LLAMA_METAL enabled and disabled.

SlyEcho · 2023-07-20T07:45:31Z

@schappim, "wrong shape" error is expected because llama.cpp does not support 70B yet. However errors in convert.py would mean the data files are not in the correct format. OK, maybe I was a little confused as to who had which error.

NilsHellwig · 2023-07-20T07:46:13Z

7B and 13B version do work without any problems however :)

schappim · 2023-07-20T07:54:05Z

confirming that same as @NilsHellwig, the 7B and 13B version do work....

SlyEcho · 2023-07-20T09:09:54Z

There are checksum files in HF if you have access:

jain-t · 2023-07-20T13:10:50Z

Getting the same error as everyone for 70B-chat

Macbook Pro, M1 chip, 16GB

main: build = 852 (294f424)
main: seed  = 1689858440
llama.cpp: loading model from /Volumes/Gargantua/LLAMA2/llama-2-70b-chat/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 22016
llama_model_load_internal: model size = 65B
llama_model_load_internal: ggml ctx size =    0.19 MB
error loading model: llama.cpp: tensor 'layers.0.attention.wk.weight' has wrong shape; expected  8192 x  8192, got  8192 x  1024
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '/Volumes/Gargantua/LLAMA2/llama-2-70b-chat/ggml-model-q4_0.bin'
main: error: unable to load model

SlyEcho · 2023-07-20T14:08:01Z

Please apply https://github.com/ggerganov/llama.cpp/pull/2276.diff and then test.
I have tested and it does work.

tony163163 · 2023-07-22T10:46:12Z

@@ -1459,7 +1468,7 @@ static bool llama_eval_internal(
offload_func_kq(tmpq);
ggml_set_name(tmpq, "tmpq");

       struct ggml_tensor * Kcur = ggml_rope_custom_inplace(ctx0, ggml_reshape_3d(ctx0, tmpk, n_embd/n_head, n_head, N), n_past, n_rot, 0, freq_base, freq_scale, 0);

       struct ggml_tensor * Kcur = ggml_rope_custom_inplace(ctx0, ggml_reshape_3d(ctx0, tmpk, n_embd/n_head, n_head/8, N), n_past, n_rot, 0, freq_base, freq_scale, 0);
       offload_func_kq(Kcur);
       ggml_set_name(Kcur, "Kcur");

or

       struct ggml_tensor * Kcur = ggml_rope_custom_inplace(ctx0, ggml_reshape_3d(ctx0, tmpk, n_embd/n_head, n_head/8, N), n_past, n_rot, 0, 0, freq_base, freq_scale);

k-hkrachenfels · 2023-07-25T13:46:37Z

Hi, I also try to run this on M1 with 64GB.
In the meantime the above PR: https://github.com/ggerganov/llama.cpp/pull/2276.diff seems merged. I wonder why I still get the above error when working on master branch.

main: build = 911 (fce48ca)
main: seed  = 1690292053
llama.cpp: loading model from llama-2-70b-chat.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 4096
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_head_kv  = 64
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 1,0e-06
llama_model_load_internal: n_ff       = 24576
llama_model_load_internal: freq_base  = 10000,0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: model size = 65B
llama_model_load_internal: ggml ctx size =    0,21 MB
error loading model: llama.cpp: tensor 'layers.0.attention.wk.weight' has wrong shape; expected  8192 x  8192, got  8192 x  1024
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model 'llama-2-70b-chat.ggmlv3.q4_0.bin'
main: error: unable to load model

Is LLaMa2-70B-meta/ggml-model-q4_0.bin the correct model. Do I have to convert the model first?

SlyEcho · 2023-07-25T13:49:15Z

You need to pass the argument -gqa 8 as it's not detected automatically yet.

Tavernari · 2023-07-26T11:47:21Z

I only added -gqa 8 on my main execution command, and it worked! Thanks.

mkarpusiewicz · 2023-07-26T13:55:19Z

Is th hf model supported? With --gqa 8 I get:

main: build = 912 (07aaa0f)
main: seed  = 1690379540
llama.cpp: loading model from .\models\llama2-70b-chat-hf-ggml-model-q4_0.bin
llama_model_load_internal: warning: assuming 70B model based on GQA == 8
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_head_kv  = 8
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 8
llama_model_load_internal: rnorm_eps  = 1.0e-06
llama_model_load_internal: n_ff       = 28416
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: model size = 70B
llama_model_load_internal: ggml ctx size =    0.21 MB
error loading model: llama.cpp: tensor 'layers.0.feed_forward.w1.weight' has wrong shape; expected  8192 x 28416, got  8192 x 28672
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '.\models\llama2-70b-chat-hf-ggml-model-q4_0.bin'
main: error: unable to load model

euan · 2023-08-01T19:09:27Z

Got it working with the --gqa 8 flag, but it was way too slow to use.
Also, when you get there, ask it what the date is of the last information it has. The response will shock you!

moseswmwong · 2023-08-02T23:47:22Z

I have a Macbook Pro with M1 Max 64 GB, I encountered this error and understand that convert.py needs to be updated to match the unique attention layer used with the 70B models. After replacing convert.py, run convert again and added "-gqa 8" at the parameters of the program 'main', I can use the model properly.

The speed is a bit slow, I wonder if the M2 Apple Silicon owner has a lot better performance?

I asked about the last information it has and it said "August 26, 2020", I doubt it was a bluff(hallucinations), so I asked about some significant world events and world leader's information and it know recent events so I challenged it about the truthfulness of this date it claims as cut off date and it apologizes and said it should be "December 2022".

I think December 2022 make more sense.

euan · 2023-08-03T05:56:18Z

@moseswmwong Where do I get the updated convert.py script?
I have a M2 Max and running 70b is dog slow.

moseswmwong · 2023-08-03T08:55:37Z

@euan Go to the page for issue #2276, choose the "files changed" tab, press the three dots, choose the "View File", and download the raw file.

aldarisbm · 2023-08-03T12:11:32Z

I have a Macbook Pro with M1 Max 64 GB, I encountered this error and understand that convert.py needs to be updated to match the unique attention layer used with the 70B models. After replacing convert.py, run convert again and added "-gqa 8" at the parameters of the program 'main', I can use the model properly.

The speed is a bit slow, I wonder if the M2 Apple Silicon owner has a lot better performance?

I asked about the last information it has and it said "August 26, 2020", I doubt it was a bluff(hallucinations), so I asked about some significant world events and world leader's information and it know recent events so I challenged it about the truthfulness of this date it claims as cut off date and it apologizes and said it should be "December 2022".

I think December 2022 make more sense.

M2 Max 64GB pretty slow as well

kechan · 2023-08-04T04:42:02Z

If you have M2 Max 96gb, tried adding -ngl 38 to use MPS Metal acceleration (or a lower number if you don't have that many cores). The initial load up is still slow given I tested it with a longer prompt, but afterwards in interactive mode, the back and forth is almost as fast as how I felt when I first met the original ChatGPT (and in the few days when everyone was hitting it). so to me, it is totally usable. And this is the go to thing if I ever ve no wifi and need to search.

NB: you need to build your main with that METAL option, this is documented at llama.cpp

moseswmwong · 2023-08-04T09:06:30Z

@kechan Thanks for the advice! It works like the speed of ChatGPT on my side too! And I can see memory and GPU usage increases when the AI "thinks".

aldarisbm · 2023-08-04T12:05:14Z

yeah definitely -ngl 30 pegs all of my GPU cores on M2 Max 30cores

franssauermann · 2023-08-07T18:33:04Z

...
Is th hf model supported? With --gqa 8 I get:
...

error loading model: llama.cpp: tensor 'layers.0.feed_forward.w1.weight' has wrong shape; expected 8192 x 28416, got 8192 x 28672

Were you able to resolve this issue?

mkarpusiewicz · 2023-08-07T18:38:59Z

@franssauermann No, I gave up, as the CPU performance would not be the greatest anyway I didn't pursue this. Just for reference, tested this on windows 10 and not mac m1/m2.

skobak · 2023-08-17T11:24:33Z

I can confirm that 70B on M1 PRO/32GB works really slow.

primus1a · 2023-08-18T14:50:34Z

CuBLAS with --gqa 8 on Ryzen 9 5900HX/32GB, along with NVIDIA RTX 3080/16GB on an external USB 3.2 SSD, is too slow to work with. Are there any setting options that would speed things up? I would be very grateful for any help.

RichardScottOZ · 2023-08-27T04:41:24Z

I only added -gqa 8 on my main execution command, and it worked! Thanks.

Thanks - I suspected this on the 8x1024 errors, so good tip.

loretoparisi · 2023-09-12T03:07:32Z

CuBLAS with --gqa 8 on Ryzen 9 5900HX/32GB, along with NVIDIA RTX 3080/16GB on an external USB 3.2 SSD, is too slow to work with. Are there any setting options that would speed things up? I would be very grateful for any help.

I would not expect this. Are we sure offloading is not going back and forth from the gpu (16gb vs. 2x on the cpu)

matusnovak mentioned this issue Aug 16, 2023

Crashes on launch getumbrel/llama-gpt#4

Closed

albertodepaola added the compatibility issues arising from specific hardware or system configs label Sep 6, 2023

alexmill mentioned this issue Sep 7, 2023

AttributeError: 'Llama' object has no attribute 'ctx' simonw/llm-llama-cpp#8

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error: 70B Model quantizing on mac: tensor 'layers.0.attention.wk.weight' has wrong shape; expected 8192 x 8192 #407

Error: 70B Model quantizing on mac: tensor 'layers.0.attention.wk.weight' has wrong shape; expected 8192 x 8192 #407

NilsHellwig commented Jul 19, 2023 •

edited

schappim commented Jul 20, 2023

asdf17128 commented Jul 20, 2023

schappim commented Jul 20, 2023 •

edited

NilsHellwig commented Jul 20, 2023

asdf17128 commented Jul 20, 2023

asdf17128 commented Jul 20, 2023

NilsHellwig commented Jul 20, 2023 •

edited

SlyEcho commented Jul 20, 2023

schappim commented Jul 20, 2023

SlyEcho commented Jul 20, 2023

NilsHellwig commented Jul 20, 2023

schappim commented Jul 20, 2023

SlyEcho commented Jul 20, 2023

jain-t commented Jul 20, 2023 •

edited

SlyEcho commented Jul 20, 2023

tony163163 commented Jul 22, 2023

k-hkrachenfels commented Jul 25, 2023

SlyEcho commented Jul 25, 2023

Tavernari commented Jul 26, 2023

mkarpusiewicz commented Jul 26, 2023

euan commented Aug 1, 2023

moseswmwong commented Aug 2, 2023 •

edited

euan commented Aug 3, 2023

moseswmwong commented Aug 3, 2023 •

edited

aldarisbm commented Aug 3, 2023

kechan commented Aug 4, 2023 •

edited

moseswmwong commented Aug 4, 2023

aldarisbm commented Aug 4, 2023

franssauermann commented Aug 7, 2023

mkarpusiewicz commented Aug 7, 2023

skobak commented Aug 17, 2023

primus1a commented Aug 18, 2023

RichardScottOZ commented Aug 27, 2023

loretoparisi commented Sep 12, 2023

Error: 70B Model quantizing on mac: tensor 'layers.0.attention.wk.weight' has wrong shape; expected 8192 x 8192 #407

Error: 70B Model quantizing on mac: tensor 'layers.0.attention.wk.weight' has wrong shape; expected 8192 x 8192 #407

Comments

NilsHellwig commented Jul 19, 2023 • edited

schappim commented Jul 20, 2023

asdf17128 commented Jul 20, 2023

schappim commented Jul 20, 2023 • edited

NilsHellwig commented Jul 20, 2023

asdf17128 commented Jul 20, 2023

asdf17128 commented Jul 20, 2023

NilsHellwig commented Jul 20, 2023 • edited

SlyEcho commented Jul 20, 2023

schappim commented Jul 20, 2023

SlyEcho commented Jul 20, 2023

NilsHellwig commented Jul 20, 2023

schappim commented Jul 20, 2023

SlyEcho commented Jul 20, 2023

jain-t commented Jul 20, 2023 • edited

SlyEcho commented Jul 20, 2023

tony163163 commented Jul 22, 2023

k-hkrachenfels commented Jul 25, 2023

SlyEcho commented Jul 25, 2023

Tavernari commented Jul 26, 2023

mkarpusiewicz commented Jul 26, 2023

euan commented Aug 1, 2023

moseswmwong commented Aug 2, 2023 • edited

euan commented Aug 3, 2023

moseswmwong commented Aug 3, 2023 • edited

aldarisbm commented Aug 3, 2023

kechan commented Aug 4, 2023 • edited

moseswmwong commented Aug 4, 2023

aldarisbm commented Aug 4, 2023

franssauermann commented Aug 7, 2023

mkarpusiewicz commented Aug 7, 2023

skobak commented Aug 17, 2023

primus1a commented Aug 18, 2023

RichardScottOZ commented Aug 27, 2023

loretoparisi commented Sep 12, 2023

NilsHellwig commented Jul 19, 2023 •

edited

schappim commented Jul 20, 2023 •

edited

NilsHellwig commented Jul 20, 2023 •

edited

jain-t commented Jul 20, 2023 •

edited

moseswmwong commented Aug 2, 2023 •

edited

moseswmwong commented Aug 3, 2023 •

edited

kechan commented Aug 4, 2023 •

edited