Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: 70B Model quantizing on mac: tensor 'layers.0.attention.wk.weight' has wrong shape; expected 8192 x 8192 #407

Open
NilsHellwig opened this issue Jul 19, 2023 · 34 comments
Labels
compatibility issues arising from specific hardware or system configs

Comments

@NilsHellwig
Copy link

NilsHellwig commented Jul 19, 2023

Used this model: https://huggingface.co/meta-llama/Llama-2-70b

Used these commands:

$ convert-pth-to-ggml.py models/LLaMa2-70B-meta 1
$ ./quantize ./models/LLaMa2-70B-meta/ggml-model-f16.bin ./models/LLaMa2-70B-meta/ggml-model-q4_0.bin 2

7B and 11B models work without any problems. This is only when using the 70B model.

error loading model: llama.cpp: tensor 'layers.0.attention.wk.weight' has wrong shape; expected 8192 x 8192, got 8192 x 1024

llama.cpp: loading model from /Users/xyz/Desktop/llama.cpp/models/LLaMa2-70B-meta/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 22016
llama_model_load_internal: model size = 65B
llama_model_load_internal: ggml ctx size =    0.19 MB
error loading model: llama.cpp: tensor 'layers.0.attention.wk.weight' has wrong shape; expected  8192 x  8192, got  8192 x  1024
llama_load_model_from_file: failed to load model
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[2], line 1
----> 1 llm = Llama(model_path="/Users/xyz/Desktop/llama.cpp/models/LLaMa2-70B-meta/ggml-model-q4_0.bin", n_ctx=512, seed=43, n_threads=8, n_gpu_layers=1)

File /opt/homebrew/Caskroom/miniforge/base/envs/tensorflow_m1/lib/python3.11/site-packages/llama_cpp/llama.py:305, in Llama.__init__(self, model_path, n_ctx, n_parts, n_gpu_layers, seed, f16_kv, logits_all, vocab_only, use_mmap, use_mlock, embedding, n_threads, n_batch, last_n_tokens_size, lora_base, lora_path, low_vram, tensor_split, rope_freq_base, rope_freq_scale, verbose)
    300     raise ValueError(f"Model path does not exist: {model_path}")
    302 self.model = llama_cpp.llama_load_model_from_file(
    303     self.model_path.encode("utf-8"), self.params
    304 )
--> 305 assert self.model is not None
    307 self.ctx = llama_cpp.llama_new_context_with_model(self.model, self.params)
    309 assert self.ctx is not None

AssertionError: 
@schappim
Copy link

I can confirm I also have this issue. I tried this method (ggerganov/llama.cpp#2276 (comment)) unsuccessfully.

@asdf17128
Copy link

also faced this problem. any solutions?

@schappim
Copy link

schappim commented Jul 20, 2023

@asdf17128 and @NilsHellwig, are you guys on M2? I'm currently attempting to run this on an M2 Ultra with 24-core CPU, 76-core GPU, 32-core Neural Engine. I've tried both metal and non-metal builds.

@NilsHellwig
Copy link
Author

@schappim I'm using a M1 Max (32 GB RAM), 24 core GPU

@asdf17128
Copy link

@asdf17128 and @NilsHellwig, are you guys on M2? I'm currently attempting to run this on an M2 Ultra.

mac studio M2 ultra with 192G RAM

@asdf17128
Copy link

maybe some changes should be made on llama.cpp, no clue

@NilsHellwig
Copy link
Author

NilsHellwig commented Jul 20, 2023

@asdf17128 I think so too. I guess it's not our fault. Added a new issue...

ggerganov/llama.cpp#2285

@SlyEcho
Copy link

SlyEcho commented Jul 20, 2023

v2 70B is not supported right now because it uses a different attention method. #2276 is a proof of concept to make it work. The convert script should not require changes because the only thing that changed is the shape of some tensors and convert.py can handle it, same for quantize.cpp.

If you have trouble converting, I suggest you check that the .pth files are correct, Meta's download script is not the best.

@schappim
Copy link

@SlyEcho, I don't think it is the download as I downloaded twice, once via git lfs, and the other via the script.
I've checked out the experimental branch and tried that, this also generates the same "wrong shape" error.
I tried with both LLAMA_METAL enabled and disabled.

@SlyEcho
Copy link

SlyEcho commented Jul 20, 2023

@schappim, "wrong shape" error is expected because llama.cpp does not support 70B yet. However errors in convert.py would mean the data files are not in the correct format. OK, maybe I was a little confused as to who had which error.

@NilsHellwig
Copy link
Author

7B and 13B version do work without any problems however :)

@schappim
Copy link

confirming that same as @NilsHellwig, the 7B and 13B version do work....

@SlyEcho
Copy link

SlyEcho commented Jul 20, 2023

There are checksum files in HF if you have access:

@jain-t
Copy link

jain-t commented Jul 20, 2023

Getting the same error as everyone for 70B-chat

Macbook Pro, M1 chip, 16GB

main: build = 852 (294f424)
main: seed  = 1689858440
llama.cpp: loading model from /Volumes/Gargantua/LLAMA2/llama-2-70b-chat/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 22016
llama_model_load_internal: model size = 65B
llama_model_load_internal: ggml ctx size =    0.19 MB
error loading model: llama.cpp: tensor 'layers.0.attention.wk.weight' has wrong shape; expected  8192 x  8192, got  8192 x  1024
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '/Volumes/Gargantua/LLAMA2/llama-2-70b-chat/ggml-model-q4_0.bin'
main: error: unable to load model

@SlyEcho
Copy link

SlyEcho commented Jul 20, 2023

Please apply https://github.com/ggerganov/llama.cpp/pull/2276.diff and then test.
I have tested and it does work.

@tony163163
Copy link

@@ -1459,7 +1468,7 @@ static bool llama_eval_internal(
offload_func_kq(tmpq);
ggml_set_name(tmpq, "tmpq");

  •        struct ggml_tensor * Kcur = ggml_rope_custom_inplace(ctx0, ggml_reshape_3d(ctx0, tmpk, n_embd/n_head, n_head, N), n_past, n_rot, 0, freq_base, freq_scale, 0);
    
  •        struct ggml_tensor * Kcur = ggml_rope_custom_inplace(ctx0, ggml_reshape_3d(ctx0, tmpk, n_embd/n_head, n_head/8, N), n_past, n_rot, 0, freq_base, freq_scale, 0);
           offload_func_kq(Kcur);
           ggml_set_name(Kcur, "Kcur");
    

or

  •        struct ggml_tensor * Kcur = ggml_rope_custom_inplace(ctx0, ggml_reshape_3d(ctx0, tmpk, n_embd/n_head, n_head/8, N), n_past, n_rot, 0, 0, freq_base, freq_scale);
    

@k-hkrachenfels
Copy link

Hi, I also try to run this on M1 with 64GB.
In the meantime the above PR: https://github.com/ggerganov/llama.cpp/pull/2276.diff seems merged. I wonder why I still get the above error when working on master branch.

main: build = 911 (fce48ca)
main: seed  = 1690292053
llama.cpp: loading model from llama-2-70b-chat.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 4096
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_head_kv  = 64
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 1,0e-06
llama_model_load_internal: n_ff       = 24576
llama_model_load_internal: freq_base  = 10000,0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: model size = 65B
llama_model_load_internal: ggml ctx size =    0,21 MB
error loading model: llama.cpp: tensor 'layers.0.attention.wk.weight' has wrong shape; expected  8192 x  8192, got  8192 x  1024
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model 'llama-2-70b-chat.ggmlv3.q4_0.bin'
main: error: unable to load model

Is LLaMa2-70B-meta/ggml-model-q4_0.bin the correct model. Do I have to convert the model first?

@SlyEcho
Copy link

SlyEcho commented Jul 25, 2023

You need to pass the argument -gqa 8 as it's not detected automatically yet.

@Tavernari
Copy link

I only added -gqa 8 on my main execution command, and it worked! Thanks.

@mkarpusiewicz
Copy link

Is th hf model supported? With --gqa 8 I get:

main: build = 912 (07aaa0f)
main: seed  = 1690379540
llama.cpp: loading model from .\models\llama2-70b-chat-hf-ggml-model-q4_0.bin
llama_model_load_internal: warning: assuming 70B model based on GQA == 8
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_head_kv  = 8
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 8
llama_model_load_internal: rnorm_eps  = 1.0e-06
llama_model_load_internal: n_ff       = 28416
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: model size = 70B
llama_model_load_internal: ggml ctx size =    0.21 MB
error loading model: llama.cpp: tensor 'layers.0.feed_forward.w1.weight' has wrong shape; expected  8192 x 28416, got  8192 x 28672
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '.\models\llama2-70b-chat-hf-ggml-model-q4_0.bin'
main: error: unable to load model

@euan
Copy link

euan commented Aug 1, 2023

Got it working with the --gqa 8 flag, but it was way too slow to use.
Also, when you get there, ask it what the date is of the last information it has. The response will shock you!

@moseswmwong
Copy link

moseswmwong commented Aug 2, 2023

I have a Macbook Pro with M1 Max 64 GB, I encountered this error and understand that convert.py needs to be updated to match the unique attention layer used with the 70B models. After replacing convert.py, run convert again and added "-gqa 8" at the parameters of the program 'main', I can use the model properly.

The speed is a bit slow, I wonder if the M2 Apple Silicon owner has a lot better performance?

I asked about the last information it has and it said "August 26, 2020", I doubt it was a bluff(hallucinations), so I asked about some significant world events and world leader's information and it know recent events so I challenged it about the truthfulness of this date it claims as cut off date and it apologizes and said it should be "December 2022".

I think December 2022 make more sense.

@euan
Copy link

euan commented Aug 3, 2023

@moseswmwong Where do I get the updated convert.py script?
I have a M2 Max and running 70b is dog slow.

@moseswmwong
Copy link

moseswmwong commented Aug 3, 2023

@euan Go to the page for issue #2276, choose the "files changed" tab, press the three dots, choose the "View File", and download the raw file.

@aldarisbm
Copy link

I have a Macbook Pro with M1 Max 64 GB, I encountered this error and understand that convert.py needs to be updated to match the unique attention layer used with the 70B models. After replacing convert.py, run convert again and added "-gqa 8" at the parameters of the program 'main', I can use the model properly.

The speed is a bit slow, I wonder if the M2 Apple Silicon owner has a lot better performance?

I asked about the last information it has and it said "August 26, 2020", I doubt it was a bluff(hallucinations), so I asked about some significant world events and world leader's information and it know recent events so I challenged it about the truthfulness of this date it claims as cut off date and it apologizes and said it should be "December 2022".

I think December 2022 make more sense.

M2 Max 64GB pretty slow as well

@kechan
Copy link

kechan commented Aug 4, 2023

If you have M2 Max 96gb, tried adding -ngl 38 to use MPS Metal acceleration (or a lower number if you don't have that many cores). The initial load up is still slow given I tested it with a longer prompt, but afterwards in interactive mode, the back and forth is almost as fast as how I felt when I first met the original ChatGPT (and in the few days when everyone was hitting it). so to me, it is totally usable. And this is the go to thing if I ever ve no wifi and need to search.

NB: you need to build your main with that METAL option, this is documented at llama.cpp

@moseswmwong
Copy link

@kechan Thanks for the advice! It works like the speed of ChatGPT on my side too! And I can see memory and GPU usage increases when the AI "thinks".

@aldarisbm
Copy link

yeah definitely -ngl 30 pegs all of my GPU cores on M2 Max 30cores
Screenshot 2023-08-04 at 08 04 34

@franssauermann
Copy link

...
Is th hf model supported? With --gqa 8 I get:
...

error loading model: llama.cpp: tensor 'layers.0.feed_forward.w1.weight' has wrong shape; expected 8192 x 28416, got 8192 x 28672

Were you able to resolve this issue?

@mkarpusiewicz
Copy link

@franssauermann No, I gave up, as the CPU performance would not be the greatest anyway I didn't pursue this. Just for reference, tested this on windows 10 and not mac m1/m2.

@skobak
Copy link

skobak commented Aug 17, 2023

I can confirm that 70B on M1 PRO/32GB works really slow.

@primus1a
Copy link

CuBLAS with --gqa 8 on Ryzen 9 5900HX/32GB, along with NVIDIA RTX 3080/16GB on an external USB 3.2 SSD, is too slow to work with. Are there any setting options that would speed things up? I would be very grateful for any help.

@RichardScottOZ
Copy link

I only added -gqa 8 on my main execution command, and it worked! Thanks.

Thanks - I suspected this on the 8x1024 errors, so good tip.

@albertodepaola albertodepaola added the compatibility issues arising from specific hardware or system configs label Sep 6, 2023
@loretoparisi
Copy link

CuBLAS with --gqa 8 on Ryzen 9 5900HX/32GB, along with NVIDIA RTX 3080/16GB on an external USB 3.2 SSD, is too slow to work with. Are there any setting options that would speed things up? I would be very grateful for any help.

I would not expect this. Are we sure offloading is not going back and forth from the gpu (16gb vs. 2x on the cpu)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compatibility issues arising from specific hardware or system configs
Projects
None yet
Development

No branches or pull requests