-
Notifications
You must be signed in to change notification settings - Fork 9.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error: 70B Model quantizing on mac: tensor 'layers.0.attention.wk.weight' has wrong shape; expected 8192 x 8192 #407
Comments
I can confirm I also have this issue. I tried this method (ggerganov/llama.cpp#2276 (comment)) unsuccessfully. |
also faced this problem. any solutions? |
@asdf17128 and @NilsHellwig, are you guys on M2? I'm currently attempting to run this on an M2 Ultra with 24-core CPU, 76-core GPU, 32-core Neural Engine. I've tried both metal and non-metal builds. |
@schappim I'm using a M1 Max (32 GB RAM), 24 core GPU |
mac studio M2 ultra with 192G RAM |
maybe some changes should be made on llama.cpp, no clue |
@asdf17128 I think so too. I guess it's not our fault. Added a new issue... |
v2 70B is not supported right now because it uses a different attention method. #2276 is a proof of concept to make it work. The convert script should not require changes because the only thing that changed is the shape of some tensors and convert.py can handle it, same for quantize.cpp. If you have trouble converting, I suggest you check that the .pth files are correct, Meta's download script is not the best. |
@SlyEcho, I don't think it is the download as I downloaded twice, once via git lfs, and the other via the script. |
@schappim, "wrong shape" error is expected because llama.cpp does not support 70B yet. However errors in convert.py would mean the data files are not in the correct format. OK, maybe I was a little confused as to who had which error. |
7B and 13B version do work without any problems however :) |
confirming that same as @NilsHellwig, the 7B and 13B version do work.... |
There are checksum files in HF if you have access: |
Getting the same error as everyone for 70B-chat Macbook Pro, M1 chip, 16GB
|
Please apply https://github.com/ggerganov/llama.cpp/pull/2276.diff and then test. |
@@ -1459,7 +1468,7 @@ static bool llama_eval_internal(
or
|
Hi, I also try to run this on M1 with 64GB.
Is |
You need to pass the argument |
I only added |
Is th
|
Got it working with the |
I have a Macbook Pro with M1 Max 64 GB, I encountered this error and understand that convert.py needs to be updated to match the unique attention layer used with the 70B models. After replacing convert.py, run convert again and added "-gqa 8" at the parameters of the program 'main', I can use the model properly. The speed is a bit slow, I wonder if the M2 Apple Silicon owner has a lot better performance? I asked about the last information it has and it said "August 26, 2020", I doubt it was a bluff(hallucinations), so I asked about some significant world events and world leader's information and it know recent events so I challenged it about the truthfulness of this date it claims as cut off date and it apologizes and said it should be "December 2022". I think December 2022 make more sense. |
@moseswmwong Where do I get the updated convert.py script? |
M2 Max 64GB pretty slow as well |
If you have M2 Max 96gb, tried adding -ngl 38 to use MPS Metal acceleration (or a lower number if you don't have that many cores). The initial load up is still slow given I tested it with a longer prompt, but afterwards in interactive mode, the back and forth is almost as fast as how I felt when I first met the original ChatGPT (and in the few days when everyone was hitting it). so to me, it is totally usable. And this is the go to thing if I ever ve no wifi and need to search. NB: you need to build your main with that METAL option, this is documented at llama.cpp |
@kechan Thanks for the advice! It works like the speed of ChatGPT on my side too! And I can see memory and GPU usage increases when the AI "thinks". |
...
Were you able to resolve this issue? |
@franssauermann No, I gave up, as the CPU performance would not be the greatest anyway I didn't pursue this. Just for reference, tested this on windows 10 and not mac m1/m2. |
I can confirm that 70B on M1 PRO/32GB works really slow. |
CuBLAS with --gqa 8 on Ryzen 9 5900HX/32GB, along with NVIDIA RTX 3080/16GB on an external USB 3.2 SSD, is too slow to work with. Are there any setting options that would speed things up? I would be very grateful for any help. |
Thanks - I suspected this on the 8x1024 errors, so good tip. |
I would not expect this. Are we sure offloading is not going back and forth from the gpu (16gb vs. 2x on the cpu) |
Used this model: https://huggingface.co/meta-llama/Llama-2-70b
Used these commands:
7B and 11B models work without any problems. This is only when using the 70B model.
error loading model: llama.cpp: tensor 'layers.0.attention.wk.weight' has wrong shape; expected 8192 x 8192, got 8192 x 1024
The text was updated successfully, but these errors were encountered: