Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to run with Metal on M1 Mac (normal works fine) #2048

Closed
4 tasks done
cheese-melted opened this issue Jun 29, 2023 · 11 comments
Closed
4 tasks done

Unable to run with Metal on M1 Mac (normal works fine) #2048

cheese-melted opened this issue Jun 29, 2023 · 11 comments

Comments

@cheese-melted
Copy link

Prerequisites

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Behavior

Able to run
$./main -m ./models/7B/ggml-model-q4_0.bin -n 128
Unable to run
$./main -m ./models/7B/ggml-model-q4_0.bin -n 128 -ngl 1

Environment and Context

$ python3 --version
Python 3.9.12

$ make --version
GNU Make 3.81
...
This program built for i386-apple-darwin11.3.0

$ g++ --version
Apple clang version 14.0.3 (clang-1403.0.22.14.1)
Target: arm64-apple-darwin22.5.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin

Failure Information (for bugs)

I notice:

examples/embd-input/embd-input-lib.cpp:217:12: warning: address of stack memory associated with local variable 'ret' returned [-Wreturn-stack-address]
    return ret.c_str();
           ^~~
1 warning generated.

during the build.

On running:
$./main -m ./models/7B/ggml-model-q4_0.bin -n 128 -ngl 1

I get (see logs below for full output):

ggml_metal_graph_compute: command buffer 0 failed with status 5
GGML_ASSERT: ggml-metal.m:975: false

Steps to Reproduce

  1. $make clean
  2. $LLAMA_METAL=1 METAL
  3. $./main -m ./models/7B/ggml-model-q4_0.bin -n 128 -ngl 1

Failure Logs

$ make clean
I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

rm -vf *.o *.so main quantize quantize-stats perplexity embedding benchmark-matmult save-load-state server vdot train-text-from-scratch embd-input-test build-info.h



$ LLAMA_METAL=1 make
I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL
I LDFLAGS:   -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

cc  -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG   -c ggml.c -o ggml.o
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL -c llama.cpp -o llama.o
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL -c examples/common.cpp -o common.o
cc -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG   -c -o k_quants.o k_quants.c
cc -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG -c ggml-metal.m -o ggml-metal.o
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL examples/main/main.cpp ggml.o llama.o common.o k_quants.o ggml-metal.o -o main  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders

====  Run ./main -h for help.  ====

c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL examples/quantize/quantize.cpp ggml.o llama.o k_quants.o ggml-metal.o -o quantize  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL examples/quantize-stats/quantize-stats.cpp ggml.o llama.o k_quants.o ggml-metal.o -o quantize-stats  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL examples/perplexity/perplexity.cpp ggml.o llama.o common.o k_quants.o ggml-metal.o -o perplexity  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL examples/embedding/embedding.cpp ggml.o llama.o common.o k_quants.o ggml-metal.o -o embedding  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL pocs/vdot/vdot.cpp ggml.o k_quants.o ggml-metal.o -o vdot  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL examples/train-text-from-scratch/train-text-from-scratch.cpp ggml.o llama.o k_quants.o ggml-metal.o -o train-text-from-scratch  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL examples/simple/simple.cpp ggml.o llama.o common.o k_quants.o ggml-metal.o -o simple  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ --shared -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL examples/embd-input/embd-input-lib.cpp ggml.o llama.o common.o k_quants.o ggml-metal.o -o libembdinput.so  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
examples/embd-input/embd-input-lib.cpp:217:12: warning: address of stack memory associated with local variable 'ret' returned [-Wreturn-stack-address]
    return ret.c_str();
           ^~~
1 warning generated.
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL examples/embd-input/embd-input-test.cpp ggml.o llama.o common.o k_quants.o ggml-metal.o -o embd-input-test  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders -L. -lembdinput



$./main -m ./models/7B/ggml-model-q4_0.bin -n 128

main: build = 762 (96a712c)
main: seed  = 1688038998
llama.cpp: loading model from ./models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.08 MB
llama_model_load_internal: mem required  = 5439.94 MB (+ 1026.00 MB per state)
llama_new_context_with_model: kv self size  =  256.00 MB

system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 0



$ ./main -m ./models/7B/ggml-model-q4_0.bin -n 128 -ngl 1
main: build = 762 (96a712c)
main: seed  = 1688038252
llama.cpp: loading model from ./models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.08 MB
llama_model_load_internal: mem required  = 5439.94 MB (+ 1026.00 MB per state)
llama_new_context_with_model: kv self size  =  256.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/alan/llama.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add                            0x128007550
ggml_metal_init: loaded kernel_mul                            0x128007c70
ggml_metal_init: loaded kernel_mul_row                        0x1280082a0
ggml_metal_init: loaded kernel_scale                          0x1280087c0
ggml_metal_init: loaded kernel_silu                           0x128008ce0
ggml_metal_init: loaded kernel_relu                           0x128009200
ggml_metal_init: loaded kernel_gelu                           0x128009720
ggml_metal_init: loaded kernel_soft_max                       0x128009dd0
ggml_metal_init: loaded kernel_diag_mask_inf                  0x12800a430
ggml_metal_init: loaded kernel_get_rows_f16                   0x12800aab0
ggml_metal_init: loaded kernel_get_rows_q4_0                  0x12800b130
ggml_metal_init: loaded kernel_get_rows_q4_1                  0x12800b920
ggml_metal_init: loaded kernel_get_rows_q2_K                  0x12800bfa0
ggml_metal_init: loaded kernel_get_rows_q3_K                  0x12800c620
ggml_metal_init: loaded kernel_get_rows_q4_K                  0x12800cca0
ggml_metal_init: loaded kernel_get_rows_q5_K                  0x12800d320
ggml_metal_init: loaded kernel_get_rows_q6_K                  0x12800d9a0
ggml_metal_init: loaded kernel_rms_norm                       0x12800e050
ggml_metal_init: loaded kernel_norm                           0x12800e700
ggml_metal_init: loaded kernel_mul_mat_f16_f32                0x12800f0d0
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32               0x12800f7b0
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32               0x12800fe90
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32               0x128010570
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32               0x128010df0
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32               0x1280114d0
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32               0x128011bb0
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32               0x128012290
ggml_metal_init: loaded kernel_rope                           0x128012d80
ggml_metal_init: loaded kernel_alibi_f32                      0x128013640
ggml_metal_init: loaded kernel_cpy_f32_f16                    0x128013ed0
ggml_metal_init: loaded kernel_cpy_f32_f32                    0x128014760
ggml_metal_init: loaded kernel_cpy_f16_f16                    0x128014ff0
ggml_metal_init: recommendedMaxWorkingSetSize =  5461.34 MB
ggml_metal_init: hasUnifiedMemory             = true
ggml_metal_init: maxTransferRate              = built-in GPU
llama_new_context_with_model: max tensor size =   102.54 MB
ggml_metal_add_buffer: allocated 'data            ' buffer, size =  3648.31 MB, ( 3648.70 /  5461.34)
ggml_metal_add_buffer: allocated 'eval            ' buffer, size =   768.00 MB, ( 4416.70 /  5461.34)
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =   258.00 MB, ( 4674.70 /  5461.34)
ggml_metal_add_buffer: allocated 'scr0            ' buffer, size =   512.00 MB, ( 5186.70 /  5461.34)
ggml_metal_add_buffer: allocated 'scr1            ' buffer, size =   512.00 MB, ( 5698.70 /  5461.34), warning: current allocated size is greater than the recommended max working set size

system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 0


ggml_metal_graph_compute: command buffer 0 failed with status 5
GGML_ASSERT: ggml-metal.m:975: false
zsh: abort      ./main -m ./models/7B/ggml-model-q4_0.bin -n 128 -ngl 1



$ git log | head -1
commit 96a712ca1b7f427e3bd7ffc0c70b2105cfc7fbf1



$pip list | egrep "torch|numpy|sentencepiece"
numpy                  1.24.0
sentencepiece          0.1.98
torch                  2.0.0
@artemsablin
Copy link

I may be wrong but status 5 is you running out of memory. Metal implementations cannot use more than half of the unified memory (or however much is reported in the maxworkingset) I think.

In your logs:

llama_model_load_internal: mem required  = 5439.94 MB (+ 1026.00 MB per state)

ggml_metal_init: recommendedMaxWorkingSetSize =  5461.34 MB

Try a smaller model.

@cheese-melted
Copy link
Author

Okay thanks 😎. If anybody ends up able to run the 7B model 4.0 quantised with metal enabled on M1 Mac 8GB RAM let me know!!

@cheese-melted cheese-melted closed this as not planned Won't fix, can't repro, duplicate, stale Jun 30, 2023
@gunners81
Copy link

Okay thanks 😎. If anybody ends up able to run the 7B model 4.0 quantised with metal enabled on M1 Mac 8GB RAM let me know!!

i think its not possible for now, due to hardware limited. my M1 Mac Air 8GB only able to load 3B model using GPU Inference / MPS, im happy with it :), much faster than CPU

@ggerganov
Copy link
Owner

Have you tried this branch: #2011

@cheese-melted
Copy link
Author

I've just given it a go - still same error message.

@RonanKMcGovern
Copy link

I managed to get it working with the smallest model (llama-2-7b-chat.ggmlv3.q2_K.bin) from TheBloke. I tried llama-2-7b-chat.ggmlv3.q4_K_S too but ran out of memory.

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
LLAMA_METAL=1 make

then

./main -t 8 -ngl 32 -m llama-2-7b-chat.ggmlv3.q2_K.bin --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 --in-prefix-bos --in-prefix ' [INST] ' --in-suffix ' [/INST]' -i -ins

I was able to chat with Llama 2 using that. The streaming is not smooth, but not too bad. Here is the data:

>  [INST] How's the form?
 [/INST]The form is great! It's always a pleasure to receive feedback from you. Your insights and suggestions are invaluable, and I will definitely consider them as I continue to improve my responses. Thank you for taking the time to review my work and for your ongoing support. How may I assist you today?

>  [INST] Can you search the web?
 [/INST]Of course! I can definitely help you with that. What would you like me to search for? Please provide me with the relevant keywords or terms, and I will do my best to find the information you're looking for.

>  [INST] Search for apple tart
 [/INST]Sure! Here are some delicious apple tart recipes you might enjoy:

1. Apple Tart Recipe - A classic recipe with a buttery crust and a sweet, juicy apple filling.
2. Mini Apple Tarts - Bite-sized versions of the classic tart, perfect for individual portions or parties.
3. Apple Tart Bar - A no-bake dessert that's easy to prepare and perfect for a crowd.
4. Caramel Apple Tart - A twist on the classic tart, with a layer of caramel sauce added to the filling.
5. Apple Almond Tarts - A delicious combination of apples and almonds in a buttery crust.
6. Apple Crumble Tart - A hearty tart with a crumbly oat topping, perfect for a comforting dessert.
7. Apple Cranberry Tarts - A fruity twist on the classic tart, with cranberries added to the filling.
8. Apple Cheddar Tart - A savory tart that's perfect for a cheese board or pic
>  [INST] 

llama_print_timings:        load time =  2563.87 ms
llama_print_timings:      sample time =  1275.74 ms /   360 runs   (    3.54 ms per token,   282.19 tokens per second)
llama_print_timings: prompt eval time = 13966.18 ms /    92 tokens (  151.81 ms per token,     6.59 tokens per second)
llama_print_timings:        eval time = 114639.65 ms /   360 runs   (  318.44 ms per token,     3.14 tokens per second)
llama_print_timings:       total time = 171982.80 ms

@cheese-melted
Copy link
Author

Thanks! The 3 bit quantisation works as well :)

@linghunjiushu
Copy link

Have you tried this branch: #2011

Thanks. Hope helps.

I tried #2011, it dont works. And I change n_ctx from 4096 to 2000, then it works fine.

my env : apple m2 pro 16gb
model : llama-2-13b-chat.ggmlv3.q4_0.bin

@RDearnaley
Copy link

What would be nice was if reducing --gpulayers fixed the problem, so only the layers that Metal was actually running counted towards the recommendedMaxWorkingSetSize. I have an M2 MacBook Pro with 64GB, recommendedMaxWorkingSetSize = 49152.00 MB, so I can run LLama2 70B 6-bit quantized on CPU (rather slow), but not on GPU — I guess I'll need to redownload one of the 5-bit quantized versions...

@kitther
Copy link

kitther commented Feb 15, 2024

@cheese-melted , reduce n_gpu_layers can reduce the mem pressure on gpu, which can solve your problem. But it will use cpu more and thus slow you down.

@viandmarket25
Copy link

viandmarket25 commented Feb 29, 2024

From my experience,

Device: MacBook Pro 2017, 13inch, intel iris gpu, 16gb ram (Metal not supported)

I tried running the llama-2-7b.Q2_K.gguf (2.5gb)

got errors including +ggml_metal_graph_compute: command buffer 0 failed with status 5

My solution was to re-compile using

make LLAMA_NO_METAL=1

It now works, even though it is not so fast.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants