Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metal: new q4_0 mat-vec mul kernel #2188

Merged
merged 1 commit into from
Jul 12, 2023

Conversation

lshzh-ww
Copy link
Collaborator

@lshzh-ww lshzh-ww commented Jul 12, 2023

Prefetch data to achieve better memory bandwidth utilization. With the new kernel token generation is ~48% faster for 33B model and ~14% faster for 7B model. Tests for 65B model are welcome.

33B model master this PR
GPU Read BW* 167 GB/s 275 GB/s
prompt 1 117ms/tok 79ms/tok
prompt 2 121ms/tok 82 ms/tok
7B model master this PR
prompt 1 25.7ms/tok 22.5ms/tok

GPUs of M1 Max / M2 Max can sustain 340 GB/s , can we reach that in llama.cpp? 😳

* GPU Read Bandwidth measured using Developer Tools comes with Xcode.
Screenshot 2023-07-12 at 01 24 05

** Measured with the following command on M1 Max 32GB. Generation results are same between this PR and master branch.
./main -m model_file -n 128 -c 512 -s 12 -ngl 1 --no-mmap
Prompt 1 “”
Prompt 2 “I believe the meaning of life is”

@lshzh-ww
Copy link
Collaborator Author

Please feel free to share results on M1/M2 Pro/Max/Ultra!

@ikawrakow
Copy link
Contributor

ikawrakow commented Jul 12, 2023

Please feel free to share results on M1/M2 Pro/Max/Ultra!

Well, I had not checked out the PR branch. With the PR branch I get on M2 Max with 30-core GPU, 64 GB RAM:

  • 7B is 21.5 ms/t on master, 21.0 ms/t with this PR
  • 33B is 91.3 ms/t for master and 77.4 ms/t with this PR.

The above is for an empty prompt. If I use -p "I believe the meaning of life is" instead, I get for 33B ~92 ms/t on master and ~84 ms/t with this PR.

@CyborgArmy83
Copy link

Looks promising! From what I've read before there is still a lot of room for Metal code improvements so this could be very welcome. I will try to test it out on my M1 Max machine as well.

@lshzh-ww
Copy link
Collaborator Author

lshzh-ww commented Jul 12, 2023

Well, I had not checked out the PR branch. With the PR branch I get on M2 Max with 30-core GPU, 64 GB RAM:

  • 7B is 21.5 ms/t on master, 21.0 ms/t with this PR
  • 33B is 91.3 ms/t for master and 77.4 ms/t with this PR.

The above is for an empty prompt. If I use -p "I believe the meaning of life is" instead, I get for 33B ~92 ms/t on master and ~84 ms/t with this PR.

Nice! I update the measurements with new prompt. Looks like this PR brings the M1 series to the same speed as M2 series. I guess M2 series have better memory access prediction so their speed are already good without this PR. Nonetheless for 65B model I guess we can see a larger performance improvement for M2 series.

@lshzh-ww
Copy link
Collaborator Author

Would be nice to see results on M1 Ultra or M2 Ultra chips. Their two-die design may also suffer from cache miss.

@ggerganov ggerganov added performance Speed related topics high priority Very important issue labels Jul 12, 2023
@adrienbrault
Copy link

Here's my benchmark: MBP 14 Apple M2 Max 32GB 12C CPU, 30C GPU

13B model master this PR
prompt 1 1/2 46.97ms/tok 36.70ms/tok
prompt 1 2/2 50.17ms/tok 37.40ms/tok
Logs
17:53:48 in ~/Developer/ai/llama.cpp on  mps-q4_0-kernel 
✘130 ➜ ./main -m ../models/TheBloke_WizardLM-13B-V1.1-GGML/wizardlm-13b-v1.1.ggmlv3.q4_0.bin -n 128 -c 512 -s 12 -ngl 1 --no-mmap -p ""
main: build = 821 (38ec9a2)
main: seed  = 12
llama.cpp: loading model from ../models/TheBloke_WizardLM-13B-V1.1-GGML/wizardlm-13b-v1.1.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32001
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 6983.71 MB
llama_model_load_internal: mem required  = 9031.71 MB (+ 1608.00 MB per state)
llama_new_context_with_model: kv self size  =  400.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Volumes/Developer/ai/llama.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add                            0x102a04c70
ggml_metal_init: loaded kernel_mul                            0x102a054b0
ggml_metal_init: loaded kernel_mul_row                        0x102a05ae0
ggml_metal_init: loaded kernel_scale                          0x102a06000
ggml_metal_init: loaded kernel_silu                           0x102a06520
ggml_metal_init: loaded kernel_relu                           0x102a06a40
ggml_metal_init: loaded kernel_gelu                           0x102a06f60
ggml_metal_init: loaded kernel_soft_max                       0x102b06b60
ggml_metal_init: loaded kernel_diag_mask_inf                  0x102b07120
ggml_metal_init: loaded kernel_get_rows_f16                   0x102b077a0
ggml_metal_init: loaded kernel_get_rows_q4_0                  0x102b07e20
ggml_metal_init: loaded kernel_get_rows_q4_1                  0x102b08610
ggml_metal_init: loaded kernel_get_rows_q2_K                  0x13de074f0
ggml_metal_init: loaded kernel_get_rows_q3_K                  0x13de07c90
ggml_metal_init: loaded kernel_get_rows_q4_K                  0x13de08310
ggml_metal_init: loaded kernel_get_rows_q5_K                  0x13de08990
ggml_metal_init: loaded kernel_get_rows_q6_K                  0x13de09010
ggml_metal_init: loaded kernel_rms_norm                       0x13de096c0
ggml_metal_init: loaded kernel_norm                           0x13de09d70
ggml_metal_init: loaded kernel_mul_mat_f16_f32                0x13de0a860
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32               0x13de0af40
ggml_metal_init: loaded kernel_mul_mat_vec_q4_0_f32           0x13de0b780
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32               0x13de0bd40
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32               0x13de0c5a0
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32               0x102a07540
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32               0x102a07d40
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32               0x102a08420
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32               0x102a08d00
ggml_metal_init: loaded kernel_rope                           0x102a095f0
ggml_metal_init: loaded kernel_alibi_f32                      0x102a09eb0
ggml_metal_init: loaded kernel_cpy_f32_f16                    0x102a0a740
ggml_metal_init: loaded kernel_cpy_f32_f32                    0x102a0b230
ggml_metal_init: loaded kernel_cpy_f16_f16                    0x102a0bac0
ggml_metal_init: recommendedMaxWorkingSetSize = 21845.34 MB
ggml_metal_init: hasUnifiedMemory             = true
ggml_metal_init: maxTransferRate              = built-in GPU
llama_new_context_with_model: max tensor size =    87.89 MB
ggml_metal_add_buffer: allocated 'data            ' buffer, size =  6983.72 MB, ( 6984.17 / 21845.34)
ggml_metal_add_buffer: allocated 'eval            ' buffer, size =  1024.00 MB, ( 8008.17 / 21845.34)
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =   402.00 MB, ( 8410.17 / 21845.34)
ggml_metal_add_buffer: allocated 'scr0            ' buffer, size =   512.00 MB, ( 8922.17 / 21845.34)
ggml_metal_add_buffer: allocated 'scr1            ' buffer, size =   512.00 MB, ( 9434.17 / 21845.34)

system_info: n_threads = 8 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 0



// RUN: %target-swift-ide-test -code-completion -source-filename %s -code-completion-token=GLOBAL_ENUM_1 > %t.globaleenum1.txt
// RUN: FileCheck --input-file=%t.globaleenum1.txt %s

// CHECK: CODE_ completionsForGlobalEnum1
// CHECK-NEXT: SYMBOL KIND
// CHECK-NEXT: BASE_ENUM=TypeName
// CHECK-NEXT: ◂────────────────────────
llama_print_timings:        load time =  4994.00 ms
llama_print_timings:      sample time =   180.25 ms /   128 runs   (    1.41 ms per token,   710.13 tokens per second)
llama_print_timings: prompt eval time =   375.17 ms /     2 tokens (  187.58 ms per token,     5.33 tokens per second)
llama_print_timings:        eval time =  4660.87 ms /   127 runs   (   36.70 ms per token,    27.25 tokens per second)
llama_print_timings:       total time =  5234.85 ms
ggml_metal_free: deallocating

17:54:12 in ~/Developer/ai/llama.cpp on  mps-q4_0-kernel   took 10.4s
➜ ./main -m ../models/TheBloke_WizardLM-13B-V1.1-GGML/wizardlm-13b-v1.1.ggmlv3.q4_0.bin -n 128 -c 512 -s 12 -ngl 1 --no-mmap -p ""
main: build = 821 (38ec9a2)
main: seed  = 12
llama.cpp: loading model from ../models/TheBloke_WizardLM-13B-V1.1-GGML/wizardlm-13b-v1.1.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32001
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 6983.71 MB
llama_model_load_internal: mem required  = 9031.71 MB (+ 1608.00 MB per state)
llama_new_context_with_model: kv self size  =  400.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Volumes/Developer/ai/llama.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add                            0x151706b40
ggml_metal_init: loaded kernel_mul                            0x151707260
ggml_metal_init: loaded kernel_mul_row                        0x151707890
ggml_metal_init: loaded kernel_scale                          0x151707db0
ggml_metal_init: loaded kernel_silu                           0x1517082d0
ggml_metal_init: loaded kernel_relu                           0x1517087f0
ggml_metal_init: loaded kernel_gelu                           0x151708d10
ggml_metal_init: loaded kernel_soft_max                       0x1517093c0
ggml_metal_init: loaded kernel_diag_mask_inf                  0x151709a20
ggml_metal_init: loaded kernel_get_rows_f16                   0x15170a0a0
ggml_metal_init: loaded kernel_get_rows_q4_0                  0x15170a720
ggml_metal_init: loaded kernel_get_rows_q4_1                  0x15170af10
ggml_metal_init: loaded kernel_get_rows_q2_K                  0x15170b590
ggml_metal_init: loaded kernel_get_rows_q3_K                  0x15170bc10
ggml_metal_init: loaded kernel_get_rows_q4_K                  0x15170c290
ggml_metal_init: loaded kernel_get_rows_q5_K                  0x15170c910
ggml_metal_init: loaded kernel_get_rows_q6_K                  0x15170cf90
ggml_metal_init: loaded kernel_rms_norm                       0x15170d640
ggml_metal_init: loaded kernel_norm                           0x15170dcf0
ggml_metal_init: loaded kernel_mul_mat_f16_f32                0x15170e6c0
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32               0x15170eda0
ggml_metal_init: loaded kernel_mul_mat_vec_q4_0_f32           0x15170f440
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32               0x15170fb20
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32               0x151710380
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32               0x151710a80
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32               0x151711160
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32               0x151711840
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32               0x151712120
ggml_metal_init: loaded kernel_rope                           0x151712a10
ggml_metal_init: loaded kernel_alibi_f32                      0x1517132d0
ggml_metal_init: loaded kernel_cpy_f32_f16                    0x151713b60
ggml_metal_init: loaded kernel_cpy_f32_f32                    0x151714650
ggml_metal_init: loaded kernel_cpy_f16_f16                    0x151714dc0
ggml_metal_init: recommendedMaxWorkingSetSize = 21845.34 MB
ggml_metal_init: hasUnifiedMemory             = true
ggml_metal_init: maxTransferRate              = built-in GPU
llama_new_context_with_model: max tensor size =    87.89 MB
ggml_metal_add_buffer: allocated 'data            ' buffer, size =  6983.72 MB, ( 6984.17 / 21845.34)
ggml_metal_add_buffer: allocated 'eval            ' buffer, size =  1024.00 MB, ( 8008.17 / 21845.34)
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =   402.00 MB, ( 8410.17 / 21845.34)
ggml_metal_add_buffer: allocated 'scr0            ' buffer, size =   512.00 MB, ( 8922.17 / 21845.34)
ggml_metal_add_buffer: allocated 'scr1            ' buffer, size =   512.00 MB, ( 9434.17 / 21845.34)

system_info: n_threads = 8 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 0



// RUN: %target-swift-ide-test -code-completion -source-filename %s -code-completion-token=GLOBAL_ENUM_1 > %t.globaleenum1.txt
// RUN: FileCheck --input-file=%t.globaleenum1.txt %s

// CHECK: CODE_ completionsForGlobalEnum1
// CHECK-NEXT: SYMBOL KIND
// CHECK-NEXT: BASE_ENUM=TypeName
// CHECK-NEXT: ◂────────────────────────
llama_print_timings:        load time =  3865.14 ms
llama_print_timings:      sample time =   182.48 ms /   128 runs   (    1.43 ms per token,   701.45 tokens per second)
llama_print_timings: prompt eval time =   352.93 ms /     2 tokens (  176.47 ms per token,     5.67 tokens per second)
llama_print_timings:        eval time =  4750.40 ms /   127 runs   (   37.40 ms per token,    26.73 tokens per second)
llama_print_timings:       total time =  5304.79 ms
ggml_metal_free: deallocating

17:54:24 in ~/Developer/ai/llama.cpp on  mps-q4_0-kernel   took 9.4s
➜ git co -
Switched to branch 'master'
Your branch is up to date with 'origin/master'.

17:54:31 in ~/Developer/ai/llama.cpp on  master 
➜ LLAMA_METAL=1 make
I llama.cpp build info:
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL
I LDFLAGS:   -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

cc -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG -c ggml-metal.m -o ggml-metal.o
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL examples/main/main.cpp ggml.o llama.o common.o k_quants.o ggml-metal.o -o main  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders

====  Run ./main -h for help.  ====

c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL examples/quantize/quantize.cpp ggml.o llama.o k_quants.o ggml-metal.o -o quantize  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL examples/quantize-stats/quantize-stats.cpp ggml.o llama.o k_quants.o ggml-metal.o -o quantize-stats  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL examples/perplexity/perplexity.cpp ggml.o llama.o common.o k_quants.o ggml-metal.o -o perplexity  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL examples/embedding/embedding.cpp ggml.o llama.o common.o k_quants.o ggml-metal.o -o embedding  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL pocs/vdot/vdot.cpp ggml.o k_quants.o ggml-metal.o -o vdot  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL examples/train-text-from-scratch/train-text-from-scratch.cpp ggml.o llama.o k_quants.o ggml-metal.o -o train-text-from-scratch  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL examples/simple/simple.cpp ggml.o llama.o common.o k_quants.o ggml-metal.o -o simple  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL -Iexamples/server examples/server/server.cpp ggml.o llama.o common.o k_quants.o ggml-metal.o -o server  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ --shared -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL examples/embd-input/embd-input-lib.cpp ggml.o llama.o common.o k_quants.o ggml-metal.o -o libembdinput.so  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL examples/embd-input/embd-input-test.cpp ggml.o llama.o common.o k_quants.o ggml-metal.o -o embd-input-test  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders -L. -lembdinput

17:54:44 in ~/Developer/ai/llama.cpp on  master   took 12.5s
➜ ./main -m ../models/TheBloke_WizardLM-13B-V1.1-GGML/wizardlm-13b-v1.1.ggmlv3.q4_0.bin -n 128 -c 512 -s 12 -ngl 1 --no-mmap -p ""
main: build = 823 (4e7464e)
main: seed  = 12
llama.cpp: loading model from ../models/TheBloke_WizardLM-13B-V1.1-GGML/wizardlm-13b-v1.1.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32001
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 6983.71 MB
llama_model_load_internal: mem required  = 9031.71 MB (+ 1608.00 MB per state)
llama_new_context_with_model: kv self size  =  400.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Volumes/Developer/ai/llama.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add                            0x1236044a0
ggml_metal_init: loaded kernel_mul                            0x123604ce0
ggml_metal_init: loaded kernel_mul_row                        0x123605310
ggml_metal_init: loaded kernel_scale                          0x123605830
ggml_metal_init: loaded kernel_silu                           0x123605d50
ggml_metal_init: loaded kernel_relu                           0x123606270
ggml_metal_init: loaded kernel_gelu                           0x123606790
ggml_metal_init: loaded kernel_soft_max                       0x123606e40
ggml_metal_init: loaded kernel_diag_mask_inf                  0x1236074a0
ggml_metal_init: loaded kernel_get_rows_f16                   0x123607b20
ggml_metal_init: loaded kernel_get_rows_q4_0                  0x1236081a0
ggml_metal_init: loaded kernel_get_rows_q4_1                  0x123608990
ggml_metal_init: loaded kernel_get_rows_q2_K                  0x144106970
ggml_metal_init: loaded kernel_get_rows_q3_K                  0x144107110
ggml_metal_init: loaded kernel_get_rows_q4_K                  0x144107790
ggml_metal_init: loaded kernel_get_rows_q5_K                  0x144107e10
ggml_metal_init: loaded kernel_get_rows_q6_K                  0x144108490
ggml_metal_init: loaded kernel_rms_norm                       0x144108b40
ggml_metal_init: loaded kernel_norm                           0x1441091f0
ggml_metal_init: loaded kernel_mul_mat_f16_f32                0x144109ce0
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32               0x14410a3c0
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32               0x14410aaa0
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32               0x14410b180
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32               0x102404af0
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32               0x1024052f0
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32               0x1024059d0
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32               0x1024060b0
ggml_metal_init: loaded kernel_rope                           0x102406ba0
ggml_metal_init: loaded kernel_alibi_f32                      0x102407460
ggml_metal_init: loaded kernel_cpy_f32_f16                    0x102407cf0
ggml_metal_init: loaded kernel_cpy_f32_f32                    0x102408580
ggml_metal_init: loaded kernel_cpy_f16_f16                    0x102408e10
ggml_metal_init: recommendedMaxWorkingSetSize = 21845.34 MB
ggml_metal_init: hasUnifiedMemory             = true
ggml_metal_init: maxTransferRate              = built-in GPU
llama_new_context_with_model: max tensor size =    87.89 MB
ggml_metal_add_buffer: allocated 'data            ' buffer, size =  6983.72 MB, ( 6984.11 / 21845.34)
ggml_metal_add_buffer: allocated 'eval            ' buffer, size =  1024.00 MB, ( 8008.11 / 21845.34)
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =   402.00 MB, ( 8410.11 / 21845.34)
ggml_metal_add_buffer: allocated 'scr0            ' buffer, size =   512.00 MB, ( 8922.11 / 21845.34)
ggml_metal_add_buffer: allocated 'scr1            ' buffer, size =   512.00 MB, ( 9434.11 / 21845.34)

system_info: n_threads = 8 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 0



// RUN: %target-swift-ide-test -code-completion -source-filename %s -code-completion-token=GLOBAL_ENUM_1 > %t.globaleenum1.txt
// RUN: FileCheck --input-file=%t.globaleenum1.txt %s

// CHECK: CODE_ completionsForGlobalEnum1
// CHECK-NEXT: SYMBOL KIND
// CHECK-NEXT: BASE_ENUM=TypeName
// CHECK-NEXT: ◂────────────────────────
llama_print_timings:        load time =  3641.14 ms
llama_print_timings:      sample time =   186.38 ms /   128 runs   (    1.46 ms per token,   686.76 tokens per second)
llama_print_timings: prompt eval time =   259.22 ms /     2 tokens (  129.61 ms per token,     7.72 tokens per second)
llama_print_timings:        eval time =  5965.22 ms /   127 runs   (   46.97 ms per token,    21.29 tokens per second)
llama_print_timings:       total time =  6429.91 ms
ggml_metal_free: deallocating

17:55:00 in ~/Developer/ai/llama.cpp on  master   took 10.6s
➜ ./main -m ../models/TheBloke_WizardLM-13B-V1.1-GGML/wizardlm-13b-v1.1.ggmlv3.q4_0.bin -n 128 -c 512 -s 12 -ngl 1 --no-mmap -p ""
main: build = 823 (4e7464e)
main: seed  = 12
llama.cpp: loading model from ../models/TheBloke_WizardLM-13B-V1.1-GGML/wizardlm-13b-v1.1.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32001
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 6983.71 MB
llama_model_load_internal: mem required  = 9031.71 MB (+ 1608.00 MB per state)
llama_new_context_with_model: kv self size  =  400.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Volumes/Developer/ai/llama.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add                            0x1026062d0
ggml_metal_init: loaded kernel_mul                            0x1026069f0
ggml_metal_init: loaded kernel_mul_row                        0x102607020
ggml_metal_init: loaded kernel_scale                          0x102607540
ggml_metal_init: loaded kernel_silu                           0x102607a60
ggml_metal_init: loaded kernel_relu                           0x102607f80
ggml_metal_init: loaded kernel_gelu                           0x1026084a0
ggml_metal_init: loaded kernel_soft_max                       0x102608b50
ggml_metal_init: loaded kernel_diag_mask_inf                  0x1026091b0
ggml_metal_init: loaded kernel_get_rows_f16                   0x102609830
ggml_metal_init: loaded kernel_get_rows_q4_0                  0x102609eb0
ggml_metal_init: loaded kernel_get_rows_q4_1                  0x10260a6a0
ggml_metal_init: loaded kernel_get_rows_q2_K                  0x10260ad20
ggml_metal_init: loaded kernel_get_rows_q3_K                  0x1027046d0
ggml_metal_init: loaded kernel_get_rows_q4_K                  0x1024045a0
ggml_metal_init: loaded kernel_get_rows_q5_K                  0x102404d40
ggml_metal_init: loaded kernel_get_rows_q6_K                  0x1024053c0
ggml_metal_init: loaded kernel_rms_norm                       0x102405a70
ggml_metal_init: loaded kernel_norm                           0x102406120
ggml_metal_init: loaded kernel_mul_mat_f16_f32                0x102406c10
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32               0x1024072f0
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32               0x1027050b0
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32               0x102705790
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32               0x102706010
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32               0x1027066f0
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32               0x10260b2e0
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32               0x10260b9c0
ggml_metal_init: loaded kernel_rope                           0x10260c4b0
ggml_metal_init: loaded kernel_alibi_f32                      0x10260cd70
ggml_metal_init: loaded kernel_cpy_f32_f16                    0x10260d600
ggml_metal_init: loaded kernel_cpy_f32_f32                    0x10260de90
ggml_metal_init: loaded kernel_cpy_f16_f16                    0x10260e720
ggml_metal_init: recommendedMaxWorkingSetSize = 21845.34 MB
ggml_metal_init: hasUnifiedMemory             = true
ggml_metal_init: maxTransferRate              = built-in GPU
llama_new_context_with_model: max tensor size =    87.89 MB
ggml_metal_add_buffer: allocated 'data            ' buffer, size =  6983.72 MB, ( 6984.11 / 21845.34)
ggml_metal_add_buffer: allocated 'eval            ' buffer, size =  1024.00 MB, ( 8008.11 / 21845.34)
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =   402.00 MB, ( 8410.11 / 21845.34)
ggml_metal_add_buffer: allocated 'scr0            ' buffer, size =   512.00 MB, ( 8922.11 / 21845.34)
ggml_metal_add_buffer: allocated 'scr1            ' buffer, size =   512.00 MB, ( 9434.11 / 21845.34)

system_info: n_threads = 8 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 0



// RUN: %target-swift-ide-test -code-completion -source-filename %s -code-completion-token=GLOBAL_ENUM_1 > %t.globaleenum1.txt
// RUN: FileCheck --input-file=%t.globaleenum1.txt %s

// CHECK: CODE_ completionsForGlobalEnum1
// CHECK-NEXT: SYMBOL KIND
// CHECK-NEXT: BASE_ENUM=TypeName
// CHECK-NEXT: ◂────────────────────────
llama_print_timings:        load time =  3934.87 ms
llama_print_timings:      sample time =   158.58 ms /   128 runs   (    1.24 ms per token,   807.18 tokens per second)
llama_print_timings: prompt eval time =   256.35 ms /     2 tokens (  128.17 ms per token,     7.80 tokens per second)
llama_print_timings:        eval time =  6372.17 ms /   127 runs   (   50.17 ms per token,    19.93 tokens per second)
llama_print_timings:       total time =  6804.90 ms
ggml_metal_free: deallocating

@ggerganov
Copy link
Owner

M1 Pro 32GB

model master this PR
7B, prompt 2 44.0 ms/tok 36.0 ms/tok
13B, prompt 2 87.0 ms/tok 64.0 ms/tok

@ikawrakow
Copy link
Contributor

Another data point: M2 Max with 30-core GPU, 65B model, -p "I believe the meaning of life is":

  • Master: 186 ms/t
  • This PR: 152 ms/t
  • This PR with suggested changes: 146 ms/t

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impressive! 🦙

ggml-metal.m Outdated
@@ -660,7 +662,11 @@ void ggml_metal_graph_compute(

nth0 = 8;
nth1 = 8;
[encoder setComputePipelineState:ctx->pipeline_mul_mat_q4_0_f32];
if (ne01 % 8 == 0) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it somehow possible to remove this restriction without affecting the performance?
If so, we can simply delete the old pipeline_mul_mat_q4_0_f32 kernel

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for models like WizardLM. In these models the last mat-vec multiplication in inference will have row number = 32001, while our new kernel consumes 8 rows every time. Let me test later if adding capabilities for dealing with <8 rows in the kernel will make it run slower or faster.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the old kernel. Might have minor performance gain or loss depending on model size and n_vocab. Tested on llama-7B and wizardlm-30B, generation results are same.

@ikawrakow
Copy link
Contributor

And here a full list of all Meta LLaMA models on 30-core M2 Max with -p "I believe the meaning of life is":

Model Master This PR This PR + suggested changes
7B 21.5 ms/t 21.0 ms/t 19.6 ms/t
13B 38.2 ms/t 35.5 ms/t 33.8 ms/t
33B 92 ms/t 78 ms/t 76 ms/t
65B 186 ms/t 152 ms/t 146 ms/t

@simicvm
Copy link

simicvm commented Jul 12, 2023

MacBook Pro, M2 Pro, 32GB, Ventura 13.4.1

./main -m models/open_llama_7b/ggml-model-q4_0.bin -n 128 -c 512 -s 12 -ngl 1 --no-mmap

llama_print_timings:        load time =  1334.44 ms
llama_print_timings:      sample time =   111.35 ms /   128 runs   (    0.87 ms per token,  1149.53 tokens per second)
llama_print_timings: prompt eval time =    83.72 ms /     2 tokens (   41.86 ms per token,    23.89 tokens per second)
llama_print_timings:        eval time =  4042.98 ms /   127 runs   (   31.83 ms per token,    31.41 tokens per second)
llama_print_timings:       total time =  4250.32 ms

./main -m models/open_llama_7b/ggml-model-q4_0.bin -n 128 -c 512 -s 12 -ngl 1 --no-mmap -p "I believe the meaning of life is"

llama_print_timings:        load time =  1013.66 ms
llama_print_timings:      sample time =    77.40 ms /    89 runs   (    0.87 ms per token,  1149.92 tokens per second)
llama_print_timings: prompt eval time =   225.65 ms /     8 tokens (   28.21 ms per token,    35.45 tokens per second)
llama_print_timings:        eval time =  2817.16 ms /    88 runs   (   32.01 ms per token,    31.24 tokens per second)
llama_print_timings:       total time =  3128.57 ms

@r3muxd
Copy link

r3muxd commented Jul 12, 2023

./main -t 10 -ins -i -c 2048 --color --temp 0.7 --repeat_penalty 1.1 -n -1 -ngl 99  -m models/wizardlm-13b-v1.1.ggmlv3.q4_1.bin
> Write a creative story.
Once gaboo nose Mand Mandceu Metal Mand Wildoria?合?ership Liveria Indian PrшаLED Cant◄ mand Dumaudi Wild Manduga Janu sensitive flexibleERR namlackvet (/ Ausvat sensitive Jagtarheckii? Mareooypaya initiheck Wildintegr Mand Hamm WildooERRoriarrnam Mandool NapoorayfaceooypERR Indian Mand gab Hell Eisenheck Lambnam Hamm Montaudientedools Julia Vert exceptionsiation Indianrayed sensitive Wild Cant面 Goldenarc Bat Jak? flexibleiation sensitive合 (/cp Hamm AusERR Wild Hamm Wildeaoria wild DEFAULT Wildnam acvetheckented Mandmioo Wild CantheckERR mand Manduga合 nam thrownershipia Montlackaria hyp perce Wild Wild?ERR sensitiveoo Mand Mand wildvat semiERRnamentedERR exceptions Indian™ thrown Mand sensitiveERRooaudi Wildoo nevERRoo Mandko*{ Hammugaheck Indian MandoriaERR Indianented sensitiveented thrown flexibleentedampa Mand (/ERRERR Wild JagERR Wild Wild Wild mandariaoria Metal合lack cóheckiation gabERR Cantmiheck Mand WildERR面 Mand semiidthnats ac? pitt keyword WildERR Cantceu WildershipERR Ausvetheck Lamb NapiaERRGGvat Indianuga modoo Cant Aus (/面 Hamm mand throwneaERRheckoria Wild Mand mand sensitive sensitive Wildvat Wild Wildheck MandheckorialackheckrelativeeaERRoo合 sensitive Indian Indianaria? Lamb Wildea Wildaken Wild Dum​ Soul pittERRea Liver keyword thrownoo Metalented ac Indian Wild Cant Mission Indiania Indianidthentedership namisión? ones Wild Hamm gab Wild wild面 Wild? Namuga ashvet Indian MandooERR (/iationlackheck Mand Indian modvat Ausmi Indianoria sensitive Wildnam Mand mandiationoo flexible spirERR namoriaoonamidthintegrationego Mand*{ Eisenooatenea acaria

minor issue
does not occur if i remove -ngl 99

@gauravpathak-infa
Copy link

With M2 Mac Studio Ultra, 128GB RAM

Model Model Name Master This PR
7B llama-7b.ggmlv3.q4_0.bin 14.91 ms per token,    67.05 tokens per second 14.53 ms per token,    68.81 tokens per second
13B nous-hermes-13b.ggmlv3.q4_0.bin 24.55 ms per token,    40.73 tokens per second 25.19 ms per token,    39.69 tokens per second
30B WizardLM-30B-Uncensored.ggmlv3.q4_0.bin 54.64 ms per token,    18.30 tokens per second 51.39 ms per token,    19.46 tokens per second
65B airoboros-65B-gpt4-1.2.ggmlv3.q4_0.bin 122.13 ms per token,     8.19 tokens per second 109.15 ms per token,     9.16 tokens per second

ggml-metal.m Outdated Show resolved Hide resolved
Prefetch data to improve GPU utilization. ~48% faster for 33B model.
@ggerganov ggerganov merged commit 1cbf561 into ggerganov:master Jul 12, 2023
4 checks passed
@loretoparisi
Copy link

@ggerganov how to properly run benchmarks? I would love to contribute!

kernel void kernel_mul_mat_q4_0_f32(
device const void * src0,
device const float * src1,
device float * dst,
constant int64_t & ne00,
constant int64_t & ne10,
constant int64_t & ne0,
threadgroup float * sum [[threadgroup(0)]],
constant int64_t & ne01[[buffer(4)]],
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the function of [[buffer(4)]] here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gets the value from the buffer at index 4, corresponding the following line in ggml-metal.m:
[encoder setBytes:&ne01 length:sizeof(ne01) atIndex:4];

@lshzh-ww lshzh-ww deleted the mps-q4_0-kernel branch July 13, 2023 16:08
@amj
Copy link

amj commented Jul 18, 2023

With M2 Mac Studio Ultra, 128GB RAM

Model Model Name Master This PR
7B llama-7b.ggmlv3.q4_0.bin 14.91 ms per token,    67.05 tokens per second 14.53 ms per token,    68.81 tokens per second
13B nous-hermes-13b.ggmlv3.q4_0.bin 24.55 ms per token,    40.73 tokens per second 25.19 ms per token,    39.69 tokens per second
30B WizardLM-30B-Uncensored.ggmlv3.q4_0.bin 54.64 ms per token,    18.30 tokens per second 51.39 ms per token,    19.46 tokens per second
65B airoboros-65B-gpt4-1.2.ggmlv3.q4_0.bin 122.13 ms per token,     8.19 tokens per second 109.15 ms per token,     9.16 tokens per second

@gauravpathak-infa I also have an M2 ultra, i am getting less than half of your numbers on these models; could you share your command line please?

@gauravpathak-infa
Copy link

@amj Here are the commands:
./main -m models/7B/llama-7b.ggmlv3.q4_0.bin -p "I believe the meaning of life is" --ignore-eos -n 256 -ngl 1
./main -m models/13B/nous-hermes-13b.ggmlv3.q4_0.bin -p "I believe the meaning of life is" --ignore-eos -n 256 -ngl 1
./main -m models/30B/WizardLM-30B-Uncensored.ggmlv3.q4_0.bin -p "I believe the meaning of life is" --ignore-eos -n 256 -ngl 1
./main -m models/65B/airoboros-65B-gpt4-1.2.ggmlv3.q4_0.bin -p "I believe the meaning of life is" --ignore-eos -n 256 -ngl 1

If you are getting half the numbers, the problem may be somewhere else. It may be useful to check if you have compiled llama.cpp with Metal. If not:
make clean
LLAMA_METAL=1 make

@amj
Copy link

amj commented Jul 19, 2023

yep that did it; LLAMA_METAL had been dropped during some clean/build step. I'm getting the same numbers you were showing. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority Very important issue performance Speed related topics
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants