metal: new q4_0 mat-vec mul kernel #2188

lshzh-ww · 2023-07-12T05:49:51Z

Prefetch data to achieve better memory bandwidth utilization. With the new kernel token generation is ~48% faster for 33B model and ~14% faster for 7B model. Tests for 65B model are welcome.

33B model	master	this PR
GPU Read BW*	167 GB/s	275 GB/s
prompt 1	117ms/tok	79ms/tok
prompt 2	121ms/tok	82 ms/tok

7B model	master	this PR
prompt 1	25.7ms/tok	22.5ms/tok

GPUs of M1 Max / M2 Max can sustain 340 GB/s , can we reach that in llama.cpp? 😳

* GPU Read Bandwidth measured using Developer Tools comes with Xcode.

** Measured with the following command on M1 Max 32GB. Generation results are same between this PR and master branch.
./main -m model_file -n 128 -c 512 -s 12 -ngl 1 --no-mmap
Prompt 1 “”
Prompt 2 “I believe the meaning of life is”

lshzh-ww · 2023-07-12T06:14:43Z

Please feel free to share results on M1/M2 Pro/Max/Ultra!

ikawrakow · 2023-07-12T07:36:08Z

Please feel free to share results on M1/M2 Pro/Max/Ultra!

Well, I had not checked out the PR branch. With the PR branch I get on M2 Max with 30-core GPU, 64 GB RAM:

7B is 21.5 ms/t on master, 21.0 ms/t with this PR
33B is 91.3 ms/t for master and 77.4 ms/t with this PR.

The above is for an empty prompt. If I use -p "I believe the meaning of life is" instead, I get for 33B ~92 ms/t on master and ~84 ms/t with this PR.

CyborgArmy83 · 2023-07-12T11:01:23Z

Looks promising! From what I've read before there is still a lot of room for Metal code improvements so this could be very welcome. I will try to test it out on my M1 Max machine as well.

lshzh-ww · 2023-07-12T13:30:48Z

Well, I had not checked out the PR branch. With the PR branch I get on M2 Max with 30-core GPU, 64 GB RAM:

7B is 21.5 ms/t on master, 21.0 ms/t with this PR

33B is 91.3 ms/t for master and 77.4 ms/t with this PR.

The above is for an empty prompt. If I use -p "I believe the meaning of life is" instead, I get for 33B ~92 ms/t on master and ~84 ms/t with this PR.

Nice! I update the measurements with new prompt. Looks like this PR brings the M1 series to the same speed as M2 series. I guess M2 series have better memory access prediction so their speed are already good without this PR. Nonetheless for 65B model I guess we can see a larger performance improvement for M2 series.

lshzh-ww · 2023-07-12T13:31:38Z

Would be nice to see results on M1 Ultra or M2 Ultra chips. Their two-die design may also suffer from cache miss.

adrienbrault · 2023-07-12T16:04:14Z

Here's my benchmark: MBP 14 Apple M2 Max 32GB 12C CPU, 30C GPU

13B model	master	this PR
prompt 1 1/2	46.97ms/tok	36.70ms/tok
prompt 1 2/2	50.17ms/tok	37.40ms/tok

Logs

17:53:48 in ~/Developer/ai/llama.cpp on  mps-q4_0-kernel 
✘130 ➜ ./main -m ../models/TheBloke_WizardLM-13B-V1.1-GGML/wizardlm-13b-v1.1.ggmlv3.q4_0.bin -n 128 -c 512 -s 12 -ngl 1 --no-mmap -p ""
main: build = 821 (38ec9a2)
main: seed  = 12
llama.cpp: loading model from ../models/TheBloke_WizardLM-13B-V1.1-GGML/wizardlm-13b-v1.1.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32001
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 6983.71 MB
llama_model_load_internal: mem required  = 9031.71 MB (+ 1608.00 MB per state)
llama_new_context_with_model: kv self size  =  400.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Volumes/Developer/ai/llama.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add                            0x102a04c70
ggml_metal_init: loaded kernel_mul                            0x102a054b0
ggml_metal_init: loaded kernel_mul_row                        0x102a05ae0
ggml_metal_init: loaded kernel_scale                          0x102a06000
ggml_metal_init: loaded kernel_silu                           0x102a06520
ggml_metal_init: loaded kernel_relu                           0x102a06a40
ggml_metal_init: loaded kernel_gelu                           0x102a06f60
ggml_metal_init: loaded kernel_soft_max                       0x102b06b60
ggml_metal_init: loaded kernel_diag_mask_inf                  0x102b07120
ggml_metal_init: loaded kernel_get_rows_f16                   0x102b077a0
ggml_metal_init: loaded kernel_get_rows_q4_0                  0x102b07e20
ggml_metal_init: loaded kernel_get_rows_q4_1                  0x102b08610
ggml_metal_init: loaded kernel_get_rows_q2_K                  0x13de074f0
ggml_metal_init: loaded kernel_get_rows_q3_K                  0x13de07c90
ggml_metal_init: loaded kernel_get_rows_q4_K                  0x13de08310
ggml_metal_init: loaded kernel_get_rows_q5_K                  0x13de08990
ggml_metal_init: loaded kernel_get_rows_q6_K                  0x13de09010
ggml_metal_init: loaded kernel_rms_norm                       0x13de096c0
ggml_metal_init: loaded kernel_norm                           0x13de09d70
ggml_metal_init: loaded kernel_mul_mat_f16_f32                0x13de0a860
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32               0x13de0af40
ggml_metal_init: loaded kernel_mul_mat_vec_q4_0_f32           0x13de0b780
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32               0x13de0bd40
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32               0x13de0c5a0
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32               0x102a07540
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32               0x102a07d40
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32               0x102a08420
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32               0x102a08d00
ggml_metal_init: loaded kernel_rope                           0x102a095f0
ggml_metal_init: loaded kernel_alibi_f32                      0x102a09eb0
ggml_metal_init: loaded kernel_cpy_f32_f16                    0x102a0a740
ggml_metal_init: loaded kernel_cpy_f32_f32                    0x102a0b230
ggml_metal_init: loaded kernel_cpy_f16_f16                    0x102a0bac0
ggml_metal_init: recommendedMaxWorkingSetSize = 21845.34 MB
ggml_metal_init: hasUnifiedMemory             = true
ggml_metal_init: maxTransferRate              = built-in GPU
llama_new_context_with_model: max tensor size =    87.89 MB
ggml_metal_add_buffer: allocated 'data            ' buffer, size =  6983.72 MB, ( 6984.17 / 21845.34)
ggml_metal_add_buffer: allocated 'eval            ' buffer, size =  1024.00 MB, ( 8008.17 / 21845.34)
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =   402.00 MB, ( 8410.17 / 21845.34)
ggml_metal_add_buffer: allocated 'scr0            ' buffer, size =   512.00 MB, ( 8922.17 / 21845.34)
ggml_metal_add_buffer: allocated 'scr1            ' buffer, size =   512.00 MB, ( 9434.17 / 21845.34)

system_info: n_threads = 8 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 0



// RUN: %target-swift-ide-test -code-completion -source-filename %s -code-completion-token=GLOBAL_ENUM_1 > %t.globaleenum1.txt
// RUN: FileCheck --input-file=%t.globaleenum1.txt %s

// CHECK: CODE_ completionsForGlobalEnum1
// CHECK-NEXT: SYMBOL KIND
// CHECK-NEXT: BASE_ENUM=TypeName
// CHECK-NEXT: ◂────────────────────────
llama_print_timings:        load time =  4994.00 ms
llama_print_timings:      sample time =   180.25 ms /   128 runs   (    1.41 ms per token,   710.13 tokens per second)
llama_print_timings: prompt eval time =   375.17 ms /     2 tokens (  187.58 ms per token,     5.33 tokens per second)
llama_print_timings:        eval time =  4660.87 ms /   127 runs   (   36.70 ms per token,    27.25 tokens per second)
llama_print_timings:       total time =  5234.85 ms
ggml_metal_free: deallocating

17:54:12 in ~/Developer/ai/llama.cpp on  mps-q4_0-kernel   took 10.4s
➜ ./main -m ../models/TheBloke_WizardLM-13B-V1.1-GGML/wizardlm-13b-v1.1.ggmlv3.q4_0.bin -n 128 -c 512 -s 12 -ngl 1 --no-mmap -p ""
main: build = 821 (38ec9a2)
main: seed  = 12
llama.cpp: loading model from ../models/TheBloke_WizardLM-13B-V1.1-GGML/wizardlm-13b-v1.1.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32001
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 6983.71 MB
llama_model_load_internal: mem required  = 9031.71 MB (+ 1608.00 MB per state)
llama_new_context_with_model: kv self size  =  400.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Volumes/Developer/ai/llama.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add                            0x151706b40
ggml_metal_init: loaded kernel_mul                            0x151707260
ggml_metal_init: loaded kernel_mul_row                        0x151707890
ggml_metal_init: loaded kernel_scale                          0x151707db0
ggml_metal_init: loaded kernel_silu                           0x1517082d0
ggml_metal_init: loaded kernel_relu                           0x1517087f0
ggml_metal_init: loaded kernel_gelu                           0x151708d10
ggml_metal_init: loaded kernel_soft_max                       0x1517093c0
ggml_metal_init: loaded kernel_diag_mask_inf                  0x151709a20
ggml_metal_init: loaded kernel_get_rows_f16                   0x15170a0a0
ggml_metal_init: loaded kernel_get_rows_q4_0                  0x15170a720
ggml_metal_init: loaded kernel_get_rows_q4_1                  0x15170af10
ggml_metal_init: loaded kernel_get_rows_q2_K                  0x15170b590
ggml_metal_init: loaded kernel_get_rows_q3_K                  0x15170bc10
ggml_metal_init: loaded kernel_get_rows_q4_K                  0x15170c290
ggml_metal_init: loaded kernel_get_rows_q5_K                  0x15170c910
ggml_metal_init: loaded kernel_get_rows_q6_K                  0x15170cf90
ggml_metal_init: loaded kernel_rms_norm                       0x15170d640
ggml_metal_init: loaded kernel_norm                           0x15170dcf0
ggml_metal_init: loaded kernel_mul_mat_f16_f32                0x15170e6c0
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32               0x15170eda0
ggml_metal_init: loaded kernel_mul_mat_vec_q4_0_f32           0x15170f440
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32               0x15170fb20
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32               0x151710380
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32               0x151710a80
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32               0x151711160
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32               0x151711840
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32               0x151712120
ggml_metal_init: loaded kernel_rope                           0x151712a10
ggml_metal_init: loaded kernel_alibi_f32                      0x1517132d0
ggml_metal_init: loaded kernel_cpy_f32_f16                    0x151713b60
ggml_metal_init: loaded kernel_cpy_f32_f32                    0x151714650
ggml_metal_init: loaded kernel_cpy_f16_f16                    0x151714dc0
ggml_metal_init: recommendedMaxWorkingSetSize = 21845.34 MB
ggml_metal_init: hasUnifiedMemory             = true
ggml_metal_init: maxTransferRate              = built-in GPU
llama_new_context_with_model: max tensor size =    87.89 MB
ggml_metal_add_buffer: allocated 'data            ' buffer, size =  6983.72 MB, ( 6984.17 / 21845.34)
ggml_metal_add_buffer: allocated 'eval            ' buffer, size =  1024.00 MB, ( 8008.17 / 21845.34)
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =   402.00 MB, ( 8410.17 / 21845.34)
ggml_metal_add_buffer: allocated 'scr0            ' buffer, size =   512.00 MB, ( 8922.17 / 21845.34)
ggml_metal_add_buffer: allocated 'scr1            ' buffer, size =   512.00 MB, ( 9434.17 / 21845.34)

system_info: n_threads = 8 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 0



// RUN: %target-swift-ide-test -code-completion -source-filename %s -code-completion-token=GLOBAL_ENUM_1 > %t.globaleenum1.txt
// RUN: FileCheck --input-file=%t.globaleenum1.txt %s

// CHECK: CODE_ completionsForGlobalEnum1
// CHECK-NEXT: SYMBOL KIND
// CHECK-NEXT: BASE_ENUM=TypeName
// CHECK-NEXT: ◂────────────────────────
llama_print_timings:        load time =  3865.14 ms
llama_print_timings:      sample time =   182.48 ms /   128 runs   (    1.43 ms per token,   701.45 tokens per second)
llama_print_timings: prompt eval time =   352.93 ms /     2 tokens (  176.47 ms per token,     5.67 tokens per second)
llama_print_timings:        eval time =  4750.40 ms /   127 runs   (   37.40 ms per token,    26.73 tokens per second)
llama_print_timings:       total time =  5304.79 ms
ggml_metal_free: deallocating

17:54:24 in ~/Developer/ai/llama.cpp on  mps-q4_0-kernel   took 9.4s
➜ git co -
Switched to branch 'master'
Your branch is up to date with 'origin/master'.

17:54:31 in ~/Developer/ai/llama.cpp on  master 
➜ LLAMA_METAL=1 make
I llama.cpp build info:
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL
I LDFLAGS:   -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

cc -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG -c ggml-metal.m -o ggml-metal.o
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL examples/main/main.cpp ggml.o llama.o common.o k_quants.o ggml-metal.o -o main  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders

====  Run ./main -h for help.  ====

c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL examples/quantize/quantize.cpp ggml.o llama.o k_quants.o ggml-metal.o -o quantize  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL examples/quantize-stats/quantize-stats.cpp ggml.o llama.o k_quants.o ggml-metal.o -o quantize-stats  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL examples/perplexity/perplexity.cpp ggml.o llama.o common.o k_quants.o ggml-metal.o -o perplexity  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL examples/embedding/embedding.cpp ggml.o llama.o common.o k_quants.o ggml-metal.o -o embedding  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL pocs/vdot/vdot.cpp ggml.o k_quants.o ggml-metal.o -o vdot  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL examples/train-text-from-scratch/train-text-from-scratch.cpp ggml.o llama.o k_quants.o ggml-metal.o -o train-text-from-scratch  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL examples/simple/simple.cpp ggml.o llama.o common.o k_quants.o ggml-metal.o -o simple  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL -Iexamples/server examples/server/server.cpp ggml.o llama.o common.o k_quants.o ggml-metal.o -o server  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ --shared -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL examples/embd-input/embd-input-lib.cpp ggml.o llama.o common.o k_quants.o ggml-metal.o -o libembdinput.so  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL examples/embd-input/embd-input-test.cpp ggml.o llama.o common.o k_quants.o ggml-metal.o -o embd-input-test  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders -L. -lembdinput

17:54:44 in ~/Developer/ai/llama.cpp on  master   took 12.5s
➜ ./main -m ../models/TheBloke_WizardLM-13B-V1.1-GGML/wizardlm-13b-v1.1.ggmlv3.q4_0.bin -n 128 -c 512 -s 12 -ngl 1 --no-mmap -p ""
main: build = 823 (4e7464e)
main: seed  = 12
llama.cpp: loading model from ../models/TheBloke_WizardLM-13B-V1.1-GGML/wizardlm-13b-v1.1.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32001
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 6983.71 MB
llama_model_load_internal: mem required  = 9031.71 MB (+ 1608.00 MB per state)
llama_new_context_with_model: kv self size  =  400.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Volumes/Developer/ai/llama.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add                            0x1236044a0
ggml_metal_init: loaded kernel_mul                            0x123604ce0
ggml_metal_init: loaded kernel_mul_row                        0x123605310
ggml_metal_init: loaded kernel_scale                          0x123605830
ggml_metal_init: loaded kernel_silu                           0x123605d50
ggml_metal_init: loaded kernel_relu                           0x123606270
ggml_metal_init: loaded kernel_gelu                           0x123606790
ggml_metal_init: loaded kernel_soft_max                       0x123606e40
ggml_metal_init: loaded kernel_diag_mask_inf                  0x1236074a0
ggml_metal_init: loaded kernel_get_rows_f16                   0x123607b20
ggml_metal_init: loaded kernel_get_rows_q4_0                  0x1236081a0
ggml_metal_init: loaded kernel_get_rows_q4_1                  0x123608990
ggml_metal_init: loaded kernel_get_rows_q2_K                  0x144106970
ggml_metal_init: loaded kernel_get_rows_q3_K                  0x144107110
ggml_metal_init: loaded kernel_get_rows_q4_K                  0x144107790
ggml_metal_init: loaded kernel_get_rows_q5_K                  0x144107e10
ggml_metal_init: loaded kernel_get_rows_q6_K                  0x144108490
ggml_metal_init: loaded kernel_rms_norm                       0x144108b40
ggml_metal_init: loaded kernel_norm                           0x1441091f0
ggml_metal_init: loaded kernel_mul_mat_f16_f32                0x144109ce0
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32               0x14410a3c0
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32               0x14410aaa0
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32               0x14410b180
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32               0x102404af0
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32               0x1024052f0
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32               0x1024059d0
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32               0x1024060b0
ggml_metal_init: loaded kernel_rope                           0x102406ba0
ggml_metal_init: loaded kernel_alibi_f32                      0x102407460
ggml_metal_init: loaded kernel_cpy_f32_f16                    0x102407cf0
ggml_metal_init: loaded kernel_cpy_f32_f32                    0x102408580
ggml_metal_init: loaded kernel_cpy_f16_f16                    0x102408e10
ggml_metal_init: recommendedMaxWorkingSetSize = 21845.34 MB
ggml_metal_init: hasUnifiedMemory             = true
ggml_metal_init: maxTransferRate              = built-in GPU
llama_new_context_with_model: max tensor size =    87.89 MB
ggml_metal_add_buffer: allocated 'data            ' buffer, size =  6983.72 MB, ( 6984.11 / 21845.34)
ggml_metal_add_buffer: allocated 'eval            ' buffer, size =  1024.00 MB, ( 8008.11 / 21845.34)
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =   402.00 MB, ( 8410.11 / 21845.34)
ggml_metal_add_buffer: allocated 'scr0            ' buffer, size =   512.00 MB, ( 8922.11 / 21845.34)
ggml_metal_add_buffer: allocated 'scr1            ' buffer, size =   512.00 MB, ( 9434.11 / 21845.34)

system_info: n_threads = 8 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 0



// RUN: %target-swift-ide-test -code-completion -source-filename %s -code-completion-token=GLOBAL_ENUM_1 > %t.globaleenum1.txt
// RUN: FileCheck --input-file=%t.globaleenum1.txt %s

// CHECK: CODE_ completionsForGlobalEnum1
// CHECK-NEXT: SYMBOL KIND
// CHECK-NEXT: BASE_ENUM=TypeName
// CHECK-NEXT: ◂────────────────────────
llama_print_timings:        load time =  3641.14 ms
llama_print_timings:      sample time =   186.38 ms /   128 runs   (    1.46 ms per token,   686.76 tokens per second)
llama_print_timings: prompt eval time =   259.22 ms /     2 tokens (  129.61 ms per token,     7.72 tokens per second)
llama_print_timings:        eval time =  5965.22 ms /   127 runs   (   46.97 ms per token,    21.29 tokens per second)
llama_print_timings:       total time =  6429.91 ms
ggml_metal_free: deallocating

17:55:00 in ~/Developer/ai/llama.cpp on  master   took 10.6s
➜ ./main -m ../models/TheBloke_WizardLM-13B-V1.1-GGML/wizardlm-13b-v1.1.ggmlv3.q4_0.bin -n 128 -c 512 -s 12 -ngl 1 --no-mmap -p ""
main: build = 823 (4e7464e)
main: seed  = 12
llama.cpp: loading model from ../models/TheBloke_WizardLM-13B-V1.1-GGML/wizardlm-13b-v1.1.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32001
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 6983.71 MB
llama_model_load_internal: mem required  = 9031.71 MB (+ 1608.00 MB per state)
llama_new_context_with_model: kv self size  =  400.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Volumes/Developer/ai/llama.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add                            0x1026062d0
ggml_metal_init: loaded kernel_mul                            0x1026069f0
ggml_metal_init: loaded kernel_mul_row                        0x102607020
ggml_metal_init: loaded kernel_scale                          0x102607540
ggml_metal_init: loaded kernel_silu                           0x102607a60
ggml_metal_init: loaded kernel_relu                           0x102607f80
ggml_metal_init: loaded kernel_gelu                           0x1026084a0
ggml_metal_init: loaded kernel_soft_max                       0x102608b50
ggml_metal_init: loaded kernel_diag_mask_inf                  0x1026091b0
ggml_metal_init: loaded kernel_get_rows_f16                   0x102609830
ggml_metal_init: loaded kernel_get_rows_q4_0                  0x102609eb0
ggml_metal_init: loaded kernel_get_rows_q4_1                  0x10260a6a0
ggml_metal_init: loaded kernel_get_rows_q2_K                  0x10260ad20
ggml_metal_init: loaded kernel_get_rows_q3_K                  0x1027046d0
ggml_metal_init: loaded kernel_get_rows_q4_K                  0x1024045a0
ggml_metal_init: loaded kernel_get_rows_q5_K                  0x102404d40
ggml_metal_init: loaded kernel_get_rows_q6_K                  0x1024053c0
ggml_metal_init: loaded kernel_rms_norm                       0x102405a70
ggml_metal_init: loaded kernel_norm                           0x102406120
ggml_metal_init: loaded kernel_mul_mat_f16_f32                0x102406c10
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32               0x1024072f0
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32               0x1027050b0
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32               0x102705790
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32               0x102706010
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32               0x1027066f0
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32               0x10260b2e0
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32               0x10260b9c0
ggml_metal_init: loaded kernel_rope                           0x10260c4b0
ggml_metal_init: loaded kernel_alibi_f32                      0x10260cd70
ggml_metal_init: loaded kernel_cpy_f32_f16                    0x10260d600
ggml_metal_init: loaded kernel_cpy_f32_f32                    0x10260de90
ggml_metal_init: loaded kernel_cpy_f16_f16                    0x10260e720
ggml_metal_init: recommendedMaxWorkingSetSize = 21845.34 MB
ggml_metal_init: hasUnifiedMemory             = true
ggml_metal_init: maxTransferRate              = built-in GPU
llama_new_context_with_model: max tensor size =    87.89 MB
ggml_metal_add_buffer: allocated 'data            ' buffer, size =  6983.72 MB, ( 6984.11 / 21845.34)
ggml_metal_add_buffer: allocated 'eval            ' buffer, size =  1024.00 MB, ( 8008.11 / 21845.34)
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =   402.00 MB, ( 8410.11 / 21845.34)
ggml_metal_add_buffer: allocated 'scr0            ' buffer, size =   512.00 MB, ( 8922.11 / 21845.34)
ggml_metal_add_buffer: allocated 'scr1            ' buffer, size =   512.00 MB, ( 9434.11 / 21845.34)

system_info: n_threads = 8 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 0



// RUN: %target-swift-ide-test -code-completion -source-filename %s -code-completion-token=GLOBAL_ENUM_1 > %t.globaleenum1.txt
// RUN: FileCheck --input-file=%t.globaleenum1.txt %s

// CHECK: CODE_ completionsForGlobalEnum1
// CHECK-NEXT: SYMBOL KIND
// CHECK-NEXT: BASE_ENUM=TypeName
// CHECK-NEXT: ◂────────────────────────
llama_print_timings:        load time =  3934.87 ms
llama_print_timings:      sample time =   158.58 ms /   128 runs   (    1.24 ms per token,   807.18 tokens per second)
llama_print_timings: prompt eval time =   256.35 ms /     2 tokens (  128.17 ms per token,     7.80 tokens per second)
llama_print_timings:        eval time =  6372.17 ms /   127 runs   (   50.17 ms per token,    19.93 tokens per second)
llama_print_timings:       total time =  6804.90 ms
ggml_metal_free: deallocating

ggerganov · 2023-07-12T16:23:08Z

M1 Pro 32GB

model	master	this PR
7B, prompt 2	44.0 ms/tok	36.0 ms/tok
13B, prompt 2	87.0 ms/tok	64.0 ms/tok

ikawrakow · 2023-07-12T16:30:07Z

Another data point: M2 Max with 30-core GPU, 65B model, -p "I believe the meaning of life is":

Master: 186 ms/t
This PR: 152 ms/t
This PR with suggested changes: 146 ms/t

ggerganov

Impressive! 🦙

ggerganov · 2023-07-12T16:30:14Z

ggml-metal.m

@@ -660,7 +662,11 @@ void ggml_metal_graph_compute(

                                            nth0 = 8;
                                            nth1 = 8;
-                                            [encoder setComputePipelineState:ctx->pipeline_mul_mat_q4_0_f32];
+                                            if (ne01 % 8 == 0) {


Is it somehow possible to remove this restriction without affecting the performance?
If so, we can simply delete the old pipeline_mul_mat_q4_0_f32 kernel

This is for models like WizardLM. In these models the last mat-vec multiplication in inference will have row number = 32001, while our new kernel consumes 8 rows every time. Let me test later if adding capabilities for dealing with <8 rows in the kernel will make it run slower or faster.

Removed the old kernel. Might have minor performance gain or loss depending on model size and n_vocab. Tested on llama-7B and wizardlm-30B, generation results are same.

ikawrakow · 2023-07-12T16:44:03Z

And here a full list of all Meta LLaMA models on 30-core M2 Max with -p "I believe the meaning of life is":

Model	Master	This PR	This PR + suggested changes
7B	21.5 ms/t	21.0 ms/t	19.6 ms/t
13B	38.2 ms/t	35.5 ms/t	33.8 ms/t
33B	92 ms/t	78 ms/t	76 ms/t
65B	186 ms/t	152 ms/t	146 ms/t

simicvm · 2023-07-12T17:43:37Z

MacBook Pro, M2 Pro, 32GB, Ventura 13.4.1

./main -m models/open_llama_7b/ggml-model-q4_0.bin -n 128 -c 512 -s 12 -ngl 1 --no-mmap

llama_print_timings:        load time =  1334.44 ms
llama_print_timings:      sample time =   111.35 ms /   128 runs   (    0.87 ms per token,  1149.53 tokens per second)
llama_print_timings: prompt eval time =    83.72 ms /     2 tokens (   41.86 ms per token,    23.89 tokens per second)
llama_print_timings:        eval time =  4042.98 ms /   127 runs   (   31.83 ms per token,    31.41 tokens per second)
llama_print_timings:       total time =  4250.32 ms

./main -m models/open_llama_7b/ggml-model-q4_0.bin -n 128 -c 512 -s 12 -ngl 1 --no-mmap -p "I believe the meaning of life is"

llama_print_timings:        load time =  1013.66 ms
llama_print_timings:      sample time =    77.40 ms /    89 runs   (    0.87 ms per token,  1149.92 tokens per second)
llama_print_timings: prompt eval time =   225.65 ms /     8 tokens (   28.21 ms per token,    35.45 tokens per second)
llama_print_timings:        eval time =  2817.16 ms /    88 runs   (   32.01 ms per token,    31.24 tokens per second)
llama_print_timings:       total time =  3128.57 ms

r3muxd · 2023-07-12T18:22:32Z

./main -t 10 -ins -i -c 2048 --color --temp 0.7 --repeat_penalty 1.1 -n -1 -ngl 99  -m models/wizardlm-13b-v1.1.ggmlv3.q4_1.bin
> Write a creative story.
Once gaboo nose Mand Mandceu Metal Mand Wildoria?合?ership Liveria Indian PrшаLED Cant◄ mand Dumaudi Wild Manduga Janu sensitive flexibleERR namlackvet (/ Ausvat sensitive Jagtarheckii? Mareooypaya initiheck Wildintegr Mand Hamm WildooERRoriarrnam Mandool NapoorayfaceooypERR Indian Mand gab Hell Eisenheck Lambnam Hamm Montaudientedools Julia Vert exceptionsiation Indianrayed sensitive Wild Cant面 Goldenarc Bat Jak? flexibleiation sensitive合 (/cp Hamm AusERR Wild Hamm Wildeaoria wild DEFAULT Wildnam acvetheckented Mandmioo Wild CantheckERR mand Manduga合 nam thrownershipia Montlackaria hyp perce Wild Wild?ERR sensitiveoo Mand Mand wildvat semiERRnamentedERR exceptions Indian™ thrown Mand sensitiveERRooaudi Wildoo nevERRoo Mandko*{ Hammugaheck Indian MandoriaERR Indianented sensitiveented thrown flexibleentedampa Mand (/ERRERR Wild JagERR Wild Wild Wild mandariaoria Metal合lack cóheckiation gabERR Cantmiheck Mand WildERR面 Mand semiidthnats ac? pitt keyword WildERR Cantceu WildershipERR Ausvetheck Lamb NapiaERRGGvat Indianuga modoo Cant Aus (/面 Hamm mand throwneaERRheckoria Wild Mand mand sensitive sensitive Wildvat Wild Wildheck MandheckorialackheckrelativeeaERRoo合 sensitive Indian Indianaria? Lamb Wildea Wildaken Wild Dum Soul pittERRea Liver keyword thrownoo Metalented ac Indian Wild Cant Mission Indiania Indianidthentedership namisión? ones Wild Hamm gab Wild wild面 Wild? Namuga ashvet Indian MandooERR (/iationlackheck Mand Indian modvat Ausmi Indianoria sensitive Wildnam Mand mandiationoo flexible spirERR namoriaoonamidthintegrationego Mand*{ Eisenooatenea acaria

minor issue
does not occur if i remove -ngl 99

gauravpathak-infa · 2023-07-12T18:29:13Z

With M2 Mac Studio Ultra, 128GB RAM

Model	Model Name	Master	This PR
7B	llama-7b.ggmlv3.q4_0.bin	14.91 ms per token, 67.05 tokens per second	14.53 ms per token, 68.81 tokens per second
13B	nous-hermes-13b.ggmlv3.q4_0.bin	24.55 ms per token, 40.73 tokens per second	25.19 ms per token, 39.69 tokens per second
30B	WizardLM-30B-Uncensored.ggmlv3.q4_0.bin	54.64 ms per token, 18.30 tokens per second	51.39 ms per token, 19.46 tokens per second
65B	airoboros-65B-gpt4-1.2.ggmlv3.q4_0.bin	122.13 ms per token, 8.19 tokens per second	109.15 ms per token, 9.16 tokens per second

ggml-metal.m

Prefetch data to improve GPU utilization. ~48% faster for 33B model.

loretoparisi · 2023-07-12T20:36:53Z

@ggerganov how to properly run benchmarks? I would love to contribute!

ggerganov · 2023-07-13T14:47:40Z

ggml-metal.metal

 kernel void kernel_mul_mat_q4_0_f32(
        device const  void * src0,
        device const float * src1,
        device       float * dst,
        constant   int64_t & ne00,
        constant   int64_t & ne10,
        constant   int64_t & ne0,
-        threadgroup float  * sum [[threadgroup(0)]],
+        constant   int64_t & ne01[[buffer(4)]],


What is the function of [[buffer(4)]] here?

This gets the value from the buffer at index 4, corresponding the following line in ggml-metal.m:
[encoder setBytes:&ne01 length:sizeof(ne01) atIndex:4];

amj · 2023-07-18T17:20:10Z

With M2 Mac Studio Ultra, 128GB RAM

Model Model Name Master This PR
7B llama-7b.ggmlv3.q4_0.bin 14.91 ms per token, 67.05 tokens per second 14.53 ms per token, 68.81 tokens per second
13B nous-hermes-13b.ggmlv3.q4_0.bin 24.55 ms per token, 40.73 tokens per second 25.19 ms per token, 39.69 tokens per second
30B WizardLM-30B-Uncensored.ggmlv3.q4_0.bin 54.64 ms per token, 18.30 tokens per second 51.39 ms per token, 19.46 tokens per second
65B airoboros-65B-gpt4-1.2.ggmlv3.q4_0.bin 122.13 ms per token, 8.19 tokens per second 109.15 ms per token, 9.16 tokens per second

@gauravpathak-infa I also have an M2 ultra, i am getting less than half of your numbers on these models; could you share your command line please?

gauravpathak-infa · 2023-07-18T17:51:50Z

@amj Here are the commands:
./main -m models/7B/llama-7b.ggmlv3.q4_0.bin -p "I believe the meaning of life is" --ignore-eos -n 256 -ngl 1
./main -m models/13B/nous-hermes-13b.ggmlv3.q4_0.bin -p "I believe the meaning of life is" --ignore-eos -n 256 -ngl 1
./main -m models/30B/WizardLM-30B-Uncensored.ggmlv3.q4_0.bin -p "I believe the meaning of life is" --ignore-eos -n 256 -ngl 1
./main -m models/65B/airoboros-65B-gpt4-1.2.ggmlv3.q4_0.bin -p "I believe the meaning of life is" --ignore-eos -n 256 -ngl 1

If you are getting half the numbers, the problem may be somewhere else. It may be useful to check if you have compiled llama.cpp with Metal. If not:
make clean
LLAMA_METAL=1 make

amj · 2023-07-19T01:54:21Z

yep that did it; LLAMA_METAL had been dropped during some clean/build step. I'm getting the same numbers you were showing. Thanks!

lshzh-ww force-pushed the mps-q4_0-kernel branch from 0e3eeb6 to 38ec9a2 Compare July 12, 2023 14:45

ggerganov added performance Speed related topics high priority Very important issue labels Jul 12, 2023

ggerganov approved these changes Jul 12, 2023

View reviewed changes

r3muxd reviewed Jul 12, 2023

View reviewed changes

ggml-metal.m Outdated Show resolved Hide resolved

lshzh-ww force-pushed the mps-q4_0-kernel branch from 38ec9a2 to fa0e4cf Compare July 12, 2023 18:39

metal: New q4_0 matrix-vector kernel

5150582

Prefetch data to improve GPU utilization. ~48% faster for 33B model.

lshzh-ww force-pushed the mps-q4_0-kernel branch from fa0e4cf to 5150582 Compare July 12, 2023 18:43

ggerganov merged commit 1cbf561 into ggerganov:master Jul 12, 2023
4 checks passed

ikawrakow mentioned this pull request Jul 13, 2023

Metal: faster Q4_0 and Q4_1 matrix x vector kernels #2212

Merged

ggerganov reviewed Jul 13, 2023

View reviewed changes

lshzh-ww deleted the mps-q4_0-kernel branch July 13, 2023 16:08

This was referenced Jul 20, 2023

Faster Q4_K on Metal #2290

Merged

Another speed gain for Q4_0 and Q4_1 on Metal #2375

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metal: new q4_0 mat-vec mul kernel #2188

metal: new q4_0 mat-vec mul kernel #2188

lshzh-ww commented Jul 12, 2023 •

edited

lshzh-ww commented Jul 12, 2023

ikawrakow commented Jul 12, 2023 •

edited

CyborgArmy83 commented Jul 12, 2023

lshzh-ww commented Jul 12, 2023 •

edited

lshzh-ww commented Jul 12, 2023

adrienbrault commented Jul 12, 2023

ggerganov commented Jul 12, 2023

ikawrakow commented Jul 12, 2023

ggerganov left a comment

ggerganov Jul 12, 2023

lshzh-ww Jul 12, 2023

lshzh-ww Jul 12, 2023

ikawrakow commented Jul 12, 2023

simicvm commented Jul 12, 2023

r3muxd commented Jul 12, 2023

gauravpathak-infa commented Jul 12, 2023

loretoparisi commented Jul 12, 2023

ggerganov Jul 13, 2023

lshzh-ww Jul 13, 2023

amj commented Jul 18, 2023

gauravpathak-infa commented Jul 18, 2023

amj commented Jul 19, 2023

metal: new q4_0 mat-vec mul kernel #2188

metal: new q4_0 mat-vec mul kernel #2188

Conversation

lshzh-ww commented Jul 12, 2023 • edited

lshzh-ww commented Jul 12, 2023

ikawrakow commented Jul 12, 2023 • edited

CyborgArmy83 commented Jul 12, 2023

lshzh-ww commented Jul 12, 2023 • edited

lshzh-ww commented Jul 12, 2023

adrienbrault commented Jul 12, 2023

ggerganov commented Jul 12, 2023

ikawrakow commented Jul 12, 2023

ggerganov left a comment

Choose a reason for hiding this comment

ggerganov Jul 12, 2023

Choose a reason for hiding this comment

lshzh-ww Jul 12, 2023

Choose a reason for hiding this comment

lshzh-ww Jul 12, 2023

Choose a reason for hiding this comment

ikawrakow commented Jul 12, 2023

simicvm commented Jul 12, 2023

r3muxd commented Jul 12, 2023

gauravpathak-infa commented Jul 12, 2023

loretoparisi commented Jul 12, 2023

ggerganov Jul 13, 2023

Choose a reason for hiding this comment

lshzh-ww Jul 13, 2023

Choose a reason for hiding this comment

amj commented Jul 18, 2023

gauravpathak-infa commented Jul 18, 2023

amj commented Jul 19, 2023

lshzh-ww commented Jul 12, 2023 •

edited

ikawrakow commented Jul 12, 2023 •

edited

lshzh-ww commented Jul 12, 2023 •

edited