ggml : support `bs > 512` for Metal `ggml_mul_mat_id` #5070

stewartoallen · 2024-01-22T01:34:58Z

Mixtral models + metal gpu + batch size > 512 = GGML_ASERT. Does not affect models such as llama-2-7b-chat.Q5_K_M.gguf

Hardware: Apple M2 Ultra
RAM: 192GB
llama.cpp current version as of 2024-01-21 (504dc37)

./main -f /tmp/prompt1k -m models/mixtral-8x7b-instruct-v0.1.Q6_K.gguf -c 4096 -b 512 << OK
./main -f /tmp/prompt1k -m models/mixtral-8x7b-instruct-v0.1.Q6_K.gguf -c 4096 -b 4096 << FAIL

### Assistant:GGML_ASSERT: ggml-metal.m:1511: ne11 <= 512

./main -f /tmp/prompt1k -m models/mixtral-8x7b-instruct-v0.1.Q6_K.gguf -c 4096 -b 4096 -ngl 0 << OK

but takes forever

The text was updated successfully, but these errors were encountered:

ggerganov · 2024-01-22T10:41:12Z

Yes, need to move the src1ids from the stack to shared memory and increase it's size

xcottos · 2024-02-06T08:35:58Z

Hi everybody,

I'm encountering the same using the python wrapper (the python kernel crashes if batch_size > 512):

GGML_ASSERT: /private/var/folders/md/5gb2vml53fl36jdz9tvg53s80000gn/T/pip-install-b5i4cgto/llama-cpp-python_1549dcff18604e30944aeaa6c55a63b3/vendor/llama.cpp/ggml-metal.m:1726: ne11 <= 512
GGML_ASSERT: /private/var/folders/md/5gb2vml53fl36jdz9tvg53s80000gn/T/pip-install-b5i4cgto/llama-cpp-python_1549dcff18604e30944aeaa6c55a63b3/vendor/llama.cpp/ggml-metal.m:1726: ne11 <= 512
GGML_ASSERT: /private/var/folders/md/5gb2vml53fl36jdz9tvg53s80000gn/T/pip-install-b5i4cgto/llama-cpp-python_1549dcff18604e30944aeaa6c55a63b3/vendor/llama.cpp/ggml-metal.m:1726: ne11 <= 512
GGML_ASSERT: /private/var/folders/md/5gb2vml53fl36jdz9tvg53s80000gn/T/pip-install-b5i4cgto/llama-cpp-python_1549dcff18604e30944aeaa6c55a63b3/vendor/llama.cpp/ggml-metal.m:1726: ne11 <= 512
zsh: abort /Volumes/AI_MASTER/envs/torch-metal/bin/python
/Volumes/AI_MASTER/envs/torch-metal/lib/python3.11/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '

I can use the batch_size max to 512, but I'm sure it was working before, so not sure what update broke it...

Please advise if I'm missing any update or procedure to address it

Thanks
Luca

xcottos · 2024-02-06T09:39:43Z

I just verified that in the llama.cpp version I have (at least 3 weeks behind) this works:

./main -i -m /Volumes/AI_MASTER/models/mistral-8x7b-instruct/mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf -c 8196 -ngl 1 -b 8196 -t 0 --color -p "[INST] what is the capital of france? [/INST]"

..............
ggml_metal_init: recommendedMaxWorkingSetSize = 51539.61 MB
ggml_metal_init: maxTransferRate = built-in GPU
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 1024.50 MiB, (26241.81 / 49152.00)
llama_new_context_with_model: KV self size = 1024.50 MiB, K (f16): 512.25 MiB, V (f16): 512.25 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 0.02 MiB, (26241.83 / 49152.00)
llama_build_graph: non-view tensors processed: 1124/1124
llama_new_context_with_model: compute buffer total size = 9292.41 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 9289.23 MiB, (35531.05 / 49152.00)

system_info: n_threads = 10 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |
main: interactive mode on.
sampling:
repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp
generate: n_ctx = 8196, n_batch = 8196, n_predict = -1, n_keep = 0

== Running in interactive mode. ==

Press Ctrl+C to interject at any time.
Press Return to return control to LLaMa.
To return control without starting a new line, end your input with '/'.
If you want to submit another line, end your input with ''.

[INST] what is the capital of france? [/INST] The capital city of France is Paris. It is located in the northern central part of the country, on the river Seine. Paris is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum, which is home to thousands of works of art, including the Mona Lisa. Paris is also famous for its fashion, cuisine, and cultural events, making it one of the most popular tourist destinations in the world.

Now I'm scared of updating llama.cpp since recompiling llama-python-cpp ended up in the behaviour described above.

please advise how I can address it

Thank you
Luca

ggerganov · 2024-02-06T10:03:16Z

Use n_batch <= 512 for now - this change was made to improve the performance. Larger batches will be supported in the future (#5070 (comment))

xcottos · 2024-02-06T10:26:21Z

Thank you for the update

Luca

weissenbacherpwc · 2024-02-09T10:07:58Z

same issue here. @ggerganov looking forward to the changes in the future

stewartoallen added the bug-unconfirmed label Jan 22, 2024

ggerganov added enhancement New feature or request macos Issues specific to macOS and removed bug-unconfirmed labels Feb 9, 2024

ggerganov changed the title ~~GGML_ASSERT ggml-metal.m:1515: ne11 <= 512 when using gpu and mixtral models~~ ggml : support bs > 512 for Metal ggml_mul_mat_id Feb 9, 2024

ggerganov added the good first issue Good for newcomers label Feb 9, 2024

ggerganov mentioned this issue Mar 10, 2024

metal : move mm_id indices to shared mem #5982

Merged

ggerganov closed this as completed in #5982 Mar 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml : support `bs > 512` for Metal `ggml_mul_mat_id` #5070

ggml : support `bs > 512` for Metal `ggml_mul_mat_id` #5070

stewartoallen commented Jan 22, 2024

ggerganov commented Jan 22, 2024

xcottos commented Feb 6, 2024 •

edited

xcottos commented Feb 6, 2024

ggerganov commented Feb 6, 2024

xcottos commented Feb 6, 2024

weissenbacherpwc commented Feb 9, 2024

ggml : support bs > 512 for Metal ggml_mul_mat_id #5070

ggml : support bs > 512 for Metal ggml_mul_mat_id #5070

Comments

stewartoallen commented Jan 22, 2024

ggerganov commented Jan 22, 2024

xcottos commented Feb 6, 2024 • edited

xcottos commented Feb 6, 2024

ggerganov commented Feb 6, 2024

xcottos commented Feb 6, 2024

weissenbacherpwc commented Feb 9, 2024

ggml : support `bs > 512` for Metal `ggml_mul_mat_id` #5070

ggml : support `bs > 512` for Metal `ggml_mul_mat_id` #5070

xcottos commented Feb 6, 2024 •

edited