New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml : support bs > 512
for Metal ggml_mul_mat_id
#5070
Comments
Yes, need to move the |
Hi everybody, I'm encountering the same using the python wrapper (the python kernel crashes if batch_size > 512): GGML_ASSERT: /private/var/folders/md/5gb2vml53fl36jdz9tvg53s80000gn/T/pip-install-b5i4cgto/llama-cpp-python_1549dcff18604e30944aeaa6c55a63b3/vendor/llama.cpp/ggml-metal.m:1726: ne11 <= 512 I can use the batch_size max to 512, but I'm sure it was working before, so not sure what update broke it... Please advise if I'm missing any update or procedure to address it Thanks |
I just verified that in the llama.cpp version I have (at least 3 weeks behind) this works: ./main -i -m /Volumes/AI_MASTER/models/mistral-8x7b-instruct/mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf -c 8196 -ngl 1 -b 8196 -t 0 --color -p "[INST] what is the capital of france? [/INST]" .............. system_info: n_threads = 10 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | == Running in interactive mode. ==
[INST] what is the capital of france? [/INST] The capital city of France is Paris. It is located in the northern central part of the country, on the river Seine. Paris is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum, which is home to thousands of works of art, including the Mona Lisa. Paris is also famous for its fashion, cuisine, and cultural events, making it one of the most popular tourist destinations in the world. Now I'm scared of updating llama.cpp since recompiling llama-python-cpp ended up in the behaviour described above. please advise how I can address it Thank you |
Use |
Thank you for the update Luca |
same issue here. @ggerganov looking forward to the changes in the future |
bs > 512
for Metal ggml_mul_mat_id
Mixtral models + metal gpu + batch size > 512 = GGML_ASERT. Does not affect models such as llama-2-7b-chat.Q5_K_M.gguf
Hardware: Apple M2 Ultra
RAM: 192GB
llama.cpp current version as of 2024-01-21 (504dc37)
./main -f /tmp/prompt1k -m models/mixtral-8x7b-instruct-v0.1.Q6_K.gguf -c 4096 -b 512 << OK
./main -f /tmp/prompt1k -m models/mixtral-8x7b-instruct-v0.1.Q6_K.gguf -c 4096 -b 4096 << FAIL
./main -f /tmp/prompt1k -m models/mixtral-8x7b-instruct-v0.1.Q6_K.gguf -c 4096 -b 4096 -ngl 0 << OK
but takes forever
The text was updated successfully, but these errors were encountered: