Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml : support bs > 512 for Metal ggml_mul_mat_id #5070

Closed
stewartoallen opened this issue Jan 22, 2024 · 6 comments · Fixed by #5982
Closed

ggml : support bs > 512 for Metal ggml_mul_mat_id #5070

stewartoallen opened this issue Jan 22, 2024 · 6 comments · Fixed by #5982
Labels
enhancement New feature or request good first issue Good for newcomers macos Issues specific to macOS

Comments

@stewartoallen
Copy link

Mixtral models + metal gpu + batch size > 512 = GGML_ASERT. Does not affect models such as llama-2-7b-chat.Q5_K_M.gguf

Hardware: Apple M2 Ultra
RAM: 192GB
llama.cpp current version as of 2024-01-21 (504dc37)

./main -f /tmp/prompt1k -m models/mixtral-8x7b-instruct-v0.1.Q6_K.gguf -c 4096 -b 512 << OK
./main -f /tmp/prompt1k -m models/mixtral-8x7b-instruct-v0.1.Q6_K.gguf -c 4096 -b 4096 << FAIL

### Assistant:GGML_ASSERT: ggml-metal.m:1511: ne11 <= 512

./main -f /tmp/prompt1k -m models/mixtral-8x7b-instruct-v0.1.Q6_K.gguf -c 4096 -b 4096 -ngl 0 << OK

but takes forever

@ggerganov
Copy link
Owner

Yes, need to move the src1ids from the stack to shared memory and increase it's size

@xcottos
Copy link

xcottos commented Feb 6, 2024

Hi everybody,

I'm encountering the same using the python wrapper (the python kernel crashes if batch_size > 512):

GGML_ASSERT: /private/var/folders/md/5gb2vml53fl36jdz9tvg53s80000gn/T/pip-install-b5i4cgto/llama-cpp-python_1549dcff18604e30944aeaa6c55a63b3/vendor/llama.cpp/ggml-metal.m:1726: ne11 <= 512
GGML_ASSERT: /private/var/folders/md/5gb2vml53fl36jdz9tvg53s80000gn/T/pip-install-b5i4cgto/llama-cpp-python_1549dcff18604e30944aeaa6c55a63b3/vendor/llama.cpp/ggml-metal.m:1726: ne11 <= 512
GGML_ASSERT: /private/var/folders/md/5gb2vml53fl36jdz9tvg53s80000gn/T/pip-install-b5i4cgto/llama-cpp-python_1549dcff18604e30944aeaa6c55a63b3/vendor/llama.cpp/ggml-metal.m:1726: ne11 <= 512
GGML_ASSERT: /private/var/folders/md/5gb2vml53fl36jdz9tvg53s80000gn/T/pip-install-b5i4cgto/llama-cpp-python_1549dcff18604e30944aeaa6c55a63b3/vendor/llama.cpp/ggml-metal.m:1726: ne11 <= 512
zsh: abort /Volumes/AI_MASTER/envs/torch-metal/bin/python
/Volumes/AI_MASTER/envs/torch-metal/lib/python3.11/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '

I can use the batch_size max to 512, but I'm sure it was working before, so not sure what update broke it...

Please advise if I'm missing any update or procedure to address it

Thanks
Luca

@xcottos
Copy link

xcottos commented Feb 6, 2024

I just verified that in the llama.cpp version I have (at least 3 weeks behind) this works:

./main -i -m /Volumes/AI_MASTER/models/mistral-8x7b-instruct/mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf -c 8196 -ngl 1 -b 8196 -t 0 --color -p "[INST] what is the capital of france? [/INST]"

..............
ggml_metal_init: recommendedMaxWorkingSetSize = 51539.61 MB
ggml_metal_init: maxTransferRate = built-in GPU
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 1024.50 MiB, (26241.81 / 49152.00)
llama_new_context_with_model: KV self size = 1024.50 MiB, K (f16): 512.25 MiB, V (f16): 512.25 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 0.02 MiB, (26241.83 / 49152.00)
llama_build_graph: non-view tensors processed: 1124/1124
llama_new_context_with_model: compute buffer total size = 9292.41 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 9289.23 MiB, (35531.05 / 49152.00)

system_info: n_threads = 10 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |
main: interactive mode on.
sampling:
repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp
generate: n_ctx = 8196, n_batch = 8196, n_predict = -1, n_keep = 0

== Running in interactive mode. ==

  • Press Ctrl+C to interject at any time.
  • Press Return to return control to LLaMa.
  • To return control without starting a new line, end your input with '/'.
  • If you want to submit another line, end your input with ''.

[INST] what is the capital of france? [/INST] The capital city of France is Paris. It is located in the northern central part of the country, on the river Seine. Paris is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum, which is home to thousands of works of art, including the Mona Lisa. Paris is also famous for its fashion, cuisine, and cultural events, making it one of the most popular tourist destinations in the world.

Now I'm scared of updating llama.cpp since recompiling llama-python-cpp ended up in the behaviour described above.

please advise how I can address it

Thank you
Luca

@ggerganov
Copy link
Owner

Use n_batch <= 512 for now - this change was made to improve the performance. Larger batches will be supported in the future (#5070 (comment))

@xcottos
Copy link

xcottos commented Feb 6, 2024

Thank you for the update

Luca

@weissenbacherpwc
Copy link

same issue here. @ggerganov looking forward to the changes in the future

@ggerganov ggerganov added enhancement New feature or request macos Issues specific to macOS and removed bug-unconfirmed labels Feb 9, 2024
@ggerganov ggerganov changed the title GGML_ASSERT ggml-metal.m:1515: ne11 <= 512 when using gpu and mixtral models ggml : support bs > 512 for Metal ggml_mul_mat_id Feb 9, 2024
@ggerganov ggerganov added the good first issue Good for newcomers label Feb 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers macos Issues specific to macOS
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants