llama : Metal inference #1642

ggerganov · 2023-05-29T17:53:41Z

Add full GPU inference of LLaMA on Apple Silicon using Metal

The initial idea was proposed and explained here: Add GPU support to ggml #915
A basic PoC was demonstrated here: ggml : cgraph export/import/eval example + GPU support ggml#108

Demo

M1 Pro + 7B LLaMA:

llama-metal-0.mp4

M2 Max + 7B LLaMA:

llama-metal-1-lq.mp4

M2 Max + 13B LLaMA:

llama-metal-13B-0-lq.mp4

M2 Max + 65B LLaMa:

llama-metal-65B-0-lq.mp4

Details

The ggml API is extended in ggml-metal.h
The Metal shaders / kernels are implemented in ggml-metal.metal
This PR implements support only for Q4_0, but all other quantizations can easily be added in the future
Works well with mmap to avoid model data duplication in memory. Still there are a few memory improvements that can be made in the future to reduce the memory usage when Metal is enabled
The core of the implementation is contained in the ggml_metal_graph_compute() function. It is analogous to the CPU-only ggml_graph_compute() and it's purpose is to evaluate a ggml_cgraph on the GPU in a similar way
The implemented shaders currently focus on qMatrix x Vector multiplication which is normally needed for LLM text-generation. For other tasks that involve Matrix x Matrix (for example prompt ingestion, perplexity computation, etc) we don't have an efficient implementation yet, so we fallback to the CPU / ANE
There is a nice separation of the implementation: the new ggml-metal.h, ggml-metal.m and ggml-metal.metal files are optional and all Metal-related code is contained within them. 3rd party user apps can decide whether they want to include / modify / ignore them
The proposed implementation can be easily extended for other backends like CUDA by following the same pattern as demonstrated in this PR
Optionally, we now have support for exporting static computation graphs. Creation and usage is demonstrated in the metal example

Usage

Add LLAMA_METAL=1 to your make command or -DLLAMA_METAL=ON to your cmake command.
Add -ngl 1 to main command-line arguments to enable GPU inference

$ make clean
$ LLAMA_METAL=1 make -j && ./main -m ./models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" --ignore-eos -n 64 -ngl 1

I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_METAL
I LDFLAGS:   -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

cc  -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG   -c ggml.c -o ggml.o
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_METAL -c llama.cpp -o llama.o
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_METAL -c examples/common.cpp -o common.o
cc -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG -c ggml-metal.m -o ggml-metal.o
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_METAL examples/main/main.cpp ggml.o llama.o common.o ggml-metal.o -o main  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_METAL examples/quantize/quantize.cpp ggml.o llama.o ggml-metal.o -o quantize  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_METAL examples/quantize-stats/quantize-stats.cpp ggml.o llama.o ggml-metal.o -o quantize-stats  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_METAL examples/perplexity/perplexity.cpp ggml.o llama.o common.o ggml-metal.o -o perplexity  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_METAL examples/embedding/embedding.cpp ggml.o llama.o common.o ggml-metal.o -o embedding  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_METAL pocs/vdot/vdot.cpp ggml.o ggml-metal.o -o vdot  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders

====  Run ./main -h for help.  ====

main: build = 653 (db3db9e)
main: seed  = 1685893102
llama.cpp: loading model from ./models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 5407.71 MB (+ 1026.00 MB per state)
.
llama_init_from_file: kv self size  =  256.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/ggerganov/development/github/llama.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add                            0x120a06020
ggml_metal_init: loaded kernel_mul                            0x120a065a0
ggml_metal_init: loaded kernel_mul_row                        0x120a06bd0
ggml_metal_init: loaded kernel_scale                          0x120a070f0
ggml_metal_init: loaded kernel_silu                           0x120a07610
ggml_metal_init: loaded kernel_relu                           0x120a07b30
ggml_metal_init: loaded kernel_soft_max                       0x120a081e0
ggml_metal_init: loaded kernel_diag_mask_inf                  0x120a08840
ggml_metal_init: loaded kernel_get_rows_q4_0                  0x120a08ec0
ggml_metal_init: loaded kernel_rms_norm                       0x120a09570
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32               0x120a09dd0
ggml_metal_init: loaded kernel_mul_mat_f16_f32                0x120a0a7a0
ggml_metal_init: loaded kernel_rope                           0x120a0b090
ggml_metal_init: loaded kernel_cpy_f32_f16                    0x120a0b920
ggml_metal_init: loaded kernel_cpy_f32_f32                    0x120a0c1b0
ggml_metal_add_buffer: allocated 'data            ' buffer, size =  3616.07 MB
ggml_metal_add_buffer: allocated 'eval            ' buffer, size =   768.00 MB
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =   258.00 MB
ggml_metal_add_buffer: allocated 'scr0            ' buffer, size =   512.00 MB
ggml_metal_add_buffer: allocated 'scr1            ' buffer, size =   512.00 MB

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 64, n_keep = 0


 I believe the meaning of life is to be happy.
That's what I would call my philosophy on how to live life, that's what I want people to remember me for.
I was actually diagnosed with a tumor when I was 17 years old and had a very long surgery in order to get it removed.

llama_print_timings:        load time =  1685.43 ms
llama_print_timings:      sample time =    45.70 ms /    64 runs   (    0.71 ms per token)
llama_print_timings: prompt eval time =   342.51 ms /     8 tokens (   42.81 ms per token)
llama_print_timings:        eval time =  3079.50 ms /    63 runs   (   48.88 ms per token)
llama_print_timings:       total time =  4816.85 ms

Implementation process of this PR (archive)

Robots

`🤖 Generated by Copilot at 324e823`

Summary

🍎📝🚀

This pull request adds Metal support for llama, a library for tensor manipulation and computation graph export/import. It introduces a new CMake option LLAMA_METAL and a new header file ggml-metal.h that enable GPU acceleration of llama expressions on Apple devices. It also improves the readability, consistency, and usability of the existing code and documentation, and adds some new features and examples. It fixes a bug in the main example program and adds a new metal example program that demonstrates how to evaluate a statically exported ggml computation graph with Metal.

If you want to use llama with Metal
You can now do so with this pull request, all
You need is to set LLAMA_METAL
And then you can export your ggml
To a file or a graph that is special

Walkthrough

Add Metal support for llama, a GPU backend for Apple devices (link, link, link, link, link, link, link, link, link, link, link, link, link, link, link)
Fix a bug in the example program main.cpp that used subtraction instead of addition to compute the sum of two numbers (link)
Add a command-line option --export to the example program main.cpp that allows exporting the computation graph to a file named llama.ggml (link, link, link)
Add a function llama_eval_export that exports a static computation graph for a context of 511 and a batch size of 1 using llama_eval_internal (link, link)
Change the logic of the function ggml_graph_import to parse the arguments of the tensor before creating it, and to handle different cases of view operations differently (link, link)
Change the logic of the function ggml_nbytes to handle cases where the tensor is not contiguous in memory (link)
Add a call to ggml_scratch_save and ggml_scratch_load to the functions ggml_view_1d, ggml_view_2d, ggml_view_3d and ggml_view_4d to preserve the scratch memory state when creating a new tensor for the offset (link, link, link, link)
Add a call to ggml_set_name to the functions ggml_view_2d, ggml_view_3d and ggml_view_4d to assign a name to the result tensor for debugging purposes (link, link, link)
Add a call to ggml_set_name to the function llama_eval_internal to assign a name to the tensor Vcur for debugging purposes (link)
Add a parameter cgraph_fname to the function llama_eval_internal that allows exporting the computation graph to a file if not null (link, link, link)
Add a variable eop to the function ggml_graph_import that stores the enum value of the operation code for convenience (link)
Add a const qualifier to the variables mean and x0 in the functions ggml_compute_forward_rms_norm_f32 and ggml_compute_forward_rope_f32 to indicate that they are not modified after initialization (link, link, link)
Change the return type of the function ggml_nrows from int to int64_t to match the type of the ne field of the ggml_tensor struct (link)
Change the visibility of the functions ggml_is_transposed and ggml_is_contiguous from static inline to public by adding them to the ggml.h header file (link, link)
Increase the width of the last column in the format strings of the functions ggml_graph_export_leaf and ggml_graph_export_node to accommodate longer tensor names (link, link)
Comment out two assertions in the function ggml_graph_export that check the work buffer size of the computation graph, because they are not valid when exporting a graph with Metal support (link)
Remove an empty line from the function ggml_graph_export for consistency (link)
Remove the declaration of the variable cur from the function llama_eval_internal because it is declared later in the same scope (link)
Replace the variable inpL with cur in the function llama_eval_internal to reflect the previous changes in the tensor creation logic (link, link)
Remove an empty line from the function llama_eval_internal for consistency (link)
Add an empty line to the function llama_eval_internal for readability (link)
Format the call to llama_model_load in the function llama_init to use multiple lines and indentation for readability (link)
Format the declarations of the functions ggml_init and ggml_free in the ggml.h header file to use multiple lines and indentation for readability (link)
Format the target link libraries command for llama to use multiple lines and indentation for readability (link)
Align the spacing of the memory requirements expressions in the function llama_model_load_internal for readability (link)
Align the spacing of the CMake options for llama to make them more consistent and readable (link)
Rename the variable GGML_CUDA_SOURCES to GGML_SOURCES_CUDA to match the naming convention of other source variables in the CMake file (link, link)
Add a subdirectory metal to the examples CMake file if LLAMA_METAL is enabled (link)
Add an empty line to the README.md file for readability (link)
Add empty lines to the Makefile to separate different conditional blocks for readability (link, link, link)
Add comments to mark the end of the conditional blocks in the Makefile (link, link, link)

examples/mtl/mtl.metal

examples/common.cpp

ggerganov · 2023-05-30T16:48:22Z

Ok, the Q4 mul mat kernel is next - very important to get this right.
If we can hit that bullseye, the rest of the dominoes will fall like a house of cards. Checkmate

philipturner · 2023-05-30T17:18:08Z

Ok, the Q4 mul mat kernel is next - very important to get this right.

A bit of advice, when I made the kernel above, I ran the CPU-side script over a dozen times per change to the Metal code. I ran until I was confident I had found the maximum achievable bandwidth. Although this overestimates actual performance, it removes all noise, so you can focus on relative performance. "Does this change make it slightly faster or slightly slower?"

Then it's very similar to training a neural network. Incrementally descend the performance slope until reaching whatever Metal shader works best for you.

jason-hulkman · 2023-06-25T16:26:11Z

I'm considering purchasing the Mac Studio with M2 Ultra 76 core 192GB. I'm curious about the performance of your 65B 4-bits model. Could you provide some details? Does it run same as A6000(9~13tokens/s in 65B 4-bits)?

philipturner · 2023-06-25T17:49:57Z

I'm considering purchasing the Mac Studio with M2 Ultra 76 core 192GB.

Wouldn't it be cheaper to just purchase access to GPT-4 through the OpenAI API? If the goal is the highest-quality LLM models available.

soleblaze · 2023-06-25T18:52:09Z

I started working on my benchmark app. I’ll publish some alpha results when I get it setup to benchmark every quant and param value for a given model and put it in a table.

Wouldn't it be cheaper to just purchase access to GPT-4 through the OpenAI API? If the goal is the highest-quality LLM models available

if you can get gpt-4 access. That said, gpt-3.5-turbo is still better than any local LLM and is much cheaper. Using gpu instances like runpod is also way cheaper for non 24/7 use vs building even a mid-level setup. I was looking at an amd 6950xt to mess with amd support and decided since it’s not officially supported with rocm I’ll be using azure amd instances. I can get about 300 hours for the price of that card.

Running your own hardware doesn’t make sense from a cost perspective unless you’re literally doing it 24/7. Even then I’m not sure what the break even point is due to power bills.

Ofc, it’s not like “makes sense from a cost perspective” is always a priority with hobbies.

soleblaze · 2023-06-27T22:44:38Z

My benchmark app can go through some models in a directory, but eventually dies with an out of memory error. This appears to be an issue with llama-cpp-python. I don't think the CPU thread piece is working properly. I removed the prompt eval times from this, as they are much slower than I get if I run llama.cpp directly. The eval times appear to be in line with running llama.cpp directly.

I may spawning llama.cpp directly or I'll look into fixing llama-cpp-python. Not sure yet, but I'll have time next week to work on that and flesh this out more.

Here's what I have running it against a few 65b models:

Racing Llama Benchmark

System Information:
OS: MacOS
ARCH: arm64
CPU: Apple M2 Ultra - 24 cores (16 performance and 8 efficiency)
GPU: Apple M2 Ultra - 76 cores
RAM: 192 GB

Runs: 10
llama-cpp-python version: 0.1.66
CPU Threads: 15
GPU Acceleration: True
Seed: -1
Prompt: ### Human: You are an AI being benchmarked. You want to be helpful and provide a useful response that can be repeated. What would you suggest is the best way to benchmark the response times of a large language model?

Assistant:

Eval Tokens per second:

Model	Params	Quant	Fastest	Slowest	Mean	Median
airoboros (gpt4 1.3)	65B	q4_0	10.84	10.27	10.73	10.77
airoboros (gpt4 1.3)	65B	q5_K_M	9.57	8.36	9.32	9.45
alpaca lora	65B	q5_K_M	8.6	6.98	7.87	7.95
dromedary lora	65B	q4_K_M	10.3	9	9.92	10.24
dromedary lora	65B	q5_K_M	9.61	8.59	9.22	9.30
gpt4 alpaca lora_mlp	65B	q4_K_M	10.16	9.07	9.61	9.70
gpt4 alpaca lora_mlp	65B	q5_K_M	9.32	8.31	8.97	8.98
guanaco	65B	q4_K_M	10.26	9.53	10.11	10.21
guanaco	65B	q5_K_M	9.56	9.14	9.37	9.41

x4080 · 2023-06-28T01:57:27Z

@soleblaze Wow you have the greatest of M2 Ultra, congrats. Do you know how it compares to using like Nvidia 4090 ? But maybe not even running on 4090 because of the RAM requrements ?

philipturner · 2023-06-28T02:17:33Z

$200 more for 5x less bandwidth. Not 5% less, 5x less.

	4 x RTX 4090	M2 Ultra GPU	M2 Ultra ANE	A100	H100
Cost	$6400	$6600	$6600	$10,000	$40,000
Dense FP16 TFLOPS	1321.2	25.6	31.6	311.84	989.5
Bandwidth	4032 GB/s	800 GB/s	400 GB/s	2039 GB/s	3350 GB/s
RAM	96 GB	192 GB	192 GB	80 GB	80 GB

soleblaze · 2023-06-28T02:52:04Z

I think that cost comparison is a bit misleading, considering you’d also need a motherboard that can handle the 4 cards, two power supplies, two electrical circuits, fast ram, and a cpu that won’t bottleneck 4 cards. I’m not sure where the multi gpu support is on this and if the cards would need to use the pci bus to share a lot data. That said, I would never argue that the m2 ultra is the better buy for this use case.

Main thing m2 has going for it is power efficiency and a small form factor. I’m guessing if llama.cpp gets the MFA stuff that philipturner is working on that it could hit 3080-4080 levels of performance. I should have my benchmark app to the point where it’d be useful to do a comparison when that happens.

It would be nice to put a 2x 3090 box on that list. IMO that’s the best performance per dollar and I’m not sure a realistic home use would go over the 48GB of ram. Plus you’d get nvlink support.

philipturner · 2023-06-28T03:06:54Z

I think that cost comparison is a bit misleading, considering you’d also need a motherboard

Exactly. To buy into the CUDA ecosystem, you have to set up a Windows PC, with a massive box and 500 W power supply. I am all for using existing hardware, which I already own, to do the computations. Not for getting new hardware unless it can be built for free (my end goal with nanotech; build a personal supercluster).

It would be nice to put a 2x 3090 box on that list. IMO that’s the best performance per dollar and I’m not sure a realistic home use would go over the 48GB of ram. Plus you’d get nvlink support.

You're implying that a 2-GPU system costs $6,000, factoring in the CPU and box?

ggerganov · 2023-06-28T05:29:42Z

Does 4x GPUs really offer 4x the bandwidth?

If I remember correctly, with multiple GPUs the inference speed does not seem to scale proportionally, although I haven't had the chance to test it (cc @JohannesGaessler)

philipturner · 2023-06-28T05:46:36Z

You can shard the feedforward and attention layers straightforwardly. The bottleneck could be the latency-bound process of broadcasting the result vector to the peers for the next feedforward.

If I remember correctly, with multiple GPUs the inference speed does not seem to scale proportionally, although I haven't had the chance to test

Amdahl’s Law

soleblaze · 2023-06-28T06:48:43Z

You're implying that a 2-GPU system costs $6,000, factoring in the CPU and box?

Lol no. Well, I’m sure there’s some boutique builders that have some offerings at that level. More to show how much lower that would cost. I don’t think you start hitting the large non-gpu costs of a build until you go to 3 or 4 gpus

I am really curious on the performance difference with multiple gpus. It looks like runpod goes up to 8x 4090. I’ll put it on my list to look at. Guessing runpod will be a good starting point to compare hardware performance differences.

JohannesGaessler · 2023-06-28T07:30:46Z

Does 4x GPUs really offer 4x the bandwidth?

You do get 4x the bandwidth, the problem is actually utilizing it. For large tensors like the matrix multiplication the scaling should be roughly linear but the problem is that for all of the small tensors the overhead from moving data between GPUs is larger than just doing the calculation on a single GPU. So this limits how much of the program you can actually parallelize and the parallelization itself introduces overhead (currently the CUDA synchronization logic for multi GPU settings still needs a lot of optimization). One possible way to improve this would be to fuse tensors where applicable so that you have one large tensor instead of many small tensors which can then be handled more efficiently (this would be beneficial in general).

There is also the issue that writing code that utilizes multiple GPUs simply takes more work to develop and maintain; right now only matrix multiplications using the weights can be parallelized with the CUDA implementation (~67% of the runtime). The matrix multiplications using the KV cache could also be parallelized for another ~20% of the runtime.

philipturner · 2023-06-28T12:49:21Z

	2 x RTX 4090	M2 Max GPU	M2 Max ANE
Cost	$3200	$3000	$3000
Dense FP16 TFLOPS	660.6	12.8	15.8
Bandwidth	2016 GB/s	400 GB/s	200 GB/s
RAM	48 GB	96 GB	96 GB

soleblaze · 2023-06-28T14:13:31Z

Since we’re talking about performance and what the manufacturers max specs are, rather than ask my questions about what matters and what’s worth benchmarking here I opened a discussion: #2038. Can y’all give me input on this?

I think y’all have the most knowledge around what we actually care about and I want to have a way for people to run useful benchmarks on their system vs trusting that the manufacturer max's are achievable (for instance, I don’t really trust that 800gb/s is achievable on apple silicon ultra chips for a single workload)

okpatil4u · 2023-06-29T09:29:07Z

Does it really compare though ? You may have to add performance of both CPU and Neural engine of M2 Ultra systems together to benchmark bang for the bucks performance.

I am also not sure about 660.6/2 tflops FP16 performance. The wikipedia page is showing 82 TFlops at half precision. Maybe I am missing the source.

Same with M2 Ultra, the wikipedia page is showing 27.2 TFlops FP32 performance, does that mean that FP16 performance is 2x27.2 TFLOPS ?

philipturner · 2023-06-29T12:31:27Z

You may have to add performance of both CPU and Neural engine of M2 Ultra systems together to benchmark bang for the bucks performance.

They cannot be used simultaneously without a hideous latency.

I am also not sure about 660.6/2 tflops FP16 performance. The wikipedia page is showing 82 TFlops at half precision. Maybe I am missing the source.

330 is for tensor cores. 82.5 is for shader cores.

Same with M2 Ultra, the wikipedia page is showing 27.2 TFlops FP32 performance, does that mean that FP16 performance is 2x27.2 TFLOPS ?

That is theoretical max ALU. The SIMD MATMUL FMADD32 instruction performs 32 float ops in 18 cycles, while the 16-bit version takes 17 cycles. So max FP32 matmul is 24.2 TFLOPS, max FP16 matmul is 25.6 TFLOPS.

okpatil4u · 2023-06-29T12:44:59Z

Thanks for the clarification Philip. This is very useful. Just one more follow up question if you don’t mind. Did you mean that I can’t run two different instances of Llama.cpp, one on M1 Metal GPU and another on multithreaded M1 CPU, without incurring significant latency ?

…

On Thu, 29 Jun 2023 at 6:01 PM, Philip Turner ***@***.***> wrote: You may have to add performance of both CPU and Neural engine of M2 Ultra systems together to benchmark bang for the bucks performance. They cannot be used simultaneously without a hideous latency. I am also not sure about 660.6/2 tflops FP16 performance. The wikipedia <https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units> page is showing 82 TFlops at half precision. Maybe I am missing the source. 330 is for tensor cores. 82.5 is for shader cores. Same with M2 Ultra, the wikipedia <https://en.wikipedia.org/wiki/Apple_M2> page is showing 27.2 TFlops FP32 performance, does that mean that FP16 performance is 2x27.2 TFLOPS ? That is theoretical max ALU. The SIMD MATMUL FMADD32 instruction performs 32 float ops in 18 cycles, while the 16-bit version takes 17 cycles. So max FP32 matmul is 24.2 TFLOPS, max FP16 matmul is 25.6 TFLOPS. — Reply to this email directly, view it on GitHub <#1642 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAXGU4GLXZTGDXMPZUDPGQLXNVYS3ANCNFSM6AAAAAAYTBBJV4> . You are receiving this because you commented.Message ID: <ggerganov/llama. ***@***.***>

PaddyPatPat · 2023-06-29T18:21:03Z

Okpatil4u, the ggml.ai homepage shows a screen recording of "Simultaneously running 4 instances of 13B LLaMA + Whisper Small on a single M1 Pro". I take this to mean that you can run multiple models at once. I assume there are complications to this though.

philipturner · 2023-08-02T02:56:18Z

It's actually quite efficient (theoretically) because you perform 4 batched inferences. The latency is the same as 1 inference, until the batch size becomes so large it's compute-bound.

x4080 · 2023-08-02T20:40:57Z

When using mac, is the "prompt" processing from user is less efficient compare to using NVIDIA ? Because for example, in summarization, the text needs to be summarized will be the prompt and in my case (m2) it takes a long time to process, and the guys who have NVIDIA is not having that problem

philipturner · 2023-08-02T22:19:24Z

That's because Metal Performance Shaders has a very inefficient GEMM, which is a compute-bound operation. Token decoding (tokens/second) is a memory-bound operation.

ghost · 2023-08-02T23:09:20Z

That's because Metal Performance Shaders has a very inefficient GEMM, which is a compute-bound operation. Token decoding (tokens/second) is a memory-bound operation.

Maybe ANE accessible through CoreML (ONNX ex) or libane can be used to accelerate GEMM?

philipturner · 2023-08-02T23:22:44Z

The ANE is designed for convolutions, so its GEMM throughput is ~25% of the advertised TFLOPS. On everything besides A-series chips, it's slower than the GPU. The solution is a more performant GPU GEMM library.

x4080 · 2023-08-19T21:58:24Z

Can we use CLBlast to speedup prompt ingestion ? It supports m1 and m2 apparently

ggerganov · 2023-08-20T05:32:57Z

Unlikely. The next performance jump will come from quantum matrix multiplication.

philipturner · 2023-08-20T12:23:51Z

Can we use CLBlast to speedup prompt ingestion ? It supports m1 and m2 apparently

CLBlast is slower than Metal Performance Shaders. Only able to reach 28% ALU utilization and unable to use half precision.

The next performance jump will come from quantum matrix multiplication.

Yes, using quantum computers to multiply hermitian matrices and solve eigenvalue problems in under $O(n^3)$ time. Until then, use GPUs to simulate quantum chemistry.

ggerganov added 2 commits May 29, 2023 20:49

mtl : export the LLaMA computation graph

f85020b

ci : disable temporary

98c267f

ggerganov force-pushed the metal branch from 9826272 to dc007ac Compare May 29, 2023 18:16

mtl : adapt the MNIST example as starter

b23fe8c

ggerganov force-pushed the metal branch from dc007ac to b23fe8c Compare May 29, 2023 18:21

ggerganov added 4 commits May 29, 2023 21:28

mtl : no need for mtl-export tool, add cli arg for main instead

a792cbd

mtl : export just a small part of the graph for now to make it easier

897d6d8

mtl : move MSL code into separate file for easy editing

248a8c3

mtl : initial get_rows_q4_0 kernel

a8fd9dc

ggerganov added the performance Speed related topics label May 29, 2023

philipturner suggested changes May 30, 2023

View reviewed changes

examples/mtl/mtl.metal Outdated Show resolved Hide resolved

JohannesGaessler reviewed May 30, 2023

View reviewed changes

examples/common.cpp Show resolved Hide resolved

ggerganov added 3 commits May 30, 2023 18:41

mtl : confirmed get_rows_q4_0 is working correctly

794704e

mtl : add rms_norm kernel + confirm working

72256eb

mtl : add mul kernel + confirm working

64afc0b

ggerganov added 2 commits May 30, 2023 22:02

mtl : initial mul_mat Q4 kernel (wrong results)

2a24994

mtl : mul_mat fixes (still wrong)

96d0052

ggerganov force-pushed the metal branch from 1c06bd6 to 96d0052 Compare May 30, 2023 19:20

ggerganov added 5 commits May 30, 2023 22:31

mtl : another mul_mat Q4 (still does not work)

29bec00

mtl : working mul_mat q4

b2fd06c

ggml : fix handling of "view" ops in ggml_graph_import()

6af6a05

mtl : add rope kernel

1213af7

mtl : add reshape and transpose handling

7ca81e9

ggerganov force-pushed the metal branch from 9883af8 to 7ca81e9 Compare May 31, 2023 20:01

ggerganov added 4 commits June 1, 2023 19:21

ggml : store offset as opt arg for ggml_view_xd() operators

94ea9e7

mtl : add cpy kernel + handle view ops

948fcfd

mtl : confirm f16 x f32 attention mul mat

51efb59

mtl : add scale kernel

0f1c580

ggerganov mentioned this pull request Jun 25, 2023

whisper : add Metal support in the Decoder ggerganov/whisper.cpp#1047

Closed

siddhsql mentioned this pull request Jul 7, 2023

violent crash on Mac Mini M2 8GB RAM when trying to use GPU #2141

Closed

ghost mentioned this pull request Aug 2, 2023

Metal support cmp-nct/ggllm.cpp#78

Open

mikecvet mentioned this pull request Feb 17, 2024

Using metal and n_gpu_layers produces no tokens mdrokz/rust-llama.cpp#30

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : Metal inference #1642

llama : Metal inference #1642

ggerganov commented May 29, 2023 •

edited by ghost

Loading

ggerganov commented May 30, 2023

philipturner commented May 30, 2023 •

edited

Loading

jason-hulkman commented Jun 25, 2023

philipturner commented Jun 25, 2023

soleblaze commented Jun 25, 2023 •

edited

Loading

soleblaze commented Jun 27, 2023 •

edited

Loading

x4080 commented Jun 28, 2023

philipturner commented Jun 28, 2023 •

edited

Loading

soleblaze commented Jun 28, 2023 •

edited

Loading

philipturner commented Jun 28, 2023

ggerganov commented Jun 28, 2023

philipturner commented Jun 28, 2023

soleblaze commented Jun 28, 2023 •

edited

Loading

JohannesGaessler commented Jun 28, 2023

philipturner commented Jun 28, 2023

soleblaze commented Jun 28, 2023 •

edited

Loading

okpatil4u commented Jun 29, 2023

philipturner commented Jun 29, 2023

okpatil4u commented Jun 29, 2023 via email

PaddyPatPat commented Jun 29, 2023

philipturner commented Aug 2, 2023

x4080 commented Aug 2, 2023

philipturner commented Aug 2, 2023

ghost commented Aug 2, 2023

philipturner commented Aug 2, 2023

x4080 commented Aug 19, 2023

ggerganov commented Aug 20, 2023

philipturner commented Aug 20, 2023

llama : Metal inference #1642

llama : Metal inference #1642

Conversation

ggerganov commented May 29, 2023 • edited by ghost Loading

Add full GPU inference of LLaMA on Apple Silicon using Metal

Demo

Details

Usage

Implementation process of this PR (archive)

Robots

🤖 Generated by Copilot at 324e823

Summary

Walkthrough

ggerganov commented May 30, 2023

philipturner commented May 30, 2023 • edited Loading

jason-hulkman commented Jun 25, 2023

philipturner commented Jun 25, 2023

soleblaze commented Jun 25, 2023 • edited Loading

soleblaze commented Jun 27, 2023 • edited Loading

Assistant:

x4080 commented Jun 28, 2023

philipturner commented Jun 28, 2023 • edited Loading

soleblaze commented Jun 28, 2023 • edited Loading

philipturner commented Jun 28, 2023

ggerganov commented Jun 28, 2023

philipturner commented Jun 28, 2023

soleblaze commented Jun 28, 2023 • edited Loading

JohannesGaessler commented Jun 28, 2023

philipturner commented Jun 28, 2023

soleblaze commented Jun 28, 2023 • edited Loading

okpatil4u commented Jun 29, 2023

philipturner commented Jun 29, 2023

okpatil4u commented Jun 29, 2023 via email

PaddyPatPat commented Jun 29, 2023

philipturner commented Aug 2, 2023

x4080 commented Aug 2, 2023

philipturner commented Aug 2, 2023

ghost commented Aug 2, 2023

philipturner commented Aug 2, 2023

x4080 commented Aug 19, 2023

ggerganov commented Aug 20, 2023

philipturner commented Aug 20, 2023

ggerganov commented May 29, 2023 •

edited by ghost

Loading

`🤖 Generated by Copilot at 324e823`

philipturner commented May 30, 2023 •

edited

Loading

soleblaze commented Jun 25, 2023 •

edited

Loading

soleblaze commented Jun 27, 2023 •

edited

Loading

philipturner commented Jun 28, 2023 •

edited

Loading

soleblaze commented Jun 28, 2023 •

edited

Loading

soleblaze commented Jun 28, 2023 •

edited

Loading

soleblaze commented Jun 28, 2023 •

edited

Loading