whisper : Metal and ggml-alloc support #1270

ggerganov · 2023-09-10T16:23:54Z

This PR adds Metal support for full GPU inference on Apple Silicon.
It also optimizes memory usage.

Base model running on M2 Ultra using Metal:

metal-base-1.mp4

Medium model running on M2 Ultra using Metal:

metal-medium-1.mp4

Usage:

Full Metal inference (all processing is on the GPU)

make clean
make -j && ./main -m ./models/ggml-base.en.bin -f ./samples/gb0.wav

CPU	OS	Config	Model	Th	Load [ms]	Encode [ms]	Commit
M2 Ultra	MacOS 13.5.1	Metal	tiny	8	16.69	1.42	2.92
M2 Ultra	MacOS 13.5.1	Metal	base	8	25.88	2.03	4.57
M2 Ultra	MacOS 13.5.1	Metal	small	8	62.30	4.01	11.80
M2 Ultra	MacOS 13.5.1	Metal	medium	8	160.75	8.26	29.41
M2 Ultra	MacOS 13.5.1	Metal	large	8	268.36	12.55	52.22

CoreML Encoder + Metal Decoder

make clean
WHISPER_COREML=1 make -j && ./main -m ./models/ggml-base.en.bin -f ./samples/gb0.wav

CPU	OS	Config	Model	Th	Load [ms]	Encode [ms]	Commit
M2 Ultra	MacOS 13.5.1	Core ML (ANE)	tiny	8	24.41	1.42	2.91
M2 Ultra	MacOS 13.5.1	Core ML (ANE)	base	8	46.50	2.02	4.72
M2 Ultra	MacOS 13.5.1	Core ML (ANE)	small	8	137.58	3.99	11.91
M2 Ultra	MacOS 13.5.1	Core ML (ANE)	medium	8	677.31	8.25	29.51
M2 Ultra	MacOS 13.5.1	Core ML (ANE)	large	8	1823.94	12.74	52.40

CPU	OS	Config	Model	Th	Load [ms]	Encode [ms]	Commit
M2 Ultra	MacOS 13.5.1	Core ML (GPU)	tiny	8	6.72	1.42	3.15
M2 Ultra	MacOS 13.5.1	Core ML (GPU)	base	8	11.31	2.03	4.42
M2 Ultra	MacOS 13.5.1	Core ML (GPU)	small	8	32.81	4.00	11.79
M2 Ultra	MacOS 13.5.1	Core ML (GPU)	medium	8	103.47	8.22	29.38
M2 Ultra	MacOS 13.5.1	Core ML (GPU)	large	8	185.71	12.48	52.30

TODOs

merge https://github.com/ggerganov/whisper.spm branch ggml-alloc when ready
fix CI
fix ggml_nbytes()
Can't figure out how to make the CMake build look for ggml-metal.metal in the bin folder

whisper.cpp

slaren · 2023-09-10T20:08:09Z

whisper.cpp

+    state->alloc_encode      = ggml_allocr_new_measure(tensor_alignment);
+    state->alloc_encode_post = ggml_allocr_new_measure(tensor_alignment);
+    state->alloc_decode      = ggml_allocr_new_measure(tensor_alignment);


There is a chance that this will not work in some systems with limited virtual memory, such as iOS, because each measure allocator reserves a large amount of virtual memory. It would be safer to allocate only one measure allocator at a time, I think that should be possible here.
It's definitely not ideal that ggml-alloc has this limitation, I expect to improve this and remove the use of virtual memory entirely with the common backends interface implementation.

I think I reorganized the allocators as proposed, but it seems some OSes still fail during the second new_measure - see linux/arm64 and linux/ppc64le in the CI: https://github.com/ggerganov/whisper.cpp/actions/runs/6146272809/job/16675319319

I am a bit confused by this, it seems that the call to mmap is crashing the process instead of returning an error, because otherwise we should see the failed assert GGML_ASSERT(!"failed to allocate virtual memory for measure buffer");. I imagine that this is related to QEMU, I'll try to reproduce it locally.

I tried the exact same commands that the CI uses to run the arm64 version with docker and QEMU, and it works on my computer. So whatever is the issue, it only seems to happen on the github CI environment and I cannot reproduce it. Maybe it is hitting some memory usage limit.

$ sudo docker run --platform linux/arm64 --rm \ -v /home/diego/code/whisper.cpp:/workspace \ -w /workspace ubuntu:22.04 /bin/sh -c ' apt update apt install -y build-essential cmake libsdl2-dev cmake . -DWHISPER_SUPPORT_SDL2=ON -DCMAKE_BUILD_TYPE=Release make ctest -L gh --output-on-failure' [...] -- The C compiler identification is GNU 11.4.0 -- The CXX compiler identification is GNU 11.4.0 -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Check for working C compiler: /usr/bin/cc - skipped -- Detecting C compile features -- Detecting C compile features - done -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Check for working CXX compiler: /usr/bin/c++ - skipped -- Detecting CXX compile features -- Detecting CXX compile features - done -- Could NOT find Git (missing: GIT_EXECUTABLE) -- Looking for pthread.h -- Looking for pthread.h - found -- Performing Test CMAKE_HAVE_LIBC_PTHREAD -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success -- Found Threads: TRUE -- CMAKE_SYSTEM_PROCESSOR: aarch64 -- ARM detected -- Configuring done -- Generating done CMake Warning: Manually-specified variables were not used by the project: WHISPER_SUPPORT_SDL2 -- Build files have been written to: /workspace [ 7%] Building C object CMakeFiles/whisper.dir/ggml.c.o [ 15%] Building C object CMakeFiles/whisper.dir/ggml-alloc.c.o [ 23%] Building CXX object CMakeFiles/whisper.dir/whisper.cpp.o [ 30%] Linking CXX shared library libwhisper.so [ 30%] Built target whisper [ 38%] Building CXX object examples/CMakeFiles/common.dir/common.cpp.o [ 46%] Building CXX object examples/CMakeFiles/common.dir/common-ggml.cpp.o [ 53%] Linking CXX static library libcommon.a [ 53%] Built target common [ 61%] Building CXX object examples/main/CMakeFiles/main.dir/main.cpp.o [ 69%] Linking CXX executable ../../bin/main [ 69%] Built target main [ 76%] Building CXX object examples/bench/CMakeFiles/bench.dir/bench.cpp.o [ 84%] Linking CXX executable ../../bin/bench [ 84%] Built target bench [ 92%] Building CXX object examples/quantize/CMakeFiles/quantize.dir/quantize.cpp.o [100%] Linking CXX executable ../../bin/quantize [100%] Built target quantize Test project /workspace Start 1: test-main-tiny 1/2 Test #1: test-main-tiny ................... Passed 82.79 sec Start 2: test-main-tiny.en 2/2 Test #2: test-main-tiny.en ................ Passed 83.46 sec 100% tests passed, 0 tests failed out of 2 Label Time Summary: en = 83.46 sec*proc (1 test) gh = 166.24 sec*proc (2 tests) tiny = 166.24 sec*proc (2 tests)

A possible workaround could be reducing the amount of virtual memory allocated here:

whisper.cpp/ggml-alloc.c

Lines 345 to 346 in d3b2dd4

// 1TB for 64-bit, 1GB for 32-bit

*size = sizeof(void *) == 4 ? 1ULL<<30 : 1ULL<<40;

Ok, thanks for looking into this. I'll now continue working on this branch and try to find a solution

Reducing the size to 128GB fixes the CI: b19888c

slaren · 2023-09-10T21:06:27Z

ggml-alloc.c

+static void * alloc_vmem(size_t size) {
+#if defined(_WIN32)
+    return VirtualAlloc(NULL, size, MEM_RESERVE, PAGE_NOACCESS);
+#elif defined(_POSIX_MAPPED_FILES)


If the emscripten build doesn't work, it can be excluded from using mmap here and in free_vmem by checking if __EMSCRIPTEN__ is defined. I think this should do it:

Suggested change

#elif defined(_POSIX_MAPPED_FILES)

#elif defined(_POSIX_MAPPED_FILES) && !defined(__EMSCRIPTEN__)

Good news - the Emscripten build looks to be working without any adjustments needed

bobqianic · 2023-09-11T12:43:36Z

I recently ran some performance tests on whisper.cpp and observed a significant drop in performance after the ggml sync. Do you think this PR could address that? I've already raised an issue about it.

ggerganov · 2023-09-12T16:54:01Z

The ggml_nbytes() function does not work with transposed tensors:

whisper.cpp/ggml.c

Lines 4304 to 4312 in 79a8805

    
           size_t ggml_nbytes(const struct ggml_tensor * tensor) { 
        
               size_t nbytes = tensor->ne[0]*tensor->nb[0]/ggml_blck_size(tensor->type); 
        
               for (int i = 1; i < GGML_MAX_DIMS; ++i) { 
        
                   nbytes += (tensor->ne[i] - 1)*tensor->nb[i]; 
        
               } 
        
               return nbytes; 
        
           }

# cur:
ne0 = 512, ne1 = 1500, ne2 = 1, ne3 = 1
nb0 = 4, nb1 = 2048, nb2 = 3072000, nb3 = 3072000
ggml_nbytes = 3072000

# ggml_transpose(ctx, cur):
ne0 = 1500, ne1 = 512, ne2 = 1, ne3 = 1
nb0 = 2048, nb1 = 4, nb2 = 3072000, nb3 = 3072000
ggml_nbytes = 3074044

@slaren tagging you to keep in mind

ggerganov · 2023-09-12T17:12:01Z

ggml.c

}
+

This one works, but it's kind of stupid to sort the elements each time.
Is there something better?

So, the goal of my implementation was to calculate the offset of the last element plus one. However, the implementation assumes that nb[0] == type_size, so it doesn't work with transposed tensors. This should fix it for blck_size == 1:

size_t nbytes = ggml_type_size(tensor->type); for (int i = 0; i < GGML_MAX_DIMS; ++i) { nbytes += (tensor->ne[i] - 1)*tensor->nb[i]; }

However, this will not work with quantized types. A possible solution could be fall back to the previous implementation for blck_size > 1, but it would be nicer to have a single implementation.

size_t ggml_nbytes(const struct ggml_tensor * tensor) { size_t nbytes; size_t blck_size = ggml_blck_size(tensor->type); if (blck_size == 1) { nbytes = ggml_type_size(tensor->type); for (int i = 0; i < GGML_MAX_DIMS; ++i) { nbytes += (tensor->ne[i] - 1)*tensor->nb[i]; } } else { nbytes = tensor->ne[0]*tensor->nb[0]/blck_size; for (int i = 1; i < GGML_MAX_DIMS; ++i) { nbytes += (tensor->ne[i] - 1)*tensor->nb[i]; } } return nbytes; }

ggerganov · 2023-09-13T20:10:03Z

Ok, so running the Core ML on the CPU + GPU is indeed faster than running it on the ANE (see updated times in the OP).

Also there is some strange behavior where the first Core ML Encoder run after starting the process is comparable to the Metal version, but next runs are x2 times faster. This happens only with CPU + GPU Core ML, it does not happen with ANE Core ML. I've updated the bench tool to take this into account and provide more accurate measurement, by running a "pre-heat" encoder pass that is not measured.

I suppose that Core ML does some extra optimizations on the first run. I.e. the first Core ML GPU run is similar in performance to my Metal implementation, but then it gets a x2 factor lead. Would be interesting if whatever optimization occurs can be done manually in the Metal code. Would be a great benefit with potential dramatic improvement in the Decoder where we cannot use Core ML, but we can use Metal.

Edit:

Here is some specific numbers with the Medium model:

mode	First run (ms)	Second run and after (ms)
Metal	284	217
Core ML (GPU + CPU)	355	103
Core ML (ANE)	744	676

The Metal version also gets a little boost after the first run - probably some caches are heated, etc. But it is not so significant as the Core ML (GPU) one. Also note that the ANE version does not get such speed-up.
So I'm wondering, if whatever makes Core ML (GPU) go faster can be somehow replicated in Metal.

There is also the explanation that Core ML does some initialization stuff on the first run which inflates the number and hence there is no "optimization" going on. Anyway, any insight will be appreciated

nchudleigh · 2023-09-14T00:48:08Z

The improvement over ANE is insanely impressive. I wonder if it will enable me to actually run the larger models on the M1 Pro with CoreML.

jhen0409 · 2023-09-14T02:59:06Z

I was very curious how this would perform on iOS, so I did some testing on it:

Device (CPU)	Model	Load w/ Metal [ms]	Full w/ Metal [ms]	Load w/ Core ML (ANE) [ms]	Full w/ Core ML (ANE) [ms]
iPhone 13 Pro Max (A15)	tiny.en	153	574	1160	239
iPhone 13 Pro Max (A15)	base.en	151	736	1626	369
iPhone 13 Pro Max (A15)	small.en	482	1871	6371	962
iPhone 13 Pro Max (A15)	medium.en	1295	4275	21136	2790
iPhone 13 Pro Max (A15)	large	(crash)	(crash)	(crash)	(crash)
iPad Air 5 (M1)	tiny.en	105	327	4357	149
iPad Air 5 (M1)	base.en	139	533	(stuck)	(unknown)
iPad Air 5 (M1)	small.en	247	1280	(stuck)	(unknown)
iPad Air 5 (M1)	medium.en	564	2919	(stuck)	(unknown)
iPad Air 5 (M1)	large	5118	5275	(stuck)	(unknown)

iOS 16.6.1, use jkv.wav, the Full is second run

In mybigday/whisper.rn#123, I use commit f408c64, the difference is that I used ARC in ggml-metal.m and removed some release code. The app archive will be more faster than the build, but currently I use Xcode build to easily test.

Compared with iOS Device, the number of GPU cores is relatively less, I think that's why it has a gap on iPhone. I don't know why CoreML is not working on my M1 iPad, I didn't enable it in Production before so I didn't find out before.

What I love about the Metal backend is that it uses less disk size & memory, this means we can use a larger models in real-world scenarios. (Maybe we could consider to support CoreML OPs for GGML instead of load a mlmodelc in the future)

UPDATE: Load & Full w/ Core ML (CPU+GPU) [ms]

Device (CPU)	Model	Load w/ Core ML (CPU+GPU) [ms]	Full w/ Core ML (CPU+GPU) [ms]
iPhone 13 Pro Max (A15)	tiny.en	237	252
iPhone 13 Pro Max (A15)	base.en	268	402
iPhone 13 Pro Max (A15)	small.en	444	1148
iPhone 13 Pro Max (A15)	medium.en	3604	(crash)
iPhone 13 Pro Max (A15)	large	(crash)	(crash)
iPad Air 5 (M1)	tiny.en	155	~~270~~ 183
iPad Air 5 (M1)	base.en	218	254
iPad Air 5 (M1)	small.en	457	729
iPad Air 5 (M1)	medium.en	4111	2316
iPad Air 5 (M1)	large	10117	(crash)

There is also a situation where the second run is faster, but the perf is not faster than ANE on iPhone.

ggerganov · 2023-09-14T12:38:20Z

@jhen0409 Thank you very much for the results!

I've further improved the Metal inference and now it is as fast as Core ML (GPU). I still have one kernel that is not optimized (the convolution at the start of the Encoder) so this explains the remaining difference between Metal and Core ML (GPU). I'm now satisfied with the results and indeed, Core ML does not do any extra optimizations - it is just slower the first run because it probably initializes some internal things.

The optimization that I did is to pad matrix multiplications that have a row dimension not multiple of 32. It's a very simple code change that can provide significant benefits for Metal. It should be straightforward to apply to llama.cpp as well and gain some extra performance:

whisper.cpp/whisper.cpp

Lines 138 to 171 in 2b4160a

    
           // faster matrix multiplications for tensors that do not have dimension 0 divisible by "pad" 
        
           // the idea is to represent the original matrix multiplication: 
        
           // 
        
           //   Z = X @ Y 
        
           // 
        
           // with two matrix multiplications: 
        
           // 
        
           //   Z = [X_0; X_1] @ [Y_0; Y_1] 
        
           // 
        
           // here X_0 and Y_0 are views of X and Y that have dimension 0 divisible by "pad" 
        
           // and X_1 and Y_1 are the remaining views. X_1 and Y_1 end up being small matrices that can be processed with more 
        
           // general-purpose kernels 
        
           // 
        
           static struct ggml_tensor * ggml_mul_mat_pad(struct ggml_context * ctx, struct ggml_tensor * x, struct ggml_tensor * y, int pad = 32) { 
        
           //#if !defined(GGML_USE_METAL) 
        
           //    return ggml_mul_mat(ctx, x, y); 
        
           //#endif 
        
               if (x->ne[0] % pad == 0 || x->ne[0] / pad < 2) { 
        
                   return ggml_mul_mat(ctx, x, y); 
        
               } 
        
               struct ggml_tensor * x_0 = ggml_view_3d(ctx, x, (x->ne[0]/pad)*pad, x->ne[1], x->ne[2], x->nb[1], x->nb[2], 0); 
        
               struct ggml_tensor * x_1 = ggml_view_3d(ctx, x,  x->ne[0]%pad,      x->ne[1], x->ne[2], x->nb[1], x->nb[2], x_0->ne[0]*x_0->nb[0]); 
        
               struct ggml_tensor * y_0 = ggml_view_3d(ctx, y, (y->ne[0]/pad)*pad, y->ne[1], y->ne[2], y->nb[1], y->nb[2], 0); 
        
               struct ggml_tensor * y_1 = ggml_view_3d(ctx, y,  y->ne[0]%pad,      y->ne[1], y->ne[2], y->nb[1], y->nb[2], y_0->ne[0]*y_0->nb[0]); 
        
               return ggml_add(ctx, 
        
                       ggml_mul_mat(ctx, x_0, y_0), 
        
                       ggml_mul_mat(ctx, x_1, y_1)); 
        
           }

nchudleigh · 2023-09-14T13:09:46Z

On M1 Pro with 32 GB RAM against a 40s recording

The improvement on medium size and up is staggering. Compared to master, we are seeing 2x speedup, compared to yesterday ~1 second.

Commit	Model	Thread	Processor Count	Load Time	Sample Time	Encode Time	Decode Time	Sample Time per Run	Encode Time per Run	Decode Time per Run	Total Time
master	ggml-tiny.en.bin	4	1	54.23	81.29	263.68	274.8	0.4234	131.84	1.4313	708.43
master	ggml-base.en.bin	4	1	77.24	79.6	518.94	501.9	0.4124	259.47	2.6005	1223.97
master	ggml-small.en.bin	4	1	232.09	81.2	2248.65	1424.05	0.4207	1124.325	7.3785	4034.77
master	ggml-medium.bin	4	1	618.91	82.61	7265.93	3383.07	0.4193	3632.965	17.1729	11418.43
master	ggml-medium.en.bin	4	1	585.99	78.57	7404.29	3389.91	0.4389	3702.145	18.9380	11525.09
master	ggml-large.bin	4	1	1108.13	79.07	12743.14	5901.58	0.4251	6371.57	31.7289	19960.5
`8e8daa8`	ggml-tiny.en.bin	4	1	51.78	77.05	143.27	355.16	0.4013	71.635	1.8498	675.18
`8e8daa8`	ggml-base.en.bin	4	1	79.24	77.64	275.19	532.86	0.4023	137.595	2.7609	1027.55
`8e8daa8`	ggml-small.en.bin	4	1	219.28	77.51	782.83	1204.21	0.4016	391.415	6.2394	2344.08
`8e8daa8`	ggml-medium.en.bin	4	1	577.15	74.32	2117.78	2538.31	0.4152	1058.89	14.1805	5384.62
`8e8daa8`	ggml-medium.bin	4	1	543.54	79.18	2120.5	2834.63	0.4019	1060.25	14.3890	5653.88
`8e8daa8`	ggml-large.bin	4	1	1755.19	76.37	3711.21	4318.31	0.4106	1855.605	23.2167	9984.53
`2b4160a`	ggml-tiny.en.bin	4	1	50.08	78.1	94.62	320.86	0.4068	47.31	1.6799	593.96
`2b4160a`	ggml-base.en.bin	4	1	85.11	79.58	172.43	486.06	0.4123	86.215	2.5316	1702.74
`2b4160a`	ggml-small.en.bin	4	1	212.96	77.66	496.48	1118.51	0.4024	248.24	5.8256	1995.0
`2b4160a`	ggml-medium.en.bin	4	1	555.72	75.79	1338.75	2378.07	0.4234	669.375	13.3599	4503.9
`2b4160a`	ggml-medium.bin	4	1	593.97	80.49	1334.06	2598.74	0.4086	667.03	13.3269	4792.06
`2b4160a`	ggml-large.bin	4	1	1831.34	77.67	2477.62	4044.54	0.4176	1238.81	21.9812	8704.82

As an image, same data.

colinc · 2023-09-14T20:27:33Z

@ggerganov This is amazing! Thank you so, so much!

On an M2-Ultra (76-core), I'm now seeing 25x realtime for the medium model (so ~2 minutes to process a ~50 minute long audio, including diarization).
But, more impressively, on a basic M2 (10-core) I'm seeing ~9.4x realtime for the medium model (so ~5 minutes to process a ~50 minute long audio, including diarization).

And, most importantly, really good transcription results so far (especially vs previous results).

pudepiedj · 2023-09-18T15:24:41Z

These are the timings I got from what I think is the current master model versions for ./samples/gb0.wav (127.4s) on an M2 MAX with 32GB RAM on MacOS 13.5.1 with 8 threads (all timings in [ms]); maybe I have not done something right because the encoding and decoding times seem quite slow compared with those @ggerganov posted for an M2 Ultra on X(formerly Twitter):

Model	Load	Encode	Decode	Total
ggml-tiny.en	49	107	607	1022
ggml-base.en	65	193	948	1557
ggml-medium.en	427	1353	4980	7259
ggml-large	993	2372	7695	11735

* metal : init * whisper : factor out graph builds * whisper : allocate encoder and decoder using ggml-alloc * whisper : ggml-alloc is now supported * whisper : CoreML support ggml-alloc * build : fix ggml-alloc * ios : update submodule * extra : update sync-ggml.sh script to also sync ggml-alloc * ci : see if this is causing the crash * whisper : refactor ggml-alloc init * whisper.android : try to fix build * whisper : initial Metal version * ci : try to debug vmem issue * metal : decoder works on GPU! * metal : add multi-decoder support * ggml : fix ggml_nbytes (probably temp solution) * metal : run "cross" step on the GPU * whisper : remove ggml_repeat in the encoder * whisper : offload the Encoder to Metal * ggml : use simpler ggml_bytes() implementation * ggml-alloc : try to make CI happy by reducing vram to 128GB * whisper : add whisper_allocr to wrap ggml_allocr * whisper : factor out alloc init in a function * cmake : update to support Metal build * whisper : add <functional> header * objc : fix build (no Metal yet) * ios : add Metal support * swiftui : fix build * metal : speed-up KQ multiplication * metal : sync latest llama.cpp kernels * readme : add Metal info * ios : update submodule * coreml : add code to toggle Core ML config (CPU, ANE, GPU) * bench : fix timings by running a pre-heat * bench : start benching the decoder * whisper : add ggml_mul_mat_pad * bench : fix uninitialized vars * whisper : add comment for disabling mul-mat padding * whisper : add description of ggml_mul_mat_pad * whisper : clean-up ggml_mul_mat_pad * metal : remove the "concurrent" flag * bench : variable n_past * ios : update SPM package

ggerganov added 5 commits September 10, 2023 18:38

metal : init

fbc3f80

whisper : factor out graph builds

949ab63

whisper : allocate encoder and decoder using ggml-alloc

bed5ad6

whisper : ggml-alloc is now supported

af6f67b

whisper : CoreML support ggml-alloc

fa672b4

ggerganov commented Sep 10, 2023

View reviewed changes

whisper.cpp Show resolved Hide resolved

ggerganov added 3 commits September 10, 2023 22:19

build : fix ggml-alloc

794e8fe

ios : update submodule

9a78b72

extra : update sync-ggml.sh script to also sync ggml-alloc

06d1d28

slaren reviewed Sep 10, 2023

View reviewed changes

ci : see if this is causing the crash

4d9acc6

ggerganov force-pushed the metal-and-alloc branch from 1b9b645 to 4d9acc6 Compare September 11, 2023 11:42

ggerganov added 2 commits September 11, 2023 15:04

whisper : refactor ggml-alloc init

2770d46

whisper.android : try to fix build

4845b9e

whisper : initial Metal version

d3b2dd4

This was referenced Sep 12, 2023

Next version release/tag #1233

Open

Compiling w/o WHISPER_USE_SCRATCH causes memory leak #1274

Closed

ggerganov added 4 commits September 12, 2023 14:02

Merge branch 'master' into metal-and-alloc

de94c78

ci : try to debug vmem issue

3b9979a

metal : decoder works on GPU!

fbc9ddc

metal : add multi-decoder support

79a8805

ggerganov added 2 commits September 12, 2023 20:10

ggml : fix ggml_nbytes (probably temp solution)

9fdd415

metal : run "cross" step on the GPU

cd47637

ggerganov commented Sep 12, 2023

View reviewed changes

ggerganov added 2 commits September 12, 2023 20:34

whisper : remove ggml_repeat in the encoder

ec9a7db

whisper : offload the Encoder to Metal

3074a7f

jhen0409 mentioned this pull request Sep 13, 2023

feat: sync whisper.cpp (enable ggml-alloc) mybigday/whisper.rn#123

Merged

ggerganov added 5 commits September 14, 2023 10:06

bench : start benching the decoder

e81c67a

whisper : add ggml_mul_mat_pad

af947cb

bench : fix uninitialized vars

c46167f

whisper : add comment for disabling mul-mat padding

f365543

whisper : add description of ggml_mul_mat_pad

2b4160a

ggerganov force-pushed the metal-and-alloc branch from b38f8a4 to 2b4160a Compare September 14, 2023 12:37

ggerganov added 3 commits September 14, 2023 17:28

whisper : clean-up ggml_mul_mat_pad

0d5e4cd

metal : remove the "concurrent" flag

bfcb2a2

bench : variable n_past

a166457

ios : update SPM package

3ac0558

ggerganov merged commit 9393598 into master Sep 15, 2023
68 checks passed

ggerganov mentioned this pull request Sep 15, 2023

whisper : add Metal support in the Decoder #1047

Closed

bobqianic mentioned this pull request Sep 17, 2023

still low performances with CoreML model (using ANE) compared to Vosk (Kaldi) using CPU #1301

Closed

bobqianic mentioned this pull request Sep 18, 2023

CoreML conversion and model execution takes 11 hours #1307

Closed

cenzovit mentioned this pull request Oct 10, 2023

Update whisper.cpp to 8e46ba8 tazz4843/whisper-rs#85

Merged

ggerganov mentioned this pull request Oct 23, 2023

stable-diffusion : ggml-alloc integration and gpu acceleration leejet/stable-diffusion.cpp#75

Merged

5 tasks

bobqianic mentioned this pull request Nov 12, 2023

GGML_USE_METAL=1 fails on iPadOS running iOS 15/16 (and 17 perhaps) #1387

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

whisper : Metal and ggml-alloc support #1270

whisper : Metal and ggml-alloc support #1270

ggerganov commented Sep 10, 2023 •

edited

Loading

slaren Sep 10, 2023

ggerganov Sep 11, 2023 •

edited

Loading

slaren Sep 11, 2023

slaren Sep 11, 2023

slaren Sep 11, 2023

ggerganov Sep 12, 2023

ggerganov Sep 13, 2023

slaren Sep 10, 2023 •

edited

Loading

ggerganov Sep 13, 2023

bobqianic commented Sep 11, 2023

ggerganov commented Sep 12, 2023

ggerganov Sep 12, 2023

slaren Sep 12, 2023 •

edited

Loading

ggerganov commented Sep 13, 2023 •

edited

Loading

nchudleigh commented Sep 14, 2023 •

edited

Loading

jhen0409 commented Sep 14, 2023 •

edited

Loading

ggerganov commented Sep 14, 2023

nchudleigh commented Sep 14, 2023 •

edited

Loading

colinc commented Sep 14, 2023 •

edited

Loading

pudepiedj commented Sep 18, 2023 •

edited

Loading

	// 1TB for 64-bit, 1GB for 32-bit
	size = sizeof(void ) == 4 ? 1ULL<<30 : 1ULL<<40;

	#elif defined(_POSIX_MAPPED_FILES)
	#elif defined(_POSIX_MAPPED_FILES) && !defined(__EMSCRIPTEN__)

whisper : Metal and ggml-alloc support #1270

whisper : Metal and ggml-alloc support #1270

Conversation

ggerganov commented Sep 10, 2023 • edited Loading

Usage:

TODOs

Choose a reason for hiding this comment

ggerganov Sep 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

slaren Sep 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bobqianic commented Sep 11, 2023

ggerganov commented Sep 12, 2023

Choose a reason for hiding this comment

slaren Sep 12, 2023 • edited Loading

Choose a reason for hiding this comment

ggerganov commented Sep 13, 2023 • edited Loading

nchudleigh commented Sep 14, 2023 • edited Loading

jhen0409 commented Sep 14, 2023 • edited Loading

ggerganov commented Sep 14, 2023

nchudleigh commented Sep 14, 2023 • edited Loading

colinc commented Sep 14, 2023 • edited Loading

pudepiedj commented Sep 18, 2023 • edited Loading

ggerganov commented Sep 10, 2023 •

edited

Loading

ggerganov Sep 11, 2023 •

edited

Loading

slaren Sep 10, 2023 •

edited

Loading

slaren Sep 12, 2023 •

edited

Loading

ggerganov commented Sep 13, 2023 •

edited

Loading

nchudleigh commented Sep 14, 2023 •

edited

Loading

jhen0409 commented Sep 14, 2023 •

edited

Loading

nchudleigh commented Sep 14, 2023 •

edited

Loading

colinc commented Sep 14, 2023 •

edited

Loading

pudepiedj commented Sep 18, 2023 •

edited

Loading