Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

whisper : Metal and ggml-alloc support #1270

Merged
merged 44 commits into from
Sep 15, 2023
Merged

whisper : Metal and ggml-alloc support #1270

merged 44 commits into from
Sep 15, 2023

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented Sep 10, 2023

This PR adds Metal support for full GPU inference on Apple Silicon.
It also optimizes memory usage.

  • Base model running on M2 Ultra using Metal:
metal-base-1.mp4
  • Medium model running on M2 Ultra using Metal:
metal-medium-1.mp4

Usage:

  • Full Metal inference (all processing is on the GPU)
make clean
make -j && ./main -m ./models/ggml-base.en.bin -f ./samples/gb0.wav
CPU OS Config Model Th Load [ms] Encode [ms] Commit
M2 Ultra MacOS 13.5.1 Metal tiny 8 16.69 1.42 2.92
M2 Ultra MacOS 13.5.1 Metal base 8 25.88 2.03 4.57
M2 Ultra MacOS 13.5.1 Metal small 8 62.30 4.01 11.80
M2 Ultra MacOS 13.5.1 Metal medium 8 160.75 8.26 29.41
M2 Ultra MacOS 13.5.1 Metal large 8 268.36 12.55 52.22
  • CoreML Encoder + Metal Decoder
make clean
WHISPER_COREML=1 make -j && ./main -m ./models/ggml-base.en.bin -f ./samples/gb0.wav
CPU OS Config Model Th Load [ms] Encode [ms] Commit
M2 Ultra MacOS 13.5.1 Core ML (ANE) tiny 8 24.41 1.42 2.91
M2 Ultra MacOS 13.5.1 Core ML (ANE) base 8 46.50 2.02 4.72
M2 Ultra MacOS 13.5.1 Core ML (ANE) small 8 137.58 3.99 11.91
M2 Ultra MacOS 13.5.1 Core ML (ANE) medium 8 677.31 8.25 29.51
M2 Ultra MacOS 13.5.1 Core ML (ANE) large 8 1823.94 12.74 52.40
CPU OS Config Model Th Load [ms] Encode [ms] Commit
M2 Ultra MacOS 13.5.1 Core ML (GPU) tiny 8 6.72 1.42 3.15
M2 Ultra MacOS 13.5.1 Core ML (GPU) base 8 11.31 2.03 4.42
M2 Ultra MacOS 13.5.1 Core ML (GPU) small 8 32.81 4.00 11.79
M2 Ultra MacOS 13.5.1 Core ML (GPU) medium 8 103.47 8.22 29.38
M2 Ultra MacOS 13.5.1 Core ML (GPU) large 8 185.71 12.48 52.30

TODOs

whisper.cpp Show resolved Hide resolved
whisper.cpp Outdated
Comment on lines 2779 to 2781
state->alloc_encode = ggml_allocr_new_measure(tensor_alignment);
state->alloc_encode_post = ggml_allocr_new_measure(tensor_alignment);
state->alloc_decode = ggml_allocr_new_measure(tensor_alignment);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a chance that this will not work in some systems with limited virtual memory, such as iOS, because each measure allocator reserves a large amount of virtual memory. It would be safer to allocate only one measure allocator at a time, I think that should be possible here.
It's definitely not ideal that ggml-alloc has this limitation, I expect to improve this and remove the use of virtual memory entirely with the common backends interface implementation.

Copy link
Owner Author

@ggerganov ggerganov Sep 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I reorganized the allocators as proposed, but it seems some OSes still fail during the second new_measure - see linux/arm64 and linux/ppc64le in the CI: https://github.com/ggerganov/whisper.cpp/actions/runs/6146272809/job/16675319319

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit confused by this, it seems that the call to mmap is crashing the process instead of returning an error, because otherwise we should see the failed assert GGML_ASSERT(!"failed to allocate virtual memory for measure buffer");. I imagine that this is related to QEMU, I'll try to reproduce it locally.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried the exact same commands that the CI uses to run the arm64 version with docker and QEMU, and it works on my computer. So whatever is the issue, it only seems to happen on the github CI environment and I cannot reproduce it. Maybe it is hitting some memory usage limit.

$ sudo docker run --platform linux/arm64 --rm \
    -v /home/diego/code/whisper.cpp:/workspace \
    -w /workspace ubuntu:22.04 /bin/sh -c '
    apt update
    apt install -y build-essential cmake libsdl2-dev
    cmake . -DWHISPER_SUPPORT_SDL2=ON -DCMAKE_BUILD_TYPE=Release
    make
    ctest -L gh --output-on-failure'

[...]

-- The C compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Could NOT find Git (missing: GIT_EXECUTABLE)
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
-- CMAKE_SYSTEM_PROCESSOR: aarch64
-- ARM detected
-- Configuring done
-- Generating done
CMake Warning:
  Manually-specified variables were not used by the project:

    WHISPER_SUPPORT_SDL2


-- Build files have been written to: /workspace
[  7%] Building C object CMakeFiles/whisper.dir/ggml.c.o
[ 15%] Building C object CMakeFiles/whisper.dir/ggml-alloc.c.o
[ 23%] Building CXX object CMakeFiles/whisper.dir/whisper.cpp.o
[ 30%] Linking CXX shared library libwhisper.so
[ 30%] Built target whisper
[ 38%] Building CXX object examples/CMakeFiles/common.dir/common.cpp.o
[ 46%] Building CXX object examples/CMakeFiles/common.dir/common-ggml.cpp.o
[ 53%] Linking CXX static library libcommon.a
[ 53%] Built target common
[ 61%] Building CXX object examples/main/CMakeFiles/main.dir/main.cpp.o
[ 69%] Linking CXX executable ../../bin/main
[ 69%] Built target main
[ 76%] Building CXX object examples/bench/CMakeFiles/bench.dir/bench.cpp.o
[ 84%] Linking CXX executable ../../bin/bench
[ 84%] Built target bench
[ 92%] Building CXX object examples/quantize/CMakeFiles/quantize.dir/quantize.cpp.o
[100%] Linking CXX executable ../../bin/quantize
[100%] Built target quantize
Test project /workspace
    Start 1: test-main-tiny
1/2 Test #1: test-main-tiny ...................   Passed   82.79 sec
    Start 2: test-main-tiny.en
2/2 Test #2: test-main-tiny.en ................   Passed   83.46 sec

100% tests passed, 0 tests failed out of 2

Label Time Summary:
en      =  83.46 sec*proc (1 test)
gh      = 166.24 sec*proc (2 tests)
tiny    = 166.24 sec*proc (2 tests)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A possible workaround could be reducing the amount of virtual memory allocated here:

whisper.cpp/ggml-alloc.c

Lines 345 to 346 in d3b2dd4

// 1TB for 64-bit, 1GB for 32-bit
*size = sizeof(void *) == 4 ? 1ULL<<30 : 1ULL<<40;

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks for looking into this. I'll now continue working on this branch and try to find a solution

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reducing the size to 128GB fixes the CI: b19888c

static void * alloc_vmem(size_t size) {
#if defined(_WIN32)
return VirtualAlloc(NULL, size, MEM_RESERVE, PAGE_NOACCESS);
#elif defined(_POSIX_MAPPED_FILES)
Copy link
Collaborator

@slaren slaren Sep 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the emscripten build doesn't work, it can be excluded from using mmap here and in free_vmem by checking if __EMSCRIPTEN__ is defined. I think this should do it:

Suggested change
#elif defined(_POSIX_MAPPED_FILES)
#elif defined(_POSIX_MAPPED_FILES) && !defined(__EMSCRIPTEN__)

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good news - the Emscripten build looks to be working without any adjustments needed

@bobqianic
Copy link
Collaborator

I recently ran some performance tests on whisper.cpp and observed a significant drop in performance after the ggml sync. Do you think this PR could address that? I've already raised an issue about it.

@ggerganov
Copy link
Owner Author

The ggml_nbytes() function does not work with transposed tensors:

whisper.cpp/ggml.c

Lines 4304 to 4312 in 79a8805

size_t ggml_nbytes(const struct ggml_tensor * tensor) {
size_t nbytes = tensor->ne[0]*tensor->nb[0]/ggml_blck_size(tensor->type);
for (int i = 1; i < GGML_MAX_DIMS; ++i) {
nbytes += (tensor->ne[i] - 1)*tensor->nb[i];
}
return nbytes;
}

# cur:
ne0 = 512, ne1 = 1500, ne2 = 1, ne3 = 1
nb0 = 4, nb1 = 2048, nb2 = 3072000, nb3 = 3072000
ggml_nbytes = 3072000

# ggml_transpose(ctx, cur):
ne0 = 1500, ne1 = 512, ne2 = 1, ne3 = 1
nb0 = 2048, nb1 = 4, nb2 = 3072000, nb3 = 3072000
ggml_nbytes = 3074044

@slaren tagging you to keep in mind

}

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one works, but it's kind of stupid to sort the elements each time.
Is there something better?

Copy link
Collaborator

@slaren slaren Sep 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the goal of my implementation was to calculate the offset of the last element plus one. However, the implementation assumes that nb[0] == type_size, so it doesn't work with transposed tensors. This should fix it for blck_size == 1:

size_t nbytes = ggml_type_size(tensor->type);
for (int i = 0; i < GGML_MAX_DIMS; ++i) { 
    nbytes += (tensor->ne[i] - 1)*tensor->nb[i]; 
}

However, this will not work with quantized types. A possible solution could be fall back to the previous implementation for blck_size > 1, but it would be nicer to have a single implementation.

size_t ggml_nbytes(const struct ggml_tensor * tensor) { 
    size_t nbytes;
    size_t blck_size = ggml_blck_size(tensor->type);
    if (blck_size == 1) { 
        nbytes = ggml_type_size(tensor->type);
        for (int i = 0; i < GGML_MAX_DIMS; ++i) { 
            nbytes += (tensor->ne[i] - 1)*tensor->nb[i]; 
        }
    }
    else {
        nbytes = tensor->ne[0]*tensor->nb[0]/blck_size;
        for (int i = 1; i < GGML_MAX_DIMS; ++i) { 
            nbytes += (tensor->ne[i] - 1)*tensor->nb[i]; 
        }
    }
    return nbytes;
}

@ggerganov
Copy link
Owner Author

ggerganov commented Sep 13, 2023

Ok, so running the Core ML on the CPU + GPU is indeed faster than running it on the ANE (see updated times in the OP).

Also there is some strange behavior where the first Core ML Encoder run after starting the process is comparable to the Metal version, but next runs are x2 times faster. This happens only with CPU + GPU Core ML, it does not happen with ANE Core ML. I've updated the bench tool to take this into account and provide more accurate measurement, by running a "pre-heat" encoder pass that is not measured.

I suppose that Core ML does some extra optimizations on the first run. I.e. the first Core ML GPU run is similar in performance to my Metal implementation, but then it gets a x2 factor lead. Would be interesting if whatever optimization occurs can be done manually in the Metal code. Would be a great benefit with potential dramatic improvement in the Decoder where we cannot use Core ML, but we can use Metal.

Edit:

Here is some specific numbers with the Medium model:

mode First run (ms) Second run and after (ms)
Metal 284 217
Core ML (GPU + CPU) 355 103
Core ML (ANE) 744 676

The Metal version also gets a little boost after the first run - probably some caches are heated, etc. But it is not so significant as the Core ML (GPU) one. Also note that the ANE version does not get such speed-up.
So I'm wondering, if whatever makes Core ML (GPU) go faster can be somehow replicated in Metal.

There is also the explanation that Core ML does some initialization stuff on the first run which inflates the number and hence there is no "optimization" going on. Anyway, any insight will be appreciated

@nchudleigh
Copy link
Contributor

nchudleigh commented Sep 14, 2023

The improvement over ANE is insanely impressive. I wonder if it will enable me to actually run the larger models on the M1 Pro with CoreML.

@jhen0409
Copy link
Sponsor Contributor

jhen0409 commented Sep 14, 2023

I was very curious how this would perform on iOS, so I did some testing on it:

Device (CPU) Model Load w/ Metal [ms] Full w/ Metal [ms] Load w/ Core ML (ANE) [ms] Full w/ Core ML (ANE) [ms]
iPhone 13 Pro Max (A15) tiny.en 153 574 1160 239
iPhone 13 Pro Max (A15) base.en 151 736 1626 369
iPhone 13 Pro Max (A15) small.en 482 1871 6371 962
iPhone 13 Pro Max (A15) medium.en 1295 4275 21136 2790
iPhone 13 Pro Max (A15) large (crash) (crash) (crash) (crash)
iPad Air 5 (M1) tiny.en 105 327 4357 149
iPad Air 5 (M1) base.en 139 533 (stuck) (unknown)
iPad Air 5 (M1) small.en 247 1280 (stuck) (unknown)
iPad Air 5 (M1) medium.en 564 2919 (stuck) (unknown)
iPad Air 5 (M1) large 5118 5275 (stuck) (unknown)

iOS 16.6.1, use jkv.wav, the Full is second run

In mybigday/whisper.rn#123, I use commit f408c64, the difference is that I used ARC in ggml-metal.m and removed some release code. The app archive will be more faster than the build, but currently I use Xcode build to easily test.

Compared with iOS Device, the number of GPU cores is relatively less, I think that's why it has a gap on iPhone. I don't know why CoreML is not working on my M1 iPad, I didn't enable it in Production before so I didn't find out before.

What I love about the Metal backend is that it uses less disk size & memory, this means we can use a larger models in real-world scenarios. (Maybe we could consider to support CoreML OPs for GGML instead of load a mlmodelc in the future)

UPDATE: Load & Full w/ Core ML (CPU+GPU) [ms]

Device (CPU) Model Load w/ Core ML (CPU+GPU) [ms] Full w/ Core ML (CPU+GPU) [ms]
iPhone 13 Pro Max (A15) tiny.en 237 252
iPhone 13 Pro Max (A15) base.en 268 402
iPhone 13 Pro Max (A15) small.en 444 1148
iPhone 13 Pro Max (A15) medium.en 3604 (crash)
iPhone 13 Pro Max (A15) large (crash) (crash)
iPad Air 5 (M1) tiny.en 155 270 183
iPad Air 5 (M1) base.en 218 254
iPad Air 5 (M1) small.en 457 729
iPad Air 5 (M1) medium.en 4111 2316
iPad Air 5 (M1) large 10117 (crash)

There is also a situation where the second run is faster, but the perf is not faster than ANE on iPhone.

@ggerganov
Copy link
Owner Author

@jhen0409 Thank you very much for the results!

I've further improved the Metal inference and now it is as fast as Core ML (GPU). I still have one kernel that is not optimized (the convolution at the start of the Encoder) so this explains the remaining difference between Metal and Core ML (GPU). I'm now satisfied with the results and indeed, Core ML does not do any extra optimizations - it is just slower the first run because it probably initializes some internal things.

The optimization that I did is to pad matrix multiplications that have a row dimension not multiple of 32. It's a very simple code change that can provide significant benefits for Metal. It should be straightforward to apply to llama.cpp as well and gain some extra performance:

whisper.cpp/whisper.cpp

Lines 138 to 171 in 2b4160a

// faster matrix multiplications for tensors that do not have dimension 0 divisible by "pad"
// the idea is to represent the original matrix multiplication:
//
// Z = X @ Y
//
// with two matrix multiplications:
//
// Z = [X_0; X_1] @ [Y_0; Y_1]
//
// here X_0 and Y_0 are views of X and Y that have dimension 0 divisible by "pad"
// and X_1 and Y_1 are the remaining views. X_1 and Y_1 end up being small matrices that can be processed with more
// general-purpose kernels
//
static struct ggml_tensor * ggml_mul_mat_pad(struct ggml_context * ctx, struct ggml_tensor * x, struct ggml_tensor * y, int pad = 32) {
//#if !defined(GGML_USE_METAL)
// return ggml_mul_mat(ctx, x, y);
//#endif
if (x->ne[0] % pad == 0 || x->ne[0] / pad < 2) {
return ggml_mul_mat(ctx, x, y);
}
struct ggml_tensor * x_0 = ggml_view_3d(ctx, x, (x->ne[0]/pad)*pad, x->ne[1], x->ne[2], x->nb[1], x->nb[2], 0);
struct ggml_tensor * x_1 = ggml_view_3d(ctx, x, x->ne[0]%pad, x->ne[1], x->ne[2], x->nb[1], x->nb[2], x_0->ne[0]*x_0->nb[0]);
struct ggml_tensor * y_0 = ggml_view_3d(ctx, y, (y->ne[0]/pad)*pad, y->ne[1], y->ne[2], y->nb[1], y->nb[2], 0);
struct ggml_tensor * y_1 = ggml_view_3d(ctx, y, y->ne[0]%pad, y->ne[1], y->ne[2], y->nb[1], y->nb[2], y_0->ne[0]*y_0->nb[0]);
return ggml_add(ctx,
ggml_mul_mat(ctx, x_0, y_0),
ggml_mul_mat(ctx, x_1, y_1));
}

@nchudleigh
Copy link
Contributor

nchudleigh commented Sep 14, 2023

On M1 Pro with 32 GB RAM against a 40s recording

The improvement on medium size and up is staggering. Compared to master, we are seeing 2x speedup, compared to yesterday ~1 second.

Commit Model Thread Processor Count Load Time Sample Time Encode Time Decode Time Sample Time per Run Encode Time per Run Decode Time per Run Total Time
master ggml-tiny.en.bin 4 1 54.23 81.29 263.68 274.8 0.4234 131.84 1.4313 708.43
master ggml-base.en.bin 4 1 77.24 79.6 518.94 501.9 0.4124 259.47 2.6005 1223.97
master ggml-small.en.bin 4 1 232.09 81.2 2248.65 1424.05 0.4207 1124.325 7.3785 4034.77
master ggml-medium.bin 4 1 618.91 82.61 7265.93 3383.07 0.4193 3632.965 17.1729 11418.43
master ggml-medium.en.bin 4 1 585.99 78.57 7404.29 3389.91 0.4389 3702.145 18.9380 11525.09
master ggml-large.bin 4 1 1108.13 79.07 12743.14 5901.58 0.4251 6371.57 31.7289 19960.5
8e8daa8 ggml-tiny.en.bin 4 1 51.78 77.05 143.27 355.16 0.4013 71.635 1.8498 675.18
8e8daa8 ggml-base.en.bin 4 1 79.24 77.64 275.19 532.86 0.4023 137.595 2.7609 1027.55
8e8daa8 ggml-small.en.bin 4 1 219.28 77.51 782.83 1204.21 0.4016 391.415 6.2394 2344.08
8e8daa8 ggml-medium.en.bin 4 1 577.15 74.32 2117.78 2538.31 0.4152 1058.89 14.1805 5384.62
8e8daa8 ggml-medium.bin 4 1 543.54 79.18 2120.5 2834.63 0.4019 1060.25 14.3890 5653.88
8e8daa8 ggml-large.bin 4 1 1755.19 76.37 3711.21 4318.31 0.4106 1855.605 23.2167 9984.53
2b4160a ggml-tiny.en.bin 4 1 50.08 78.1 94.62 320.86 0.4068 47.31 1.6799 593.96
2b4160a ggml-base.en.bin 4 1 85.11 79.58 172.43 486.06 0.4123 86.215 2.5316 1702.74
2b4160a ggml-small.en.bin 4 1 212.96 77.66 496.48 1118.51 0.4024 248.24 5.8256 1995.0
2b4160a ggml-medium.en.bin 4 1 555.72 75.79 1338.75 2378.07 0.4234 669.375 13.3599 4503.9
2b4160a ggml-medium.bin 4 1 593.97 80.49 1334.06 2598.74 0.4086 667.03 13.3269 4792.06
2b4160a ggml-large.bin 4 1 1831.34 77.67 2477.62 4044.54 0.4176 1238.81 21.9812 8704.82

As an image, same data.
image

@colinc
Copy link
Contributor

colinc commented Sep 14, 2023

@ggerganov This is amazing! Thank you so, so much!

On an M2-Ultra (76-core), I'm now seeing 25x realtime for the medium model (so ~2 minutes to process a ~50 minute long audio, including diarization).
But, more impressively, on a basic M2 (10-core) I'm seeing ~9.4x realtime for the medium model (so ~5 minutes to process a ~50 minute long audio, including diarization).

And, most importantly, really good transcription results so far (especially vs previous results).

@pudepiedj
Copy link

pudepiedj commented Sep 18, 2023

These are the timings I got from what I think is the current master model versions for ./samples/gb0.wav (127.4s) on an M2 MAX with 32GB RAM on MacOS 13.5.1 with 8 threads (all timings in [ms]); maybe I have not done something right because the encoding and decoding times seem quite slow compared with those @ggerganov posted for an M2 Ultra on X(formerly Twitter):

Model Load Encode Decode Total
ggml-tiny.en 49 107 607 1022
ggml-base.en 65 193 948 1557
ggml-medium.en 427 1353 4980 7259
ggml-large 993 2372 7695 11735

didzis pushed a commit to didzis/whisper.cpp that referenced this pull request Sep 30, 2023
* metal : init

* whisper : factor out graph builds

* whisper : allocate encoder and decoder using ggml-alloc

* whisper : ggml-alloc is now supported

* whisper : CoreML support ggml-alloc

* build : fix ggml-alloc

* ios : update submodule

* extra : update sync-ggml.sh script to also sync ggml-alloc

* ci : see if this is causing the crash

* whisper : refactor ggml-alloc init

* whisper.android : try to fix build

* whisper : initial Metal version

* ci : try to debug vmem issue

* metal : decoder works on GPU!

* metal : add multi-decoder support

* ggml : fix ggml_nbytes (probably temp solution)

* metal : run "cross" step on the GPU

* whisper : remove ggml_repeat in the encoder

* whisper : offload the Encoder to Metal

* ggml : use simpler ggml_bytes() implementation

* ggml-alloc : try to make CI happy by reducing vram to 128GB

* whisper : add whisper_allocr to wrap ggml_allocr

* whisper : factor out alloc init in a function

* cmake : update to support Metal build

* whisper : add <functional> header

* objc : fix build (no Metal yet)

* ios : add Metal support

* swiftui : fix build

* metal : speed-up KQ multiplication

* metal : sync latest llama.cpp kernels

* readme : add Metal info

* ios : update submodule

* coreml : add code to toggle Core ML config (CPU, ANE, GPU)

* bench : fix timings by running a pre-heat

* bench : start benching the decoder

* whisper : add ggml_mul_mat_pad

* bench : fix uninitialized vars

* whisper : add comment for disabling mul-mat padding

* whisper : add description of ggml_mul_mat_pad

* whisper : clean-up ggml_mul_mat_pad

* metal : remove the "concurrent" flag

* bench : variable n_past

* ios : update SPM package
jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this pull request Oct 24, 2023
* metal : init

* whisper : factor out graph builds

* whisper : allocate encoder and decoder using ggml-alloc

* whisper : ggml-alloc is now supported

* whisper : CoreML support ggml-alloc

* build : fix ggml-alloc

* ios : update submodule

* extra : update sync-ggml.sh script to also sync ggml-alloc

* ci : see if this is causing the crash

* whisper : refactor ggml-alloc init

* whisper.android : try to fix build

* whisper : initial Metal version

* ci : try to debug vmem issue

* metal : decoder works on GPU!

* metal : add multi-decoder support

* ggml : fix ggml_nbytes (probably temp solution)

* metal : run "cross" step on the GPU

* whisper : remove ggml_repeat in the encoder

* whisper : offload the Encoder to Metal

* ggml : use simpler ggml_bytes() implementation

* ggml-alloc : try to make CI happy by reducing vram to 128GB

* whisper : add whisper_allocr to wrap ggml_allocr

* whisper : factor out alloc init in a function

* cmake : update to support Metal build

* whisper : add <functional> header

* objc : fix build (no Metal yet)

* ios : add Metal support

* swiftui : fix build

* metal : speed-up KQ multiplication

* metal : sync latest llama.cpp kernels

* readme : add Metal info

* ios : update submodule

* coreml : add code to toggle Core ML config (CPU, ANE, GPU)

* bench : fix timings by running a pre-heat

* bench : start benching the decoder

* whisper : add ggml_mul_mat_pad

* bench : fix uninitialized vars

* whisper : add comment for disabling mul-mat padding

* whisper : add description of ggml_mul_mat_pad

* whisper : clean-up ggml_mul_mat_pad

* metal : remove the "concurrent" flag

* bench : variable n_past

* ios : update SPM package
jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this pull request Oct 24, 2023
* metal : init

* whisper : factor out graph builds

* whisper : allocate encoder and decoder using ggml-alloc

* whisper : ggml-alloc is now supported

* whisper : CoreML support ggml-alloc

* build : fix ggml-alloc

* ios : update submodule

* extra : update sync-ggml.sh script to also sync ggml-alloc

* ci : see if this is causing the crash

* whisper : refactor ggml-alloc init

* whisper.android : try to fix build

* whisper : initial Metal version

* ci : try to debug vmem issue

* metal : decoder works on GPU!

* metal : add multi-decoder support

* ggml : fix ggml_nbytes (probably temp solution)

* metal : run "cross" step on the GPU

* whisper : remove ggml_repeat in the encoder

* whisper : offload the Encoder to Metal

* ggml : use simpler ggml_bytes() implementation

* ggml-alloc : try to make CI happy by reducing vram to 128GB

* whisper : add whisper_allocr to wrap ggml_allocr

* whisper : factor out alloc init in a function

* cmake : update to support Metal build

* whisper : add <functional> header

* objc : fix build (no Metal yet)

* ios : add Metal support

* swiftui : fix build

* metal : speed-up KQ multiplication

* metal : sync latest llama.cpp kernels

* readme : add Metal info

* ios : update submodule

* coreml : add code to toggle Core ML config (CPU, ANE, GPU)

* bench : fix timings by running a pre-heat

* bench : start benching the decoder

* whisper : add ggml_mul_mat_pad

* bench : fix uninitialized vars

* whisper : add comment for disabling mul-mat padding

* whisper : add description of ggml_mul_mat_pad

* whisper : clean-up ggml_mul_mat_pad

* metal : remove the "concurrent" flag

* bench : variable n_past

* ios : update SPM package
vonstring pushed a commit to vonstring/whisper.cpp that referenced this pull request Nov 7, 2023
* metal : init

* whisper : factor out graph builds

* whisper : allocate encoder and decoder using ggml-alloc

* whisper : ggml-alloc is now supported

* whisper : CoreML support ggml-alloc

* build : fix ggml-alloc

* ios : update submodule

* extra : update sync-ggml.sh script to also sync ggml-alloc

* ci : see if this is causing the crash

* whisper : refactor ggml-alloc init

* whisper.android : try to fix build

* whisper : initial Metal version

* ci : try to debug vmem issue

* metal : decoder works on GPU!

* metal : add multi-decoder support

* ggml : fix ggml_nbytes (probably temp solution)

* metal : run "cross" step on the GPU

* whisper : remove ggml_repeat in the encoder

* whisper : offload the Encoder to Metal

* ggml : use simpler ggml_bytes() implementation

* ggml-alloc : try to make CI happy by reducing vram to 128GB

* whisper : add whisper_allocr to wrap ggml_allocr

* whisper : factor out alloc init in a function

* cmake : update to support Metal build

* whisper : add <functional> header

* objc : fix build (no Metal yet)

* ios : add Metal support

* swiftui : fix build

* metal : speed-up KQ multiplication

* metal : sync latest llama.cpp kernels

* readme : add Metal info

* ios : update submodule

* coreml : add code to toggle Core ML config (CPU, ANE, GPU)

* bench : fix timings by running a pre-heat

* bench : start benching the decoder

* whisper : add ggml_mul_mat_pad

* bench : fix uninitialized vars

* whisper : add comment for disabling mul-mat padding

* whisper : add description of ggml_mul_mat_pad

* whisper : clean-up ggml_mul_mat_pad

* metal : remove the "concurrent" flag

* bench : variable n_past

* ios : update SPM package
landtanin pushed a commit to landtanin/whisper.cpp that referenced this pull request Dec 16, 2023
* metal : init

* whisper : factor out graph builds

* whisper : allocate encoder and decoder using ggml-alloc

* whisper : ggml-alloc is now supported

* whisper : CoreML support ggml-alloc

* build : fix ggml-alloc

* ios : update submodule

* extra : update sync-ggml.sh script to also sync ggml-alloc

* ci : see if this is causing the crash

* whisper : refactor ggml-alloc init

* whisper.android : try to fix build

* whisper : initial Metal version

* ci : try to debug vmem issue

* metal : decoder works on GPU!

* metal : add multi-decoder support

* ggml : fix ggml_nbytes (probably temp solution)

* metal : run "cross" step on the GPU

* whisper : remove ggml_repeat in the encoder

* whisper : offload the Encoder to Metal

* ggml : use simpler ggml_bytes() implementation

* ggml-alloc : try to make CI happy by reducing vram to 128GB

* whisper : add whisper_allocr to wrap ggml_allocr

* whisper : factor out alloc init in a function

* cmake : update to support Metal build

* whisper : add <functional> header

* objc : fix build (no Metal yet)

* ios : add Metal support

* swiftui : fix build

* metal : speed-up KQ multiplication

* metal : sync latest llama.cpp kernels

* readme : add Metal info

* ios : update submodule

* coreml : add code to toggle Core ML config (CPU, ANE, GPU)

* bench : fix timings by running a pre-heat

* bench : start benching the decoder

* whisper : add ggml_mul_mat_pad

* bench : fix uninitialized vars

* whisper : add comment for disabling mul-mat padding

* whisper : add description of ggml_mul_mat_pad

* whisper : clean-up ggml_mul_mat_pad

* metal : remove the "concurrent" flag

* bench : variable n_past

* ios : update SPM package
iThalay pushed a commit to iThalay/whisper.cpp that referenced this pull request Sep 23, 2024
* metal : init

* whisper : factor out graph builds

* whisper : allocate encoder and decoder using ggml-alloc

* whisper : ggml-alloc is now supported

* whisper : CoreML support ggml-alloc

* build : fix ggml-alloc

* ios : update submodule

* extra : update sync-ggml.sh script to also sync ggml-alloc

* ci : see if this is causing the crash

* whisper : refactor ggml-alloc init

* whisper.android : try to fix build

* whisper : initial Metal version

* ci : try to debug vmem issue

* metal : decoder works on GPU!

* metal : add multi-decoder support

* ggml : fix ggml_nbytes (probably temp solution)

* metal : run "cross" step on the GPU

* whisper : remove ggml_repeat in the encoder

* whisper : offload the Encoder to Metal

* ggml : use simpler ggml_bytes() implementation

* ggml-alloc : try to make CI happy by reducing vram to 128GB

* whisper : add whisper_allocr to wrap ggml_allocr

* whisper : factor out alloc init in a function

* cmake : update to support Metal build

* whisper : add <functional> header

* objc : fix build (no Metal yet)

* ios : add Metal support

* swiftui : fix build

* metal : speed-up KQ multiplication

* metal : sync latest llama.cpp kernels

* readme : add Metal info

* ios : update submodule

* coreml : add code to toggle Core ML config (CPU, ANE, GPU)

* bench : fix timings by running a pre-heat

* bench : start benching the decoder

* whisper : add ggml_mul_mat_pad

* bench : fix uninitialized vars

* whisper : add comment for disabling mul-mat padding

* whisper : add description of ggml_mul_mat_pad

* whisper : clean-up ggml_mul_mat_pad

* metal : remove the "concurrent" flag

* bench : variable n_past

* ios : update SPM package
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants