Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix coreml ANE optimized encoder #1716

Merged
merged 1 commit into from Jan 4, 2024
Merged

Conversation

philloooo
Copy link
Contributor

@philloooo philloooo commented Jan 2, 2024

Transpose the result back to the format that's accepted by the decoder.
I tested with tiny, small, and base models, and ran ./tests/run-tests.sh, result all looks good. @ggerganov I am not sure why your previous attempt didn't work, can you double-check?

Performance-wise, this is my result on M3 pro with a 30min audio and the base model(I used a longer audio to get a better average encode time per segment so you can ignore the inital coreml model load overhead). The encode time is ~2x faster than metal.
With ANE optimized model:

whisper_print_timings:     load time =    93.50 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   678.59 ms
whisper_print_timings:   sample time =  6985.69 ms / 36729 runs (    0.19 ms per run)
whisper_print_timings:   encode time =  1985.48 ms /    67 runs (   29.63 ms per run)
whisper_print_timings:   decode time =   210.62 ms /    86 runs (    2.45 ms per run)
whisper_print_timings:   batchd time = 29357.13 ms / 36314 runs (    0.81 ms per run)
whisper_print_timings:   prompt time =   857.84 ms / 14712 runs (    0.06 ms per run)
whisper_print_timings:    total time = 42247.07 ms

With vanilla openai whisper model:

whisper_print_timings:     load time =    97.97 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   671.39 ms
whisper_print_timings:   sample time =  7039.62 ms / 36803 runs (    0.19 ms per run)
whisper_print_timings:   encode time =  2792.66 ms /    67 runs (   41.68 ms per run)
whisper_print_timings:   decode time =   202.56 ms /    84 runs (    2.41 ms per run)
whisper_print_timings:   batchd time = 29273.17 ms / 36390 runs (    0.80 ms per run)
whisper_print_timings:   prompt time =   845.34 ms / 14712 runs (    0.06 ms per run)
whisper_print_timings:    total time = 42520.91 ms

Metal:

whisper_print_timings:     load time =   103.68 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   678.44 ms
whisper_print_timings:   sample time =  6981.09 ms / 36940 runs (    0.19 ms per run)
whisper_print_timings:   encode time =  3958.86 ms /    66 runs (   59.98 ms per run)
whisper_print_timings:   decode time =   164.65 ms /    67 runs (    2.46 ms per run)
whisper_print_timings:   batchd time = 29736.41 ms / 36546 runs (    0.81 ms per run)
whisper_print_timings:   prompt time =   844.35 ms / 14712 runs (    0.06 ms per run)
whisper_print_timings:    total time = 42517.96 ms

@ggerganov
Copy link
Owner

Thanks for looking into this - I will recheck the results now

For reference, here is the discussion back then: #548 (reply in thread)
Somehow the results were corrupted, but either I was doing something wrong or something else got fixed along the way

@ggerganov
Copy link
Owner

ggerganov commented Jan 4, 2024

Indeed, the ANE-optimized Core ML models work correctly and are faster than the original models. Here are the results that I get on M2 Ultra with 76 GPU cores and 32 ANE cores (only the "Enc." column is relevant for this change):

master + Core ML ANE

CPU OS Config Model Th Enc. Dec. Bch5 PP Commit
NEON BLAS COREML METAL tiny 4 29.37 1.37 0.51 0.01 eaac005
NEON BLAS COREML METAL base 4 43.33 1.98 0.78 0.02 eaac005
NEON BLAS COREML METAL small 4 139.86 3.88 1.71 0.05 eaac005
NEON BLAS COREML METAL medium 4 658.38 8.09 3.93 0.13 eaac005
NEON BLAS COREML METAL large-v2 4 1686.48 11.68 6.06 0.23 eaac005

PR + Core ML ANE

CPU OS Config Model Th Enc. Dec. Bch5 PP Commit
NEON BLAS COREML METAL tiny 4 23.17 1.37 0.51 0.01 eaac005
NEON BLAS COREML METAL base 4 30.34 1.97 0.77 0.02 eaac005
NEON BLAS COREML METAL small 4 99.59 3.96 1.75 0.05 eaac005
NEON BLAS COREML METAL medium 4 307.01 8.06 3.93 0.13 eaac005
NEON BLAS COREML METAL large-v2 4 526.80 11.68 6.03 0.23 eaac005

Notice however that the ANE-optimized Core ML models are not suitable to run on the GPU (i.e. config.computeUnits = MLComputeUnitsCPUAndGPU;):

master + Core ML GPU

CPU OS Config Model Th Enc. Dec. Bch5 PP Commit
NEON BLAS COREML METAL tiny 4 6.93 1.37 0.51 0.01 eaac005
NEON BLAS COREML METAL base 4 12.80 1.97 0.77 0.02 eaac005
NEON BLAS COREML METAL small 4 31.70 3.95 1.71 0.05 eaac005
NEON BLAS COREML METAL medium 4 93.59 8.09 3.93 0.13 eaac005
NEON BLAS COREML METAL large-v2 4 179.49 11.69 5.93 0.23 eaac005

PR + Core ML GPU

CPU OS Config Model Th Enc. Dec. Bch5 PP Commit
NEON BLAS COREML METAL tiny 4 9.21 1.37 0.51 0.01 eaac005
NEON BLAS COREML METAL base 4 18.46 1.97 0.78 0.02 eaac005
NEON BLAS COREML METAL small 4 59.12 3.97 1.72 0.05 eaac005
NEON BLAS COREML METAL medium 4 168.24 8.08 3.87 0.13 eaac005
NEON BLAS COREML METAL large-v2 4 303.07 11.72 5.95 0.23 eaac005

For reference, here are the results for running the entire computation on the GPU with Metal (i.e. no Core ML):

Full Metal (no Core ML)

CPU OS Config Model Th Enc. Dec. Bch5 PP Commit
NEON BLAS METAL tiny 4 10.74 1.37 0.51 0.01 eaac005
NEON BLAS METAL base 4 19.06 1.98 0.78 0.02 eaac005
NEON BLAS METAL small 4 53.24 3.87 1.71 0.05 eaac005
NEON BLAS METAL medium 4 143.67 8.12 3.94 0.13 eaac005
NEON BLAS METAL large-v2 4 253.30 11.67 6.06 0.23 eaac005

@ggerganov ggerganov merged commit ba5bcde into ggerganov:master Jan 4, 2024
39 checks passed
@Josscii
Copy link

Josscii commented Jan 5, 2024

2024-01-05 09:24:22.023112+0800 [4575:1504107] Error: Transpose unit is not supported.
2024-01-05 09:24:22.023235+0800 [4575:1504107] Error: Transpose unit is not supported.
2024-01-05 09:24:22.023298+0800 [4575:1504107] Error: Transpose unit is not supported.
2024-01-05 09:24:22.034869+0800 [4575:1504107] Error: Transpose unit is not supported.
2024-01-05 09:24:22.035000+0800 [4575:1504107] Error: Transpose unit is not supported.
2024-01-05 09:24:22.035063+0800 [4575:1504107] Error: Transpose unit is not supported.
2024-01-05 09:24:27.377632+0800 [4575:1504107] [espresso] [Espresso::handle_ex_plan] exception=at at /ggml-base-encoder.mlmodelc/model.mil:14:12: In 'ios16.conv' operations, tensors parameter x[0], parameter weight[0], parameter bias[0], and output at index 0 must have the same data type.
2024-01-05 09:24:27.377830+0800 [4575:1504107] [coreml] Error plan build: -1.


with this update, I generated new coreml encoder and run on iPhone XR with iOS 16, it output above errors

@ggerganov
Copy link
Owner

Hm, interesting. I actually didn't test if the ANE models work on iOS. Maybe this is the problem that we observed in the past, and now there is an error actually being reported

@ggerganov
Copy link
Owner

I just tested it on iPhone 13 Mini (A15) with iOS 17.2.1 and it works without errors

ggml_metal_init: allocating
ggml_metal_init: picking default device: Apple A15 GPU
ggml_metal_init: ggml.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/private/var/containers/Bundle/Application//whisper.objc.app/ggml-metal.metal'
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =    15.75 MiB, (  157.52)
whisper_init_state: kv self size  =   16.52 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =    17.58 MiB, (  175.09)
whisper_init_state: kv cross size =   18.43 MB
whisper_init_state: loading Core ML model from '/private/var/containers/Bundle/Application//whisper.objc.app/ggml-base.en-encoder.mlmodelc'
whisper_init_state: first run on a device may take a while ...

whisper_init_state: Core ML model loaded
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =     0.02 MiB, (  175.11)
whisper_init_state: compute buffer (conv)   =    5.74 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =     0.02 MiB, (  175.12)
whisper_init_state: compute buffer (cross)  =    4.78 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =     0.02 MiB, (  175.14)
whisper_init_state: compute buffer (decode) =   96.48 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =     3.86 MiB, (  178.98)
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =     2.94 MiB, (  181.91)
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =    90.39 MiB, (  272.28)

...

whisper_print_timings:      mel time =    69.66 ms
whisper_print_timings:   sample time =    16.86 ms /     1 runs (   16.86 ms per run)
whisper_print_timings:   encode time =    89.77 ms /     1 runs (   89.77 ms per run)
whisper_print_timings:   decode time =    71.90 ms /     7 runs (   10.27 ms per run)
whisper_print_timings:   batchd time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =   252.02 ms

Could it be related to the iOS version?

@Josscii
Copy link

Josscii commented Jan 5, 2024

I think it could be the clip's problem, iPhone XR is quite a old device,it use A12.

@astrowonk
Copy link

Not sure if it is due to changes in this PR, or something else, but the whisper_init_state: first run on a device may take a while … step when using CoreML (which had gotten very fast in Sonoma), now takes a very long time again. Like ~5 minutes on my first test run of whisper.cpp after recompiling with this PR and making a new medium.en CoreML model with the generate-coreml-model.sh script.

Second time running was quick.

Overall performance on a 10 minute podcast was 157 seconds… on my little M1 Mac Mini, about the same as previous whisper.cpp builds that were CoreML but GPU+CPU only.

The GPU was going all out according to Activity Monitor, but at least according to asitop, the ANE doesn't appear to be doing much? I see the config.computeUnits = MLComputeUnitsAll; change but at least on my M1 the GPU is most of the work.

@ggerganov
Copy link
Owner

What happens if you switch to config.computeUnits = MLComputeUnitsCPUAndNeuralEngine;?

@astrowonk
Copy link

What happens if you switch to config.computeUnits = MLComputeUnitsCPUAndNeuralEngine;?

Changed whisper.cpp/coreml/whisper-encoder.mm file config to have MLComputeUnitsCPUAndNeuralEngine, did a make clean and then another WHISPER_COREML=1 make -j.

Still not seeing much ANE usage in asitop… weirdly I'm still seeing mostly GPU and some CPU. Processing time for my test file still right around 158 seconds. (Second run since whisper_init_state took a long time again after recompiling.)

@philloooo
Copy link
Contributor Author

On asitop It's expected to be mostly GPU and CPU because the decoder is running on GPU and it's much more expensive than the encoder, and the pre/post processing is all CPU. You should see a small ANE usage though, but in my testing the ANE is 10x more power efficient than GPU so the usage is very minimal.

@astrowonk astrowonk mentioned this pull request Feb 15, 2024
10 tasks
@aehlke
Copy link

aehlke commented Mar 27, 2024

It's good enough for realtime on iPhone now?

@bebound
Copy link
Contributor

bebound commented Apr 10, 2024

The ANE optimized encoder generated by generate-coreml-model.sh is much faster than files in https://huggingface.co/ggerganov/whisper.cpp/tree/main. (33s vs 59s)

whisper_print_timings:     load time =  1148.52 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   135.14 ms
whisper_print_timings:   sample time =   230.31 ms /     1 runs (  230.31 ms per run)
whisper_print_timings:   encode time = 10612.82 ms /    13 runs (  816.37 ms per run)
whisper_print_timings:   decode time = 17358.26 ms /   637 runs (   27.25 ms per run)
whisper_print_timings:   batchd time =   111.01 ms /     6 runs (   18.50 ms per run)
whisper_print_timings:   prompt time =  2460.09 ms /  2087 runs (    1.18 ms per run)
whisper_print_timings:    total time = 32753.55 ms
whisper_print_timings:     load time =  1182.02 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   132.84 ms
whisper_print_timings:   sample time =   235.12 ms /     1 runs (  235.12 ms per run)
whisper_print_timings:   encode time = 38726.64 ms /    13 runs ( 2978.97 ms per run)
whisper_print_timings:   decode time = 15707.32 ms /   637 runs (   24.66 ms per run)
whisper_print_timings:   batchd time =   134.52 ms /     6 runs (   22.42 ms per run)
whisper_print_timings:   prompt time =  2380.83 ms /  2087 runs (    1.14 ms per run)
whisper_print_timings:    total time = 59573.27 ms

Maybe we need to update the files shared in huggingface?

@Josscii
Copy link

Josscii commented Apr 10, 2024

as I reported the old device crash issue above, I think it need some patch to avoid this.

@bebound
Copy link
Contributor

bebound commented Apr 11, 2024

I see.

But this PR makes config.computeUnits = MLComputeUnitsAll default , and the compiled main runs slowly with huggingface mlmodelc encoder. I'm inclined to revert to MLComputeUnitsCPUAndGPU before we updating encoder model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants