whisper.android: How to build with CLBlast #1809

luciferous · 2024-01-25T06:45:16Z

Documenting for anyone else who wants to get CLBlast running on Android.

Benchmark and transcription measurements below are from a Release variant built against ggerganov/ggml@53558f9 (with CLBlast) and e72e415 (without CLBlast).

Benchmarks

Without CLBlast (BLAS = 0).

System Info: AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | METAL = 0 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0 | 
Loading data...
Copying jfk.wav...
All data copied to working directory.
Loading model...
Loaded model ggml-tiny.en-q5_1.bin.
Running benchmark. This will take minutes...
[...snip...]
  64 x   64: Q4_0     0.7 GFLOPS (128 runs) | Q4_1     0.6 GFLOPS (128 runs)
  64 x   64: Q5_0     0.6 GFLOPS (128 runs) | Q5_1     0.9 GFLOPS (128 runs) | Q8_0     1.2 GFLOPS (128 runs)
  64 x   64: F16      1.2 GFLOPS (128 runs) | F32      0.5 GFLOPS (128 runs)
 128 x  128: Q4_0     3.4 GFLOPS (128 runs) | Q4_1     4.2 GFLOPS (128 runs)
 128 x  128: Q5_0     5.2 GFLOPS (128 runs) | Q5_1     2.7 GFLOPS (128 runs) | Q8_0     5.6 GFLOPS (128 runs)
 128 x  128: F16      3.6 GFLOPS (128 runs) | F32      4.8 GFLOPS (128 runs)
 256 x  256: Q4_0    12.9 GFLOPS (128 runs) | Q4_1    12.3 GFLOPS (128 runs)
 256 x  256: Q5_0    11.2 GFLOPS (128 runs) | Q5_1    10.4 GFLOPS (128 runs) | Q8_0    13.0 GFLOPS (128 runs)
 256 x  256: F16     15.2 GFLOPS (128 runs) | F32      9.0 GFLOPS (128 runs)
 512 x  512: Q4_0    15.2 GFLOPS ( 57 runs) | Q4_1    16.6 GFLOPS ( 62 runs)
 512 x  512: Q5_0    12.6 GFLOPS ( 48 runs) | Q5_1    12.6 GFLOPS ( 47 runs) | Q8_0    16.1 GFLOPS ( 60 runs)
 512 x  512: F16     19.4 GFLOPS ( 73 runs) | F32     10.5 GFLOPS ( 40 runs)
1024 x 1024: Q4_0    19.1 GFLOPS (  9 runs) | Q4_1    20.2 GFLOPS ( 10 runs)
1024 x 1024: Q5_0    15.2 GFLOPS (  8 runs) | Q5_1    15.4 GFLOPS (  8 runs) | Q8_0    19.5 GFLOPS ( 10 runs)
1024 x 1024: F16     22.2 GFLOPS ( 11 runs) | F32     10.8 GFLOPS (  6 runs)
2048 x 2048: Q4_0    19.8 GFLOPS (  3 runs) | Q4_1    21.3 GFLOPS (  3 runs)
2048 x 2048: Q5_0    15.7 GFLOPS (  3 runs) | Q5_1    16.5 GFLOPS (  3 runs) | Q8_0    20.8 GFLOPS (  3 runs)
2048 x 2048: F16     23.3 GFLOPS (  3 runs) | F32     10.3 GFLOPS (  3 runs)
4096 x 4096: Q4_0    19.3 GFLOPS (  3 runs) | Q4_1    21.5 GFLOPS (  3 runs)
4096 x 4096: Q5_0    15.8 GFLOPS (  3 runs) | Q5_1    16.2 GFLOPS (  3 runs) | Q8_0    20.5 GFLOPS (  3 runs)
4096 x 4096: F16     20.2 GFLOPS (  3 runs) | F32      8.2 GFLOPS (  3 runs)

With CLBlast (BLAS = 1)

System Info: AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | METAL = 0 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0 | 
Loading data...
Copying jfk.wav...
All data copied to working directory.
Loading model...
Loaded model ggml-tiny.en-q5_1.bin.
Running benchmark. This will take minutes...
[...snip...]
  64 x   64: Q4_0     0.4 GFLOPS (128 runs) | Q4_1     0.4 GFLOPS (128 runs)
  64 x   64: Q5_0     0.3 GFLOPS (128 runs) | Q5_1     0.3 GFLOPS (128 runs) | Q8_0     0.4 GFLOPS (128 runs)
  64 x   64: F16      0.4 GFLOPS (128 runs) | F32      0.3 GFLOPS (128 runs)
 128 x  128: Q4_0     1.5 GFLOPS (128 runs) | Q4_1     1.8 GFLOPS (128 runs)
 128 x  128: Q5_0     2.1 GFLOPS (128 runs) | Q5_1     2.0 GFLOPS (128 runs) | Q8_0     2.0 GFLOPS (128 runs)
 128 x  128: F16      1.8 GFLOPS (128 runs) | F32      2.1 GFLOPS (128 runs)
 256 x  256: Q4_0     7.7 GFLOPS (128 runs) | Q4_1     7.9 GFLOPS (128 runs)
 256 x  256: Q5_0     7.7 GFLOPS (128 runs) | Q5_1     7.9 GFLOPS (128 runs) | Q8_0     7.6 GFLOPS (128 runs)
 256 x  256: F16      7.5 GFLOPS (128 runs) | F32      8.4 GFLOPS (128 runs)
 512 x  512: Q4_0    19.3 GFLOPS ( 73 runs) | Q4_1    18.8 GFLOPS ( 71 runs)
 512 x  512: Q5_0    19.3 GFLOPS ( 73 runs) | Q5_1    19.4 GFLOPS ( 73 runs) | Q8_0    19.4 GFLOPS ( 73 runs)
 512 x  512: F16     19.3 GFLOPS ( 73 runs) | F32     19.0 GFLOPS ( 71 runs)
1024 x 1024: Q4_0    63.0 GFLOPS ( 30 runs) | Q4_1    65.5 GFLOPS ( 31 runs)
1024 x 1024: Q5_0    63.3 GFLOPS ( 30 runs) | Q5_1    66.0 GFLOPS ( 31 runs) | Q8_0    66.5 GFLOPS ( 31 runs)
1024 x 1024: F16     63.0 GFLOPS ( 30 runs) | F32     62.2 GFLOPS ( 29 runs)
2048 x 2048: Q4_0   100.2 GFLOPS (  6 runs) | Q4_1    97.1 GFLOPS (  6 runs)
2048 x 2048: Q5_0   101.1 GFLOPS (  6 runs) | Q5_1    95.9 GFLOPS (  6 runs) | Q8_0   101.5 GFLOPS (  6 runs)
2048 x 2048: F16     96.7 GFLOPS (  6 runs) | F32    101.0 GFLOPS (  6 runs)
4096 x 4096: Q4_0   108.0 GFLOPS (  3 runs) | Q4_1    78.7 GFLOPS (  3 runs)
4096 x 4096: Q5_0    79.1 GFLOPS (  3 runs) | Q5_1    81.3 GFLOPS (  3 runs) | Q8_0    56.0 GFLOPS (  3 runs)
4096 x 4096: F16     64.9 GFLOPS (  3 runs) | F32     49.5 GFLOPS (  3 runs)

Transcribe jfk.wav.

BLAS = 0

whisper_print_timings:     load time =   183.27 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    55.03 ms
whisper_print_timings:   sample time =    19.49 ms /     1 runs (   19.49 ms per run)
whisper_print_timings:   encode time =  1098.81 ms /     1 runs ( 1098.81 ms per run)
whisper_print_timings:   decode time =   290.64 ms /    27 runs (   10.76 ms per run)
whisper_print_timings:   batchd time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =  1464.72 ms

BLAS = 1

whisper_print_timings:     load time =  2030.81 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    49.70 ms
whisper_print_timings:   sample time =    26.18 ms /     1 runs (   26.18 ms per run)
whisper_print_timings:   encode time =  1344.75 ms /     1 runs ( 1344.75 ms per run)
whisper_print_timings:   decode time =   303.45 ms /    27 runs (   11.24 ms per run)
whisper_print_timings:   batchd time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =  1725.54 ms

luciferous · 2024-01-25T06:46:55Z

In draft, because the FetchContent_Declare(ggml SOURCE_DIR...) assumes a directory layout and I'm working on an option to specify ggml's location.

~~Also, I'm not quite sure why we lose FP16_VA with CLBlast (see: FP16_VA = 0), so looking into that too.~~ (Just needed to pass -march=... to ggml compile.)

bobqianic · 2024-02-01T00:12:29Z

Seems like quite a speed up, but notably does not translate to faster transcription.

Could you post the timing information?

Example:

whisper_print_timings:     load time =  7241.31 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    30.68 ms
whisper_print_timings:   sample time =   362.31 ms /   751 runs (    0.48 ms per run)
whisper_print_timings:   encode time =  1177.27 ms /     5 runs (  235.45 ms per run)
whisper_print_timings:   decode time =    79.06 ms /     5 runs (   15.81 ms per run)
whisper_print_timings:   batchd time =  3858.86 ms /   735 runs (    5.25 ms per run)
whisper_print_timings:   prompt time =   191.72 ms /    89 runs (    2.15 ms per run)
whisper_print_timings:    total time = 13115.86 ms

gpokat · 2024-02-02T03:34:02Z

did it get correct inference ? My build can not find anything when compile with CLBLAST on android.

luciferous · 2024-02-05T00:23:08Z

@bobqianic

Looks like I was wrong about it not speeding up transcription 🙂.

The timing numbers below are from transcribing jfk.wav:

BLAS = 0

whisper_print_timings:     load time =   389.18 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   142.32 ms
whisper_print_timings:   sample time =    57.65 ms /     1 runs (   57.65 ms per run)
whisper_print_timings:   encode time = 28396.24 ms /     1 runs (28396.24 ms per run)
whisper_print_timings:   decode time =  2010.30 ms /    27 runs (   74.46 ms per run)
whisper_print_timings:   batchd time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time = 30621.52 ms

BLAS = 1

whisper_print_timings:     load time =   683.91 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   162.44 ms
whisper_print_timings:   sample time =    66.18 ms /     1 runs (   66.18 ms per run)
whisper_print_timings:   encode time =  3750.02 ms /     1 runs ( 3750.02 ms per run)
whisper_print_timings:   decode time =  2034.95 ms /    27 runs (   75.37 ms per run)
whisper_print_timings:   batchd time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =  6026.00 ms

I'll update the commit message to reflect.

Also to clarify my aims: since I don't understand the basis of the effect on performance, I'm hesitant to make any strong claims for my changes. My goal with this PR is only to merge a reference build to make it easier for further developments of CLBlast on Android.

Android logs aren't hooked up to the transcription, so this is what I did to retrieve the whisper_print_timings.

diff --git a/examples/whisper.android/lib/src/main/jni/whisper/jni.c b/examples/whisper.android/lib/src/main/jni/whisper/jni.c
index 7f9d724..e522a3a 100644
--- a/examples/whisper.android/lib/src/main/jni/whisper/jni.c
+++ b/examples/whisper.android/lib/src/main/jni/whisper/jni.c
@@ -14,6 +14,13 @@
 #define LOGI(...) __android_log_print(ANDROID_LOG_INFO,     TAG, __VA_ARGS__)
 #define LOGW(...) __android_log_print(ANDROID_LOG_WARN,     TAG, __VA_ARGS__)
 
+static void log_callback(enum ggml_log_level level, const char * fmt, void * data) {
+    if (level == GGML_LOG_LEVEL_ERROR)     __android_log_print(ANDROID_LOG_ERROR, TAG, fmt, data);
+    else if (level == GGML_LOG_LEVEL_INFO) __android_log_print(ANDROID_LOG_INFO, TAG, fmt, data);
+    else if (level == GGML_LOG_LEVEL_WARN) __android_log_print(ANDROID_LOG_WARN, TAG, fmt, data);
+    else __android_log_print(ANDROID_LOG_DEFAULT, TAG, fmt, data);
+}
+
 static inline int min(int a, int b) {
     return (a < b) ? a : b;
 }
@@ -182,6 +189,8 @@ Java_com_whispercpp_whisper_WhisperLib_00024Companion_fullTranscribe(
     params.no_context = true;
     params.single_segment = false;
 
+    whisper_log_set(log_callback, NULL);
+
     whisper_reset_timings(context);
 
     LOGI("About to run whisper_full");

luciferous · 2024-02-05T01:44:01Z

@gpokat Transcribing jfk.wav seems to be working for me. What are you seeing?

gpokat · 2024-02-05T02:04:07Z

@gpokat Transcribing jfk.wav seems to be working for me. What are you seeing?

The output only shows [music] to me.
Did you recompile clblast with tunners output or as it is ?
btw seems you are using debug build, i used release with -O3 applied for speedup.

*Update.
After applied tunners for half precision the things bring alive and the infirence being correct. However in my case timings pretty much the same as on cpu with optimizations.
No less the ~2sec of processing time on Helio g99 Mc2.
6 cpu -O3 mode vs full tuned clblast with 2 cpu. (gpu mali g57)

luciferous · 2024-02-05T04:58:42Z

@gpokat Just want to verify that the System Info line show BLAS = 1, e.g.,

System Info: AVX = 0 | AVX2 = 0 | ... | BLAS = 1 | ...

If not, then, it might not have built correctly.

ggerganov

@luciferous Did you build the CPU version in Release mode (i.e. -O3)?

examples/whisper.android/lib/src/main/jni/whisper/CMakeLists.txt

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

luciferous · 2024-02-05T23:39:32Z

@ggerganov I think these are in Debug. I'll follow up with numbers from a Release build and update the commit message with something appropriate.

Built against ggerganov/ggml@53558f9.

Verified O3 via:

Build / Select Build Variant..., I set both :app and :lib modules to release;
lib/.cxx/tools/release/arm64-v8a/compile_commands.json to see -O3 in compile options

Transcribe sample

BLAS = 0

whisper_print_timings:     load time =   183.27 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    55.03 ms
whisper_print_timings:   sample time =    19.49 ms /     1 runs (   19.49 ms per run)
whisper_print_timings:   encode time =  1098.81 ms /     1 runs ( 1098.81 ms per run)
whisper_print_timings:   decode time =   290.64 ms /    27 runs (   10.76 ms per run)
whisper_print_timings:   batchd time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =  1464.72 ms

BLAS = 1

whisper_print_timings:     load time =  2030.81 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    49.70 ms
whisper_print_timings:   sample time =    26.18 ms /     1 runs (   26.18 ms per run)
whisper_print_timings:   encode time =  1344.75 ms /     1 runs ( 1344.75 ms per run)
whisper_print_timings:   decode time =   303.45 ms /    27 runs (   11.24 ms per run)
whisper_print_timings:   batchd time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =  1725.54 ms

Benchmarks

BLAS = 0

System Info: AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | METAL = 0 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0 | 
Loading data...
Copying jfk.wav...
All data copied to working directory.
Loading model...
Loaded model ggml-tiny.en-q5_1.bin.
Running benchmark. This will take minutes...
[...snip...]
  64 x   64: Q4_0     0.7 GFLOPS (128 runs) | Q4_1     0.6 GFLOPS (128 runs)
  64 x   64: Q5_0     0.6 GFLOPS (128 runs) | Q5_1     0.9 GFLOPS (128 runs) | Q8_0     1.2 GFLOPS (128 runs)
  64 x   64: F16      1.2 GFLOPS (128 runs) | F32      0.5 GFLOPS (128 runs)
 128 x  128: Q4_0     3.4 GFLOPS (128 runs) | Q4_1     4.2 GFLOPS (128 runs)
 128 x  128: Q5_0     5.2 GFLOPS (128 runs) | Q5_1     2.7 GFLOPS (128 runs) | Q8_0     5.6 GFLOPS (128 runs)
 128 x  128: F16      3.6 GFLOPS (128 runs) | F32      4.8 GFLOPS (128 runs)
 256 x  256: Q4_0    12.9 GFLOPS (128 runs) | Q4_1    12.3 GFLOPS (128 runs)
 256 x  256: Q5_0    11.2 GFLOPS (128 runs) | Q5_1    10.4 GFLOPS (128 runs) | Q8_0    13.0 GFLOPS (128 runs)
 256 x  256: F16     15.2 GFLOPS (128 runs) | F32      9.0 GFLOPS (128 runs)
 512 x  512: Q4_0    15.2 GFLOPS ( 57 runs) | Q4_1    16.6 GFLOPS ( 62 runs)
 512 x  512: Q5_0    12.6 GFLOPS ( 48 runs) | Q5_1    12.6 GFLOPS ( 47 runs) | Q8_0    16.1 GFLOPS ( 60 runs)
 512 x  512: F16     19.4 GFLOPS ( 73 runs) | F32     10.5 GFLOPS ( 40 runs)
1024 x 1024: Q4_0    19.1 GFLOPS (  9 runs) | Q4_1    20.2 GFLOPS ( 10 runs)
1024 x 1024: Q5_0    15.2 GFLOPS (  8 runs) | Q5_1    15.4 GFLOPS (  8 runs) | Q8_0    19.5 GFLOPS ( 10 runs)
1024 x 1024: F16     22.2 GFLOPS ( 11 runs) | F32     10.8 GFLOPS (  6 runs)
2048 x 2048: Q4_0    19.8 GFLOPS (  3 runs) | Q4_1    21.3 GFLOPS (  3 runs)
2048 x 2048: Q5_0    15.7 GFLOPS (  3 runs) | Q5_1    16.5 GFLOPS (  3 runs) | Q8_0    20.8 GFLOPS (  3 runs)
2048 x 2048: F16     23.3 GFLOPS (  3 runs) | F32     10.3 GFLOPS (  3 runs)
4096 x 4096: Q4_0    19.3 GFLOPS (  3 runs) | Q4_1    21.5 GFLOPS (  3 runs)
4096 x 4096: Q5_0    15.8 GFLOPS (  3 runs) | Q5_1    16.2 GFLOPS (  3 runs) | Q8_0    20.5 GFLOPS (  3 runs)
4096 x 4096: F16     20.2 GFLOPS (  3 runs) | F32      8.2 GFLOPS (  3 runs)

BLAS = 1

System Info: AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | METAL = 0 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0 | 
Loading data...
Copying jfk.wav...
All data copied to working directory.
Loading model...
Loaded model ggml-tiny.en-q5_1.bin.
Running benchmark. This will take minutes...
[...snip...]
  64 x   64: Q4_0     0.4 GFLOPS (128 runs) | Q4_1     0.4 GFLOPS (128 runs)
  64 x   64: Q5_0     0.3 GFLOPS (128 runs) | Q5_1     0.3 GFLOPS (128 runs) | Q8_0     0.4 GFLOPS (128 runs)
  64 x   64: F16      0.4 GFLOPS (128 runs) | F32      0.3 GFLOPS (128 runs)
 128 x  128: Q4_0     1.5 GFLOPS (128 runs) | Q4_1     1.8 GFLOPS (128 runs)
 128 x  128: Q5_0     2.1 GFLOPS (128 runs) | Q5_1     2.0 GFLOPS (128 runs) | Q8_0     2.0 GFLOPS (128 runs)
 128 x  128: F16      1.8 GFLOPS (128 runs) | F32      2.1 GFLOPS (128 runs)
 256 x  256: Q4_0     7.7 GFLOPS (128 runs) | Q4_1     7.9 GFLOPS (128 runs)
 256 x  256: Q5_0     7.7 GFLOPS (128 runs) | Q5_1     7.9 GFLOPS (128 runs) | Q8_0     7.6 GFLOPS (128 runs)
 256 x  256: F16      7.5 GFLOPS (128 runs) | F32      8.4 GFLOPS (128 runs)
 512 x  512: Q4_0    19.3 GFLOPS ( 73 runs) | Q4_1    18.8 GFLOPS ( 71 runs)
 512 x  512: Q5_0    19.3 GFLOPS ( 73 runs) | Q5_1    19.4 GFLOPS ( 73 runs) | Q8_0    19.4 GFLOPS ( 73 runs)
 512 x  512: F16     19.3 GFLOPS ( 73 runs) | F32     19.0 GFLOPS ( 71 runs)
1024 x 1024: Q4_0    63.0 GFLOPS ( 30 runs) | Q4_1    65.5 GFLOPS ( 31 runs)
1024 x 1024: Q5_0    63.3 GFLOPS ( 30 runs) | Q5_1    66.0 GFLOPS ( 31 runs) | Q8_0    66.5 GFLOPS ( 31 runs)
1024 x 1024: F16     63.0 GFLOPS ( 30 runs) | F32     62.2 GFLOPS ( 29 runs)
2048 x 2048: Q4_0   100.2 GFLOPS (  6 runs) | Q4_1    97.1 GFLOPS (  6 runs)
2048 x 2048: Q5_0   101.1 GFLOPS (  6 runs) | Q5_1    95.9 GFLOPS (  6 runs) | Q8_0   101.5 GFLOPS (  6 runs)
2048 x 2048: F16     96.7 GFLOPS (  6 runs) | F32    101.0 GFLOPS (  6 runs)
4096 x 4096: Q4_0   108.0 GFLOPS (  3 runs) | Q4_1    78.7 GFLOPS (  3 runs)
4096 x 4096: Q5_0    79.1 GFLOPS (  3 runs) | Q5_1    81.3 GFLOPS (  3 runs) | Q8_0    56.0 GFLOPS (  3 runs)
4096 x 4096: F16     64.9 GFLOPS (  3 runs) | F32     49.5 GFLOPS (  3 runs)

luciferous · 2024-02-06T00:47:10Z

@ggerganov Updated the commit message with the new measurements. Curiously, changes to ggml in the last week or so may have degraded CLBlast benchmark performance. For the sake of posterity, this is what I saw building against a ggml version from sometime in mid-January (unfortunately, I don't recall which):

  64 x   64: Q4_0     0.0 GFLOPS ( 93 runs) | Q4_1     0.1 GFLOPS (107 runs)
  64 x   64: Q5_0     0.0 GFLOPS ( 89 runs) | Q5_1     0.1 GFLOPS ( 97 runs) | Q8_0     0.0 GFLOPS ( 87 runs)
  64 x   64: F16      0.1 GFLOPS (110 runs) | F32      0.0 GFLOPS ( 66 runs)
 128 x  128: Q4_0     0.4 GFLOPS ( 90 runs) | Q4_1     0.4 GFLOPS ( 91 runs)
 128 x  128: Q5_0     0.4 GFLOPS ( 89 runs) | Q5_1     0.4 GFLOPS ( 89 runs) | Q8_0     0.4 GFLOPS ( 91 runs)
 128 x  128: F16      0.4 GFLOPS (105 runs) | F32      0.4 GFLOPS ( 94 runs)
 256 x  256: Q4_0     2.3 GFLOPS ( 70 runs) | Q4_1     2.4 GFLOPS ( 73 runs)
 256 x  256: Q5_0     2.4 GFLOPS ( 71 runs) | Q5_1     2.4 GFLOPS ( 71 runs) | Q8_0     2.3 GFLOPS ( 70 runs)
 256 x  256: F16      2.3 GFLOPS ( 70 runs) | F32      2.6 GFLOPS ( 77 runs)
 512 x  512: Q4_0     9.6 GFLOPS ( 36 runs) | Q4_1     9.6 GFLOPS ( 36 runs)
 512 x  512: Q5_0     9.5 GFLOPS ( 36 runs) | Q5_1     9.7 GFLOPS ( 37 runs) | Q8_0     9.5 GFLOPS ( 36 runs)
 512 x  512: F16      6.5 GFLOPS ( 25 runs) | F32      9.6 GFLOPS ( 36 runs)
1024 x 1024: Q4_0    44.5 GFLOPS ( 21 runs) | Q4_1    42.8 GFLOPS ( 20 runs)
1024 x 1024: Q5_0    46.2 GFLOPS ( 22 runs) | Q5_1    45.7 GFLOPS ( 22 runs) | Q8_0    44.1 GFLOPS ( 21 runs)
1024 x 1024: F16     14.7 GFLOPS (  7 runs) | F32     36.0 GFLOPS ( 17 runs)
2048 x 2048: Q4_0    83.6 GFLOPS (  5 runs) | Q4_1    80.3 GFLOPS (  5 runs)
2048 x 2048: Q5_0    79.8 GFLOPS (  5 runs) | Q5_1    82.5 GFLOPS (  5 runs) | Q8_0    82.4 GFLOPS (  5 runs)
2048 x 2048: F16     53.8 GFLOPS (  4 runs) | F32     59.2 GFLOPS (  4 runs)
4096 x 4096: Q4_0   102.9 GFLOPS (  3 runs) | Q4_1   108.7 GFLOPS (  3 runs)
4096 x 4096: Q5_0   108.8 GFLOPS (  3 runs) | Q5_1   108.0 GFLOPS (  3 runs) | Q8_0   108.3 GFLOPS (  3 runs)
4096 x 4096: F16    183.6 GFLOPS (  3 runs) | F32    109.0 GFLOPS (  3 runs)

gpokat · 2024-02-06T16:14:10Z

@luciferous did you build CLBLAST with tuner or is your device already in a tuned list for clblast ? https://github.com/CNugteren/CLBlast/blob/master/doc/tuning.md
The untuned calculations could lead to opencl performance degradation.

luciferous · 2024-02-06T19:37:38Z

@gpokat Ah I see. Thank you for explaining. No, I didn't build with a tuner and it doesn't seem like my device (MaliG710) is on the list. I'll incorporate your explanation into a note in the README.

JunkFood02 · 2024-04-17T03:40:53Z

Great job! Just wondering how CLBlast performs on optimized GPUs with larger models and longer audio samples. I'll do some tests and return with additional benchmarks later

@ggerganov

* FetchContent * OpenCL * Documentation and make optional * Specify GGML build options in build.gradle * Use gradle properties * @ggerganov Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * @gpokat --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

JunkFood02 · 2024-04-17T13:13:10Z

After compiling CLBlast with whisper.cpp and running it on my Snapdragon 8 Gen 2 device (with a tuned Adreno 740 GPU for CLBlast), benchmark results and transcription speed haven't shown significant changes (improvement or regression) compared to CPU inference without it. This could be because ggml isn't offloading many computations to the GPU/OpenCL.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

whisper.android: How to build with CLBlast #1809

whisper.android: How to build with CLBlast #1809

luciferous commented Jan 25, 2024 •

edited

luciferous commented Jan 25, 2024 •

edited

bobqianic commented Feb 1, 2024 •

edited

gpokat commented Feb 2, 2024

luciferous commented Feb 5, 2024

luciferous commented Feb 5, 2024

gpokat commented Feb 5, 2024 •

edited

luciferous commented Feb 5, 2024

ggerganov left a comment

luciferous commented Feb 5, 2024 •

edited

luciferous commented Feb 6, 2024

gpokat commented Feb 6, 2024

luciferous commented Feb 6, 2024 •

edited

JunkFood02 commented Apr 17, 2024

JunkFood02 commented Apr 17, 2024

whisper.android: How to build with CLBlast #1809

whisper.android: How to build with CLBlast #1809

Conversation

luciferous commented Jan 25, 2024 • edited

luciferous commented Jan 25, 2024 • edited

bobqianic commented Feb 1, 2024 • edited

gpokat commented Feb 2, 2024

luciferous commented Feb 5, 2024

luciferous commented Feb 5, 2024

gpokat commented Feb 5, 2024 • edited

luciferous commented Feb 5, 2024

ggerganov left a comment

Choose a reason for hiding this comment

luciferous commented Feb 5, 2024 • edited

luciferous commented Feb 6, 2024

gpokat commented Feb 6, 2024

luciferous commented Feb 6, 2024 • edited

JunkFood02 commented Apr 17, 2024

JunkFood02 commented Apr 17, 2024

luciferous commented Jan 25, 2024 •

edited

luciferous commented Jan 25, 2024 •

edited

bobqianic commented Feb 1, 2024 •

edited

gpokat commented Feb 5, 2024 •

edited

luciferous commented Feb 5, 2024 •

edited

luciferous commented Feb 6, 2024 •

edited