Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

whisper.android: How to build with CLBlast #1809

Merged
merged 7 commits into from Feb 9, 2024

Conversation

luciferous
Copy link
Contributor

@luciferous luciferous commented Jan 25, 2024

Documenting for anyone else who wants to get CLBlast running on Android.

Depends on ggerganov/ggml#706.

Benchmark and transcription measurements below are from a Release variant built against ggerganov/ggml@53558f9 (with CLBlast) and e72e415 (without CLBlast).


Benchmarks

Without CLBlast (BLAS = 0).

System Info: AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | METAL = 0 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0 | 
Loading data...
Copying jfk.wav...
All data copied to working directory.
Loading model...
Loaded model ggml-tiny.en-q5_1.bin.
Running benchmark. This will take minutes...
[...snip...]
  64 x   64: Q4_0     0.7 GFLOPS (128 runs) | Q4_1     0.6 GFLOPS (128 runs)
  64 x   64: Q5_0     0.6 GFLOPS (128 runs) | Q5_1     0.9 GFLOPS (128 runs) | Q8_0     1.2 GFLOPS (128 runs)
  64 x   64: F16      1.2 GFLOPS (128 runs) | F32      0.5 GFLOPS (128 runs)
 128 x  128: Q4_0     3.4 GFLOPS (128 runs) | Q4_1     4.2 GFLOPS (128 runs)
 128 x  128: Q5_0     5.2 GFLOPS (128 runs) | Q5_1     2.7 GFLOPS (128 runs) | Q8_0     5.6 GFLOPS (128 runs)
 128 x  128: F16      3.6 GFLOPS (128 runs) | F32      4.8 GFLOPS (128 runs)
 256 x  256: Q4_0    12.9 GFLOPS (128 runs) | Q4_1    12.3 GFLOPS (128 runs)
 256 x  256: Q5_0    11.2 GFLOPS (128 runs) | Q5_1    10.4 GFLOPS (128 runs) | Q8_0    13.0 GFLOPS (128 runs)
 256 x  256: F16     15.2 GFLOPS (128 runs) | F32      9.0 GFLOPS (128 runs)
 512 x  512: Q4_0    15.2 GFLOPS ( 57 runs) | Q4_1    16.6 GFLOPS ( 62 runs)
 512 x  512: Q5_0    12.6 GFLOPS ( 48 runs) | Q5_1    12.6 GFLOPS ( 47 runs) | Q8_0    16.1 GFLOPS ( 60 runs)
 512 x  512: F16     19.4 GFLOPS ( 73 runs) | F32     10.5 GFLOPS ( 40 runs)
1024 x 1024: Q4_0    19.1 GFLOPS (  9 runs) | Q4_1    20.2 GFLOPS ( 10 runs)
1024 x 1024: Q5_0    15.2 GFLOPS (  8 runs) | Q5_1    15.4 GFLOPS (  8 runs) | Q8_0    19.5 GFLOPS ( 10 runs)
1024 x 1024: F16     22.2 GFLOPS ( 11 runs) | F32     10.8 GFLOPS (  6 runs)
2048 x 2048: Q4_0    19.8 GFLOPS (  3 runs) | Q4_1    21.3 GFLOPS (  3 runs)
2048 x 2048: Q5_0    15.7 GFLOPS (  3 runs) | Q5_1    16.5 GFLOPS (  3 runs) | Q8_0    20.8 GFLOPS (  3 runs)
2048 x 2048: F16     23.3 GFLOPS (  3 runs) | F32     10.3 GFLOPS (  3 runs)
4096 x 4096: Q4_0    19.3 GFLOPS (  3 runs) | Q4_1    21.5 GFLOPS (  3 runs)
4096 x 4096: Q5_0    15.8 GFLOPS (  3 runs) | Q5_1    16.2 GFLOPS (  3 runs) | Q8_0    20.5 GFLOPS (  3 runs)
4096 x 4096: F16     20.2 GFLOPS (  3 runs) | F32      8.2 GFLOPS (  3 runs)

With CLBlast (BLAS = 1)

System Info: AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | METAL = 0 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0 | 
Loading data...
Copying jfk.wav...
All data copied to working directory.
Loading model...
Loaded model ggml-tiny.en-q5_1.bin.
Running benchmark. This will take minutes...
[...snip...]
  64 x   64: Q4_0     0.4 GFLOPS (128 runs) | Q4_1     0.4 GFLOPS (128 runs)
  64 x   64: Q5_0     0.3 GFLOPS (128 runs) | Q5_1     0.3 GFLOPS (128 runs) | Q8_0     0.4 GFLOPS (128 runs)
  64 x   64: F16      0.4 GFLOPS (128 runs) | F32      0.3 GFLOPS (128 runs)
 128 x  128: Q4_0     1.5 GFLOPS (128 runs) | Q4_1     1.8 GFLOPS (128 runs)
 128 x  128: Q5_0     2.1 GFLOPS (128 runs) | Q5_1     2.0 GFLOPS (128 runs) | Q8_0     2.0 GFLOPS (128 runs)
 128 x  128: F16      1.8 GFLOPS (128 runs) | F32      2.1 GFLOPS (128 runs)
 256 x  256: Q4_0     7.7 GFLOPS (128 runs) | Q4_1     7.9 GFLOPS (128 runs)
 256 x  256: Q5_0     7.7 GFLOPS (128 runs) | Q5_1     7.9 GFLOPS (128 runs) | Q8_0     7.6 GFLOPS (128 runs)
 256 x  256: F16      7.5 GFLOPS (128 runs) | F32      8.4 GFLOPS (128 runs)
 512 x  512: Q4_0    19.3 GFLOPS ( 73 runs) | Q4_1    18.8 GFLOPS ( 71 runs)
 512 x  512: Q5_0    19.3 GFLOPS ( 73 runs) | Q5_1    19.4 GFLOPS ( 73 runs) | Q8_0    19.4 GFLOPS ( 73 runs)
 512 x  512: F16     19.3 GFLOPS ( 73 runs) | F32     19.0 GFLOPS ( 71 runs)
1024 x 1024: Q4_0    63.0 GFLOPS ( 30 runs) | Q4_1    65.5 GFLOPS ( 31 runs)
1024 x 1024: Q5_0    63.3 GFLOPS ( 30 runs) | Q5_1    66.0 GFLOPS ( 31 runs) | Q8_0    66.5 GFLOPS ( 31 runs)
1024 x 1024: F16     63.0 GFLOPS ( 30 runs) | F32     62.2 GFLOPS ( 29 runs)
2048 x 2048: Q4_0   100.2 GFLOPS (  6 runs) | Q4_1    97.1 GFLOPS (  6 runs)
2048 x 2048: Q5_0   101.1 GFLOPS (  6 runs) | Q5_1    95.9 GFLOPS (  6 runs) | Q8_0   101.5 GFLOPS (  6 runs)
2048 x 2048: F16     96.7 GFLOPS (  6 runs) | F32    101.0 GFLOPS (  6 runs)
4096 x 4096: Q4_0   108.0 GFLOPS (  3 runs) | Q4_1    78.7 GFLOPS (  3 runs)
4096 x 4096: Q5_0    79.1 GFLOPS (  3 runs) | Q5_1    81.3 GFLOPS (  3 runs) | Q8_0    56.0 GFLOPS (  3 runs)
4096 x 4096: F16     64.9 GFLOPS (  3 runs) | F32     49.5 GFLOPS (  3 runs)

Transcribe jfk.wav.

BLAS = 0

whisper_print_timings:     load time =   183.27 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    55.03 ms
whisper_print_timings:   sample time =    19.49 ms /     1 runs (   19.49 ms per run)
whisper_print_timings:   encode time =  1098.81 ms /     1 runs ( 1098.81 ms per run)
whisper_print_timings:   decode time =   290.64 ms /    27 runs (   10.76 ms per run)
whisper_print_timings:   batchd time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =  1464.72 ms

BLAS = 1

whisper_print_timings:     load time =  2030.81 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    49.70 ms
whisper_print_timings:   sample time =    26.18 ms /     1 runs (   26.18 ms per run)
whisper_print_timings:   encode time =  1344.75 ms /     1 runs ( 1344.75 ms per run)
whisper_print_timings:   decode time =   303.45 ms /    27 runs (   11.24 ms per run)
whisper_print_timings:   batchd time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =  1725.54 ms

@luciferous luciferous marked this pull request as draft January 25, 2024 06:45
@luciferous
Copy link
Contributor Author

luciferous commented Jan 25, 2024

In draft, because the FetchContent_Declare(ggml SOURCE_DIR...) assumes a directory layout and I'm working on an option to specify ggml's location.

Also, I'm not quite sure why we lose FP16_VA with CLBlast (see: FP16_VA = 0), so looking into that too. (Just needed to pass -march=... to ggml compile.)

@bobqianic
Copy link
Collaborator

bobqianic commented Feb 1, 2024

Seems like quite a speed up, but notably does not translate to faster transcription.

Could you post the timing information?

Example:

whisper_print_timings:     load time =  7241.31 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    30.68 ms
whisper_print_timings:   sample time =   362.31 ms /   751 runs (    0.48 ms per run)
whisper_print_timings:   encode time =  1177.27 ms /     5 runs (  235.45 ms per run)
whisper_print_timings:   decode time =    79.06 ms /     5 runs (   15.81 ms per run)
whisper_print_timings:   batchd time =  3858.86 ms /   735 runs (    5.25 ms per run)
whisper_print_timings:   prompt time =   191.72 ms /    89 runs (    2.15 ms per run)
whisper_print_timings:    total time = 13115.86 ms

@gpokat
Copy link

gpokat commented Feb 2, 2024

did it get correct inference ? My build can not find anything when compile with CLBLAST on android.

@luciferous
Copy link
Contributor Author

@bobqianic

Looks like I was wrong about it not speeding up transcription 🙂.

The timing numbers below are from transcribing jfk.wav:

BLAS = 0

whisper_print_timings:     load time =   389.18 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   142.32 ms
whisper_print_timings:   sample time =    57.65 ms /     1 runs (   57.65 ms per run)
whisper_print_timings:   encode time = 28396.24 ms /     1 runs (28396.24 ms per run)
whisper_print_timings:   decode time =  2010.30 ms /    27 runs (   74.46 ms per run)
whisper_print_timings:   batchd time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time = 30621.52 ms

BLAS = 1

whisper_print_timings:     load time =   683.91 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   162.44 ms
whisper_print_timings:   sample time =    66.18 ms /     1 runs (   66.18 ms per run)
whisper_print_timings:   encode time =  3750.02 ms /     1 runs ( 3750.02 ms per run)
whisper_print_timings:   decode time =  2034.95 ms /    27 runs (   75.37 ms per run)
whisper_print_timings:   batchd time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =  6026.00 ms

I'll update the commit message to reflect.

Also to clarify my aims: since I don't understand the basis of the effect on performance, I'm hesitant to make any strong claims for my changes. My goal with this PR is only to merge a reference build to make it easier for further developments of CLBlast on Android.


Android logs aren't hooked up to the transcription, so this is what I did to retrieve the whisper_print_timings.

diff --git a/examples/whisper.android/lib/src/main/jni/whisper/jni.c b/examples/whisper.android/lib/src/main/jni/whisper/jni.c
index 7f9d724..e522a3a 100644
--- a/examples/whisper.android/lib/src/main/jni/whisper/jni.c
+++ b/examples/whisper.android/lib/src/main/jni/whisper/jni.c
@@ -14,6 +14,13 @@
 #define LOGI(...) __android_log_print(ANDROID_LOG_INFO,     TAG, __VA_ARGS__)
 #define LOGW(...) __android_log_print(ANDROID_LOG_WARN,     TAG, __VA_ARGS__)
 
+static void log_callback(enum ggml_log_level level, const char * fmt, void * data) {
+    if (level == GGML_LOG_LEVEL_ERROR)     __android_log_print(ANDROID_LOG_ERROR, TAG, fmt, data);
+    else if (level == GGML_LOG_LEVEL_INFO) __android_log_print(ANDROID_LOG_INFO, TAG, fmt, data);
+    else if (level == GGML_LOG_LEVEL_WARN) __android_log_print(ANDROID_LOG_WARN, TAG, fmt, data);
+    else __android_log_print(ANDROID_LOG_DEFAULT, TAG, fmt, data);
+}
+
 static inline int min(int a, int b) {
     return (a < b) ? a : b;
 }
@@ -182,6 +189,8 @@ Java_com_whispercpp_whisper_WhisperLib_00024Companion_fullTranscribe(
     params.no_context = true;
     params.single_segment = false;
 
+    whisper_log_set(log_callback, NULL);
+
     whisper_reset_timings(context);
 
     LOGI("About to run whisper_full");

@luciferous
Copy link
Contributor Author

@gpokat Transcribing jfk.wav seems to be working for me. What are you seeing?

@gpokat
Copy link

gpokat commented Feb 5, 2024

@gpokat Transcribing jfk.wav seems to be working for me. What are you seeing?

The output only shows [music] to me.
Did you recompile clblast with tunners output or as it is ?
btw seems you are using debug build, i used release with -O3 applied for speedup.

*Update.
After applied tunners for half precision the things bring alive and the infirence being correct. However in my case timings pretty much the same as on cpu with optimizations.
No less the ~2sec of processing time on Helio g99 Mc2.
6 cpu -O3 mode vs full tuned clblast with 2 cpu. (gpu mali g57)

@luciferous
Copy link
Contributor Author

@gpokat Just want to verify that the System Info line show BLAS = 1, e.g.,

System Info: AVX = 0 | AVX2 = 0 | ... | BLAS = 1 | ...

If not, then, it might not have built correctly.

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luciferous Did you build the CPU version in Release mode (i.e. -O3)?

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
@luciferous
Copy link
Contributor Author

luciferous commented Feb 5, 2024

@ggerganov I think these are in Debug. I'll follow up with numbers from a Release build and update the commit message with something appropriate.

Built against ggerganov/ggml@53558f9.

Verified O3 via:

  • Build / Select Build Variant..., I set both :app and :lib modules to release;
  • lib/.cxx/tools/release/arm64-v8a/compile_commands.json to see -O3 in compile options

Transcribe sample

BLAS = 0

whisper_print_timings:     load time =   183.27 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    55.03 ms
whisper_print_timings:   sample time =    19.49 ms /     1 runs (   19.49 ms per run)
whisper_print_timings:   encode time =  1098.81 ms /     1 runs ( 1098.81 ms per run)
whisper_print_timings:   decode time =   290.64 ms /    27 runs (   10.76 ms per run)
whisper_print_timings:   batchd time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =  1464.72 ms

BLAS = 1

whisper_print_timings:     load time =  2030.81 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    49.70 ms
whisper_print_timings:   sample time =    26.18 ms /     1 runs (   26.18 ms per run)
whisper_print_timings:   encode time =  1344.75 ms /     1 runs ( 1344.75 ms per run)
whisper_print_timings:   decode time =   303.45 ms /    27 runs (   11.24 ms per run)
whisper_print_timings:   batchd time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =  1725.54 ms

Benchmarks

BLAS = 0

System Info: AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | METAL = 0 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0 | 
Loading data...
Copying jfk.wav...
All data copied to working directory.
Loading model...
Loaded model ggml-tiny.en-q5_1.bin.
Running benchmark. This will take minutes...
[...snip...]
  64 x   64: Q4_0     0.7 GFLOPS (128 runs) | Q4_1     0.6 GFLOPS (128 runs)
  64 x   64: Q5_0     0.6 GFLOPS (128 runs) | Q5_1     0.9 GFLOPS (128 runs) | Q8_0     1.2 GFLOPS (128 runs)
  64 x   64: F16      1.2 GFLOPS (128 runs) | F32      0.5 GFLOPS (128 runs)
 128 x  128: Q4_0     3.4 GFLOPS (128 runs) | Q4_1     4.2 GFLOPS (128 runs)
 128 x  128: Q5_0     5.2 GFLOPS (128 runs) | Q5_1     2.7 GFLOPS (128 runs) | Q8_0     5.6 GFLOPS (128 runs)
 128 x  128: F16      3.6 GFLOPS (128 runs) | F32      4.8 GFLOPS (128 runs)
 256 x  256: Q4_0    12.9 GFLOPS (128 runs) | Q4_1    12.3 GFLOPS (128 runs)
 256 x  256: Q5_0    11.2 GFLOPS (128 runs) | Q5_1    10.4 GFLOPS (128 runs) | Q8_0    13.0 GFLOPS (128 runs)
 256 x  256: F16     15.2 GFLOPS (128 runs) | F32      9.0 GFLOPS (128 runs)
 512 x  512: Q4_0    15.2 GFLOPS ( 57 runs) | Q4_1    16.6 GFLOPS ( 62 runs)
 512 x  512: Q5_0    12.6 GFLOPS ( 48 runs) | Q5_1    12.6 GFLOPS ( 47 runs) | Q8_0    16.1 GFLOPS ( 60 runs)
 512 x  512: F16     19.4 GFLOPS ( 73 runs) | F32     10.5 GFLOPS ( 40 runs)
1024 x 1024: Q4_0    19.1 GFLOPS (  9 runs) | Q4_1    20.2 GFLOPS ( 10 runs)
1024 x 1024: Q5_0    15.2 GFLOPS (  8 runs) | Q5_1    15.4 GFLOPS (  8 runs) | Q8_0    19.5 GFLOPS ( 10 runs)
1024 x 1024: F16     22.2 GFLOPS ( 11 runs) | F32     10.8 GFLOPS (  6 runs)
2048 x 2048: Q4_0    19.8 GFLOPS (  3 runs) | Q4_1    21.3 GFLOPS (  3 runs)
2048 x 2048: Q5_0    15.7 GFLOPS (  3 runs) | Q5_1    16.5 GFLOPS (  3 runs) | Q8_0    20.8 GFLOPS (  3 runs)
2048 x 2048: F16     23.3 GFLOPS (  3 runs) | F32     10.3 GFLOPS (  3 runs)
4096 x 4096: Q4_0    19.3 GFLOPS (  3 runs) | Q4_1    21.5 GFLOPS (  3 runs)
4096 x 4096: Q5_0    15.8 GFLOPS (  3 runs) | Q5_1    16.2 GFLOPS (  3 runs) | Q8_0    20.5 GFLOPS (  3 runs)
4096 x 4096: F16     20.2 GFLOPS (  3 runs) | F32      8.2 GFLOPS (  3 runs)

BLAS = 1

System Info: AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | METAL = 0 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0 | 
Loading data...
Copying jfk.wav...
All data copied to working directory.
Loading model...
Loaded model ggml-tiny.en-q5_1.bin.
Running benchmark. This will take minutes...
[...snip...]
  64 x   64: Q4_0     0.4 GFLOPS (128 runs) | Q4_1     0.4 GFLOPS (128 runs)
  64 x   64: Q5_0     0.3 GFLOPS (128 runs) | Q5_1     0.3 GFLOPS (128 runs) | Q8_0     0.4 GFLOPS (128 runs)
  64 x   64: F16      0.4 GFLOPS (128 runs) | F32      0.3 GFLOPS (128 runs)
 128 x  128: Q4_0     1.5 GFLOPS (128 runs) | Q4_1     1.8 GFLOPS (128 runs)
 128 x  128: Q5_0     2.1 GFLOPS (128 runs) | Q5_1     2.0 GFLOPS (128 runs) | Q8_0     2.0 GFLOPS (128 runs)
 128 x  128: F16      1.8 GFLOPS (128 runs) | F32      2.1 GFLOPS (128 runs)
 256 x  256: Q4_0     7.7 GFLOPS (128 runs) | Q4_1     7.9 GFLOPS (128 runs)
 256 x  256: Q5_0     7.7 GFLOPS (128 runs) | Q5_1     7.9 GFLOPS (128 runs) | Q8_0     7.6 GFLOPS (128 runs)
 256 x  256: F16      7.5 GFLOPS (128 runs) | F32      8.4 GFLOPS (128 runs)
 512 x  512: Q4_0    19.3 GFLOPS ( 73 runs) | Q4_1    18.8 GFLOPS ( 71 runs)
 512 x  512: Q5_0    19.3 GFLOPS ( 73 runs) | Q5_1    19.4 GFLOPS ( 73 runs) | Q8_0    19.4 GFLOPS ( 73 runs)
 512 x  512: F16     19.3 GFLOPS ( 73 runs) | F32     19.0 GFLOPS ( 71 runs)
1024 x 1024: Q4_0    63.0 GFLOPS ( 30 runs) | Q4_1    65.5 GFLOPS ( 31 runs)
1024 x 1024: Q5_0    63.3 GFLOPS ( 30 runs) | Q5_1    66.0 GFLOPS ( 31 runs) | Q8_0    66.5 GFLOPS ( 31 runs)
1024 x 1024: F16     63.0 GFLOPS ( 30 runs) | F32     62.2 GFLOPS ( 29 runs)
2048 x 2048: Q4_0   100.2 GFLOPS (  6 runs) | Q4_1    97.1 GFLOPS (  6 runs)
2048 x 2048: Q5_0   101.1 GFLOPS (  6 runs) | Q5_1    95.9 GFLOPS (  6 runs) | Q8_0   101.5 GFLOPS (  6 runs)
2048 x 2048: F16     96.7 GFLOPS (  6 runs) | F32    101.0 GFLOPS (  6 runs)
4096 x 4096: Q4_0   108.0 GFLOPS (  3 runs) | Q4_1    78.7 GFLOPS (  3 runs)
4096 x 4096: Q5_0    79.1 GFLOPS (  3 runs) | Q5_1    81.3 GFLOPS (  3 runs) | Q8_0    56.0 GFLOPS (  3 runs)
4096 x 4096: F16     64.9 GFLOPS (  3 runs) | F32     49.5 GFLOPS (  3 runs)

@luciferous
Copy link
Contributor Author

@ggerganov Updated the commit message with the new measurements. Curiously, changes to ggml in the last week or so may have degraded CLBlast benchmark performance. For the sake of posterity, this is what I saw building against a ggml version from sometime in mid-January (unfortunately, I don't recall which):

  64 x   64: Q4_0     0.0 GFLOPS ( 93 runs) | Q4_1     0.1 GFLOPS (107 runs)
  64 x   64: Q5_0     0.0 GFLOPS ( 89 runs) | Q5_1     0.1 GFLOPS ( 97 runs) | Q8_0     0.0 GFLOPS ( 87 runs)
  64 x   64: F16      0.1 GFLOPS (110 runs) | F32      0.0 GFLOPS ( 66 runs)
 128 x  128: Q4_0     0.4 GFLOPS ( 90 runs) | Q4_1     0.4 GFLOPS ( 91 runs)
 128 x  128: Q5_0     0.4 GFLOPS ( 89 runs) | Q5_1     0.4 GFLOPS ( 89 runs) | Q8_0     0.4 GFLOPS ( 91 runs)
 128 x  128: F16      0.4 GFLOPS (105 runs) | F32      0.4 GFLOPS ( 94 runs)
 256 x  256: Q4_0     2.3 GFLOPS ( 70 runs) | Q4_1     2.4 GFLOPS ( 73 runs)
 256 x  256: Q5_0     2.4 GFLOPS ( 71 runs) | Q5_1     2.4 GFLOPS ( 71 runs) | Q8_0     2.3 GFLOPS ( 70 runs)
 256 x  256: F16      2.3 GFLOPS ( 70 runs) | F32      2.6 GFLOPS ( 77 runs)
 512 x  512: Q4_0     9.6 GFLOPS ( 36 runs) | Q4_1     9.6 GFLOPS ( 36 runs)
 512 x  512: Q5_0     9.5 GFLOPS ( 36 runs) | Q5_1     9.7 GFLOPS ( 37 runs) | Q8_0     9.5 GFLOPS ( 36 runs)
 512 x  512: F16      6.5 GFLOPS ( 25 runs) | F32      9.6 GFLOPS ( 36 runs)
1024 x 1024: Q4_0    44.5 GFLOPS ( 21 runs) | Q4_1    42.8 GFLOPS ( 20 runs)
1024 x 1024: Q5_0    46.2 GFLOPS ( 22 runs) | Q5_1    45.7 GFLOPS ( 22 runs) | Q8_0    44.1 GFLOPS ( 21 runs)
1024 x 1024: F16     14.7 GFLOPS (  7 runs) | F32     36.0 GFLOPS ( 17 runs)
2048 x 2048: Q4_0    83.6 GFLOPS (  5 runs) | Q4_1    80.3 GFLOPS (  5 runs)
2048 x 2048: Q5_0    79.8 GFLOPS (  5 runs) | Q5_1    82.5 GFLOPS (  5 runs) | Q8_0    82.4 GFLOPS (  5 runs)
2048 x 2048: F16     53.8 GFLOPS (  4 runs) | F32     59.2 GFLOPS (  4 runs)
4096 x 4096: Q4_0   102.9 GFLOPS (  3 runs) | Q4_1   108.7 GFLOPS (  3 runs)
4096 x 4096: Q5_0   108.8 GFLOPS (  3 runs) | Q5_1   108.0 GFLOPS (  3 runs) | Q8_0   108.3 GFLOPS (  3 runs)
4096 x 4096: F16    183.6 GFLOPS (  3 runs) | F32    109.0 GFLOPS (  3 runs)

@gpokat
Copy link

gpokat commented Feb 6, 2024

@luciferous did you build CLBLAST with tuner or is your device already in a tuned list for clblast ? https://github.com/CNugteren/CLBlast/blob/master/doc/tuning.md
The untuned calculations could lead to opencl performance degradation.

@luciferous
Copy link
Contributor Author

luciferous commented Feb 6, 2024

@gpokat Ah I see. Thank you for explaining. No, I didn't build with a tuner and it doesn't seem like my device (MaliG710) is on the list. I'll incorporate your explanation into a note in the README.

@ggerganov ggerganov merged commit 19f8048 into ggerganov:master Feb 9, 2024
39 checks passed
@JunkFood02
Copy link
Contributor

Great job! Just wondering how CLBlast performs on optimized GPUs with larger models and longer audio samples. I'll do some tests and return with additional benchmarks later

jiahansu pushed a commit to OOPRY/whisper.cpp that referenced this pull request Apr 17, 2024
* FetchContent

* OpenCL

* Documentation and make optional

* Specify GGML build options in build.gradle

* Use gradle properties

* @ggerganov

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* @gpokat

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
@JunkFood02
Copy link
Contributor

After compiling CLBlast with whisper.cpp and running it on my Snapdragon 8 Gen 2 device (with a tuned Adreno 740 GPU for CLBlast), benchmark results and transcription speed haven't shown significant changes (improvement or regression) compared to CPU inference without it. This could be because ggml isn't offloading many computations to the GPU/OpenCL.

See also:

viktor-silakov pushed a commit to viktor-silakov/whisper_node_mic.cpp that referenced this pull request May 11, 2024
* FetchContent

* OpenCL

* Documentation and make optional

* Specify GGML build options in build.gradle

* Use gradle properties

* @ggerganov

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* @gpokat

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants