Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comparision with faster-whisper #1127

Open
geekodour opened this issue Jul 21, 2023 · 17 comments
Open

Comparision with faster-whisper #1127

geekodour opened this issue Jul 21, 2023 · 17 comments

Comments

@geekodour
Copy link
Contributor

geekodour commented Jul 21, 2023

I did a very rough comparison of https://github.com/guillaumekln/faster-whisper and whisper.cpp, turns out faster-whisper is faster than whisper.cpp in CPU.

For eg. It takes faster-whisper 14seconds with the small.en, whereas with whisper.cpp it's 46seconds. What causes this slowness? Or I am not setting parameters correctly, I tried keeping the beam size and threads similar.

I have a suspicion that I am not doing the comparison correctly, it'll be awesome if someone more knowledgeable can tell why faster-whisper is faster on CPU

I think I am comparing int8(faster-whisper) to int4(https://huggingface.co/ggerganov/whisper.cpp) quantization here. But not sure how much of a difference should it make.

See comparison here:
https://gist.github.com/geekodour/8734b3bf22b8ede61fb5bfc92ce68fe3

@bobqianic
Copy link
Collaborator

I've just submitted a pull request that aims to address a few of these issues. Currently, OpenBLAS isn't active on the Windows platform, even though the previously released binary file is named whisper-blas-bin-x64. When OpenBLAS is enabled, it boosts CPU inferencing speeds by a factor of 3-4. I ran some tests on my i7-12700H, using the -w 2 flag for matrix multiplication. I found that it achieves at least 50% of the theoretical maximum (OpenBLAS enabled).

@fire
Copy link

fire commented Jul 26, 2023

Does this mean we're at 30 seconds compared to faster-whispher's 14 seconds?

@guillaumekln
Copy link

guillaumekln commented Jul 27, 2023

For this example there are 2 main reasons explaining why faster-whisper is faster:

Disclaimer: I'm the author of faster-whisper.

@vadi2
Copy link
Contributor

vadi2 commented Aug 2, 2023

Just throwing in there that faster-whisper is quicker than whisper.cpp on the GPU as well.

Using an RTX 4080 on Ubuntu 22.04, a 12min audio sample takes 3.4min to transcribe using whisper.cpp with a medium model while faster-whisper does it in 30s using the higher quality large-v2 model. medium model brings it down to 20s. It's seriously impressive.

@nchudleigh
Copy link
Contributor

@guillaumekln has batched beam search still not been implemented?

@guillaumekln
Copy link

The related issue #1048 is still open so I don't think it is implemented yet.

@nchudleigh
Copy link
Contributor

Going to take a crack at bringing over the implementation from llama.

Fair warning, I am not very experienced with C/C++. Will link the PR here once ready for review.

@bobqianic
Copy link
Collaborator

bobqianic commented Sep 3, 2023

Could you run another test on the latest version of whisper.cpp? I'm curious to see how much we've improved since last month. You can find the latest version in PR #1243. Thanks! @geekodour

Please use OpenBLAS (64bit) eg. openblas64-dev , and use the following command for testing:
./main -bs 5 -bo 5 -t 8 -f steve2.wav -m models/ggml-small.en.bin

@bobqianic
Copy link
Collaborator

Going to take a crack at bringing over the implementation from llama.

Fair warning, I am not very experienced with C/C++. Will link the PR here once ready for review.

Any progress?

@bobqianic
Copy link
Collaborator

bobqianic commented Sep 5, 2023

Could you run another test on the latest version of whisper.cpp? I'm curious to see how much we've improved since last month. You can find the latest version in PR #1243. Thanks! @geekodour

Please use OpenBLAS (64bit) eg. openblas64-dev , and use the following command for testing: ./main -bs 5 -bo 5 -t 8 -f steve2.wav -m models/ggml-small.en.bin

In terms of CPU performance, whisper.cpp* isn't lagging too far behind. To give you an idea, our latest tests were run on an i7-12700H, using 4 threads and a beam size of 5.

Please note:

  1. We haven't added batch decoding yet
  2. Parallel sampling is still on the to-do list
  3. Assisted generation is also not available at this time

Small model on CPU

diffusion2023-07-03.wav 27m:49s

Implementation Precision Beam size Time
whisper.cpp (1ee2707) FP32 5 9m:45s
faster-whisper (ad388cd) FP32 5 7m:30s

@ggerganov
Copy link
Owner

@bobqianic We should better have the batched decoding implemented before additional tests. Without it whisper.cpp will always be significantly slower

@bobqianic
Copy link
Collaborator

@bobqianic We should better have the batched decoding implemented before additional tests. Without it whisper.cpp will always be significantly slower

Agree.

@nchudleigh
Copy link
Contributor

No progress on this yet from me. Will update here with draft PR when I have something.

@InflexCZE
Copy link

InflexCZE commented Mar 29, 2024

I did some measurements of my own that I'd like to share along with some observations and maybe perf recommendations for this great project.

Audio Context size 1500 (default)

# Method Compute type Relative speedup Mean time Std Dev
#​1 WhisperCpp - 1.0 4723.80 15.13
#​27 WhisperCpp BLAS 1.72 2745.40 14.72
#​2 CT2InsideCpp BLAS f32 2.28 2068.30 18.09
#​3 CT2InsideCpp BLAS i8 1.59 2976.30 27.16
#​4 CT2InsideCpp MKL f32 2.76 1712.00 24.04
#​5 CT2InsideCpp MKL i8 2.14 2210.60 28.87
#​6 CT2 BLAS f32 2.68 1762.10 33.39
#​7 CT2 BLAS i8 1.78 2646.40 16.97
#​8 CT2 MKL f32 3.37 1403.10 26.42
#​9 CT2 MKL i8 2.52 1873.60 8.66
#​10 CT2 (No MEL) BLAS f32 2.73 1730.90 12.55
#​11 CT2 (No MEL) BLAS i8 1.81 2615.70 15.70
#​12 CT2 (No MEL) MKL f32 3.43 1379.00 13.19
#​13 CT2 (No MEL) MKL i8 2.54 1856.70 10.24

Audio Context size 512 (Seems to strike well accuracy vs performance #166)

# Method Compute type Relative speedup Mean time Std Dev
#​14 WhisperCpp - 1.0 1349.00 37.21
#​28 WhisperCpp BLAS 1.76 766.60 12.42
#​15 CT2InsideCpp BLAS f32 1.48 909.00 20.83
#​16 CT2InsideCpp BLAS i8 1.29 1042.60 12.34
#​17 CT2InsideCpp MKL f32 2.42 556.40 7.62
#​18 CT2InsideCpp MKL i8 1.99 677.30 10.56
#​19 CT2 BLAS f32 1.71 791.10 11.68
#​20 CT2 BLAS i8 1.45 932.00 6.88
#​21 CT2 MKL f32 3.03 444.80 12.61
#​22 CT2 MKL i8 2.38 566.40 7.69
#​23 CT2 (No MEL) Blas f32 1.76 767.90 4.25
#​24 CT2 (No MEL) Blas i8 1.48 910.90 3.51
#​25 CT2 (No MEL) MKL f32 3.15 427.80 7.35
#​26 CT2 (No MEL) MKL i8 2.44 552.90 20.40

Legend:
CT2 - https://github.com/OpenNMT/CTranslate2
CT2 (No MEL) - MEL was pre-computed before benchmark. Just for perf-comparison of MEL vs Encode-Decode stages
WhisperCpp - Just downloaded and compiled as is, with /arch:AVX2, not sure what compute type is default
CT2InsideCpp - WhisperCpp frontend, Encode-Decode stages replaced with callbacks to CT2
MKL - https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html
BLAS - https://github.com/OpenMathLib/OpenBLAS

Obsevations:

  1. Gj on finding this Audio Context size reduction trick.
    I don't care if it's documented feature of the model or not, it works and makes processing on embedded devices actually viable.
    Thank you!

  2. In Further improve the speed of Whisper.cpp (might need a dependency) #589 (reply in thread) the question was raised how MKL compares to OpenBLAS.
    In Further improve the speed of Whisper.cpp (might need a dependency) #589 (reply in thread) guillaumekln observed ~70% better perf thanks to MKL.
    I second this, with a small caviat. In my measurements this was the case only for AC=512 (#​19 vs #​21 => 78%). For full AC (#​6 vs #​8) the improvement was only 25%.
    The difference likely comes from much smaller data used and prolly smaller model too.
    Still, very nice gains, no issues with installation on my side, so it could be very low hanging fruit for ggml improvement.

  3. INT8 compute mode turned out to be perf regression across the board compared to FLOAT32.
    This was very interesting for me, I belive this is result of small data and model running on "sufficiently powerful" HW.
    What is normally memory-bound program, managed to comfortably fit into L3 at every computation step thus turning the program into compute-bound problem, so the memory footprint reduction had no effect and additional compute load on i8 coversions increased the times.
    Will definitely follow up on this when porting to RPi. Much less cache, could become memory-bound again on f32s.
    Let me know if you have different hypotesis on this.

  4. The #​2 #​3 #​4 #​5 vs #​6 #​7 #​8 #​9 respectively and #​15 #​16 #​17 #​18 vs #​19 #​20 #​21 #​22 respectively show overhead of Whisper.cpp frontend vs pure CT2 model.
    This is particularly interesting as it shows that there is potential to optimize the Whisper.cpp by 12-25% just by reducing data shuffles, runtime allocations and runtime tensor graph preparations.

For reproduction I attach sources of my benchmark as well as patch to Whisper.cpp that was used to expose MEL and replace Encode-Decode steps.

Much code here :)
#define AUDIO_CONTEXT_SIZE 1500
#define COMPUTE_TYPE FLOAT32
#define BEAM_SIZE 5

std::string StringifyWhisperCpp(std::vector<int16_t> samples)
{
    whisper_context_params cparams = whisper_context_default_params();
    cparams.use_gpu = false;
    
    whisper_context* ctx = whisper_init_from_file_with_params("\\models\\ggml-base.bin", cparams);
    
    auto samples32 = Int16ToFP32(samples);

    auto wparams = whisper_full_default_params(WHISPER_SAMPLING_BEAM_SEARCH);
    wparams.n_threads = 1;
    wparams.audio_ctx = AUDIO_CONTEXT_SIZE;
    wparams.no_timestamps = true;
    wparams.print_special = false;
    wparams.token_timestamps = false;
    wparams.beam_search.beam_size = BEAM_SIZE;

    Benchmark(__func__, [&]
    {
        if (whisper_full(ctx, wparams, samples32.data(), samples32.size()) != 0)
        {
            throw std::exception("failed to process audio");
        }

        return 0;
    });

    std::string output;
    const int n_segments = whisper_full_n_segments(ctx);
    for (int i = 0; i < n_segments; ++i)
    {
        output += whisper_full_get_segment_text(ctx, i);
    }
    
    whisper_free(ctx);
    return output;
}


#ifdef CT2
std::string StringifyCT2(std::vector<int16_t> samples, bool precomputedMEL = false)
{
    std::vector<std::vector<size_t>> prompts{ {50258, 50259, 50359, 50363} };

    auto mels = WhisperCppMel(nullptr, samples);

    try
    {
        std::string path = "\\faster-whisper\\models\\base";
        auto modelFile = ctranslate2::models::Model::load
        (
            path,
            Device::CPU,
            0,
            ComputeType::COMPUTE_TYPE
        );
        auto model = ctranslate2::models::WhisperReplica::create_from_model(*modelFile);

        auto featuresSize = precomputedMEL ? (dim_t)mels.size() / 80 : AUDIO_CONTEXT_SIZE * 2;
        
        StorageView features { Shape{ {1, 80, featuresSize} }, DataType::FLOAT32, Device::CPU };


        whisper_context* whisper_context = nullptr;
        if (precomputedMEL)
        {
            features.copy_from(mels.data(), mels.size(), Device::CPU, /*synchronous*/ true);
        }
        else
        {
            whisper_context = WhisperCppMelContext();
        }

        auto result = Benchmark(__func__, [&]
        {
            if (precomputedMEL == false)
            {
                auto mel = WhisperCppMel(whisper_context, samples);
                features.copy_from(mel.data(), mel.size(), Device::CPU, /*synchronous*/ true);
            }

            return model->generate
            (
                features,
                prompts,
                models::WhisperOptions
                {
                    .beam_size = BEAM_SIZE,
                    .patience = 1,
                    .length_penalty = 1,
                    .repetition_penalty = 1.01,
                    .no_repeat_ngram_size = 0,
                    .max_length = 448,
                    .sampling_temperature = 1.0,
                    .return_scores = false,//true,
                    .return_no_speech_prob = false,//true,
                    .max_initial_timestamp_index = 50,
                    .suppress_blank = false,//true,
                    .suppress_tokens = {-1},
                }
            );
        });

        std::string output = "";
        for (auto strings : result[0].sequences)
        for (auto string : strings)
        {
            output += string;
        }

        if(whisper_context != nullptr)
        {
            whisper_free(whisper_context);
        }

        return output;
    }
    catch (std::exception& e)
    {
        auto message = e.what();
        printf("%s\n", message);
    }
}

std::string StringifyCT2WithCpp(std::vector<int16_t> samples)
{
    std::string path = "\\faster-whisper\\models\\base";
    auto modelFile = ctranslate2::models::Model::load
    (
        path,
        Device::CPU,
        0,
        ComputeType::COMPUTE_TYPE
    );
    auto model = ctranslate2::models::WhisperReplica::create_from_model(*modelFile);

    struct CallbackContext
    {
        models::WhisperReplica* model;
        whisper_context* whisper;

        std::string stringOut;
        std::vector<whisper_token> lastTokens;
        StorageView features;

    }
    callback_context
    {
        .model = model.get(),
        .features = { Shape {{1, 80, AUDIO_CONTEXT_SIZE * 2}}, DataType::FLOAT32, Device::CPU }
    };

    whisper_context_params cparams = whisper_context_default_params();
    cparams.use_gpu = false;

    cparams.custom_encoder.custom_encoder_context = &callback_context;
    cparams.custom_encoder.custom_encoder_callback = [](void* custom_encoder_context, ggml_tensor* mel_in, ggml_tensor* encoded_tokens_out)
    {
        auto& ctx = *(CallbackContext*) custom_encoder_context;
        std::vector<std::vector<size_t>> prompts{ {50258, 50259, 50359, 50363} };

        assert(mel_in->type == GGML_TYPE_F32);
        ctx.features.copy_from((float*)mel_in->data, mel_in->ne[0] * mel_in->ne[1], Device::CPU, /*synchronous*/ true);

        auto results = ctx.model->generate
        (
            ctx.features,
            prompts,
            models::WhisperOptions
            {
                .beam_size = BEAM_SIZE,
                .patience = 1,
                .length_penalty = 1,
                .repetition_penalty = 1.01,
                .no_repeat_ngram_size = 0,
                .max_length = 448,
                .sampling_temperature = 1.0,
                .return_scores = false,//true,
                .return_no_speech_prob = false,//true,
                .max_initial_timestamp_index = 50,
                .suppress_blank = false,//true,
                .suppress_tokens = {-1},
            }
        );

        auto& tokensOut = results[0].sequences_ids;
        //assert(encoded_tokens_out->type == GGML_TYPE_I32);
        //ggml_backend_tensor_set(encoded_tokens_out, tokensOut.data(), 0, sizeof(size_t) * tokensOut.size());

        //for (auto tokens : tokensOut)
        //for (auto token : tokens)
        //{
        //    ctx.stringOut += whisper_token_to_str(ctx.whisper, token);
        //}

        ctx.lastTokens.clear();
        for (auto tokens : results[0].sequences_ids)
        for (auto token : tokens)
        {
            ctx.lastTokens.push_back((whisper_token)token);
        }
    };

    cparams.custom_encoder.custom_decoder_context = &callback_context;
    cparams.custom_encoder.custom_decoder_callback = []
    (
        void* custom_decoder_context,
        void* submit_tokens_context,
        void(*submit_tokens)(void* submit_tokens_context, whisper_token* tokens, size_t tokens_count)
    )
    {
        auto& ctx = *(CallbackContext*) custom_decoder_context;
        submit_tokens(submit_tokens_context, ctx.lastTokens.data(), ctx.lastTokens.size());
    };
    
    whisper_context* ctx = whisper_init_from_file_with_params("\\models\\ggml-base.bin", cparams);
    callback_context.whisper = ctx;
    
    auto samples32 = Int16ToFP32(samples);

    auto wparams = whisper_full_default_params(WHISPER_SAMPLING_BEAM_SEARCH);
    wparams.n_threads = 1;
    wparams.audio_ctx = AUDIO_CONTEXT_SIZE;
    wparams.no_timestamps = true;
    wparams.print_special = false;
    wparams.token_timestamps = false;
    wparams.beam_search.beam_size = BEAM_SIZE;

    Benchmark(__func__, [&]
    {
        if (whisper_full(ctx, wparams, samples32.data(), samples32.size()) != 0)
        {
            throw std::exception("failed to process audio");
        }

        return 0;
    });

    std::string output;
    const int n_segments = whisper_full_n_segments(ctx);
    for (int i = 0; i < n_segments; ++i)
    {
        output += whisper_full_get_segment_text(ctx, i);
    }

    whisper_free(ctx);
    return output;
}

template<class T>
auto Benchmark(const char* name, T&& callback)
{
    printf("%s\n", name);
    using time = decltype(ggml_time_ms());

    // Wamup run
    auto firstResult = callback();

    std::vector<time> runs;
    for (int i = 0; i < 10; ++i)
    {
        auto begin = ggml_time_ms();
        callback();
        auto elapsed = ggml_time_ms() - begin;

        runs.push_back(elapsed);
    }

    auto sum = std::accumulate(runs.begin(), runs.end(), 0);
    auto mean = (float)sum / (float)runs.size();
    printf("%.2f;", mean);

    float variance = 0;
    for (auto run : runs)
    {
        auto diff = run - mean;
        variance += diff * diff;
    }
    
    auto stdDev = sqrt(variance / (float)(runs.size() - 1));
    printf("%.2f;", stdDev);

    for (auto run : runs)
    {
        printf("%i;", (int)run);
    }
    printf("\n");

    return firstResult;
}

std::vector<float> FilterMel(const std::vector<float>& raw_mels, int n_ctx = AUDIO_CONTEXT_SIZE)
{
    auto mel_offset = 0;
    auto mel_inp_n_mel = 80;
    auto mel_inp_n_len = (int)(raw_mels.size() / mel_inp_n_mel);
    const int i0 = std::min(mel_offset, mel_inp_n_len);
    const int i1 = std::min(mel_offset + 2 * n_ctx, mel_inp_n_len);

    std::vector<float> mels;
    mels.resize(mel_inp_n_mel * (i1 - i0));
    for (int j = 0; j < mel_inp_n_mel; ++j)
    {
        for (int i = i0; i < i1; ++i)
        {
            mels[j * 2 * n_ctx + (i - i0)] = raw_mels[j * mel_inp_n_len + i];
        }
    }

    return mels;
}

whisper_context* WhisperCppMelContext()
{
    whisper_context_params cparams = whisper_context_default_params();
    cparams.use_gpu = false;

    return whisper_init_from_file_with_params("\\models\\ggml-base.bin", cparams);
}

std::vector<float> WhisperCppMel(whisper_context* acceptedContext, std::vector<int16_t> samples, int contextSize = AUDIO_CONTEXT_SIZE)
{
    whisper_context* ctx = acceptedContext;

    if (ctx == nullptr)
    {
        ctx = WhisperCppMelContext();
    }

    auto samples32 = Int16ToFP32(samples);

    if (whisper_pcm_to_mel(ctx, samples32.data(), samples32.size(), /*n_threads*/1) != 0)
    {
        fprintf(stderr, "failed to process audio\n");
        return {};
    }

    auto* mel = whisper_extract_mel(ctx);
    auto mels = FilterMel(mel->data, contextSize);

    if (acceptedContext == nullptr)
    {
        whisper_free(ctx);
    }

    return mels;
}
diff --git a/whisper.cpp b/whisper.cpp
index f601197..d39275d 100644
--- a/whisper.cpp
+++ b/whisper.cpp
@@ -351,6 +351,7 @@ static const std::map<std::string, std::pair<int, std::string>> g_lang = {
     { "yue", { 99,  "cantonese",      } },
 };
 
+/*
 struct whisper_mel {
     int n_len;
     int n_len_org;
@@ -358,6 +359,7 @@ struct whisper_mel {
 
     std::vector<float> data;
 };
+*/
 
 struct whisper_filters {
     int32_t n_mel;
@@ -815,6 +817,8 @@ struct whisper_state {
     whisper_openvino_context * ctx_openvino = nullptr;
 #endif
 
+    whisper_custom_encoder custom_encoder;
+
     // [EXPERIMENTAL] token-level timestamps data
     int64_t t_beg  = 0;
     int64_t t_last = 0;
@@ -1625,7 +1629,7 @@ static bool whisper_encode_external(const whisper_state & wstate) {
     const bool use_openvino = wstate.ctx_openvino != nullptr;
 #endif
 
-    return use_coreml || use_openvino;
+    return use_coreml || use_openvino || wstate.custom_encoder.custom_encoder_callback != nullptr;
 }
 
 static struct ggml_cgraph * whisper_build_graph_conv(
@@ -2059,6 +2063,13 @@ static bool whisper_encode_internal(
             whisper_coreml_encode(wstate.ctx_coreml, mel->ne[0], mel->ne[1], (float *) mel->data, (float *) wstate.embd_enc->data);
 #elif defined(WHISPER_USE_OPENVINO)
             whisper_openvino_encode(wstate.ctx_openvino, mel, wstate.embd_enc);
+#else
+            wstate.custom_encoder.custom_encoder_callback
+            (
+                wstate.custom_encoder.custom_encoder_context,
+                mel,
+                wstate.embd_enc
+            );
 #endif
         }
     }
@@ -2758,7 +2769,7 @@ static void log_mel_spectrogram_worker_thread(int ith, const std::vector<float>
 
 // ref: https://github.com/openai/whisper/blob/main/whisper/audio.py#L110-L157
 static bool log_mel_spectrogram(
-              whisper_state & wstate,
+             int64_t* time_out,
               const float * samples,
               const int   n_samples,
               const int   /*sample_rate*/,
@@ -2837,7 +2848,10 @@ static bool log_mel_spectrogram(
         mel.data[i] = (mel.data[i] + 4.0)/4.0;
     }
 
-    wstate.t_mel_us += ggml_time_us() - t_start_us;
+    if(time_out != nullptr)
+    {
+        *time_out += ggml_time_us() - t_start_us;
+    }
 
     // Dump log_mel_spectrogram
     if (debug) {
@@ -2853,6 +2867,37 @@ static bool log_mel_spectrogram(
     return true;
 }
 
+static bool log_mel_spectrogram
+(
+    whisper_state& wstate,
+    const float* samples,
+    const int   n_samples,
+    const int   sample_rate,
+    const int   frame_size,
+    const int   frame_step,
+    const int   n_mel,
+    const int   n_threads,
+    const whisper_filters& filters,
+    const bool   debug,
+    whisper_mel& mel
+)
+{
+    return log_mel_spectrogram
+    (
+        &wstate.t_mel_us,
+        samples,
+        n_samples,
+        sample_rate,
+        frame_size,
+        frame_step,
+        n_mel,
+        n_threads,
+        filters,
+        debug,
+        mel
+    );
+}
+
 // split text into tokens
 //
 // ref: https://github.com/openai/gpt-2/blob/a74da5d99abaaba920de8131d64da2862a8f213b/src/encoder.py#L53
@@ -2970,6 +3015,8 @@ struct whisper_state * whisper_init_state(whisper_context * ctx) {
 
     whisper_state * state = new whisper_state;
 
+    state->custom_encoder = ctx->params.custom_encoder;
+
     state->backend = whisper_backend_init(ctx->params);
     if (!state->backend) {
         WHISPER_LOG_ERROR("%s: whisper_backend_init() failed\n", __func__);
@@ -3052,13 +3099,20 @@ struct whisper_state * whisper_init_state(whisper_context * ctx) {
     }
 
     // encoder allocator
-    if (!whisper_encode_external(*state)) {
-        bool ok = whisper_allocr_graph_init(state->alloc_encode, ctx->backend,
-                [&]() {
-                    return whisper_build_graph_encoder(*ctx, *state);
-                });
+    if (!whisper_encode_external(*state))
+    {
+        bool ok = whisper_allocr_graph_init
+       (
+            state->alloc_encode,
+            ctx->backend,
+            [&]()
+            {
+                return whisper_build_graph_encoder(*ctx, *state);
+            }
+        );
 
-        if (!ok) {
+        if (!ok)
+        {
             WHISPER_LOG_ERROR("%s: failed to init encoder allocator\n", __func__);
             whisper_free_state(state);
             return nullptr;
@@ -3400,6 +3454,23 @@ int whisper_pcm_to_mel(struct whisper_context * ctx, const float * samples, int
     return whisper_pcm_to_mel_with_state(ctx, ctx->state, samples, n_samples, n_threads);
 }
 
+int whisper_pcm_to_mel_no_state
+(
+    const float* samples_in,
+    int   n_samples,
+    float* mel_out,
+    size_t* mel_size_in_out,
+    int   n_threads
+)
+{
+    //if (!log_mel_spectrogram(*state, samples, n_samples, WHISPER_SAMPLE_RATE, WHISPER_N_FFT, WHISPER_HOP_LENGTH, ctx->model.filters.n_mel, n_threads, ctx->model.filters, false, state->mel)) {
+        WHISPER_LOG_ERROR("%s: failed to compute mel spectrogram\n", __func__);
+        return -1;
+//    }
+//
+//    return 0;
+}
+
 // same as whisper_pcm_to_mel, but applies a Phase Vocoder to speed up the audio x2 (PV without phase lock is not good)
 int whisper_pcm_to_mel_phase_vocoder_with_state(struct whisper_context * ctx, struct whisper_state * state, const float * samples, int n_samples, int n_threads) {
     if (!log_mel_spectrogram(*state, samples, n_samples, WHISPER_SAMPLE_RATE, 2 * WHISPER_N_FFT, 2 * WHISPER_HOP_LENGTH, ctx->model.filters.n_mel, n_threads, ctx->model.filters, false, state->mel)) {
@@ -3453,6 +3524,11 @@ int whisper_set_mel(
     return whisper_set_mel_with_state(ctx, ctx->state, data, n_len, n_mel);
 }
 
+whisper_mel* whisper_extract_mel(struct whisper_context* ctx)
+{
+    return &ctx->state->mel;
+}
+
 int whisper_encode_with_state(struct whisper_context * ctx, struct whisper_state * state, int offset, int n_threads) {
     if (!whisper_encode_internal(*ctx, *state, offset, n_threads, nullptr, nullptr)) {
         WHISPER_LOG_ERROR("%s: failed to eval\n", __func__);
@@ -5181,6 +5257,52 @@ int whisper_full_with_state(
 
         int best_decoder_id = 0;
 
+        //if (state->custom_encoder.custom_encoder_callback != nullptr)
+        //{
+        //    seek += 100 * WHISPER_CHUNK_SIZE;
+       //    continue;
+        //}
+
+        if (state->custom_encoder.custom_decoder_callback != nullptr)
+        {
+            auto& decoder = state->decoders[best_decoder_id];
+
+            // TAGS: WHISPER_DECODER_INIT
+            decoder.sequence.tokens.clear();
+            decoder.sequence.result_len = 0;
+            decoder.sequence.sum_logprobs_all = 0.0;
+            decoder.sequence.sum_logprobs = -INFINITY;
+            decoder.sequence.avg_logprobs = -INFINITY;
+            decoder.sequence.entropy = 0.0;
+            decoder.sequence.score = -INFINITY;
+
+            decoder.seek_delta = 100 * WHISPER_CHUNK_SIZE;
+
+            decoder.failed = false;
+            decoder.completed = false;
+            decoder.has_ts = false;
+
+            if (params.grammar_rules != nullptr)
+            {
+                decoder.grammar = whisper_grammar_init(params.grammar_rules, params.n_grammar_rules, params.i_start_rule);
+            }
+            else
+            {
+                decoder.grammar = {};
+            }
+
+            state->custom_encoder.custom_decoder_callback
+            (
+                state->custom_encoder.custom_decoder_context,
+                &decoder,
+                [](void* context, whisper_token* tokens, size_t tokens_count)
+                {
+                    auto& decoderTokens = ((whisper_decoder*)context)->sequence.tokens;
+                    decoderTokens.insert(decoderTokens.begin(), tokens, tokens + tokens_count);
+                }
+            );
+        }
+        else // TODO: Needs formatting, but I don't want to blow patch
         for (int it = 0; it < (int) temperatures.size(); ++it) {
             const float t_cur = temperatures[it];
 
@@ -5685,13 +5807,16 @@ int whisper_full_with_state(
             //WHISPER_LOG_DEBUG("prompt_init.size() = %d, prompt.size() = %d, result_len = %d, seek_delta = %d\n", prompt_init.size(), prompt.size(), result_len, seek_delta);
 
             // update prompt_past
-            prompt_past.clear();
-            if (prompt.front() == whisper_token_prev(ctx)) {
-                prompt_past.insert(prompt_past.end(), prompt.begin() + 1, prompt.end() - prompt_init.size());
-            }
-
-            for (int i = 0; i < result_len; ++i) {
-                prompt_past.push_back(tokens_cur[i].id);
+            if(state->custom_encoder.custom_decoder_callback == nullptr)
+            {
+               prompt_past.clear();
+               if (prompt.front() == whisper_token_prev(ctx)) {
+                   prompt_past.insert(prompt_past.end(), prompt.begin() + 1, prompt.end() - prompt_init.size());
+               }
+
+               for (int i = 0; i < result_len; ++i) {
+                   prompt_past.push_back(tokens_cur[i].id);
+               }
             }
 
             if (!tokens_cur.empty() && ctx->model.n_loaded > 0) {
diff --git a/whisper.h b/whisper.h
index a5371eb..708dbad 100644
--- a/whisper.h
+++ b/whisper.h
@@ -6,6 +6,7 @@
 #include <stddef.h>
 #include <stdint.h>
 #include <stdbool.h>
+#include <vector>
 
 #ifdef __GNUC__
 #    define WHISPER_DEPRECATED(func, hint) func __attribute__((deprecated(hint)))
@@ -34,6 +35,8 @@
 #define WHISPER_HOP_LENGTH  160
 #define WHISPER_CHUNK_SIZE  30
 
+#define WHISPER_USE_CT2
+
 #ifdef __cplusplus
 extern "C" {
 #endif
@@ -84,9 +87,30 @@ extern "C" {
     typedef int32_t whisper_token;
     typedef int32_t whisper_seq_id;
 
+    struct whisper_custom_encoder
+    {
+        void* custom_encoder_context;
+        void(*custom_encoder_callback)
+       (
+            void* custom_encoder_context,
+            ggml_tensor* mel_in,
+            ggml_tensor* encoded_tokens_out
+        );
+
+        void* custom_decoder_context;
+        void(*custom_decoder_callback)
+        (
+            void* custom_decoder_context,
+            void* submit_tokens_context,
+            void(*submit_tokens)(void* submit_tokens_context, whisper_token* tokens, size_t tokens_count)
+        );
+    };
+
     struct whisper_context_params {
         bool  use_gpu;
         int   gpu_device;  // CUDA device
+
+        whisper_custom_encoder custom_encoder;
     };
 
     typedef struct whisper_token_data {
@@ -217,6 +241,15 @@ extern "C" {
                                int   n_samples,
                                int   n_threads);
 
+    WHISPER_API int whisper_pcm_to_mel_no_state
+   (
+        const float* samples_in,
+        int   n_samples,
+        float* mel_out,
+        size_t* mel_size_in_out,
+        int   n_threads
+    );
+
     WHISPER_API int whisper_pcm_to_mel_with_state(
             struct whisper_context * ctx,
               struct whisper_state * state,
@@ -257,6 +290,16 @@ extern "C" {
                                int   n_len,
                                int   n_mel);
 
+    struct whisper_mel {
+        int n_len;
+        int n_len_org;
+        int n_mel;
+
+        std::vector<float> data;
+    };
+
+    WHISPER_API whisper_mel* whisper_extract_mel(struct whisper_context* ctx);
+
     // Run the Whisper encoder on the log mel spectrogram stored inside the default state in the provided whisper context.
     // Make sure to call whisper_pcm_to_mel() or whisper_set_mel() first.
     // offset can be used to specify the offset of the first frame in the spectrogram.


Disclaimers:

  1. Runtimes measure only actual evaluation. Loading times of the models, teardown and first warmup run is not considered.
    Reasoning is simple, on short transcriptions we typically care about response time so the model will be preloaded, on large transcriptions the loading time amortizes to nothing.

  2. My implementation of Audio Context size reduction is not correct for multi-pass decoding (+30s clips).
    Whisper.cpp nailed this down, I did not bother, it is out of my target use-case for now.

  3. All measurements were done on Win10 i7-9700K MSVC Whisper Base model (multi-lang) and following ~7s audio voice command:
    https://github.com/Picovoice/rhino/blob/4a69bd13dceed859911f4e360dfde0cdb9d9fbf0/resources/audio_samples/test_within_context.wav

@bobqianic
Copy link
Collaborator

bobqianic commented Mar 29, 2024

WhisperCpp - Just downloaded and compiled as is, with /arch:AVX2, not sure what compute type is default

That's FP32. I believe if you give the OpenBLAS version a try, you'll find its performance quite similar to CT2, with hardly any noticeable difference. (v1.5.4)

@InflexCZE
Copy link

InflexCZE commented Mar 29, 2024

Great, I completely missed that in docs, thanks.
Should be BIG RED, if you want perf, enable this :)

So for the BLAS backend I added measurements #​27 and #​28 to the tables above.
Solid ~1.7x perf improvement, on AC 512 it actually beats CT2. On 1500 there is something more to it tho, not even close.

In the ggml code I also noticed there is some support for MKL too, tried to measure it, but it's throwing on me, not sure why.
Maybe you'll see what's going on? Maybe support for it is not finished yet?

Intel oneMKL ERROR: Parameter 9 was incorrect on entry to cblas_sgemm.

 ne1  = 1500
 ne01 = 512
 ne10 = 512
 ne00 = 512

 cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans,
          ne1, ne01, ne10,
         1.0f,    y, ne10,
                  x, ne00,
         0.0f,    d, ne01);

.exe!ggml_compute_forward_mul_mat(const ggml_compute_params * params, ggml_tensor * dst) Line 10629 C
.exe!ggml_compute_forward(ggml_compute_params * params, ggml_tensor * tensor) Line 16061    C
.exe!ggml_graph_compute_thread(void * data) Line 18157  C
.exe!ggml_graph_compute(ggml_cgraph * cgraph, ggml_cplan * cplan) Line 18490    C
.exe!ggml_backend_cpu_graph_compute(ggml_backend * backend, ggml_cgraph * cgraph) Line 809  C
.exe!ggml_backend_graph_compute_async(ggml_backend * backend, ggml_cgraph * cgraph) Line 282    C
.exe!ggml_backend_graph_compute(ggml_backend * backend, ggml_cgraph * cgraph) Line 276  C
.exe!ggml_graph_compute_helper(ggml_backend * backend, ggml_cgraph * graph, int n_threads) Line 190 C++
.exe!whisper_encode_internal(whisper_context & wctx, whisper_state & wstate, const int mel_offset, const int n_threads, bool(*)(void *) abort_callback, void * abort_callback_data) Line 2088   C++
.exe!whisper_full_with_state(whisper_context * ctx, whisper_state * state, whisper_full_params params, const float * samples, int n_samples) Line 5247  C++
.exe!whisper_full(whisper_context * ctx, whisper_full_params params, const float * samples, int n_samples) Line 5943    C++
Full context of locals

blck_0  -3689348814741910324    const __int64
blck_1  -3689348814741910324    const __int64
d   0x0000026db2e0e080 {-431602080.}    float *
desired_wsize   1048576 const unsigned __int64
dr0 -3689348814741910324    const __int64
dr1 -3689348814741910324    const __int64
dst 0x0000026dff7242d0 {type=GGML_TYPE_F32 (0) backend=GGML_BACKEND_TYPE_CPU (0) buffer=0x0000026dfb5ace80 {...} ...}   ggml_tensor *
from_float_to_vec_dot   0x00007ff62c9bf856 {.exe!ggml_fp32_to_fp16_row} void(*)(const float *, void *, int)
i02 0   const __int64
i03 0   const __int64
i12 0   __int64
i13 0   __int64
ir010   -3689348814741910324    const __int64
ir011   -3689348814741910324    const __int64
ir110   -3689348814741910324    const __int64
ir111   -3689348814741910324    const __int64
ith 0   const int
ith0    -3689348814741910324    const __int64
ith1    -3689348814741910324    const __int64
nb00    2   const unsigned __int64
nb0 4   const unsigned __int64
nb01    1024    const unsigned __int64
nb1 2048    const unsigned __int64
nb02    524288  const unsigned __int64
nb2 3072000 const unsigned __int64
nb03    524288  const unsigned __int64
nb3 3072000 const unsigned __int64
nb10    4   const unsigned __int64
nb11    2048    const unsigned __int64
nb12    3072000 const unsigned __int64
nb13    3072000 const unsigned __int64
ne_plane    262144  const __int64
ne00    512 const __int64
ne0 512 const __int64
ne01    512 const __int64
ne1 1500    const __int64
ne02    1   const __int64
ne2 1   const __int64
ne03    1   const __int64
ne3 1   const __int64
ne10    512 const __int64
ne11    1500    const __int64
ne12    1   const __int64
ne13    1   const __int64
nr0 -3689348814741910324    const __int64
nr1 -3689348814741910324    const __int64
nrc -3689348814741910324    __int64
nth 1   const int
nth0    -3689348814741910324    const __int64
nth1    -3689348814741910324    const __int64
params  0x000000f7a3cf9b18 {type=GGML_TASK_TYPE_COMPUTE (1) ith=0 nth=1 ...}    const ggml_compute_params *
r2  1   const __int64
r3  1   const __int64
row_size    14757395258967641292    const unsigned __int64
src0    0x0000026dfcc79ca0 {type=GGML_TYPE_F16 (1) backend=GGML_BACKEND_TYPE_CPU (0) buffer=0x0000026de331ee70 {...} ...}   const ggml_tensor *
src1    0x0000026dff723c90 {type=GGML_TYPE_F32 (0) backend=GGML_BACKEND_TYPE_CPU (0) buffer=0x0000026dfb5ace80 {...} ...}   const ggml_tensor *
src1_col_stride 14757395258967641292    const unsigned __int64
src1_cont   true    const bool
t0  0   __int64
tmp 0x000000f7a3cf8fe0 {-107374176., -107374176., -107374176., -107374176., -107374176., -107374176., -107374176., ...} float[32]
type    GGML_TYPE_F16 (1)   ggml_type
vec_dot 0x00007ff62d234110 {.exe!ggml_vec_dot_f16(int, float *, unsigned __int64, unsigned short *, unsigned __int64, unsigned short *, unsigned __int64, int)} void(*)(int, float *, unsigned __int64, const void *, unsigned __int64, const void *, unsigned __int64, int)
vec_dot_num_rows    1   const __int64
vec_dot_type    GGML_TYPE_F16 (1)   ggml_type
wdata   0xcccccccccccccccc  const void *
x   0x0000026dbdd91070  const void *
y   0x0000026db2b20080 {-0.218084723}   const float *

@AIWintermuteAI
Copy link
Contributor

It looks like compiled with OpenBLAS is actually worse on Raspberry Pi 5 (1743.21 ms vs. 6232.27 ms - I ran jfk sample a few times, while the numbers differed slightly, the overall result was the same).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants