Install all required dependencies via the CLI:

```bash
python3 -m venv .venv && source .venv/bin/activate
sudo apt install ffmpeg
pip3 install pywhispercpp faster-whisper jiwer sentence_transformers
```

Choose the Whisper model identifier that you would like to use.

In [1]:
desired_whisper_model = input("Whisper model version (e.g., base.en, base):")

Read in all available audio files from `samples/audio`. Assumes that the audio files all have a matching file within `sample/truth`.

In [2]:
import os

audio_dir = "samples/audio/"
transcription_dir = "samples/transcription/"
truth_dir = "samples/truth/"

files = os.listdir(audio_dir)

filenames = []
for file in files:
    name, ext = os.path.splitext(file)
    filenames.append(name)

print(filenames)

['3min47sec', '13min56sec', '0min12sec']


Use the python binding for whisper.cpp to inference whisper GGML. Cross-platform and cross-language.

Use faster-whisper for CTransformers2 inferencing acceleration. Python only,

In [3]:
transcript = ""  # store the resulting transcript

if input("Choose one using the number: (1) pywhispercpp or (2) faster-whisper") == 1:
    from pywhispercpp.model import Model

    model = Model(desired_whisper_model, n_threads=6, models_dir="./models")

    for filename in filenames:
        audio_file = f"{audio_dir}{filename}.wav"
        print(f"Filename: {audio_file}")
        segments = model.transcribe(audio_file, speed_up=True)

        for segment in segments:
            transcript = " ".join([seg.text for seg in segments])

else:
    from faster_whisper import WhisperModel
    import time

    model = WhisperModel(
        desired_whisper_model,
        device="cpu",
        compute_type="int8",
        cpu_threads=6,
        download_root="./models",
    )

    for filename in filenames:
        audio_file = f"{audio_dir}{filename}.wav"
        segments, info = model.transcribe(audio_file)

        start = time.perf_counter()
        segments = list(segments)  # The transcription will actually run here.
        end = time.perf_counter()
        print(f"Filename: {audio_file}")
        print(f"Transcribed in {start - end:0.4f} seconds")

        for segment in segments:
            transcript = " ".join([seg.text for seg in segments])

transcript_file = f"{transcription_dir}{filename}.txt"
with open(transcript_file, "w") as f:
    f.write(transcript)

  from .autonotebook import tqdm as notebook_tqdm


Calculate Word Error Rate (WER) and Word Information Loss (WIL). WER measures word-level accuracy. WIL measures semantic fidelity. WER compares words. WIL compares meaning.

Sentence embedding model is used to perform cosine-similarity.

In [None]:
from jiwer import wer, process_words
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")


for filename in filenames:
    truth = ""
    transcript = ""

    transcript_file = f"{transcription_dir}{filename}.txt"
    with open(transcript_file, "r") as f:
        transcript = f.read()

    truth_file = f"{truth_dir}{filename}.txt"
    with open(truth_file, "r") as f:
        truth = f.read()

    output = process_words(truth, transcript)

    wer = output.wer
    wil = output.wil

    truth_embedding = model.encode(truth, convert_to_tensor=True)
    transcript_embedding = model.encode(transcript, convert_to_tensor=True)

    document_similarity = util.pytorch_cos_sim(
        truth_embedding, transcript_embedding
    ).item()

    print(f"[{filename}] Word Error Rate: {wer}")
    print(f"[{filename}] Word Information Loss: {wil}")
    print(f"[{filename}] (EXPERIMENTAL) Document Similarity: {document_similarity}")