Ease deployment of CPU whisper by filling all settings. Setup is outlined in https://github.com/ggerganov/whisper.cpp. I had difficulty with CUDA in Fedora Silverblue (could only make in a distrobox and the end result had artefacts or hallucinations). This ASR has the highest accuracy and lowest speed.

In [1]:
import numpy as np
import subprocess
import os
import glob

np.set_printoptions(linewidth=50)

# change to reflect your local directory and file name
file_dir = '/var/home/fraser/machine_learning/whisper.cpp/samples/'
# ensure the spaces are replaced with '-' (cell below will rename files for ffmpeg processing)
audio_file = 'Track-13.wav'
output_file = audio_file + '-output.wav'
print(audio_file)
print(output_file)

Track-13.wav
Track-13.wav-output.wav


In [2]:
# rename all audio files with spaces in their name
# Get a list of all audio files, .m4a, .mp3, and .wav, in the directory
files = glob.glob(os.path.join(file_dir, '*.m4a')) + \
        glob.glob(os.path.join(file_dir, '*.mp3')) + \
        glob.glob(os.path.join(file_dir, '*.ogg')) + \
        glob.glob(os.path.join(file_dir, '*.wav'))

# Iterate over the files 
for file in files:
    # If the file name contains a space
    if ' ' in file:
        # Replace the spaces with hyphens
        new_name = file.replace(' ', '-')
        # Rename the file
        os.rename(file, new_name)

In [3]:
# convert audio_file then transcribe to text
# overwrites existing file with same name
try:
    yes_command = f'echo "y" | '
    subprocess.run([yes_command + 'ffmpeg' + ' -i ' + file_dir + audio_file + ' -ar 16000 -ac 1 -c:a pcm_s16le ' + file_dir + output_file], shell=True, check=True)
    print("Audio coverted successfully.")
except subprocess.CalledProcessError as e:
    print(f"Audio convertion failed with error {e.returncode}.")

Audio coverted successfully.


ffmpeg version 4.4 Copyright (c) 2000-2021 the FFmpeg developers
  built with gcc 9.3.0 (crosstool-NG 1.24.0.133_b0863d8_dirty)
  configuration: --prefix=/root/miniconda3/envs/conda_bld/conda-bld/ffmpeg_1635335682798/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_place --cc=/root/miniconda3/envs/conda_bld/conda-bld/ffmpeg_1635335682798/_build_env/bin/x86_64-conda-linux-gnu-cc --disable-doc --disable-openssl --enable-avresample --enable-hardcoded-tables --enable-libfreetype --enable-libopenh264 --enable-pic --enable-pthreads --enable-shared --disable-static --enable-version3 --enable-zlib --enable-libmp3lame
  libavutil      56. 70.100 / 56. 70.100
  libavcodec     58.134.100 / 58.134.100
  libavformat    58. 76.100 / 58. 76.100
  libavdevice    58. 13.100 / 58. 13.100
  libavfilter     7.110.100 /  7.110.100
  libavresample   4.  0.  0 /  4.  0.  0

In [4]:
'''
# transcribe using the whisper-distill model: this model is halluciating presently
# but once that is fixed, I can swap out the slower CPU model for the CUDA enabled model
# so this code is for future use
try:
    subprocess.run(['transcribe -t 24 -m /var/home/fraser/machine_learning/whisper.cpp/models/ggml-large-32-2.en.bin -f ' + file_dir + output_file + ' -otxt'], shell=True, check=True)
    print("Transcription executed successfully and saved in " + file_dir + output_file)
except subprocess.CalledProcessError as e:
    print(f"Transcription failed with error {e.returncode}.")
'''

'\n# transcribe using the whisper-distill model: this model is halluciating presently\n# but once that is fixed, I can swap out the slower CPU model for the CUDA enabled model\n# so this code is for future use\ntry:\n    subprocess.run([\'transcribe -t 24 -m /var/home/fraser/machine_learning/whisper.cpp/models/ggml-large-32-2.en.bin -f \' + file_dir + output_file + \' -otxt\'], shell=True, check=True)\n    print("Transcription executed successfully and saved in " + file_dir + output_file)\nexcept subprocess.CalledProcessError as e:\n    print(f"Transcription failed with error {e.returncode}.")\n'

In [5]:
# transcribe using the large quantized CPU model, output text file
try:
    subprocess.run(['transcribe -t 24 -m /var/home/fraser/machine_learning/whisper.cpp/models/ggml-model-whisper-large-q5_0.bin -f ' + file_dir + output_file + ' -otxt'], shell=True, check=True)
    print("Transcription executed successfully and saved in " + file_dir + output_file)
except subprocess.CalledProcessError as e:
    print(f"Transcription failed with error {e.returncode}.")


whisper_init_from_file_with_params_no_state: loading model from '/var/home/fraser/machine_learning/whisper.cpp/models/ggml-model-whisper-large-q5_0.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 8
whisper_model_load: qntvr         = 1
whisper_model_load: type          = 5 (large)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: n_langs       = 99
whisper_model_load:      CPU total size =  1080.10 MB
whisper_model_load: model size    = 1080.10 MB
whisper_init_state: kv self size  =  220.20 MB
whisper_init_state: kv cross size =  245.76 MB
whisper_init


[00:00:00.000 --> 00:00:10.760]   So, is it possible to insert formatting or markdown within the context of the text by
[00:00:10.760 --> 00:00:11.960]   speaking it out?
[00:00:11.960 --> 00:00:23.000]   So for instance, "*bold*" is one way of formatting text, while another one might be "\n" and
[00:00:23.000 --> 00:00:25.800]   that would indicate a new paragraph.
[00:00:25.800 --> 00:00:33.040]   So it's an interesting experiment and there may be some ways of melding both natural speech
[00:00:33.040 --> 00:00:44.040]   and some brief commands in order to reduce the final editing load that is required.

Transcription executed successfully and saved in /var/home/fraser/machine_learning/whisper.cpp/samples/Track-13.wav-output.wav


output_txt: saving output to '/var/home/fraser/machine_learning/whisper.cpp/samples/Track-13.wav-output.wav.txt'

whisper_print_timings:     load time =   577.18 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    42.96 ms
whisper_print_timings:   sample time =   175.33 ms /   528 runs (    0.33 ms per run)
whisper_print_timings:   encode time = 28255.19 ms /     2 runs (14127.60 ms per run)
whisper_print_timings:   decode time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   batchd time = 12544.39 ms /   521 runs (   24.08 ms per run)
whisper_print_timings:   prompt time =   719.62 ms /    68 runs (   10.58 ms per run)
whisper_print_timings:    total time = 42323.57 ms


7m to transcribe a 2:24m file, but all punctuation, etc., is *perfect*. whisper-distil on CUDA takes 2.6s but has some errors. Attempts at injecting markdown formatting commands failed as the command is surrounded by ' ' for some odd reason. When this changes, it should be possible to inject formatting while dictating.