Ease deployment of CPU whisper by filling all settings. Setup is outlined in https://github.com/ggerganov/whisper.cpp. I had difficulty with CUDA in Fedora Silverblue (could only make in a distrobox and the end result had artefacts or hallucinations). This ASR has the highest accuracy and lowest speed.

CAUTION: MAKE SURE TO BACKUP AUDIO FILES IF THEY HAVE SPACES IN THEIR NAMES AS THEY WILL BE RENAMED (AND THE METADATA ALTERED). For file names without spaces, the original file is not renamed and a copy is made in a compatible audio format.

In [1]:
import numpy as np
import subprocess
import os
import glob

np.set_printoptions(linewidth=50)

# change to reflect your local directory and file name
directory = '/var/home/fraser/machine_learning/whisper.cpp/samples/'
# ensure the spaces are replaced with '-' (cell below will rename files for ffmpeg processing)
audio_file = '1.m2a'
output_file = audio_file + '-output.wav'
print(audio_file)
print(output_file)

1.m2a
1.m2a-output.wav


In [2]:
# rename all audio files with spaces in their name
# added *.m2a for podcasts
files = glob.glob(os.path.join(directory, '*.m4a')) + \
        glob.glob(os.path.join(directory, '*.mp3')) + \
        glob.glob(os.path.join(directory, '*.m2a')) + \
        glob.glob(os.path.join(directory, '*.ogg')) + \
        glob.glob(os.path.join(directory, '*.wav'))

# Iterate over the files 
for file in files:
    # If the file name contains a space
    if ' ' in file:
        # Replace the spaces with hyphens
        new_name = file.replace(' ', '-')
        # Rename the file
        os.rename(file, new_name)

In [3]:
# convert audio_file then transcribe to text
# overwrites existing file with same name
try:
    yes_command = f'echo "y" | '
    subprocess.run([yes_command + 'ffmpeg' + ' -i ' + directory + audio_file + ' -ar 16000 -ac 1 -c:a pcm_s16le ' 
                    + directory + output_file], shell=True, check=True)
    print("Audio coverted successfully.")
except subprocess.CalledProcessError as e:
    print(f"Audio convertion failed with error {e.returncode}.")

Audio coverted successfully.


ffmpeg version 4.4 Copyright (c) 2000-2021 the FFmpeg developers
  built with gcc 9.3.0 (crosstool-NG 1.24.0.133_b0863d8_dirty)
  configuration: --prefix=/root/miniconda3/envs/conda_bld/conda-bld/ffmpeg_1635335682798/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_place --cc=/root/miniconda3/envs/conda_bld/conda-bld/ffmpeg_1635335682798/_build_env/bin/x86_64-conda-linux-gnu-cc --disable-doc --disable-openssl --enable-avresample --enable-hardcoded-tables --enable-libfreetype --enable-libopenh264 --enable-pic --enable-pthreads --enable-shared --disable-static --enable-version3 --enable-zlib --enable-libmp3lame
  libavutil      56. 70.100 / 56. 70.100
  libavcodec     58.134.100 / 58.134.100
  libavformat    58. 76.100 / 58. 76.100
  libavdevice    58. 13.100 / 58. 13.100
  libavfilter     7.110.100 /  7.110.100
  libavresample   4.  0.  0 /  4.  0.  0

In [4]:
'''
# transcribe using the whisper-distill model: this model is halluciating presently
# but once that is fixed, I can swap out the slower CPU model for the CUDA enabled model
# so this code is for future use
try:
    subprocess.run(['transcribe -t 24 -m /var/home/fraser/machine_learning/whisper.cpp/models/ggml-large-32-2.en.bin -f ' 
                    + directory + output_file + ' -otxt'], shell=True, check=True)
    print("Transcription executed successfully and saved in " + directory + output_file)
except subprocess.CalledProcessError as e:
    print(f"Transcription failed with error {e.returncode}.")
'''

'\n# transcribe using the whisper-distill model: this model is halluciating presently\n# but once that is fixed, I can swap out the slower CPU model for the CUDA enabled model\n# so this code is for future use\ntry:\n    subprocess.run([\'transcribe -t 24 -m /var/home/fraser/machine_learning/whisper.cpp/models/ggml-large-32-2.en.bin -f \' \n                    + directory + output_file + \' -otxt\'], shell=True, check=True)\n    print("Transcription executed successfully and saved in " + directory + output_file)\nexcept subprocess.CalledProcessError as e:\n    print(f"Transcription failed with error {e.returncode}.")\n'

In [5]:
# transcribe using the large quantized CPU model, output text file
try:
    subprocess.run(['transcribe -t 24 -m /var/home/fraser/machine_learning/whisper.cpp/models/ggml-model-whisper-large-q5_0.bin -f ' 
                    + directory + output_file + ' -otxt'], shell=True, check=True)
    print("Transcription executed successfully and saved in " + directory + output_file)
except subprocess.CalledProcessError as e:
    print(f"Transcription failed with error {e.returncode}.")

whisper_init_from_file_with_params_no_state: loading model from '/var/home/fraser/machine_learning/whisper.cpp/models/ggml-model-whisper-large-q5_0.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 8
whisper_model_load: qntvr         = 1
whisper_model_load: type          = 5 (large)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: n_langs       = 99
whisper_model_load:      CPU total size =  1080.10 MB
whisper_model_load: model size    = 1080.10 MB
whisper_init_state: kv self size  =  220.20 MB
whisper_init_state: kv cross size =  245.76 MB
whisper_init


[00:00:00.000 --> 00:00:27.740]   [music]
[00:00:27.740 --> 00:00:29.100]   Here's your Closing Bell Brief
[00:00:29.100 --> 00:00:31.100]   for Monday, September 25th.
[00:00:31.100 --> 00:00:33.520]   I'm J.R. Whalen for The Wall Street Journal.
[00:00:33.520 --> 00:00:35.120]   U.S. stocks opened the week higher,
[00:00:35.120 --> 00:00:36.960]   snapping four session losing streaks
[00:00:36.960 --> 00:00:39.260]   for all three major U.S. indexes.
[00:00:39.260 --> 00:00:42.260]   The Dow Jones Industrial Average rose 43 points
[00:00:42.260 --> 00:00:44.620]   to close at 34,007.
[00:00:44.620 --> 00:00:47.400]   The S&P 500 was up 17 points,
[00:00:47.400 --> 00:00:49.700]   and the NASDAQ added 60 points.
[00:00:49.700 --> 00:00:51.700]   Stocks rose despite a continued rise
[00:00:51.700 --> 00:00:53.660]   in U.S. Treasury bond yields.
[00:00:53.660 --> 00:00:55.420]   Higher bond yields typically push up
[00:00:55.420 --> 00:00:58.260]   companies' borrowing costs and squee

output_txt: saving output to '/var/home/fraser/machine_learning/whisper.cpp/samples/1.m2a-output.wav.txt'

whisper_print_timings:     load time =   635.68 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    89.81 ms
whisper_print_timings:   sample time =   882.48 ms /  1919 runs (    0.46 ms per run)
whisper_print_timings:   encode time = 78501.54 ms /     4 runs (19625.38 ms per run)
whisper_print_timings:   decode time =    78.85 ms /     1 runs (   78.85 ms per run)
whisper_print_timings:   batchd time = 76501.24 ms /  1901 runs (   40.24 ms per run)
whisper_print_timings:   prompt time =  6640.32 ms /   407 runs (   16.32 ms per run)
whisper_print_timings:    total time = 163347.53 ms


7m to transcribe a 2:24m file, but all punctuation, etc., is *perfect*. whisper-distil on CUDA takes 2.6s but has some errors. Attempts at injecting markdown formatting commands failed as the command is surrounded by ' ' for some odd reason. When this changes, it should be possible to inject formatting while dictating.