Whisper-distil is the fastest ASR at present, 1.9s vs 4m 44s, capable of processing at over 40x real speed with accuracy that is extremely close to whisper.  

Prerequisites:
torch,
cuda,
ffmpeg, and 
distil-whisper at https://github.com/huggingface/distil-whisper.

CAUTION: MAKE SURE TO BACKUP AUDIO FILES IF THEY HAVE SPACES IN THEIR NAMES AS THEY WILL BE RENAMED (AND THE METADATA ALTERED). For file names without spaces, the original file is not renamed and a copy is made in a compatible audio format.

In [1]:
import torch
torch.__version__

'2.2.1+cu121'

In [2]:
# need cuda for vastly faster transcription
# 47 seconds for a full 30 min podcast! 1.9s for the same file that took whisper 4m 44 seconds!
torch.cuda.is_available()

True

In [3]:
# conda activate py310
import torch
import subprocess
import os
import glob
import textwrap
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
print(device)

model_id = "distil-whisper/distil-large-v2"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, 
    torch_dtype=torch_dtype, 
    low_cpu_mem_usage=True, 
    use_safetensors=True,
    attn_implementation="flash_attention_2")
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)

2024-03-17 11:18:28.697326: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-03-17 11:18:28.738015: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-17 11:18:29.702282: I itex/core/wrapper/itex_cpu_wrapper.cc:52] Intel Extension for Tensorflow* AVX2 CPU backend is loaded.
2024-03-17 11:18:29.704346: W itex/core/wrapper/itex_gpu_wrapper.cc:32] Could not load dynamic library: libimf.so: cannot open shared object file: No such file or directory
2024-03-17 11:18:29.736647: W itex/core/ops/op_init.cc:58] Op: _QuantizedMaxPool

cuda:0


You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [4]:
# rename all audio files with spaces in their name for ffmpeg
# Specify the audio file directory
directory = '/var/home/fraser/machine_learning/whisper.cpp/samples/'

# Get a list of all audio files, .m4a, .mp3, and .wav, in the directory
# added *.m2a for podcasts
files = glob.glob(os.path.join(directory, '*.m4a')) + \
        glob.glob(os.path.join(directory, '*.mp3')) + \
        glob.glob(os.path.join(directory, '*.m2a')) + \
        glob.glob(os.path.join(directory, '*.ogg')) + \
        glob.glob(os.path.join(directory, '*.wav'))

# Iterate over the files 
# CAUTION: this overwrites files with spaces in their names such as 'Track 11.wav' to 'Track-11.wav' 
# by overwriting the file, it permits processing but alters the metadata to date file saved=today
for file in files:
    # If the file name contains a space
    if ' ' in file:
        # Replace the spaces with hyphens
        new_name = file.replace(' ', '-')
        # Rename the file
        os.rename(file, new_name)

In [5]:
# audio file you wish to transcribe:
# make sure spaces are replaced with a '-'
# e.g., 'Track-13.wav' for 'Track 13.wav'
audio_file = '4331.m2a'

In [6]:
# convert audio file to 16-bit wav format required by whisper
output_file = audio_file + '-output.wav'
print(output_file)

# convert audio_file then transcribe to text
# overwrites existing file with same name
try:
    yes_command = f'echo "y" | '
    subprocess.run([yes_command + 'ffmpeg' + ' -i ' + directory + audio_file + ' -ar 16000 -ac 1 -c:a pcm_s16le ' + directory + output_file], shell=True, check=True)
    print("Audio coverted successfully.")
except subprocess.CalledProcessError as e:
    print(f"Audio convertion failed with error {e.returncode}.")

4331.m2a-output.wav


ffmpeg version 4.4 Copyright (c) 2000-2021 the FFmpeg developers
  built with gcc 9.3.0 (crosstool-NG 1.24.0.133_b0863d8_dirty)
  configuration: --prefix=/root/miniconda3/envs/conda_bld/conda-bld/ffmpeg_1635335682798/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_place --cc=/root/miniconda3/envs/conda_bld/conda-bld/ffmpeg_1635335682798/_build_env/bin/x86_64-conda-linux-gnu-cc --disable-doc --disable-openssl --enable-avresample --enable-hardcoded-tables --enable-libfreetype --enable-libopenh264 --enable-pic --enable-pthreads --enable-shared --disable-static --enable-version3 --enable-zlib --enable-libmp3lame
  libavutil      56. 70.100 / 56. 70.100
  libavcodec     58.134.100 / 58.134.100
  libavformat    58. 76.100 / 58. 76.100
  libavdevice    58. 13.100 / 58. 13.100
  libavfilter     7.110.100 /  7.110.100
  libavresample   4.  0.  0 /  4.  0.  0

Audio coverted successfully.


size=   72531kB time=00:38:40.97 bitrate= 256.0kbits/s speed= 880x    
video:0kB audio:72531kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.000194%


In [7]:
# chunk_lengthS=15 and batch_size=16 is ideal
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=15,
    batch_size=16,
    torch_dtype=torch_dtype,
    device=device
)
result_local = pipe(directory + output_file)

# output transcription
wrapper = textwrap.TextWrapper(width=80,
    initial_indent=" ",
    subsequent_indent="",
    break_long_words=False,
    break_on_hyphens=False)
print(wrapper.fill(result_local["text"]))

# save transcript to the samples folder as a .md file
saved_txt=result_local["text"]
f = open(directory + output_file + ".md", "a")
f.write(saved_txt)
f.close()

  Hi, welcome back to prosecuting Donald Trump. It's Tuesday, February 20th. I'm
Andrew Weissman, and I'm here with Mary McCord. Hi, Mary. Good morning, Andrew.
Well, guess what? There's a lot to talk about. We have nothing to say. Yeah,
it's been such a slow week. He is like- If this is slow, I hate to see what not
slow is. That's kind of a preview of what we're about to talk about. There's so
much that happened last week, and it's very much about what our lives and all of
our lives are about to be in short order. So, okay, Mary, take it away. What are
some things that happened last week that we're going to talk about? Yeah, quick
roadmap for today's episode. Obviously, we're, you know, waiting for two
decisions out of the Supreme Court. One, I think, could come today, which is a
decision on Trump's motion to stay, the DC Circuit's immunity decision,
basically a decision saying he does not have immunity from criminal prosecution.
That's been fully briefed. We'll talk a little bit abou