Whisper-distil is the fastest ASR at present, 1.9s vs 4m 44s, capable of processing at over 40x real speed with accuracy that is extremely close to whisper.  

Prerequisites:
torch,
cuda,
ffmpeg, and 
distil-whisper at https://github.com/huggingface/distil-whisper.

In [8]:
import torch
torch.__version__

'2.2.1+cu121'

In [9]:
# need cuda for vastly faster transcription
# 47 seconds for a full 30 min podcast! 1.9s for the same file that took whisper 4m 44 seconds!
torch.cuda.is_available()

True

In [10]:
# conda activate py310
import torch
import subprocess
import os
import glob
import textwrap
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
print(device)

model_id = "distil-whisper/distil-large-v2"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, 
    torch_dtype=torch_dtype, 
    low_cpu_mem_usage=True, 
    use_safetensors=True,
    attn_implementation="flash_attention_2")
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)

cuda:0


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [11]:
# rename all audio files with spaces in their name for ffmpeg
# Specify the audio file directory
directory = '/var/home/fraser/machine_learning/whisper.cpp/samples/'

# Get a list of all audio files, .m4a, .mp3, and .wav, in the directory
files = glob.glob(os.path.join(directory, '*.m4a')) + \
        glob.glob(os.path.join(directory, '*.mp3')) + \
        glob.glob(os.path.join(directory, '*.ogg')) + \
        glob.glob(os.path.join(directory, '*.wav'))

# Iterate over the files 
for file in files:
    # If the file name contains a space
    if ' ' in file:
        # Replace the spaces with hyphens
        new_name = file.replace(' ', '-')
        # Rename the file
        os.rename(file, new_name)

In [12]:
# audio file you wish to transcribe:
# make sure spaces are replaced with a '-'
# e.g., 'Track-13.wav' for 'Track 13.wav'
audio_file = 'prosecuting.wav'

In [13]:
# convert audio file to 16-bit wav format required by whisper
output_file = audio_file + '-output.wav'
print(output_file)

# convert audio_file then transcribe to text
# overwrites existing file with same name
try:
    yes_command = f'echo "y" | '
    subprocess.run([yes_command + 'ffmpeg' + ' -i ' + directory + audio_file + ' -ar 16000 -ac 1 -c:a pcm_s16le ' + directory + output_file], shell=True, check=True)
    print("Audio coverted successfully.")
except subprocess.CalledProcessError as e:
    print(f"Audio convertion failed with error {e.returncode}.")

prosecuting.wav-output.wav
Audio coverted successfully.


ffmpeg version 4.4 Copyright (c) 2000-2021 the FFmpeg developers
  built with gcc 9.3.0 (crosstool-NG 1.24.0.133_b0863d8_dirty)
  configuration: --prefix=/root/miniconda3/envs/conda_bld/conda-bld/ffmpeg_1635335682798/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_place --cc=/root/miniconda3/envs/conda_bld/conda-bld/ffmpeg_1635335682798/_build_env/bin/x86_64-conda-linux-gnu-cc --disable-doc --disable-openssl --enable-avresample --enable-hardcoded-tables --enable-libfreetype --enable-libopenh264 --enable-pic --enable-pthreads --enable-shared --disable-static --enable-version3 --enable-zlib --enable-libmp3lame
  libavutil      56. 70.100 / 56. 70.100
  libavcodec     58.134.100 / 58.134.100
  libavformat    58. 76.100 / 58. 76.100
  libavdevice    58. 13.100 / 58. 13.100
  libavfilter     7.110.100 /  7.110.100
  libavresample   4.  0.  0 /  4.  0.  0

In [14]:
# chunk_lengthS=15 and batch_size=16 is ideal
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=15,
    batch_size=16,
    torch_dtype=torch_dtype,
    device=device
)
result_local = pipe(directory + output_file)

# output transcription
wrapper = textwrap.TextWrapper(width=80,
    initial_indent=" ",
    subsequent_indent="",
    break_long_words=False,
    break_on_hyphens=False)
print(wrapper.fill(result_local["text"]))

# save transcript to the samples folder as a .md file
saved_txt=result_local["text"]
f = open(directory + output_file + ".md", "a")
f.write(saved_txt)
f.close()

  Hi and welcome to a fresh episode of prosecuting Donald Trump. It's Tuesday,
March 12th. I'm Andrew Weissman, and I'm here with Mary McCord. I shouldn't
really say I'm here with, because we can see each other and we're talking each
other, but you're gallivanting around. Oh, you do some gallivanting too.
Sometimes you just have to work from a different location. So that's what I'm
doing. Anyway, good morning, Andrew. Good morning. And we're going to cut to the
chase because there's so much going on. It is Tuesday morning and lots of topics
to get to. So Mary, what's up on your radar screen? I think let's start with,
and you know, I guess you can't knock them for try and immunity takes two and
three. So I think Mr. Trump's attorneys, you know, seeing, I think, what they
thought was a good thing, the delay they're getting by taking their immunity
case up to the Supreme Court, they're trying to get that kind of delay in both
the Marilago case as well as the Manhattan case brought by dist