# Transcribe audio using a locally-running HuggingFace Model
3/24/2025, Dave Sisk, https://github.com/davidcsisk, https://www.linkedin.com/in/davesisk-doctordatabase/

These Vosk transcription models can be and are run locally. We avoid the charges for calls to OpenAI's Whisper model as well as avoid shipping data outside the confines of the host running this process, but we do need sufficient computing power to run the process locally, of course.  The smaller models will work on laptops and mobile devices, the larger Vosk models will need more fully provisioned resources. Details about the Vosk models can be found here: https://github.com/alphacep/vosk-api

In [5]:
# Do the necessary prep work...
#! pip install vosk
# Need to download and extract Vosk model into ./model directory
# https://alphacephei.com/vosk/models
# The model director should be in the current working directory,
# and look like something like this:
# model/
# ├── am/                # Acoustic model files
# ├── conf/              # Configuration files
# ├── graph/             # Graph files for decoding
# ├── ivector/           # Files for speaker identification (optional)
# ├── rescore/           # Rescoring files (optional)
# ├── README             # Information about the model
# └── other files        # Additional files required for the model

#! pip install wave
#! pip install pyaudio
#! pip install pydub

# Must download and install ffmpeg
# https://ffmpeg.org/download.html
# check it's accessible in the path: ffmpeg -version  (You might have to restart terminal/vscode/jupyter notebook kernel)

In [1]:
import os
import wave
import json
from vosk import Model, KaldiRecognizer

In [2]:
# Load the Vosk model
#model = Model("vosk-model-small-en-us-0.15")  # Ensure the Vosk model is downloaded and placed in the "model" directory
model = Model("vosk-model-en-us-0.22-lgraph")  # Ensure the Vosk model is downloaded and placed in the "model" directory

# Directory containing MP3 files
audio_directory = "."

In [3]:
# Function to convert MP3 to WAV with required specifications
def convert_mp3_to_wav(mp3_file, wav_file):
    from pydub import AudioSegment
    audio = AudioSegment.from_mp3(mp3_file)
    # Convert to mono, 16-bit, and 16kHz
    audio = audio.set_channels(1).set_sample_width(2).set_frame_rate(16000)
    audio.export(wav_file, format="wav")


In [None]:
# Convert and transcibe all the mp3 files...this process takes around 17m, and seems to be the same 
# whether we use the smallest model or the 2nd smallest (the difference is likely memory usage instead of cpu)
# Iterate through all MP3 files in the directory
for file_name in os.listdir(audio_directory):
    if file_name.endswith(".mp3"):
        print(f"Processing {file_name}...")
        wav_file = f"{os.path.splitext(file_name)[0]}.wav"
        
        # Convert MP3 to WAV
        convert_mp3_to_wav(file_name, wav_file)
        
        # Open the WAV file
        with wave.open(wav_file, "rb") as wf:
            if wf.getnchannels() != 1 or wf.getsampwidth() != 2 or wf.getframerate() not in [8000, 16000]:
                print(f"Skipping {file_name}: WAV file must be mono, 16-bit, and 8kHz or 16kHz.")
                continue
            
            rec = KaldiRecognizer(model, wf.getframerate())
            transcription = ""
            
            # Perform transcription
            while True:
                data = wf.readframes(4000)
                if len(data) == 0:
                    break
                if rec.AcceptWaveform(data):
                    result = json.loads(rec.Result())
                    transcription += result.get("text", "") + " "
            
            # Finalize transcription
            final_result = json.loads(rec.FinalResult())
            transcription += final_result.get("text", "")
            
            # Save the transcription to a text file
            output_file = f"{os.path.splitext(file_name)[0]}.txt"
            with open(output_file, "w", encoding="utf-8") as f:
                f.write(transcription.strip())
            print(f"Transcription saved to {output_file}") 

Processing popularhistoryoftheartofmusic_00_mathews.mp3...
Transcription saved to popularhistoryoftheartofmusic_00_mathews.txt
Processing popularhistoryoftheartofmusic_01_mathews.mp3...
Transcription saved to popularhistoryoftheartofmusic_00_mathews.txt
Processing popularhistoryoftheartofmusic_01_mathews.mp3...
Transcription saved to popularhistoryoftheartofmusic_01_mathews.txt
Processing popularhistoryoftheartofmusic_02_mathews.mp3...
Transcription saved to popularhistoryoftheartofmusic_01_mathews.txt
Processing popularhistoryoftheartofmusic_02_mathews.mp3...
Transcription saved to popularhistoryoftheartofmusic_02_mathews.txt
Processing popularhistoryoftheartofmusic_03_mathews_reduced-sample-rate_16KHz.mp3...
Transcription saved to popularhistoryoftheartofmusic_02_mathews.txt
Processing popularhistoryoftheartofmusic_03_mathews_reduced-sample-rate_16KHz.mp3...
Transcription saved to popularhistoryoftheartofmusic_03_mathews_reduced-sample-rate_16KHz.txt
Transcription saved to popularhis

It appears that the 2nd smallest model (the one with an embedded language graph) produces a little higher quality / more accurate output than the smallest model, so we'll base this work off of that model. 

Downstream, the intention is to use this process along with vector embeddings to try to detect deepfake audio that is used in exploits such as the CEO funds transfer scam and similar...it seems much easier to get accurate and meaningful vector embeddings if we first convert the raw audio to transcribed text. Look for an example of that in the vector search/cybersecurity project also owned by me. Enjoy!

Let's add an example that works from a video source.  Deepfake tools generally produce video output, not just audio. Because the Vosk model only understands audio data, we'll need to first extract audio from the video, then extract text transcription from that audio. This might not be necessary with a different model...it depends on the model being used to do the transcription.

In [None]:
# Example of video -> audio -> transcribed text
import subprocess
import os
import wave
import json
from vosk import Model, KaldiRecognizer

# Function to extract audio from MP4 and convert to WAV
def extract_audio_from_mp4(mp4_file, wav_file):
    # Use ffmpeg to extract audio and convert to WAV
    command = [
        "ffmpeg", "-i", mp4_file, "-ac", "1", "-ar", "16000", "-sample_fmt", "s16", wav_file
    ]
    try:
        # Add a timeout to prevent hanging
        subprocess.run(command, check=True, timeout=60)
    except subprocess.TimeoutExpired:
        raise TimeoutError(f"FFmpeg process timed out while processing {mp4_file}.")

    # Verify the WAV file was created and is not empty
    if not os.path.exists(wav_file):
        raise FileNotFoundError(f"WAV file {wav_file} was not created.")
    if os.path.getsize(wav_file) == 0:
        raise ValueError(f"WAV file {wav_file} is empty.")

# Function to transcribe text from WAV audio
def transcribe_audio(wav_file, model):
    with wave.open(wav_file, "rb") as wf:
        if wf.getnchannels() != 1 or wf.getsampwidth() != 2 or wf.getframerate() not in [8000, 16000]:
            raise ValueError("WAV file must be mono, 16-bit, and 8kHz or 16kHz.")

        rec = KaldiRecognizer(model, wf.getframerate())
        transcription = ""

        # Perform transcription
        while True:
            data = wf.readframes(4000)
            if len(data) == 0:
                break
            if rec.AcceptWaveform(data):
                result = json.loads(rec.Result())
                transcription += result.get("text", "") + " "

        # Finalize transcription
        final_result = json.loads(rec.FinalResult())
        transcription += final_result.get("text", "")

        return transcription.strip()

# Example usage
mp4_file = "./HomerSimpson-deepfake-video.mp4"  # Replace with your MP4 file
wav_file = f"{os.path.splitext(mp4_file)[0]}.wav"

# Delete existing WAV file if it exists
if os.path.exists(wav_file):
    os.remove(wav_file)

# Extract audio
extract_audio_from_mp4(mp4_file, wav_file)
print(f"Audio extracted and saved as {wav_file}")

# Load Vosk model
model = Model("vosk-model-en-us-0.22-lgraph")

# Transcribe audio
transcription = transcribe_audio(wav_file, model)
print("Transcription:")
print(transcription)

# Save transcription to a text file
output_file = f"{os.path.splitext(mp4_file)[0]}.txt"
with open(output_file, "w", encoding="utf-8") as f:
    f.write(transcription)
print(f"Transcription saved to {output_file}")

Audio extracted and saved as ./HomerSimpson-deepfake-video.wav
Transcription:
alright i am following up on the audit readiness review air and q one expenses for contractor payment we need to transfer one hundred and thirty seven thousand eight hundred and twenty dollars zero sense to the holding account a b c twelve thousand three hundred and forty five this must be sent by three p m t day so r c f epo can refer elected in the pre audit submission let me know once the transfer is done thanks
Transcription saved to ./HomerSimpson-deepfake-video.txt
