# Test ASR - Audio to Text transcription

This notebook has tests to process and transcript audio files using the OpenAI Whisper model. 

## References

- [Hugging Face Open ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard)
- [OpenAI Whisper Large v3](https://huggingface.co/openai/whisper-large-v3)

## OpenAI Whisper Install

In [3]:
!pip install --upgrade pip



In [5]:
!pip install --upgrade git+https://github.com/huggingface/transformers.git 

Collecting git+https://github.com/huggingface/transformers.git
  Cloning https://github.com/huggingface/transformers.git to /private/var/folders/pt/164vn4ns5cgb9p75rc5qgffm0000gn/T/pip-req-build-1tqyi7uv
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers.git /private/var/folders/pt/164vn4ns5cgb9p75rc5qgffm0000gn/T/pip-req-build-1tqyi7uv
  Resolved https://github.com/huggingface/transformers.git to commit 96eb06286b63c9c93334d507e632c175d6ba8b28
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Installing backend dependencies ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting filelock (from transformers==4.42.0.dev0)
  Using cached filelock-3.14.0-py3-none-any.whl.metadata (2.8 kB)
Collecting huggingface-hub<1.0,>=0.23.0 (from transformers==4.42.0.dev0)
  Downloading huggingface_hub-0.23.2-py3-none-any.whl.metadata (12 kB)
Collecting num

In [6]:
!pip install --upgrade accelerate 

Collecting accelerate
  Downloading accelerate-0.30.1-py3-none-any.whl.metadata (18 kB)
Collecting torch>=1.10.0 (from accelerate)
  Downloading torch-2.3.0-cp312-none-macosx_11_0_arm64.whl.metadata (26 kB)
Collecting sympy (from torch>=1.10.0->accelerate)
  Downloading sympy-1.12.1-py3-none-any.whl.metadata (12 kB)
Collecting networkx (from torch>=1.10.0->accelerate)
  Downloading networkx-3.3-py3-none-any.whl.metadata (5.1 kB)
Collecting mpmath<1.4.0,>=1.1.0 (from sympy->torch>=1.10.0->accelerate)
  Using cached mpmath-1.3.0-py3-none-any.whl.metadata (8.6 kB)
Downloading accelerate-0.30.1-py3-none-any.whl (302 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.6/302.6 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading torch-2.3.0-cp312-none-macosx_11_0_arm64.whl (61.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 MB[0m [31m27.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading networkx-3

In [8]:
!pip install --upgrade datasets\[audio\]

Collecting datasets[audio]
  Downloading datasets-2.19.1-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow>=12.0.0 (from datasets[audio])
  Downloading pyarrow-16.1.0-cp312-cp312-macosx_11_0_arm64.whl.metadata (3.0 kB)
Collecting pyarrow-hotfix (from datasets[audio])
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl.metadata (3.6 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets[audio])
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting pandas (from datasets[audio])
  Downloading pandas-2.2.2-cp312-cp312-macosx_11_0_arm64.whl.metadata (19 kB)
Collecting xxhash (from datasets[audio])
  Downloading xxhash-3.4.1-cp312-cp312-macosx_11_0_arm64.whl.metadata (12 kB)
Collecting multiprocess (from datasets[audio])
  Downloading multiprocess-0.70.16-py312-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.3.1,>=2023.1.0 (from fsspec[http]<=2024.3.1,>=2023.1.0->datasets[audio])
  Using cached fsspec-2024.3.1-py3-none-any.whl.metadata (6.8 kB)
Collecting aiohttp (from data

## Test Whisper via Pipeline

In [11]:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    batch_size=16,
    return_timestamps=True,
    torch_dtype=torch_dtype,
    device=device,
)

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]

result = pipe(sample)
print(result["text"])


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


 Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel. Nor is Mr. Quilter's manner less interesting than his matter. He tells us that at this festive season of the year, with Christmas and roast beef looming before us, similes drawn from eating and its results occur most readily to the mind. He has grave doubts whether Sir Frederick Leighton's work is really Greek after all, and can discover in it but little of rocky Ithaca. Linnell's pictures are a sort of Upguards and Adam paintings, and Mason's exquisite idylls are as national as a jingo poem. Mr. Burkett Foster's landscapes smile at one much in the same way that Mr. Carker used to flash his teeth. And Mr. John Collier gives his sitter a cheerful slap on the back before he says, like a shampooer in a Turkish bath, Next man!


## Test with local recording from Voice Memos

In [16]:
!pip install pydub

Collecting pydub
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Downloading pydub-0.25.1-py2.py3-none-any.whl (32 kB)
Installing collected packages: pydub
Successfully installed pydub-0.25.1


### Test with 5 min sample and CPU-only

In [31]:
import os
import time
from pydub import AudioSegment
from pydub.silence import detect_nonsilent
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
import soundfile as sf

# Define file paths and names
basepath = "/Users/alastori/test-ASR/"
audio_file_path = os.path.join(basepath, "data/2024-05-30.m4a")
file_name = os.path.splitext(os.path.basename(audio_file_path))[0]
temp_file_path = os.path.join(basepath, f"temp/temp_{file_name}.wav")
trimmed_file_path = os.path.join(basepath, f"temp/trimmed_{file_name}.wav")
txt_file_path = os.path.join(basepath, f"data/output_{file_name}.txt")

# Ensure the temp directory exists
os.makedirs(os.path.dirname(temp_file_path), exist_ok=True)

# Convert the .m4a file to .wav using pydub and resample to 16 kHz
audio = AudioSegment.from_file(audio_file_path, format="m4a")
audio = audio.set_frame_rate(16000)

# Extract the first 5 minutes (5 * 60 * 1000 milliseconds)
audio_5min = audio[:5 * 60 * 1000]
audio_5min.export(temp_file_path, format="wav")

# Load the 5-minute segment
audio_5min = AudioSegment.from_file(temp_file_path, format="wav")

# Detect non-silent chunks
non_silent_chunks = detect_nonsilent(audio_5min, min_silence_len=1000, silence_thresh=-40)

# Concatenate non-silent chunks
trimmed_audio = AudioSegment.empty()
for start, end in non_silent_chunks:
    trimmed_audio += audio_5min[start:end]

# Export the trimmed audio
trimmed_audio.export(trimmed_file_path, format="wav")

# Load the trimmed .wav file
audio_input, sample_rate = sf.read(trimmed_file_path)

# Ensure the audio is sampled at 16 kHz
assert sample_rate == 16000, "The audio sample rate must be 16000 Hz"

# Force the use of CPU for the entire pipeline to avoid MPS issues
device = "cpu"

# Set the environment variable to enable CPU fallback for unsupported MPS operations
os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1"

# Model ID for OpenAI Whisper
model_id = "openai/whisper-large-v3"

# Load the model with the specified dtype and device settings
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

# Load the processor (tokenizer and feature extractor)
processor = AutoProcessor.from_pretrained(model_id)

# Set up the pipeline for automatic speech recognition
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    batch_size=16,
    return_timestamps=True,
    device=device,
)

# Measure transcription time
start_time = time.time()

# Run the pipeline on the local audio file using the raw audio data
result = pipe(audio_input)

end_time = time.time()
transcription_time = end_time - start_time

# Print the transcribed text and the time taken
print(f"Transcription Time: {transcription_time:.2f} seconds")
print(result["text"])

# Save the transcribed text to a TXT file
with open(txt_file_path, "w") as txt_file:
    txt_file.write(result["text"])

print(f"Transcription saved to {txt_file_path}")


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Transcription Time: 51.25 seconds
 Hey, good morning. Hey, good morning. How are you doing? Good, good. How are you doing? I'm doing well. Actually, it's afternoon already for me. Where are you based out of? I'm in Florida, East Coast. Yeah. Okay. Yeah. Actually, I moved to the United States in 2019 and then to New Jersey. And then we moved to Florida but same time zone totally understand well I'm here in San Jose so I'm on the west coast and near our HQ and by the way did I pronounce your name correctly is it pronounced the air tone yeah that's perfect Well, yeah Here in the US everybody pronounced differently but it's almost the same in my My country, I'm Brazilian. So we do have a lot of different accents. So Ayrton is fine Yeah, I understand when you were talking to me But what about you is Shrias?. Alright, cool, cool, cool. What about you, is it Shreyas? It's Shreyas. Shreyas, okay. The last name is much more difficult to pronounce, but the first name is... People typically get i

### Process the file in chunks and ignore silence

In [None]:
import os
import time
from pydub import AudioSegment
from pydub.silence import detect_nonsilent
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
import soundfile as sf

# Define file paths and names
basepath = "/Users/alastori/test-ASR/"
audio_file_path = os.path.join(basepath, "data/2024-05-30.m4a")
file_name = os.path.splitext(os.path.basename(audio_file_path))[0]
temp_dir = os.path.join(basepath, "temp")
txt_file_path = os.path.join(basepath, f"data/output_{file_name}.txt")

# Ensure the temp directory exists
os.makedirs(temp_dir, exist_ok=True)

# Read the original audio file
print(f"Reading original audio from {audio_file_path}")
audio = AudioSegment.from_file(audio_file_path, format="m4a")

# Convert the .m4a file to .wav using pydub and resample to 16 kHz
print("Converting from m4a to wav and resampling to 16 kHz")
conversion_start_time = time.time()
audio = audio.set_frame_rate(16000)
conversion_end_time = time.time()
conversion_time = (conversion_end_time - conversion_start_time) / 60
print(f"File conversion time: {conversion_time:.2f} minutes")

# Detect non-silent chunks
print("Detecting non-silent parts of the audio")
silence_detection_start_time = time.time()
non_silent_chunks = detect_nonsilent(audio, min_silence_len=1000, silence_thresh=-40)
silence_detection_end_time = time.time()
silence_detection_time = (silence_detection_end_time - silence_detection_start_time) / 60
print(f"Silence detection time: {silence_detection_time:.2f} minutes")

# Concatenate non-silent chunks
non_silent_audio = AudioSegment.empty()
for start, end in non_silent_chunks:
    non_silent_audio += audio[start:end]

# Define chunk length (1 minute)
chunk_length_ms = 1 * 60 * 1000

# Split non-silent audio into chunks
print("Splitting non-silent audio into 1-minute chunks")
chunks = [non_silent_audio[i:i + chunk_length_ms] for i in range(0, len(non_silent_audio), chunk_length_ms)]
total_chunks = len(chunks)
print(f"Total chunks to process: {total_chunks}")

# Force the use of CPU for the entire pipeline to avoid MPS issues
device = "cpu"

# Model ID for OpenAI Whisper
model_id = "openai/whisper-large-v3"

# Load the model with the specified dtype and device settings
print("Loading Whisper model")
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

# Load the processor (tokenizer and feature extractor)
print("Loading processor")
processor = AutoProcessor.from_pretrained(model_id)

# Set up the pipeline for automatic speech recognition
print("Setting up the ASR pipeline")
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    batch_size=16,
    return_timestamps=True,
    device=device,
)

# Process each chunk
transcriptions = []

for i, chunk in enumerate(chunks):
    print(f"Processing chunk {i + 1}/{total_chunks}")
    chunk_start_time = time.time()

    # Export the chunk to a temporary WAV file
    chunk_file_path = os.path.join(temp_dir, f"chunk_{i}_{file_name}.wav")
    chunk.export(chunk_file_path, format="wav")

    # Load the chunk .wav file
    audio_input, sample_rate = sf.read(chunk_file_path)

    # Ensure the audio is sampled at 16 kHz
    assert sample_rate == 16000, "The audio sample rate must be 16000 Hz"

    # Run the pipeline on the chunk
    result = pipe(audio_input)

    # Append the transcribed text
    transcription = result["text"]
    transcriptions.append(transcription)

    # Print the transcription for the current chunk
    print(f"Transcription for chunk {i + 1}/{total_chunks}:")
    print(transcription)

    # Save the partial transcription to a TXT file
    with open(txt_file_path, "a") as txt_file:
        txt_file.write(f"Chunk {i + 1}/{total_chunks}:\n")
        txt_file.write(transcription + "\n\n")

    chunk_end_time = time.time()
    chunk_transcription_time = (chunk_end_time - chunk_start_time) / 60  # Convert to minutes

    # Estimate remaining time
    remaining_chunks = total_chunks - (i + 1)
    estimated_remaining_time = remaining_chunks * chunk_transcription_time  # Already in minutes

    # Output partial result and estimated remaining time
    print(f"Chunk {i + 1}/{total_chunks} transcribed in {chunk_transcription_time:.2f} minutes")
    print(f"Estimated remaining transcription time: {estimated_remaining_time:.2f} minutes")

# Print the total conversion and transcription time
total_transcription_time = sum([len(chunk) / (1000 * 60) for chunk in chunks])
print(f"Total transcription time: {total_transcription_time:.2f} minutes")
print(f"Transcription saved to {txt_file_path}")


### Test with local Mac M3 hardware accelaration

To utilize the GPU on your Mac M3, you need to ensure that the necessary libraries are configured correctly to take advantage of the GPU. For Mac, the Metal Performance Shaders (MPS) backend can be used with PyTorch to enable GPU acceleration.


In [38]:
import os
import time
from pydub import AudioSegment
from pydub.silence import detect_nonsilent
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
import soundfile as sf

# Define file paths and names
basepath = "/Users/alastori/test-ASR/"
audio_file_path = os.path.join(basepath, "data/2024-05-30.m4a")
file_name = os.path.splitext(os.path.basename(audio_file_path))[0]
temp_dir = os.path.join(basepath, "temp")
txt_file_path = os.path.join(basepath, f"data/output_{file_name}.txt")

# Ensure the temp directory exists
os.makedirs(temp_dir, exist_ok=True)

# Read the original audio file
print(f"Reading original audio from {audio_file_path}")
audio = AudioSegment.from_file(audio_file_path, format="m4a")

# Convert the .m4a file to .wav using pydub and resample to 16 kHz
print("Converting from m4a to wav and resampling to 16 kHz")
conversion_start_time = time.time()
audio = audio.set_frame_rate(16000)
conversion_end_time = time.time()
conversion_time = (conversion_end_time - conversion_start_time) / 60
print(f"File conversion time: {conversion_time:.2f} minutes")

# Detect non-silent chunks
print("Detecting non-silent parts of the audio")
silence_detection_start_time = time.time()
non_silent_chunks = detect_nonsilent(audio, min_silence_len=1000, silence_thresh=-40)
silence_detection_end_time = time.time()
silence_detection_time = (silence_detection_end_time - silence_detection_start_time) / 60
print(f"Silence detection time: {silence_detection_time:.2f} minutes")

# Concatenate non-silent chunks
non_silent_audio = AudioSegment.empty()
for start, end in non_silent_chunks:
    non_silent_audio += audio[start:end]

# Define chunk length (1 minute)
chunk_length_ms = 1 * 60 * 1000

# Split non-silent audio into chunks
print("Splitting non-silent audio into 1-minute chunks")
chunks = [non_silent_audio[i:i + chunk_length_ms] for i in range(0, len(non_silent_audio), chunk_length_ms)]
total_chunks = len(chunks)
print(f"Total chunks to process: {total_chunks}")

# Check for MPS availability
if torch.backends.mps.is_available() and torch.backends.mps.is_built():
    device = "mps"
else:
    device = "cpu"
print(f"Using device: {device}")

# Set the environment variable to enable CPU fallback for unsupported MPS operations
os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1"

# Model ID for OpenAI Whisper
model_id = "openai/whisper-large-v3"

# Load the model with the specified dtype and device settings
print("Loading Whisper model")
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

# Load the processor (tokenizer and feature extractor)
print("Loading processor")
processor = AutoProcessor.from_pretrained(model_id)

# Set up the pipeline for automatic speech recognition
print("Setting up the ASR pipeline")
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    batch_size=16,
    return_timestamps=True,
    device=device,
)

# Process each chunk
transcriptions = []

for i, chunk in enumerate(chunks):
    print(f"Processing chunk {i + 1}/{total_chunks}")
    chunk_start_time = time.time()

    # Export the chunk to a temporary WAV file
    chunk_file_path = os.path.join(temp_dir, f"chunk_{i}_{file_name}.wav")
    chunk.export(chunk_file_path, format="wav")

    # Load the chunk .wav file
    audio_input, sample_rate = sf.read(chunk_file_path)

    # Ensure the audio is sampled at 16 kHz
    assert sample_rate == 16000, "The audio sample rate must be 16000 Hz"

    # Run the pipeline on the chunk
    result = pipe(audio_input)

    # Append the transcribed text
    transcription = result["text"]
    transcriptions.append(transcription)

    # Print the transcription for the current chunk
    print(f"Transcription for chunk {i + 1}/{total_chunks}:")
    print(transcription)

    # Save the partial transcription to a TXT file
    with open(txt_file_path, "a") as txt_file:
        txt_file.write(f"Chunk {i + 1}/{total_chunks}:\n")
        txt_file.write(transcription + "\n\n")

    chunk_end_time = time.time()
    chunk_transcription_time = (chunk_end_time - chunk_start_time) / 60  # Convert to minutes

    # Estimate remaining time
    remaining_chunks = total_chunks - (i + 1)
    estimated_remaining_time = remaining_chunks * chunk_transcription_time  # Already in minutes

    # Output partial result and estimated remaining time
    print(f"Chunk {i + 1}/{total_chunks} transcribed in {chunk_transcription_time:.2f} minutes")
    print(f"Estimated remaining transcription time: {estimated_remaining_time:.2f} minutes")

# Print the total conversion and transcription time
total_transcription_time = sum([len(chunk) / (1000 * 60) for chunk in chunks])
print(f"Total transcription time: {total_transcription_time:.2f} minutes")
print(f"Transcription saved to {txt_file_path}")


Reading original audio from /Users/alastori/test-ASR/data/2024-05-30.m4a
Converting from m4a to wav and resampling to 16 kHz
File conversion time: 0.02 minutes
Detecting non-silent parts of the audio
Silence detection time: 0.89 minutes
Splitting non-silent audio into 1-minute chunks
Total chunks to process: 43
Using device: mps
Loading Whisper model
Loading processor


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Setting up the ASR pipeline
Processing chunk 1/43




NotImplementedError: The operator 'aten::isin.Tensor_Tensor_out' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on https://github.com/pytorch/pytorch/issues/77764. As a temporary fix, you can set the environment variable `PYTORCH_ENABLE_MPS_FALLBACK=1` to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.