# Whisper Large Transcription

This notebook demonstrates how to transcribe audio using the `whisper-large` model. We will:
1. Load and preprocess an audio file.
2. Apply the Whisper model to transcribe the audio.
3. Output the transcription.

## Explanation
The Whisper model is a state-of-the-art speech recognition model that can transcribe audio into text. It is particularly useful for converting spoken language into written text, which can be used for various applications such as subtitles, transcription services, and more.


In [None]:
# Setup installers
commands = [
    ("PIP_ROOT_USER_ACTION=ignore pip install -q openai-whisper", "Install whisper"),
    ("PIP_ROOT_USER_ACTION=ignore pip install -q torch", "Install torch"),
    ("PIP_ROOT_USER_ACTION=ignore pip install -q numpy", "Install numpy"),
    ("PIP_ROOT_USER_ACTION=ignore pip install -q soundfile", "Install soundfile"),
    ("PIP_ROOT_USER_ACTION=ignore pip install -q tqdm", "Install tqdm")
]

# Import the utils module which sets up the environment
from modules import utils
from modules import disable_warnings

# Use LogTools
log_tools = utils.LogTools()

# Execute
log_tools.command_state(commands)

## Step 2: Load Libraries

In [None]:
# Import necessary libraries
import soundfile as sf
import numpy as np
import whisper
import time
import threading
import torch
from tqdm import tqdm
import torch

# Check to see what GPU resources are available
def get_best_device():
    if torch.cuda.is_available():
        print("Using CUDA")
        return torch.device("cuda")
    elif torch.backends.mps.is_available():
        print("Using MPS")
        return torch.device("mps")
    else:
        print("Using CPU")
        return torch.device("cpu")

device = get_best_device()

## Step 3: Load the Audio File

We start by loading an audio file using `soundfile`. The audio needs to be in a
format supported by `whisper`.

In [None]:
# Load the audio file
audio_filepath = "../../test_pcm.wav"
audio, sample_rate = sf.read(audio_filepath)

# Display the audio properties
print(f"Audio sample rate: {sample_rate} Hz")
print(f"Audio duration: {len(audio) / sample_rate:.2f} seconds")

## Step 4: Transcribe the Audio

Next, we apply the Whisper model to transcribe the audio file.

### Model Options
The Whisper model comes in several sizes, each with different performance characteristics:
- `tiny`
- `tiny.en`
- `base`
- `base.en`
- `small`
- `small.en`
- `medium`
- `medium.en`
- `large`
- `large-v2`
- `large-v3`


In [None]:
# Set Whisper variables
model_name = "large-v3"
batch_size = 16 # reduce if low on GPU mem
compute_type = "float16" # change to "int8" if low on GPU mem (may reduce accuracy)

# Function to update transcription progress
def update_progress_bar(progress_bar, duration, interval=1):
    for i in range(0, duration, interval):
        time.sleep(interval)
        progress_bar.update(interval)
    progress_bar.update(duration % interval)

print("Starting transcription process...")

# Function to display model loading progress
def load_model_with_progress(model_name):
    print("Loading Whisper model...")
    progress_bar = tqdm(total=100, desc="Model Loading", bar_format="{l_bar}{bar:50}| {n_fmt}/{total_fmt} {elapsed}")

    model = whisper.load_model(model_name)
    for i in range(100):
        time.sleep(0.1)
        progress_bar.update(1)
    progress_bar.close()
    return model

# Load the Whisper model with progress
model = load_model_with_progress(model_name)

# Create a progress bar for transcription
duration = int(len(audio) / sample_rate)
progress_bar = tqdm(total=duration, desc="Transcribing audio", bar_format="{l_bar}{bar:50}| {n_fmt}/{total_fmt} {elapsed}")

# Start the progress update in a separate thread
progress_thread = threading.Thread(target=update_progress_bar, args=(progress_bar, duration))
progress_thread.start()

# Transcribe the audio
result = model.transcribe(audio, compute_type="float16", batch_size=32, verbose=True)
# result = model.transcribe(audio)

# Wait for the progress thread to finish and close the progress bar
progress_thread.join()
progress_bar.close()


# Output the transcription
transcription = result['text']
print("\nTranscript:")
print(transcription)

# Free up Resources
*Remove any local files and free up GPU resources.*

Press the large red button below to get started! 🚀

In [None]:
# Delete the model object
del model

# Free up GPU memory
torch.cuda.empty_cache()
print("GPU memory freed")

## Conclusion

In this notebook, we demonstrated how to transcribe audio using `whisper` models. We loaded and preprocessed the audio, applied the Whisper model, and output the transcription.