## Speech-to-Text Transcription with Whisper and 4-bit Quantization

### 1. Setup and Configuration

This section imports necessary libraries and defines global variables for the model and audio file paths. It also includes checks for GPU availability and CUDA version, which are crucial for utilizing quantization effectively.

In [23]:
import torch
import torchaudio
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq, BitsAndBytesConfig
import time

In [24]:
# Check for GPU availability and display device information
if torch.cuda.is_available():
    print("GPU is available.")
    print("Device name:", torch.cuda.get_device_name(0))
else:
    print("GPU is not available. Using CPU.")

GPU is available.
Device name: NVIDIA GeForce RTX 3050 Laptop GPU


In [25]:
# Display the CUDA version
print(torch.version.cuda)

12.8


In [26]:
# Verify bitsandbytes integration by printing the class for 8-bit linear layers. This also checks if bitsandbytes can target the GPU. If not, it will raise a warning.
from bitsandbytes import nn
print(nn.Linear8bitLt)

<class 'bitsandbytes.nn.modules.Linear8bitLt'>


In [27]:
# Define the model ID and audio file path
MODEL_ID = "openai/whisper-base"
AUDIO_PATH = "sample.wav" # Path to your audio file

### 2. Model Loading and Quantization

This segment focuses on loading the Whisper-base model. A key aspect is applying 4-bit quantization using `BitsAndBytesConfig`, which significantly reduces the model's memory footprint while aiming to maintain reasonable accuracy. The original model is also loaded for performance comparison

In [28]:
# Define 4-bit quantization configuration
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16, # Use float16 for computation with 4-bit tensors
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",  # Using NF4 (NormalFloat4) quantization for better accuracy
)

In [29]:
# Load the Whisper-base model with 4-bit quantization
print("üîß Loading Whisper Base with GPU quantization...")
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    MODEL_ID,
    quantization_config=quant_config,
    device_map="auto" # Automatically maps the model to the available device (GPU if present)
)

üîß Loading Whisper Base with GPU quantization...


In [30]:
# Load the original (non-quantized) Whisper-base model for comparison
model_orig = AutoModelForSpeechSeq2Seq.from_pretrained(
    "openai/whisper-base", 
    device_map="auto"
)

In [31]:
# Print the data types of both models to confirm quantization
print(f"Quantized model dtype: {model.dtype}")
print(f"Original model dtype: {model_orig.dtype}")

Quantized model dtype: torch.float16
Original model dtype: torch.float32


### 3. Audio Preprocessing

This section handles loading the audio file and ensuring it's in the correct format (16kHz sampling rate) for the Whisper model. The audio is then processed into the input format required by the model.

In [32]:
# Load the processor corresponding to the Whisper model
processor = AutoProcessor.from_pretrained(MODEL_ID)

In [33]:
# Load the audio file
waveform, sr = torchaudio.load(AUDIO_PATH)

RuntimeError: Couldn't find appropriate backend to handle uri sample.wav and format None.

In [None]:
#Resample the audio to 16000 Hz
if sr != 16000:
    print("üîÑ Resampling from", sr, "Hz to 16000 Hz...")
    waveform = torchaudio.transforms.Resample(sr, 16000)(waveform)

In [None]:
# Prepare the audio inputs for the model and move them to the appropriate device
inputs = processor(waveform.squeeze(), sampling_rate=16000, return_tensors="pt").to(model.device)

### 4. Transcription and Performance Comparison

This core section performs the speech-to-text transcription using both the original and the 4-bit quantized models. It measures and compares the transcription time and the resulting text output to assess the impact of quantization on speed and accuracy.

In [None]:
# Create a detached clone of the original inputs for the non-quantized model
inputs_orig = {
    k: v.clone().detach() if isinstance(v, torch.Tensor) else v
    for k, v in inputs.items()
}

In [None]:
# Convert input tensors to float16 for the quantized model as it expects float16 compute dtype
for k in inputs:
    if isinstance(inputs[k], torch.Tensor):
        inputs[k] = inputs[k].to(dtype=torch.float16)

In [None]:
# --- Transcribe with the Original Model ---

start = time.time()
print("üîä Transcribing with Original Model...")
with torch.no_grad(): # Disable gradient calculation for inference
    output_ids = model_orig.generate(**inputs_orig)
text_orig = processor.batch_decode(output_ids, skip_special_tokens=True)[0]
print(f"üìù Original: {text_orig}")
end = time.time()
print(f"Time taken for original model: {end - start:.2f} seconds")

In [None]:
# --- Transcribe with the Quantized Model ---

start = time.time()
print("\nüîä Transcribing with Quantized Model...")
with torch.no_grad():
    output_ids = model.generate(**inputs)
text_quant = processor.batch_decode(output_ids, skip_special_tokens=True)[0]
print(f"üìù Quantized: {text_quant}")
end = time.time()
print(f"Time taken for quantized model: {end - start:.2f} seconds")

### 5. Memory Footprint Comparison

This segment provides a direct comparison of the memory usage between the original and the 4-bit quantized models, demonstrating the efficiency gains achieved through quantization.

In [None]:
# Calculate and print the memory footprint of the original model in MB
print(f"Memory (MB) - Original Model: {model_orig.get_memory_footprint() / (1024 ** 2):.2f}")

In [None]:
# Calculate and print the memory footprint of the quantized model in MB
print(f"Memory (MB) - Quantized Model: {model.get_memory_footprint() / (1024 ** 2):.2f}")

### 6. Saving Models (Optional)

This optional section provides code for saving the quantized model and processor locally, or pushing them to the Hugging Face Hub (commented-out). This is useful for future use without re-quantization or for sharing the optimized model.

In [None]:
YOUR_REPOSITORY = "Winoto/whisper-base-4bit-quantized"

In [None]:
# Uncomment the following lines to save to Hugging Face Hub (requires login). Don't forget to specify YOUR_REPOSITORY with your Hugging Face repository name
from huggingface_hub import login
login() # You will be prompted to enter your Hugging Face token

In [None]:
model.push_to_hub(YOUR_REPOSITORY)
processor.push_to_hub(YOUR_REPOSITORY)

In [None]:
# # Define a path to a local directory for saving
local_save_path = "./quantized-model"

# Save the quantized model and processor locally
model.save_pretrained(local_save_path)
processor.save_pretrained(local_save_path)
print(f"Model and processor saved locally to: {local_save_path}")