## Comparing Chichewa Speech-to-Text: Open-Source Whisper vs OpenAI API

This notebook presents a direct comparison of speech-to-text performance for Chichewa (Chicheŵa) using the open-source Whisper model and OpenAI’s hosted transcription models. The focus is strictly on evaluating transcription accuracy and qualitative differences between the two approaches when applied to the same audio samples. No fine-tuning or model adaptation is performed; instead, both systems are assessed in their default configurations to provide a clear, practical baseline comparison for Chichewa speech recognition in a low-resource language setting.


## Import Required Libraries

In [32]:
import os
from pathlib import Path
from dotenv import load_dotenv
from IPython.display import Markdown, display, SVG
from datetime import datetime
import librosa
import soundfile as sf
import numpy as np
import requests
from openai import OpenAI
from huggingface_hub import login
from transformers import AutoTokenizer, WhisperProcessor, WhisperForConditionalGeneration, TextStreamer, BitsAndBytesConfig, pipeline
import torch


In [2]:
# ====================================
# TEST FOR MPS (MAC GPU) SUPPORT
# ====================================

print("PyTorch version:", torch.__version__)
print("MPS built:", torch.backends.mps.is_built())
print("MPS available:", torch.backends.mps.is_available())

if torch.backends.mps.is_available():
    device = torch.device("mps")
    x = torch.randn(1000, 1000, device=device)
    y = x @ x
    print("MPS test successful ✅")
else:
    print("MPS not available ❌")


PyTorch version: 2.9.0
MPS built: True
MPS available: True
MPS test successful ✅


## Setup Workspace and Global Variables

In [39]:
DIR_DATA = Path.cwd().parent / "data"
DIR_AUDIO = DIR_DATA / "audio"
FILE_TEST_AUDIO = DIR_AUDIO / "WhatsApp_Audio_2026-01-07_at_5_16_59_AM-2.wav"

# ====================================
# LOAD ENV VARIABLES
# ====================================
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')

## Preprocess Audio Files
Convert to .wav

In [4]:
def sanitize_filename(filename):
    """
    Remove spaces and periods from filename (except the extension period).
    
    Args:
        filename: Original filename
    
    Returns:
        Sanitized filename
    """
    name, ext = os.path.splitext(filename)
    # Replace spaces and periods with underscores
    name = name.replace(' ', '_').replace('.', '_')
    return f"{name}{ext}"

In [5]:
def convert_mp4_to_wav(mp4_path, output_dir=None):
    """
    Convert MP4 audio to WAV format.
    
    Args:
        mp4_path: Path to the MP4 file (str or Path object)
        output_dir: Directory to save the WAV file (defaults to same directory as MP4)
    
    Returns:
        Path to the converted WAV file
    """
    mp4_path = Path(mp4_path)
    
    if output_dir is None:
        output_dir = mp4_path.parent
    else:
        output_dir = Path(output_dir)
    
    # Create output directory if it doesn't exist
    output_dir.mkdir(parents=True, exist_ok=True)
    
    # Sanitize the output filename
    sanitized_name = sanitize_filename(f"{mp4_path.stem}.wav")
    wav_path = output_dir / sanitized_name
    
    # Load audio from MP4 and save as WAV
    audio, sr = librosa.load(mp4_path, sr=16000, mono=True)
    sf.write(wav_path, audio, sr)
    
    print(f"Converted {mp4_path.name} to {wav_path.name}")
    return wav_path

In [None]:
# ====================================
# CONVERT EXAMPLE MP4 TO WAV
# ====================================
for file in DIR_AUDIO.iterdir():
    if file.suffix == ".mp4":
        convert_mp4_to_wav(file, output_dir=DIR_AUDIO)
        

## Transcribe with Open Source Whisper Large 

In [None]:
# ====================================
# LOAD WHISPER MODEL AND PROCESSOR
# ====================================

# Load model + processor (multilingual)
model_id = "openai/whisper-large-v3"

processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(model_id)

# Move to device
device = "mps" if torch.backends.mps.is_available() else "cpu"
model.to(device)

In [36]:
def transcribe_with_whisper(audio_path, model, processor, device):
    """
    Transcribe audio file using Whisper model with automatic language detection.
    
    Args:
        audio_path: Path to the audio file
        model: Whisper model
        processor: Whisper processor
        device: Device to run inference on ('mps' or 'cpu')
    
    Returns:
        Transcribed text
    """
    # Load audio (Whisper expects 16kHz mono)
    audio, sr = librosa.load(audio_path, sr=16000)
    
    # Prepare input features
    inputs = processor(
        audio,
        sampling_rate=16000,
        return_tensors="pt"
    )
    
    input_features = inputs.input_features.to(device)
    
    # Let Whisper auto-detect language
    with torch.no_grad():
        predicted_ids = model.generate(input_features)
    
    # Decode
    transcription = processor.batch_decode(
        predicted_ids,
        skip_special_tokens=True
    )[0]
    
    return transcription

In [None]:
# ====================================
# TRANSCRIBE TEST AUDIO
# ====================================
# Test the function
transcription = transcribe_with_whisper(FILE_TEST_AUDIO, model, processor, device)
Markdown(transcription)

 Asiwe na mafuna uziwa kutia nga buweze banji ATM kadi. Yeo mwesi uguida nshiro.

## Transcribe with Commercial OpenAI API

In addition to open-source Whisper, this notebook also evaluates transcription using OpenAI’s commercial speech-to-text models via the OpenAI API. These hosted models abstract away language detection and decoding details, allowing Chichewa audio to be transcribed directly without specifying a language token. The comparison therefore reflects practical, out-of-the-box performance of OpenAI’s continuously updated transcription service against open-source Whisper running locally.


In [40]:
# Initialize OpenAI client
openai = OpenAI(api_key=OPENAI_API_KEY)

In [48]:
# Sign in to OpenAI using Secrets in Colab

AUDIO_MODEL = "whisper-1"

transcription = openai.audio.transcriptions.create(model=AUDIO_MODEL, file=FILE_TEST_AUDIO, response_format="text")
print(transcription)


Asibwe na mafuna uziwa kutianga buwezi banji ATM card Iyo mwesi uguira nshiro

