# Usage Guide for Fine-Tuned Whisper ASR Model on Chichewa

This notebook provides a step-by-step guide to using a fine-tuned Whisper model for Automatic Speech Recognition (ASR) in **Chichewa**. The model has been fine-tuned on a Chichewa dataset consisting of approximately **24 hours of speech data**, and it has undergone several iterations, resulting in multiple versions with differing levels of performance. This guide directs you to the **best-performing model** to ensure optimal transcription accuracy.

### Key Highlights:
1. **Fine-Tuned on Chichewa**: The model has been specifically trained on 24 hours of Chichewa speech, making it well-suited for transcription tasks in this language. The fine-tuning process has significantly improved its performance for Chichewa ASR tasks.
   
2. **No Tokenizer Required**: Unlike other Whisper models, this fine-tuned version does not use a tokenizer. Instead, the model works directly with the processor, simplifying the inference process.

3. **Best Performance via Commit Hash**: To ensure you are using the best-performing model, you will load the model by passing the **`revision` parameter** along with the exact **commit hash** corresponding to the version that achieved the lowest Word Error Rate (WER). This ensures you are working with the most accurate version of the model.

4. **Best Performance**: This notebook demonstrates how to load and use the model version with the best WER for Chichewa ASR tasks. Following this guide will allow you to achieve the highest transcription quality.

### What You’ll Learn:
- How to load and use the fine-tuned Whisper model specifically for Chichewa ASR.
- How to process and transcribe audio files data using this model.
- How to ensure you are using the version of the model that delivers the best transcription performance for Chichewa by utilizing the `revision` parameter with the commit hash.

By following this guide, you’ll be equipped to leverage this specialized ASR model to produce high-quality transcriptions of Chichewa speech with minimal setup and effort.


In [5]:
import librosa
from datasets import Audio, load_dataset, DatasetDict, load_from_disk

import torch

from transformers import (WhisperFeatureExtractor, WhisperTokenizer,
                        WhisperProcessor, logging)
from transformers import WhisperForConditionalGeneration
from transformers import (pipeline, AutoModel, AutoTokenizer,
AutoProcessor, AutoModelForSpeechSeq2Seq)

# Suppress Warnings
import warnings
# Suppress all warnings
warnings.filterwarnings('ignore')

# Suppress Hugging Face logs except errors
logging.set_verbosity_error()


In [6]:
# Check if MPS is available (for Apple Silicon)
device = torch.device("mps") if torch.backends.mps.is_available() else torch.device("cpu")
print(f"Using device: {device}")

Using device: mps


In [None]:
# Audio Files to Test

In [1]:
audio1 = "./audio_files/bushiri.wav"
audio2 = "./audio_files/mcp.wav"

## 1. Best-Performing Model and WER Information

In this section, we provide the commit hash for the best-performing model and its corresponding Word Error Rate (WER) based on evaluation.

### Best-Performing Model:

The model with the best WER performance can be loaded using the following **commit hash**:

```python
# Best-performing model hash
HUGGINGFACE_MODEL_ID = "dmatekenya/whisper-large-chichewa"
BEST_MODEL_COMMIT_HASH = "bff60fb08ba9f294e05bfcab4306f30b6a0cfc0a"  
```

### Note/Warning
While the model endpoint remains the same (i.e., dmatekenya/whisper-large-chichewa), it is crucial to include the specific commit hash (COMMIT_HASH) provided above when loading the model to access the best-performing version. Please use this [full url](https://huggingface.co/dmatekenya/whisper-large-v3-chichewa/commit/bff60fb08ba9f294e05bfcab4306f30b6a0cfc0a) to access this commit.
Without the commit hash, the latest version may be loaded, which could have a higher Word Error Rate (WER) than the version evaluated as best. To ensure you get the most accurate results, always include the commit hash.
```


## 2. Performing Speech-to-Text Inference on Audio Files
In this section, I will demonstrate how to transcribe an individual audio file directly using the fine-tuned Whisper model, instead of loading a dataset through the HuggingFace `datasets` package. This approach is particularly useful when deploying the model within an application.

In [9]:
import torch
import librosa
from transformers import WhisperProcessor, WhisperForConditionalGeneration

# Check if MPS is available (for Apple Silicon)
device = torch.device("mps") if torch.backends.mps.is_available() else torch.device("cpu")
print(f"Using device: {device}")

# HuggingFace endpoint for this finetuned model
finetuned_model_id = "dmatekenya/whisper-large-v3-chichewa"
best_model_commit_hash = "bff60fb08ba9f294e05bfcab4306f30b6a0cfc0a"

# Whisper base model endpoint
whisper_base_model_id = "openai/whisper-large-v3"
whisper_base_model_language = "shona"

# Load processor
processor = WhisperProcessor.from_pretrained(
    whisper_base_model_id,
    language=whisper_base_model_language,
    task="transcribe"
)

# Load model (keeping original precision for accuracy)
model = WhisperForConditionalGeneration.from_pretrained(
    finetuned_model_id,
    revision=best_model_commit_hash
)
model = model.to(device)

# Set model to evaluation mode (important for inference speed)
model.eval()

# Load audio
y, sr = librosa.load(audio1, sr=16000)

# Prepare input features
input_features = processor(y, return_tensors="pt", sampling_rate=sr).input_features
input_features = input_features.to(device)

# Generate transcription with minimal but safe optimizations
with torch.no_grad():  # Critical: disable gradient computation for 2-4x speedup
    generated_ids = model.generate(
        inputs=input_features,
        # Keep the model's default generation parameters for accuracy
        # Only add safe optimizations:
        use_cache=True,  # Enable KV cache for faster generation
        pad_token_id=processor.tokenizer.pad_token_id
    )

# Decode transcription
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(f"Transcript from model: \n {transcription}")

Using device: mps


Loading checkpoint shards: 100%|██████████| 2/2 [00:00<00:00,  5.89it/s]


Transcript from model: 
 Hello a Malawi azanga konseko omwe muliko, ine ndine mwana wanu a Shepard Bushiri zikomo chifukwa chakuti aumene muliko uthenga kumene ukufikani. Ine uthenga ndi okuti ndikuyamba program yogawa chimanga kwa ma family kapena kwa anthu muno mu Malawi 1 million. Project imeneyi ipanga kuti chimanga chonse chimene chakhala chikugawidwa chikwana 17000 imene value yake ikhoza kukwanira.


In [None]:
import torch

# Check if MPS is available (for Apple Silicon)
device = torch.device("mps") if torch.backends.mps.is_available() else torch.device("cpu")
print(f"Using device: {device}")

# HuggingFace endpoint for this finetuned model
finetuned_model_id = "dmatekenya/whisper-large-v3-chichewa"

# Use the best model commit hash from the earlier cell
best_model_commit_hash = "bff60fb08ba9f294e05bfcab4306f30b6a0cfc0a"

# Load whisper processor using base model
# Whisper base model endpoint
whisper_base_model_id = "openai/whisper-large-v3"
# Language-I used Shona when finetuning, so use it when loading base model
whisper_base_model_language = "shona"
processor = WhisperProcessor.from_pretrained(whisper_base_model_id,
                                           language=whisper_base_model_language,
                                           task="transcribe")
# Load the finetuned model and move to device
model = WhisperForConditionalGeneration.from_pretrained(finetuned_model_id,
                                                      revision=best_model_commit_hash)
model = model.to(device)

# Use one of the predefined audio files
y, sr = librosa.load(audio2, sr=16000)

# Prepare input features for the model and move to device
input_features = processor(y, return_tensors="pt", sampling_rate=sr).input_features
input_features = input_features.to(device)

# Generate transcription
generated_ids = model.generate(inputs=input_features)
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

# Print the transcript
print(f"Transcript from model: \n {transcription}")


In [10]:
# HuggingFace endpoint for this finetuned model
finetuned_model_id = "dmatekenya/whisper-large-v3-chichewa"

# Use the best model commit hash from the earlier cell
best_model_commit_hash = "bff60fb08ba9f294e05bfcab4306f30b6a0cfc0a"

# Whisper base model endpoint
whisper_base_model_id = "openai/whisper-large-v3"

# Language-I used Shona when finetuning, so use it when loading base model
whisper_base_model_language = "shona"

# Load whisper processor using base model for
processor = WhisperProcessor.from_pretrained(whisper_base_model_id,
                                           language=whisper_base_model_language,
                                           task="transcribe")
# Load the finetuned model and use revision parameter to ensure
# we load the best model
model = WhisperForConditionalGeneration.from_pretrained(finetuned_model_id,
                                                      revision=best_model_commit_hash)

# Use one of the predefined audio files
audio_file = audio1  # Using the predefined audio1 path
y, sr = librosa.load(audio1, sr=16000)

# Prepare input features for the model
input_features = processor(y, return_tensors="pt", sampling_rate=sr).input_features

# Generate transcription
generated_ids = model.generate(inputs=input_features)
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

# Print the transcript
print(f"Transcript from model: \n {transcription}")

Fetching 2 files:   0%|          | 0/2 [03:51<?, ?it/s]
Cancellation requested; stopping current tasks.

Cancellation requested; stopping current tasks.


KeyboardInterrupt: 