# ASR for Khowar using Whisper Model
In this Notebook, we have used Whisper, an automatic speech recognition (ASR) system, to fine-tune a model for the Khowar language. Since Khowar is a low-resource language with limited training data, we used the pre-trained model of Urdu as a starting point. The main reason for choosing Urdu is that both Urdu and Khowar use the same writing script. This similarity in their written form makes it easier for the model to learn and recognize Khowar speech more accurately. By building on the Urdu model, we were able to improve the performance of Whisper for Khowar ASR tasks.

**Whisper:** Whisper is an automatic speech recognition (ASR) system developed by OpenAI. It converts spoken language into written text using deep learning. Whisper is trained on a large, diverse dataset of multilingual and multitask audio, making it highly accurate and capable of understanding various languages, accents, and background noise conditions.

### 1. Import and install Necessary Libraries

In [1]:
# #installing necessary libraries
# !pip install datasets>=2.6.1
# !pip install git+https://github.com/huggingface/transformers
# !pip install librosa
# !pip install evaluate>=0.30
# !pip install jiwer
# !pip install gradio

In [None]:
from datasets import Dataset
import pandas as pd
from datasets import Audio
import gc

### 2. Load and Prepare Dataset

In [None]:

train_df = pd.read_csv("/content/drive/MyDrive/thesis/data/train.csv")
test_df = pd.read_csv("/content/drive/MyDrive/thesis/data/test.csv")

# Add the directory path to the filenames
train_df['path'] = "/content/drive/MyDrive/thesis/data/labeled/" + train_df['path']
test_df['path'] = "/content/drive/MyDrive/thesis/data/labeled/" + test_df['path']

In [None]:
train_df.columns = ["path", "sentence"]
test_df.columns = ["path", "sentence"]

In [None]:
train_dataset = Dataset.from_pandas(train_df)
test_dataset = Dataset.from_pandas(test_df)

In [None]:
train_dataset = train_dataset.cast_column("path", Audio(sampling_rate=16000))
test_dataset = test_dataset.cast_column("path", Audio(sampling_rate=16000))

### 3. Import Feature Extractor and Whisper Tokenizer

In [None]:
## import feature extractor
from transformers import WhisperFeatureExtractor

feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-base")

## Load WhisperTokenizer
from transformers import WhisperTokenizer

tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-base", language="Urdu", task="transcribe")

### 4. Create Whisper Processor

In [None]:
## Combine To Create A WhisperProcessor
from transformers import WhisperProcessor

processor = WhisperProcessor.from_pretrained("openai/whisper-base", language="Urdu", task="transcribe")

### 5. Preparing and Maping Dataset

In [None]:
def prepare_dataset(examples):
    # compute log-Mel input features from input audio array
    audio = examples["path"]
    examples["input_features"] = feature_extractor(
        audio["array"], sampling_rate=16000).input_features[0]
    del examples["path"]
    sentences = examples["sentence"]

    # encode target text to label ids
    examples["labels"] = tokenizer(sentences).input_ids
    del examples["sentence"]
    return examples

In [None]:
train_dataset = train_dataset.map(prepare_dataset, num_proc=1)
test_dataset = test_dataset.map(prepare_dataset, num_proc=1)

Map:   0%|          | 0/1599 [00:00<?, ? examples/s]

Map:   0%|          | 0/199 [00:00<?, ? examples/s]

### 6. Data Collector Difinition

In [None]:
import torch

from dataclasses import dataclass
from typing import Any, Dict, List, Union

@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any
    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need different padding methods
        # first treat the audio inputs by simply returning torch tensors
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")
        # get the tokenized label sequences
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        # pad the labels to max length
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")
        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)
        # if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]
        batch["labels"] = labels
        return batch

## lets initiate the data collator
data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)

### 7. Evaluation Matrics Definition

In [None]:
import evaluate

metric = evaluate.load("wer")

In [None]:
def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    # replace -100 with the pad_token_id
    label_ids[label_ids == -100] = tokenizer.pad_token_id

    # we do not want to group tokens when computing the metrics
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)
    wer = 100 * metric.compute(predictions=pred_str, references=label_str)
    return {"wer": wer}

In [None]:
# Load a Pre-Trained Checkpoint
from transformers import WhisperForConditionalGeneration
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-base")

# Explicitly set gradient_checkpointing to False on the model config
model.config.gradient_checkpointing = False

In [None]:
model.config.forced_decoder_ids = None
model.config.suppress_tokens = []

### 8. Training

In [None]:
# Define the Training Arguments
from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="./whisper-base-kho",  # change to a repo name of your choice
    per_device_train_batch_size=16,
    gradient_accumulation_steps=1,  # increase by 2x for every 2x decrease in batch size
    learning_rate=1e-5,
    warmup_steps=500,
    max_steps=15000,
    gradient_checkpointing=False, # Changed to False
    fp16=True,
    eval_strategy="steps",
    per_device_eval_batch_size=1,
    predict_with_generate=True,
    generation_max_length=225,
    save_steps=500,
    eval_steps=500,
    # logging_steps=25,
    report_to=["tensorboard"],
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
    push_to_hub=False,
)

from transformers import Seq2SeqTrainer
import numpy as np # Import numpy

def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    # replace -100 with the pad_token_id
    label_ids[label_ids == -100] = tokenizer.pad_token_id

    # Ensure inputs to batch_decode are numpy arrays
    pred_str = tokenizer.batch_decode(np.asarray(pred_ids), skip_special_tokens=True)
    label_str = tokenizer.batch_decode(np.asarray(label_ids), skip_special_tokens=True)

    wer = 100 * metric.compute(predictions=pred_str, references=label_str)
    return {"wer": wer}


trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    # Use processing_class instead of tokenizer as recommended by the warning
    processing_class=processor,
)

## start the model training
trainer.train()

Step,Training Loss,Validation Loss,Wer
500,1.1053,0.773119,69.746083
1000,0.1669,0.890454,65.099946
1500,0.0208,1.154179,62.452728
2000,0.0045,1.191722,62.452728
2500,0.0019,1.277595,61.858455
3000,0.0029,1.237094,65.532145
3500,0.0052,1.285191,62.020529
4000,0.0014,1.351269,61.048082


Using custom `forced_decoder_ids` from the (generation) config. This is deprecated in favor of the `task` and `language` flags/config options.
Transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English. This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`. See https://github.com/huggingface/transformers/pull/28687 for more details.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


### Results and Discussion
The Whisper model was fine-tuned on approximately 2 hours of labeled Khowar audio data using the pretrained base model for Urdu. This choice was intentional, as Khowar and Urdu share a similar script, making Urdu a suitable base for transfer learning.

After training for **4,000 steps** (out of the planned 15,000), the model achieved a **Word Error Rate (WER) of 61.04%**, which is considered a promising result given the limited data and compute resources. Training was stopped early due to hardware constraints.

The result highlights Whisper's robustness and the potential of low-resource language adaptation using related language models. While the WER is still high, it establishes a solid baseline for future improvements, such as:

* Increasing dataset size and dialectal variety
* Using more compute for full training
* Experimenting with larger Whisper variants or multilingual pretraining

Overall, this experiment demonstrates that fine-tuning Whisper on a small Khowar dataset using Urdu as a base model is a viable approach for building ASR systems in underrepresented languages.


### Demo with Best Checkpoint

In [None]:
import torch
import torchaudio
from transformers import WhisperForConditionalGeneration, WhisperProcessor

# === Load Model and Processor ===
model = WhisperForConditionalGeneration.from_pretrained("/content/drive/MyDrive/WHISPER_KHOWAR")
processor = WhisperProcessor.from_pretrained("/content/drive/MyDrive/WHISPER_KHOWAR")

# === Load and Preprocess Audio ===
file_path = "/content/voice.wav"  # Your .wav file
waveform, sample_rate = torchaudio.load(file_path)

# Resample if not 16kHz
if sample_rate != 16000:
    resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)
    waveform = resampler(waveform)

# Convert to 1D numpy array
audio = waveform.squeeze().numpy()

# === Prepare Input ===
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")

# === Run Inference ===
with torch.no_grad():
    predicted_ids = model.generate(inputs["input_features"])

# === Decode and Print Transcription ===
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print("Transcription:", transcription)


Using custom `forced_decoder_ids` from the (generation) config. This is deprecated in favor of the `task` and `language` flags/config options.
Transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English. This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`. See https://github.com/huggingface/transformers/pull/28687 for more details.
`generation_config` default values have been modified to match model-specific defaults: {'suppress_tokens': [], 'begin_suppress_tokens': [220, 50257]}. If this is not desired, please set these values explicitly.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
A custom logits processor of type <class 'transform

Transcription:  اننا اان مسہ زائیلہ ہیس ماروٹیین ہمیشہ زباک دست
