## Installations

Create conda environment->
conda create -n whisenv

Activate the environment->
conda activate whisenv


Install below packages in conda prompt once environment activated->

conda install -c conda-forge pandas tqdm numpy -y

conda install pytorch torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

pip install transformers datasets evaluate peft bitsandbytes accelerate

pip install --no-cache-dir h5py

(Refer environment.yml for exact package versions)

##### Model Setup for Hindi Speech-to-Text with Whisper. 
This script initializes the OpenAI Whisper large model for automatic speech recognition (ASR) in Hindi. It loads:

WhisperTokenizer to tokenize Hindi audio transcripts,

WhisperFeatureExtractor to process raw audio inputs,

WhisperForConditionalGeneration as the core ASR model,

The model is set up for the task of transcribing Hindi speech and is loaded onto the GPU (cuda) for faster inference.

In [228]:
from transformers import WhisperTokenizer
from transformers import WhisperFeatureExtractor
from transformers import WhisperForConditionalGeneration

tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-large",language='hindi',task='transcribe')
feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-large",language='hindi',task='transcribe')
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large").to('cuda')

##### Whisper Processor Initialization for Hindi ASR. 
This code loads the WhisperProcessor from Hugging Face’s transformers library, combining both the tokenizer and feature extractor into a single processor. It’s configured for transcribing Hindi audio using the OpenAI Whisper large model, streamlining the preprocessing and tokenization steps for speech-to-text tasks.

In [229]:
from transformers import WhisperProcessor

processor = WhisperProcessor.from_pretrained("openai/whisper-large",language='hindi',task='transcribe')

## Dataset processing(train+dev,test)

#### Dataset Preparation for Hindi Common Voice ASR
This script loads the Hindi Common Voice dataset (train, dev, test splits) from local .tsv files and audio clips, merges train and dev sets for training, and converts them into Hugging Face DatasetDict format. It processes audio file paths to full paths, keeps only necessary columns (audio and sentence), and casts the audio column to the appropriate Audio type for easy use with Hugging Face pipelines.

Train samples:7563


Test samples:3337

In [235]:
import os
import pandas as pd
import torchaudio
from datasets import Dataset, DatasetDict, Audio
from tqdm import tqdm

# Set the dataset path
DATASET_PATH = r"C:\Users\WORKSTATIONS\Desktop\BijoyashreeDas\COMMON_VOICE_HI\cv-corpus-21.0-2025-03-14\hi"
CLIPS_PATH = os.path.join(DATASET_PATH, "clips")

# Load train+dev as train
train_df = pd.read_csv(os.path.join(DATASET_PATH, "train.tsv"), sep="\t")
dev_df = pd.read_csv(os.path.join(DATASET_PATH, "dev.tsv"), sep="\t")
train_df = pd.concat([train_df, dev_df], ignore_index=True)

# Load test data
test_df = pd.read_csv(os.path.join(DATASET_PATH, "test.tsv"), sep="\t")

# Function to get full audio path
def get_audio_path(filename):
    return os.path.join(CLIPS_PATH, filename)

# Convert data to Hugging Face dataset format
def convert_to_hf_dataset(df):
    df = df[['path', 'sentence']].dropna()  # Keep only required columns
    df['audio'] = df['path'].apply(get_audio_path)  # Convert paths
    return Dataset.from_pandas(df[['audio', 'sentence']])  # Create HF dataset

# Convert train and test to Hugging Face format
commonvoice_train = convert_to_hf_dataset(train_df)
commonvoice_test = convert_to_hf_dataset(test_df)

# Define dataset dictionary
commonvoice_dataset = DatasetDict({
    "train": commonvoice_train,
    "test": commonvoice_test
})

# Cast the audio column to Hugging Face Audio format
commonvoice_dataset = commonvoice_dataset.cast_column("audio", Audio())

# Print dataset structure
print(commonvoice_dataset)


DatasetDict({
    train: Dataset({
        features: ['audio', 'sentence'],
        num_rows: 7563
    })
    test: Dataset({
        features: ['audio', 'sentence'],
        num_rows: 3337
    })
})


##### Inspecting Dataset Samples
This snippet demonstrates how to access and inspect individual samples from the prepared Hugging Face dataset. It retrieves the first example from the training split, printing the audio file path and its corresponding transcription text. This is useful for verifying the dataset loading and preprocessing steps.



In [152]:
# Get the first sample from the train set
first_sample = commonvoice_dataset["train"][0]

# Print the audio filename and transcription
print("Audio File:", first_sample["audio"]["path"])
print("Transcription:", first_sample["sentence"])


Audio File: C:\Users\WORKSTATIONS\Desktop\BijoyashreeDas\COMMON_VOICE_HI\cv-corpus-21.0-2025-03-14\hi\clips\common_voice_hi_26008353.mp3
Transcription: हमने उसका जन्मदिन मनाया।


## Total hours in train and test

##### Calculating Total Duration of Train and Test Audio Files
This code calculates the total duration of all audio files in the train and test splits of the dataset. Using torchaudio to load each audio file, it sums their lengths in seconds and converts the result to hours. This metric helps understand the amount of audio data available for training and evaluation.

In [11]:
import torchaudio
from tqdm import tqdm

# Function to calculate total duration of audio files
def get_total_duration(dataset):
    total_duration = 0.0  # In seconds
    for sample in tqdm(dataset, desc="Calculating duration"):
        audio_path = sample["audio"]["path"]
        waveform, sample_rate = torchaudio.load(audio_path)  # Load audio
        total_duration += waveform.shape[1] / sample_rate  # Compute duration (seconds)
    
    return total_duration / 3600  # Convert seconds to hours

# Compute total duration for train and test sets
train_hours = get_total_duration(commonvoice_dataset["train"])
test_hours = get_total_duration(commonvoice_dataset["test"])

# Print results
print(f"Total duration of Train set: {train_hours:.2f} hours")
print(f"Total duration of Test set: {test_hours:.2f} hours")


Calculating duration: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7563/7563 [01:33<00:00, 80.90it/s]
Calculating duration: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3337/3337 [00:47<00:00, 70.89it/s]

Total duration of Train set: 9.38 hours
Total duration of Test set: 4.73 hours





## Resample audio files to 16kHz



This script resamples all audio samples in the Common Voice dataset to a consistent 16 kHz sampling rate, which is commonly required for speech recognition models like Whisper. It uses torchaudio for resampling and processes both the training and testing splits to ensure uniform audio input quality.

In [239]:
import torchaudio
import torch
from datasets import Audio

# Get the first audio sample in the train set
sample = commonvoice_dataset["train"][0]["audio"]

# Print original sampling rate
print(f"Original Sampling Rate: {sample['sampling_rate']} Hz")



# Function to resample audio to 16kHz
def resample_audio(batch):
    waveform = batch["audio"]["array"]
    orig_sr = batch["audio"]["sampling_rate"]
    
    # Convert to PyTorch tensor
    waveform = torch.tensor(waveform, dtype=torch.float32)

    # Resample if needed
    if orig_sr != 16000:
        resampler = torchaudio.transforms.Resample(orig_sr, 16000)
        waveform = resampler(waveform)

    return {"audio": {"array": waveform.numpy(), "sampling_rate": 16000}}  # Convert back to NumPy

# Apply the resampling function to train and test sets
commonvoice_dataset = commonvoice_dataset.map(resample_audio)

print("✅ Resampling complete. All audio is now at 16kHz.")



Original Sampling Rate: 32000 Hz


Map:   0%|          | 0/7563 [00:00<?, ? examples/s]

Map:   0%|          | 0/3337 [00:00<?, ? examples/s]

✅ Resampling complete. All audio is now at 16kHz.


### Inspecting Sampling Rates of Random Audio Samples
This utility function prints the sampling rates of a few randomly selected audio files from the dataset splits (train and test). It helps verify that audio resampling (to 16 kHz) was applied correctly and consistently across the dataset.

In [31]:
import random

# Function to print sampling rate of N random samples
def print_random_sampling_rates(dataset, split, num_samples=3):
    print(f"\nSampling rates of {num_samples} random files from '{split}' set:")
    
    # Select random indices
    random_indices = random.sample(range(len(dataset[split])), num_samples)
    
    # Fetch and print sampling rates
    for idx in random_indices:
        sample = dataset[split][idx]["audio"]
        print(f"Sample {idx}: {sample['sampling_rate']} Hz")

# Print sampling rates for train and test sets
print_random_sampling_rates(commonvoice_dataset, "train")
print_random_sampling_rates(commonvoice_dataset, "test")



Sampling rates of 3 random files from 'train' set:
Sample 3346: 16000 Hz
Sample 3332: 16000 Hz
Sample 4704: 16000 Hz

Sampling rates of 3 random files from 'test' set:
Sample 2137: 16000 Hz
Sample 3064: 16000 Hz
Sample 1557: 16000 Hz


## WER on train and test sets before fine tuning

This code performs transcription of the audio dataset using a pretrained Whisper model and evaluates the transcription quality with the Word Error Rate (WER) metric

In [34]:
import torch
import evaluate
from tqdm import tqdm

# Load WER metric
wer_metric = evaluate.load("wer")

# Function to transcribe audio
def transcribe_audio(batch):
    audio = batch["audio"]["array"]  # Get audio waveform
    inputs = processor(audio, sampling_rate=16000, return_tensors="pt")  # Process audio
    input_features = inputs.input_features.to("cuda")  # Move to GPU

    # Generate transcription
    with torch.no_grad():
        predicted_ids = model.generate(input_features)

    # Decode predictions
    transcription = tokenizer.batch_decode(predicted_ids, skip_special_tokens=True)[0]
    return {"transcription": transcription}

# Apply transcription function to train and test sets
commonvoice_dataset = commonvoice_dataset.map(transcribe_audio)

# Compute WER
def compute_wer(dataset):
    references = [x["sentence"] for x in dataset]  # Ground truth
    predictions = [x["transcription"] for x in dataset]  # Model output
    wer = wer_metric.compute(predictions=predictions, references=references)
    return wer

# Compute WER for train and test sets
train_wer = compute_wer(commonvoice_dataset["train"])
test_wer = compute_wer(commonvoice_dataset["test"])

print(f"✅ WER on Train Set: {train_wer:.2%}")
print(f"✅ WER on Test Set: {test_wer:.2%}")


Map:   0%|          | 0/7563 [00:00<?, ? examples/s]

Due to a bug fix in https://github.com/huggingface/transformers/pull/28687 transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English.This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`.
  attn_output = torch.nn.functional.scaled_dot_product_attention(
Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.43.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Map:   0%|          | 0/3337 [00:00<?, ? examples/s]

✅ WER on Train Set: 68.92%
✅ WER on Test Set: 71.67%


## Actual vs Predicted transcription for 10 random audio samples from test (using baseline whisper model without finetuning)

In [244]:
import torch

# Function to transcribe audio using Whisper model
def transcribe_audio(audio_array, sampling_rate):
    # Preprocess audio with forced transcription mode (no translation)
    inputs = processor(audio_array, sampling_rate=16000, return_tensors="pt")
    input_features = inputs.input_features.to("cuda")

    # Add forced transcription settings
    forced_decoder_ids = processor.get_decoder_prompt_ids(
        language="hi", task="transcribe"  # Set language and force transcription
    )

    # Generate transcription
    with torch.no_grad():
        predicted_ids = model.generate(
            input_features,
            forced_decoder_ids=forced_decoder_ids
        )

    # Decode predictions
    transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
    return transcription

# 🎧 Compare Actual vs. Predicted for First 5 Samples in Test Set
print("\n✅ Comparing Actual vs. Predicted Transcriptions (First 10 Samples)\n")

for idx in range(min(10, len(commonvoice_dataset["test"]))):
    sample = commonvoice_dataset["test"][idx]
    predicted_transcription = transcribe_audio(sample["audio"]["array"], sample["audio"]["sampling_rate"])

    print(f"🎧 **Sample {idx+1}:**")
    print(f"📝 **Actual   :** {sample['sentence']}")
    print(f"🤖 **Predicted:** {predicted_transcription}")
    print("-" * 80)



✅ Comparing Actual vs. Predicted Transcriptions (First 10 Samples)

🎧 **Sample 1:**
📝 **Actual   :** अब रामपुर में अखिलेश बांटेंगे लैपटॉप का 'लॉलीपॉप'
🤖 **Predicted:**  अब राम्पुर में अकिलेश बातेंगे लैप्टप का लोलीपॉप
--------------------------------------------------------------------------------
🎧 **Sample 2:**
📝 **Actual   :** Flipkart: बंपर ऑफर्स के साथ बिक रहा है Lenovo का ये शानदार स्मार्टफोन
🤖 **Predicted:**  Flipkart, Bumper offer के साथ बिग रहा है, Lenovo का ये शानदार smartphone
--------------------------------------------------------------------------------
🎧 **Sample 3:**
📝 **Actual   :** मैं मुसीबत में पड़ गया।
🤖 **Predicted:**  मैं मुसीवत में पढ़ गया
--------------------------------------------------------------------------------
🎧 **Sample 4:**
📝 **Actual   :** सुशील मोदी है 'अफवाह मियां', बिगड़ चुका है मानसिक संतुलन: तेजस्वी यादव
🤖 **Predicted:**  शुशील मोडी है अख्वा मियां बीगर चुका है मनसी संकुलं ते जस्पी आदा
-------------------------------------------------------------

## Extract log-mel features and tokenize the transcriptions(for both train and test)

This code applies feature extraction and tokenization to the entire dataset, including both training and testing splits. It uses the Whisper feature extractor to convert raw audio waveforms into log-Mel spectrogram features, which are suitable as model inputs. Simultaneously, it tokenizes the corresponding text transcriptions into token IDs for training targets. By calling .map() on the Hugging Face DatasetDict, these preprocessing steps are automatically performed on all dataset splits, streamlining the data preparation process for model training and evaluation.

In [154]:
from transformers import WhisperFeatureExtractor

# Load the Whisper feature extractor
feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-large")

def extract_features_and_encode(batch):
    # Extract log-Mel spectrogram features
    batch["input_features"] = feature_extractor(batch["audio"]["array"], sampling_rate=16000).input_features[0]

    # Encode target transcriptions
    batch["labels"] = tokenizer(batch["sentence"]).input_ids

    return batch

# Apply the function to the dataset
commonvoice_dataset = commonvoice_dataset.map(extract_features_and_encode, num_proc=1)

print("✅ Feature extraction and tokenization complete!")


Map:   0%|          | 0/7563 [00:00<?, ? examples/s]

Map:   0%|          | 0/3337 [00:00<?, ? examples/s]

✅ Feature extraction and tokenization complete!


{
    'audio': {
        'array': numpy_array, 
        'sampling_rate': 16000
    },
    'sentence': 'This is the transcription of the audio.',
    'input_features': tensor_of_log_mel_spectrogram,
    'labels': [tokenized_ids]
}


Current dataset looks like above.

This code prints the structure of the preprocessed dataset after feature extraction and tokenization, providing an overview of the dataset format. It also displays a sample data entry from the training set, showing the extracted audio features and corresponding tokenized labels. This helps verify that the preprocessing steps were correctly applied.

In [158]:
# Print dataset format after feature extraction
print(commonvoice_dataset)

# Print a sample entry from the dataset (first example from train set)
print(commonvoice_dataset["train"][0])


DatasetDict({
    train: Dataset({
        features: ['audio', 'sentence', 'input_features', 'labels'],
        num_rows: 7563
    })
    test: Dataset({
        features: ['audio', 'sentence', 'input_features', 'labels'],
        num_rows: 3337
    })
})


IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)



## Change it to
Dataset({
    features: ['input_features', 'labels'],
    num_rows: 6760
})

This snippet removes the original raw audio and transcription text columns from the dataset, keeping only the processed features and labels needed for training.

In [170]:
# Remove unnecessary columns
commonvoice_dataset = commonvoice_dataset.remove_columns(["audio", "sentence"])
#commonvoice_dataset = commonvoice_dataset.remove_columns(["transcription"])

# Print dataset structure
print(commonvoice_dataset)


DatasetDict({
    train: Dataset({
        features: ['input_features', 'labels'],
        num_rows: 7563
    })
    test: Dataset({
        features: ['input_features', 'labels'],
        num_rows: 3337
    })
})


### input_features → Tensor of shape (batch_size, 80, time_steps)

80 is the number of Mel frequency bins (Whisper feature size)

time_steps varies based on the longest audio in the batch (others are padded)

### labels → Tensor of shape (batch_size, label_length)

The tokenized transcription sequences

Padded to the longest sequence in the batch

Padding tokens replaced with -100 (ignored during loss computation)

This class defines a custom data collator used during training to batch and pad variable-length audio input features and their corresponding tokenized labels. It ensures the audio features are padded correctly using the processor’s feature extractor, while the label sequences are padded separately with the tokenizer’s padding method. Padding tokens in labels are replaced with -100 to be ignored during loss computation. Additionally, if a beginning-of-sequence token is present at the start of labels, it is removed to avoid duplication during training.



In [176]:
import torch

from dataclasses import dataclass
from typing import Any, Dict, List, Union


@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need different padding methods
        # first treat the audio inputs by simply returning torch tensors
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        # get the tokenized label sequences
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        # pad the labels to max length
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        # if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch

This line initializes the custom data collator by passing the processor (which includes the feature extractor and tokenizer). The data_collator will be used during training to dynamically pad batches of audio features and labels, ensuring consistent input shapes and correct handling of padding tokens for the loss calculation.

In [179]:
data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)

### Evaluation metrics

In [182]:
import evaluate

metric = evaluate.load("wer")

Using the latest cached version of the module from C:\Users\WORKSTATIONS\.cache\huggingface\modules\evaluate_modules\metrics\evaluate-metric--wer\85bee9e4216a78bb09b2d0d500f6af5c23da58f9210e661add540f5df6630fcd (last modified on Wed Mar 19 12:07:22 2025) since it couldn't be found locally at evaluate-metric--wer, or remotely on the Hugging Face Hub.


## Post-processing on the model

To reduce our models memory footprint, we load the model in 8bit, this means we quantize the model to use 1/4th precision (when comapared to float32) with minimal loss to performance. Finally, we need to apply some post-processing steps on the 8-bit model to enable training. We do so by first freezing all the model layers, and then cast the layer-norm and the output layer in float32 for training and model stability. Since the Whisper model uses Convolutional layers in the Encoder, checkpointing disables grad computation to avoid this we specifically need to make the inputs trainable.

This code loads the pretrained Whisper model (whisper-large) from Hugging Face and moves it to the GPU (cuda) for faster computation.

In [185]:
from transformers import WhisperForConditionalGeneration

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large").to('cuda')


In [121]:
pip install bitsandbytes


Collecting bitsandbytes
  Downloading bitsandbytes-0.45.4-py3-none-win_amd64.whl.metadata (5.1 kB)
Downloading bitsandbytes-0.45.4-py3-none-win_amd64.whl (75.4 MB)
   ---------------------------------------- 0.0/75.4 MB ? eta -:--:--
    --------------------------------------- 1.0/75.4 MB 6.3 MB/s eta 0:00:12
   - -------------------------------------- 3.4/75.4 MB 9.2 MB/s eta 0:00:08
   --- ------------------------------------ 6.0/75.4 MB 10.0 MB/s eta 0:00:07
   ---- ----------------------------------- 8.4/75.4 MB 10.4 MB/s eta 0:00:07
   ----- ---------------------------------- 10.5/75.4 MB 10.2 MB/s eta 0:00:07
   ------ --------------------------------- 11.8/75.4 MB 9.6 MB/s eta 0:00:07
   ------- -------------------------------- 13.4/75.4 MB 9.0 MB/s eta 0:00:07
   ------- -------------------------------- 14.7/75.4 MB 8.8 MB/s eta 0:00:07
   -------- ------------------------------- 16.3/75.4 MB 8.5 MB/s eta 0:00:07
   --------- ------------------------------ 18.1/75.4 MB 8.5 MB/s

In [125]:
pip install --upgrade peft


Note: you may need to restart the kernel to use updated packages.


In [133]:
pip install "accelerate>=0.26.0"


Note: you may need to restart the kernel to use updated packages.


In [139]:
import accelerate
print(accelerate.__version__)


1.0.1


This section demonstrates loading the Whisper model with 4-bit quantization using the bitsandbytes library to significantly reduce GPU memory usage during training or inference. The BitsAndBytesConfig configures the model to use 4-bit precision with optimized compute and quantization settings. After loading the quantized model, prepare_model_for_kbit_training from the peft library prepares it for efficient fine-tuning with low-bit precision, enabling faster experimentation on limited hardware resources.

In [186]:
from transformers import BitsAndBytesConfig
from peft import prepare_model_for_kbit_training

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,  # Instead of 8-bit
    bnb_4bit_compute_dtype="float16",  # Ensure compatibility
    bnb_4bit_use_double_quant=True  # Optional: Helps reduce memory usage
)

model = WhisperForConditionalGeneration.from_pretrained(
    "openai/whisper-large",
    quantization_config=bnb_config,
    device_map="auto"
)

model = prepare_model_for_kbit_training(model)


This snippet registers a forward hook on the model’s first convolutional layer (conv1 in the encoder) to ensure that the output tensor requires gradients. This is useful when you want to enable gradient computation for specific intermediate outputs during backpropagation, which can be important for certain training or fine-tuning strategies.

In [188]:
def make_inputs_require_grad(module, input, output):
    output.requires_grad_(True)

model.model.encoder.conv1.register_forward_hook(make_inputs_require_grad)

<torch.utils.hooks.RemovableHandle at 0x1bf2c814820>

## Apply Low-rank adapters (LoRA) to the model

In [48]:
pip install --upgrade bitsandbytes transformers peft accelerate


Note: you may need to restart the kernel to use updated packages.


This code applies LoRA (Low-Rank Adaptation) to the Whisper model for parameter-efficient fine-tuning. The LoraConfig sets the LoRA hyperparameters such as rank (r), scaling factor (lora_alpha), target modules (q_proj and v_proj layers), dropout, and bias handling. The get_peft_model function wraps the original model with LoRA adapters, enabling efficient fine-tuning by updating only a small subset of parameters. Finally, print_trainable_parameters() displays which model parameters will be updated during training.

This wraps your existing pretrained model (e.g., Whisper) with LoRA adapters and freezes the original model’s parameters.

✅ Only the small number of new parameters in LoRA layers are trainable.

In [189]:
from peft import LoraConfig, PeftModel, LoraModel, LoraConfig, get_peft_model

config = LoraConfig(r=32, lora_alpha=64, target_modules=["q_proj", "v_proj"], lora_dropout=0.05, bias="none")

model = get_peft_model(model, config)
model.print_trainable_parameters()

trainable params: 15,728,640 || all params: 1,559,033,600 || trainable%: 1.0089


We are ONLY using 1% of the total trainable parameters, thereby performing Parameter-Efficient Fine-Tuning

## Define the Training Configuration

In [70]:
!pip uninstall h5py -y
!pip install --no-cache-dir h5py


Found existing installation: h5py 3.11.0
Uninstalling h5py-3.11.0:
  Successfully uninstalled h5py-3.11.0
Collecting h5py
  Downloading h5py-3.11.0-cp38-cp38-win_amd64.whl.metadata (2.5 kB)
Downloading h5py-3.11.0-cp38-cp38-win_amd64.whl (3.0 MB)
   ---------------------------------------- 0.0/3.0 MB ? eta -:--:--
   ------------------------ --------------- 1.8/3.0 MB 11.2 MB/s eta 0:00:01
   ---------------------------------------- 3.0/3.0 MB 11.6 MB/s eta 0:00:00
Installing collected packages: h5py
Successfully installed h5py-3.11.0


ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow-intel 2.16.1 requires flatbuffers>=23.5.26, but you have flatbuffers 1.12 which is incompatible.
tensorflow-intel 2.16.1 requires keras>=3.0.0, but you have keras 2.9.0 which is incompatible.
tensorflow-intel 2.16.1 requires protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.20.3, but you have protobuf 3.19.6 which is incompatible.
tensorflow-intel 2.16.1 requires tensorboard<2.17,>=2.16, but you have tensorboard 2.9.1 which is incompatible.


In [190]:
from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir=r"C:\Users\WORKSTATIONS\Desktop\BijoyashreeDas\WHISPER",  
    per_device_train_batch_size=8,
    gradient_accumulation_steps=1,  
    learning_rate=1e-3,
    warmup_steps=50,
    num_train_epochs=20,
    evaluation_strategy="steps",
    fp16=True,
    per_device_eval_batch_size=8,
    generation_max_length=128,
    logging_steps=100,
    #max_steps=100,  # only for testing purposes, remove this in final run
    remove_unused_columns=False,  
    label_names=["labels"],  
)




## Train the model->save the adapter weights and trained model

Ensures that only LoRA adapter weights (not the full model) are saved — which makes checkpointing lightweight and storage-efficient.

In [192]:
import os
from transformers import Seq2SeqTrainer, TrainerCallback, TrainerState, TrainerControl, TrainingArguments
from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR

# Define save path
save_path = "C:/Users/WORKSTATIONS/Desktop/BijoyashreeDas/WHISPER"
adapter_path = os.path.join(save_path, "lora_adapter")

# Ensure save directories exist
os.makedirs(save_path, exist_ok=True)
os.makedirs(adapter_path, exist_ok=True)

# Callback to save only LoRA adapter weights
class SavePeftModelCallback(TrainerCallback):
    def on_save(
        self,
        args: TrainingArguments,
        state: TrainerState,
        control: TrainerControl,
        **kwargs,
    ):
        checkpoint_folder = os.path.join(args.output_dir, f"{PREFIX_CHECKPOINT_DIR}-{state.global_step}")

        peft_model_path = os.path.join(checkpoint_folder, "adapter_model")
        kwargs["model"].save_pretrained(peft_model_path)

        pytorch_model_path = os.path.join(checkpoint_folder, "pytorch_model.bin")
        if os.path.exists(pytorch_model_path):
            os.remove(pytorch_model_path)
        return control

# Trainer setup
trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=commonvoice_dataset["train"],
    eval_dataset=commonvoice_dataset["test"],
    data_collator=data_collator,
    tokenizer=processor.feature_extractor,
    callbacks=[SavePeftModelCallback],
)

# Disable caching for training
model.config.use_cache = False




  trainer = Seq2SeqTrainer(


In [193]:
# Train the model
trainer.train()



  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Step,Training Loss,Validation Loss
100,0.4691,0.413666
200,0.322,0.400084
300,0.3052,0.410437
400,0.2948,0.402908
500,0.3008,0.380991
600,0.3075,0.382256
700,0.278,0.394883
800,0.2873,0.391826
900,0.2803,0.399232
1000,0.261,0.383539


  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  return fn(*args, **kwargs)
  with torch.enab

TrainOutput(global_step=18920, training_loss=1.3686554358827134, metrics={'train_runtime': 218087.6325, 'train_samples_per_second': 0.694, 'train_steps_per_second': 0.087, 'total_flos': 3.24576772890624e+20, 'train_loss': 1.3686554358827134, 'epoch': 20.0})

In [194]:
# Save the full fine-tuned model (Whisper + LoRA)
#model.save_pretrained(save_path)
#processor.save_pretrained(save_path)
from peft import PeftModel

# Save LoRA adapter separately
model.save_pretrained(adapter_path)


#print(f"✅ Model and weights saved at: {save_path}")
print(f"✅ LoRA adapter saved separately at: {adapter_path}")

✅ LoRA adapter saved separately at: C:/Users/WORKSTATIONS/Desktop/BijoyashreeDas/WHISPER\lora_adapter




## Print steps-train loss-test loss (Cross-Entropy Loss (CE Loss))

If you have 1,000 training samples and use batch_size=8, then you'll have:

1000/8=125 steps per epoch



In [196]:
print("Step\tTraining Loss\tValidation Loss")
for log in trainer.state.log_history:
    step = log.get("step", "N/A")
    train_loss = log.get("loss", None)  # Training loss
    val_loss = log.get("eval_loss", None)  # Validation loss

    if step != "N/A":  # Only print if it's a valid step
        train_loss_str = f"{train_loss:.6f}" if train_loss is not None else "N/A"
        val_loss_str = f"{val_loss:.6f}" if val_loss is not None else "N/A"
        print(f"{step}\t{train_loss_str}\t{val_loss_str}")


Step	Training Loss	Validation Loss
100	0.469100	N/A
100	N/A	0.413666
200	0.322000	N/A
200	N/A	0.400084
300	0.305200	N/A
300	N/A	0.410437
400	0.294800	N/A
400	N/A	0.402908
500	0.300800	N/A
500	N/A	0.380991
600	0.307500	N/A
600	N/A	0.382256
700	0.278000	N/A
700	N/A	0.394883
800	0.287300	N/A
800	N/A	0.391826
900	0.280300	N/A
900	N/A	0.399232
1000	0.261000	N/A
1000	N/A	0.383539
1100	0.239000	N/A
1100	N/A	0.389126
1200	0.242700	N/A
1200	N/A	0.405511
1300	3.132300	N/A
1300	N/A	3.828397
1400	3.332800	N/A
1400	N/A	2.864469
1500	2.768500	N/A
1500	N/A	2.640012
1600	2.587000	N/A
1600	N/A	2.520831
1700	2.491100	N/A
1700	N/A	2.440840
1800	2.426300	N/A
1800	N/A	2.389422
1900	2.407000	N/A
1900	N/A	2.329651
2000	2.341800	N/A
2000	N/A	2.314102
2100	2.327100	N/A
2100	N/A	2.244779
2200	2.305800	N/A
2200	N/A	2.222084
2300	2.254900	N/A
2300	N/A	2.225258
2400	2.241300	N/A
2400	N/A	2.159856
2500	2.209200	N/A
2500	N/A	2.154295
2600	2.194600	N/A
2600	N/A	2.116607
2700	2.162200	N/A
2700	N/A	2.083102
2800	2.1281

## Evaluation and Inference

 Loads the PEFT/LoRA configuration
✅ Loads the base Whisper model in 8-bit mode for efficiency
✅ Merges the fine-tuned LoRA weights with the base model
✅ Enables caching for faster inference

Loads the base Whisper model (openai/whisper-large) pretrained checkpoint.

Loads a LoRA adapter (Low-Rank Adaptation), which is a lightweight fine-tuning method that adjusts only a small subset of model parameters. The adapter is applied on top of the base model.

Enables caching for improved speed during generation.

Saves the combined model (base + LoRA adapter) locally.

Loads and saves the processor associated with the Whisper model, which includes audio feature extraction and tokenization logic needed for input/output processing.



In [199]:
from peft import PeftModel
from transformers import WhisperForConditionalGeneration

# Define paths
base_model_path = "openai/whisper-large"  # Change if needed
adapter_path = "C:/Users/WORKSTATIONS/Desktop/BijoyashreeDas/WHISPER/lora_adapter"

# Load base Whisper model
base_model = WhisperForConditionalGeneration.from_pretrained(base_model_path).to("cuda")

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, adapter_path)

# Enable cache for inference
model.config.use_cache = True

print("✅ Model and LoRA adapter loaded successfully!")


✅ Model and LoRA adapter loaded successfully!


## Save final model

In [201]:
save_path = "C:/Users/WORKSTATIONS/Desktop/BijoyashreeDas/WHISPER/final_model"

# Save the full model with LoRA adapter
model.save_pretrained(save_path)

print(f"Model saved successfully at {save_path}")


Model saved successfully at C:/Users/WORKSTATIONS/Desktop/BijoyashreeDas/WHISPER/final_model




## Compute WER on Train and Test Sets on saved model after finetuning 

In [203]:
from transformers import WhisperProcessor

# Reload the processor from the base model and save it
processor = WhisperProcessor.from_pretrained("openai/whisper-large")  # Change to your base model
processor.save_pretrained("C:/Users/WORKSTATIONS/Desktop/BijoyashreeDas/WHISPER/final_model")


[]

In [204]:
from transformers import WhisperForConditionalGeneration, WhisperProcessor
from torch.utils.data import DataLoader
import torch
import gc
import numpy as np
from tqdm import tqdm
import evaluate  # ✅ Use `evaluate` instead of `datasets.load_metric`

# Define paths
model_path = "C:/Users/WORKSTATIONS/Desktop/BijoyashreeDas/WHISPER/final_model"

# Load the model
model = WhisperForConditionalGeneration.from_pretrained(model_path).to("cuda")
processor = WhisperProcessor.from_pretrained(model_path)

# Set to eval mode
model.eval()

# Load WER metric
metric = evaluate.load("wer")  # ✅ Correct way to load the WER metric


Using the latest cached version of the module from C:\Users\WORKSTATIONS\.cache\huggingface\modules\evaluate_modules\metrics\evaluate-metric--wer\85bee9e4216a78bb09b2d0d500f6af5c23da58f9210e661add540f5df6630fcd (last modified on Wed Mar 19 12:07:22 2025) since it couldn't be found locally at evaluate-metric--wer, or remotely on the Hugging Face Hub.


In [205]:
def compute_wer(dataset, batch_size=8, max_new_tokens=255):
    """Generates transcriptions and computes WER for the given dataset."""
    dataloader = DataLoader(dataset, batch_size=batch_size, collate_fn=data_collator)
    predictions, references = [], []

    for batch in tqdm(dataloader, desc="Evaluating WER..."):
        with torch.no_grad():
            input_features = batch["input_features"].to("cuda")

            # Generate transcription
            generated_tokens = model.generate(input_features, max_new_tokens=max_new_tokens)

            # Decode predictions and references
            decoded_preds = processor.batch_decode(generated_tokens, skip_special_tokens=True)
            decoded_labels = processor.batch_decode(batch["labels"], skip_special_tokens=True)

            predictions.extend(decoded_preds)
            references.extend(decoded_labels)

        # Free memory
        del generated_tokens, batch
        gc.collect()

    # Compute WER
    wer = 100 * metric.compute(predictions=predictions, references=references)
    return wer


In [206]:
# Assuming 'common_voice' is your dataset
train_wer = compute_wer(commonvoice_dataset["train"])
test_wer = compute_wer(commonvoice_dataset["test"])

print(f"Train WER: {train_wer:.2f}%")
print(f"Test WER: {test_wer:.2f}%")


Evaluating WER...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 946/946 [2:27:43<00:00,  9.37s/it]
Evaluating WER...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 418/418 [1:05:44<00:00,  9.44s/it]


Train WER: 103.68%
Test WER: 104.47%


If your dataset has 7,568 samples, then:

7568/8 = 946 batches

This means your dataset has 946 mini-batches, and each iteration processes one batch.