# **Voice-to-Text Extraction of Nepali Recordings with Multiple Agent Detection**


**Dataset Link**: https://www.kaggle.com/datasets/ishworsubedii/nepali-speech-to-text-dataset

- Used 10 audio files from this dataset for quick demonstration. (2079-11-21_1.wav to 2079-11-21_10.wav)
- Download the dataset and extract it inside a new folder called "Dataset" in work working directory (Same directory as notebook)

**Library Imports**

In [1]:
import pandas as pd
from pyxlsb import open_workbook
import os
from pydub import AudioSegment
import whisper
from pyannote.audio.pipelines.speaker_diarization import SpeakerDiarization
from pyannote.core import Segment
import torch
from dotenv import load_dotenv
from jiwer import wer

  from .autonotebook import tqdm as notebook_tqdm
INFO:speechbrain.utils.quirks:Applied quirks (see `speechbrain.utils.quirks`): [allow_tf32, disable_jit_profiling]
INFO:speechbrain.utils.quirks:Excluded quirks specified by the `SB_DISABLE_QUIRKS` environment (comma-separated list): []


**Data Preparation:**

This section loads the Nepali speech transcript dataset from an Excel Binary (.xlsb) file. The code extracts audio paths and corresponding transcriptions, converting them to a pandas DataFrame and saving as CSV for easier access in subsequent steps.

In [2]:
# Read .xlsb file
xlsb_file = "Dataset/Nepali Speech To Text Dataset/transcripts/audio transcript.xlsb"  

data = []
with open_workbook(xlsb_file) as wb:
    with wb.get_sheet(1) as sheet:
        for row in sheet.rows():
            values = [item.v for item in row]
            data.append(values)

# Skip the header row
data = data[1:]

# Convert to DataFrame
df = pd.DataFrame(data, columns=["Audio Path", "Transcription"])
df.to_csv("transcripts.csv", index=False)

print(df.head())  # Verify structure


                                          Audio Path  \
0  Nepali Speech To Text Dataset\audio_chunks\207...   
1  Nepali Speech To Text Dataset\audio_chunks\207...   
2  Nepali Speech To Text Dataset\audio_chunks\207...   
3  Nepali Speech To Text Dataset\audio_chunks\207...   
4  Nepali Speech To Text Dataset\audio_chunks\207...   

                                       Transcription  
0  प्रसाद विश्वकर्मा  सम्माननीय अध्यक्षमहोदय, आजक...  
1  अभ्यास गर्ने कुराको विषयमा त्यसमा उल्लेख छ । र...  
2   खम्बा बिद्रोह भएको कुरा चाहिँ हामीलाई थाहा छ ...  
3  अन्त्य भई शान्ति प्रक्रिया सुरु भएको सोह्र वर्...  
4  म जस्तै अरुलाई सम्झन्छु र चित्त बुझाउँछु । भाव...  


**Audio Preprocessing**

The preprocessing code converts audio files to a standardized format appropriate for speech recognition models. Each WAV file is resampled to 16kHz and converted to mono channel (1-channel), which are standard requirements for many ASR models. This preprocessing ensures consistent audio quality for the downstream models.

In [3]:
audio_dir = "Dataset/Nepali Speech To Text Dataset/audio_chunks"
output_dir = "processed_audio/"

os.makedirs(output_dir, exist_ok=True)

for file in os.listdir(audio_dir):
    if file.endswith(".wav"):
        audio = AudioSegment.from_wav(os.path.join(audio_dir, file))
        audio = audio.set_frame_rate(16000).set_channels(1)
        audio.export(os.path.join(output_dir, file), format="wav")

print("Audio processing complete!")



Audio processing complete!


**Model Selection**  (Test Pre-trained Models)

Initial Whisper Test

This code block tests the Whisper large model on a single Nepali audio file. The large model was selected because it demonstrated superior performance on Nepali language compared to small and medium variants, leading to more accurate transcriptions of specialized language and dialects.

In [None]:
model = whisper.load_model("medium") 
audio_path = "processed_audio/2079-11-21_1.wav"

result = model.transcribe(audio_path, language="ne")
print(result["text"])




 বেবববববববববববববববববববববববববববববববববববববববববববববববববববববববববববববববববববববববববববববববববববববববববববববববববববববববববববব�ববববববববববব focal��ববববববববববববববব


In [None]:
model = whisper.load_model("large") 
audio_path = "processed_audio/2079-11-21_1.wav"

result = model.transcribe(audio_path, language="ne")
print(result["text"])


 प्रसाद विश्वकर्मा सम्माननी अदक्षि मोदे आज को नागरिक दैनिग मा योटा समाचार चेने आया को छा नेपाली सेना रो अम्रिकी सेना को संयुक्त तालिम राज्या


In [None]:
model = whisper.load_model("small") 
audio_path = "processed_audio/2079-11-21_1.wav"

result = model.transcribe(audio_path, language="ne")
print(result["text"])


 अजा ख़ाद बिस्चवा कर माः सम्मानु नि अदक्सि मोडे अजा कु नगरिक देनिग मा योडा समचार चेने आगो जा नेपाली सेना रामरिकी सेना को संविड्टा पाली मरगरिक


**Batch Transcription for All Files with Whisper**

This section processes all preprocessed audio files through the Whisper large model, specifically instructing it to use Nepali language ("ne") for transcription. The results are saved to a CSV file with filenames and their corresponding transcriptions, creating a complete dataset of machine-generated transcripts.

In [4]:
# Load Whisper Large model
model = whisper.load_model("large")

audio_dir = "processed_audio/"
transcriptions = []

for file in os.listdir(audio_dir):
    if file.endswith(".wav"):
        audio_path = os.path.join(audio_dir, file)
        result = model.transcribe(audio_path, language="ne")
        transcriptions.append([file, result["text"]])

# Save to CSV
df = pd.DataFrame(transcriptions, columns=["Audio File", "Transcription"])
df.to_csv("whisper_transcriptions.csv", index=False)

print("Transcription complete! Check whisper_transcriptions.csv")




Transcription complete! Check whisper_transcriptions.csv


**Speaker Identification (Diarization)**

The diarization code implements speaker identification using the pyannote.audio library. It processes each audio file to identify different speakers and their time segments, enabling the distinction between different agents in the customer service calls. The authentication with Hugging Face token ensures access to the pre-trained model.

In [5]:
# Load environment variables from .env file
load_dotenv()

# Retrieve the Hugging Face authentication token from the environment variables
auth_token = os.getenv("HF_AUTH_TOKEN")

# Load Pyannote Pretrained Model with authentication
diarization_model = SpeakerDiarization.from_pretrained("pyannote/speaker-diarization", use_auth_token=auth_token)

# Assuming you have an 'audio_dir' variable defined
audio_dir = "processed_audio/"

# Process Each Audio File
for file in os.listdir(audio_dir):
    if file.endswith(".wav"):
        audio_path = os.path.join(audio_dir, file)
        
        # Perform diarization
        diarization = diarization_model(audio_path)

        print(f" {file} Speaker Segments:")
        for turn, _, speaker in diarization.itertracks(yield_label=True):
            print(f"{speaker}: {turn.start:.2f} sec - {turn.end:.2f} sec")


  if ismodule(module) and hasattr(module, '__file__'):
Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.5.0.post0. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint C:\Users\Aniket Singh\.cache\torch\pyannote\models--pyannote--segmentation\snapshots\c4c8ceafcbb3a7a280c2d357aee9fbc9b0be7f9b\pytorch_model.bin`
INFO:speechbrain.utils.fetching:Fetch hyperparams.yaml: Fetching from HuggingFace Hub 'speechbrain/spkrec-ecapa-voxceleb' if not cached


Model was trained with pyannote.audio 0.0.1, yours is 3.3.2. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.6.0+cpu. Bad things might happen unless you revert torch to 1.x.


INFO:speechbrain.utils.fetching:Fetch custom.py: Fetching from HuggingFace Hub 'speechbrain/spkrec-ecapa-voxceleb' if not cached
  wrapped_fwd = torch.cuda.amp.custom_fwd(fwd, cast_inputs=cast_inputs)
INFO:speechbrain.utils.fetching:Fetch embedding_model.ckpt: Fetching from HuggingFace Hub 'speechbrain/spkrec-ecapa-voxceleb' if not cached
INFO:speechbrain.utils.fetching:Fetch mean_var_norm_emb.ckpt: Fetching from HuggingFace Hub 'speechbrain/spkrec-ecapa-voxceleb' if not cached
INFO:speechbrain.utils.fetching:Fetch classifier.ckpt: Fetching from HuggingFace Hub 'speechbrain/spkrec-ecapa-voxceleb' if not cached
INFO:speechbrain.utils.fetching:Fetch label_encoder.txt: Fetching from HuggingFace Hub 'speechbrain/spkrec-ecapa-voxceleb' if not cached
INFO:speechbrain.utils.parameter_transfer:Loading pretrained files for: embedding_model, mean_var_norm_emb, classifier, label_encoder


 2079-11-21_1.wav Speaker Segments:
SPEAKER_01: 0.03 sec - 0.99 sec
SPEAKER_00: 13.35 sec - 15.05 sec
SPEAKER_02: 19.02 sec - 19.54 sec
SPEAKER_01: 19.54 sec - 24.36 sec
SPEAKER_01: 26.22 sec - 30.47 sec
 2079-11-21_10.wav Speaker Segments:
SPEAKER_01: 0.03 sec - 11.54 sec
SPEAKER_01: 12.37 sec - 12.99 sec
SPEAKER_00: 13.97 sec - 26.52 sec
SPEAKER_00: 27.77 sec - 30.47 sec
 2079-11-21_2.wav Speaker Segments:
SPEAKER_00: 0.03 sec - 30.47 sec
 2079-11-21_3.wav Speaker Segments:
SPEAKER_01: 0.03 sec - 17.07 sec
SPEAKER_02: 17.07 sec - 18.02 sec
SPEAKER_02: 19.49 sec - 20.84 sec
SPEAKER_00: 26.36 sec - 30.47 sec
 2079-11-21_4.wav Speaker Segments:
SPEAKER_00: 0.03 sec - 30.09 sec
SPEAKER_01: 28.82 sec - 29.58 sec
 2079-11-21_5.wav Speaker Segments:
SPEAKER_01: 0.03 sec - 3.02 sec
SPEAKER_01: 3.83 sec - 24.92 sec
SPEAKER_00: 3.96 sec - 4.91 sec
SPEAKER_00: 25.48 sec - 27.18 sec
 2079-11-21_6.wav Speaker Segments:
SPEAKER_00: 3.63 sec - 20.40 sec
SPEAKER_00: 20.99 sec - 30.07 sec
 2079-11-21

**Evaluation**

This final section evaluates the performance of the Whisper model by calculating Word Error Rate (WER) between the ground truth transcriptions and the model's output. The relatively high WER observed (which would be shown in the notebook output) can be attributed to several factors:

- Small dataset size (only 10 audio files)

- Complexity of Nepali language and dialects

- Potential differences in transcription style between ground truth and model output

- Limitations in ASR models for low-resource languages like Nepali

- Possible presence of domain-specific terminology in customer service conversations

- The evaluation provides a quantitative measure of model performance while acknowledging the challenging nature of the task and dataset limitations.

In [6]:
# Load Ground Truth and Predictions
df = pd.read_csv("whisper_transcriptions.csv")  # Transcribed texts
ground_truth = pd.read_csv("transcripts.csv")  # Original transcriptions

# Ensure correct matching
df = df.sort_values("Audio File")
ground_truth = ground_truth.sort_values("Audio Path")

# Calculate WER for each audio
wer_scores = []
for i in range(len(df)):
    pred_text = df.iloc[i]["Transcription"]
    true_text = ground_truth.iloc[i]["Transcription"]

    wer_score = wer(true_text, pred_text)
    wer_scores.append(wer_score)

# Average WER
avg_wer = sum(wer_scores) / len(wer_scores)
print(f"Average WER: {avg_wer:.2f}")


Average WER: 1.04
