# **Project Overview**


The aim behind the project is to make and demonstrate a **speech-based command understanding system** for surgical environments.  
The proposed pipeline integrates **speech recognition (Whisper)** with **natural language understanding (BERT)**  to predict the spoken surgical commands to the following categories:*Request Instrument*, *Adjust Device*, and *Request Information*.  

At this stage, experiments are conducted on a **custom-recorded dataset of 60 audio commands**, that I recorded to simulate the tone and delivery variations more realistically.

In the next phase, the system will be further evaluated on a **larger benchmark dataset (EndoVis MICCAI)** to assess its scalability and generalization capabilities.  


# ***First step: Setup***
Let’s start by connecting our Google Drive so we can access the dataset and save our results:





In [62]:
#Connect Google Drive
from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [7]:
# To check that the project folder exists:
!ls "/content/drive/MyDrive/NLP_project"

Audio


In [5]:
!ls "/content/drive/MyDrive/NLP_project/Audio"


class1-Request-Instrument  class2-Adjust-Device  class3-Request-Info


In [6]:
!ls "/content/drive/MyDrive/NLP_project/Audio/class1-Request-Instrument"

1_Bring_me_clamp_fast_quiet.m4a     1_Need_retractor_fast_quiet.m4a
1_Bring_me_clamp_loud.m4a	    1_Need_retractor_loud.m4a
1_Give_me_forceps_fast_quiet.m4a    1_Need_suture_fast_quiet.m4a
1_Give_me_forceps_loud.m4a	    1_Need_suture_loud.m4a
1_Give_me_stapler_fast_quiet.m4a    1_Pass_needle_holder_fast_quiet.m4a
1_Give_me_stapler_loud.m4a	    1_Pass_needle_holder_loud.m4a
1_Hand_me_scalpel_fast_quiet.m4a    1_Pass_suction_tube_fast_quiet.m4a
1_Hand_me_scalpel_loud.m4a	    1_Pass_suction_tube_loud.m4a
1_Hand_over_syringe_fast_quiet.m4a  1_Prepare_scissors_fast_quiet.m4a
1_Hand_over_syringe_loud.m4a	    1_Prepare_scissors_loud.m4a


In [7]:
!ls "/content/drive/MyDrive/NLP_project/Audio/class2-Adjust-Device"

2_Adjust_microscope_focus_fast_quiet.m4a
2_Adjust_microscope_focus_loud.m4a
2_Increase_camera_brightness_fast_quiet.m4a
2_Increase_camera_brightness_loud.m4a
2_Increase_suction_power_fast_quiet.m4a
2_Increase_suction_power_loud.m4a
2_Lower_lighting_fast_quiet.m4a
2_Lower_lighting_loud.m4a
2_Move_arm_left_fast_quiet.m4a
2_Move_arm_left_loud.m4a
2_Reduce_table_height_fast_quiet.m4a
2_Reduce_table_height_loud.m4a
2_Set_ventilator_standby_fast_quiet.m4a
2_Set_ventilator_standby_loud.m4a
2_Stabilize_robotic_arm_fast_quiet.m4a
2_Stabilize_robotic_arm_loud.m4a
2_Turn_off_cauterizer_fast_quiet.m4a
2_Turn_off_cauterizer_loud.m4a
2_Zoom_in_endoscope_fast_quiet.m4a
2_Zoom_in_endoscope_loud.m4a


In [8]:
!ls "/content/drive/MyDrive/NLP_project/Audio/class3-Request-Info"

3_Camera_active_fast_quiet.m4a
3_Camera_active_loud.m4a
3_Display_anesthesia_level_fast_quiet.m4a
3_Display_anesthesia_level_loud.m4a
3_Oxygen_level_stable_fast_quiet.m4a
3_Oxygen_level_stable_loud.m4a
3_Patient_pulse_rate_fast_quiet.m4a
3_Patient_pulse_rate_loud.m4a
3_Pressure_normal_fast_quiet.m4a
3_Pressure_normal_loud.m4a
3_Recording_video_feed_fast_quiet.m4a
3_Recording_video_feed_loud.m4a
3_Show_blood_pressure_fast_quiet.m4a
3_Show_blood_pressure_loud.m4a
3_Suction_working_fast_quiet.m4a
3_Suction_working_loud.m4a
3_Temperature_reading_fast_quiet.m4a
3_Temperature_reading_loud.m4a
3_What_is_heart_rate_fast_quiet.m4a
3_What_is_heart_rate_loud.m4a


In [8]:
# Install dependencies
!pip install -q openai-whisper
!sudo apt update && sudo apt install -y ffmpeg

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/803.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━[0m [32m532.5/803.2 kB[0m [31m16.6 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m803.2/803.2 kB[0m [31m16.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for openai-whisper (pyproject.toml) ... [?25l[?25hdone
Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,632 B]
Get:2 https://cli.github.com/packages stable InRelease [3,917 B]
Hit:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:5 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Get

# ***Second step: Speech-to-Text Transcription (Whisper)***

let’s now load the **Whisper model** to automatically transcribe our recorded audio commands.  
Each audio file will be converted into text, and the transcriptions will later be used to train the BERT classifier.

In [9]:
# Prepare directories and CSV paths for Whisper transcription
import os
import csv
import whisper

# Defining base directory (where audio recordings are stored)
base_dir = "/content/drive/MyDrive/NLP_project/Audio"

# Define where to save the output transcriptions
output_csv = "/content/drive/MyDrive/NLP_project/audio_transcriptions.csv"


In [10]:
# Now let’s run the Whisper Model to automatically transcribe all the audio recordings
#This will create the dataset that contain each filename and its transcription and the corresponding label to be trained upon.

import whisper, os

model = whisper.load_model("small")

rows = [("filename", "transcription", "label")]

for folder in os.listdir(base_dir):
    folder_path = os.path.join(base_dir, folder)
    if not os.path.isdir(folder_path):
        continue

    label = folder.split("-")[-1].strip().replace(" ", "_").lower()

    for file in os.listdir(folder_path):
        if not file.lower().endswith((".wav", ".m4a")):
            continue
        audio_path = os.path.join(folder_path, file)
        result = model.transcribe(audio_path)
        text = result["text"].strip()
        rows.append((file, text, label))


100%|███████████████████████████████████████| 461M/461M [00:05<00:00, 96.5MiB/s]


In [11]:
#downloading the csv file so we can use it later for text classification
import csv

output_csv = "/content/drive/MyDrive/NLP_project/audio_transcriptions.csv"

with open(output_csv, "w", newline="", encoding="utf-8") as f:
    writer = csv.writer(f)
    writer.writerows(rows)

print("CSV saved successfully at:", output_csv)


CSV saved successfully at: /content/drive/MyDrive/NLP_project/audio_transcriptions.csv


In [86]:
!pip install jiwer


Collecting jiwer
  Downloading jiwer-4.0.0-py3-none-any.whl.metadata (3.3 kB)
Collecting rapidfuzz>=3.9.7 (from jiwer)
  Downloading rapidfuzz-3.14.1-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (12 kB)
Downloading jiwer-4.0.0-py3-none-any.whl (23 kB)
Downloading rapidfuzz-3.14.1-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (3.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.2/3.2 MB[0m [31m46.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: rapidfuzz, jiwer
Successfully installed jiwer-4.0.0 rapidfuzz-3.14.1


In [118]:
# Now, I will Evaluate Whisper Transcription Accuracy on these recodrings, How much did Whisper model really transcript the Audio to text correctly?
# Here, I will compare Whisper's generated transcriptions with the correct (expected) commands.
# This evaluation helps measure how accurately Whisper recognized the recorded speech.

import pandas as pd
from jiwer import wer, cer, process_words

# Load the Whisper transcriptions
df = pd.read_csv("/content/drive/MyDrive/NLP_project/audio_transcriptions.csv")
df.columns = ["filename", "text", "label"]


ground_truth = {
    # Class 1: Request Instrument
    "1_Bring_me_clamp_loud.m4a": "bring me clump.",
    "1_Give_me_forceps_loud.m4a": "give me forceps.",
    "1_Give_me_stapler_loud.m4a": "give me stapler.",
    "1_Hand_me_scalpel_loud.m4a": "hand me scalpel.",
    "1_Hand_over_syringe_loud.m4a": "hand over syringe.",
    "1_Need_retractor_loud.m4a": "need retractor.",
    "1_Need_suture_loud.m4a": "need suture.",
    "1_Pass_needle_holder_loud.m4a": "pass needle holder.",
    "1_Pass_suction_tube_loud.m4a": "pass suction tube.",
    "1_Prepare_scissors_loud.m4a": "prepare scissors.",

    # Class 2: Adjust Device
    "2_Adjust_microscope_focus_loud.m4a": "adjust microscope focus.",
    "2_Increase_camera_brightness_loud.m4a": "increase camera brightness",
    "2_Increase_suction_power_loud.m4a": "increase suction power",
    "2_Lower_lighting_loud.m4a": "lower lighting",
    "2_Move_arm_left_loud.m4a": "move arm left.",
    "2_Reduce_table_height_loud.m4a": "Reduce table height.",
    "2_Set_ventilator_standby_loud.m4a": "Set ventilator standby.",
    "2_Stabilize_robotic_arm_loud.m4a": "stabilize robotic arm",
    "2_Turn_off_cauterizer_loud.m4a": "turn off cauterizer.",
    "2_Zoom_in_endoscope_loud.m4a": "zoom in endoscope",

    # Class 3: Request Information
    "3_Camera_active_loud.m4a": "Camera active",
    "3_Display_anesthesia_level_loud.m4a": "display anesthesia level.",
    "3_Oxygen_level_stable_loud.m4a": "oxygen level stable.",
    "3_Patient_pulse_rate_loud.m4a": "patient pulse rate.",
    "3_Pressure_normal_loud.m4a": "pressure normal.",
    "3_Recording_video_feed_loud.m4a": "recording video feed.",
    "3_Show_blood_pressure_loud.m4a": "show blood pressure",
    "3_Suction_working_loud.m4a": "suction working",
    "3_Temperature_reading_loud.m4a": "temperature reading.",
    "3_What_is_heart_rate_loud.m4a": "what is heart rate?",
}

# Create a list to store comparison results
results = []
for idx, row in df.iterrows():
    fname = row["filename"]
    pred_text = str(row["text"]).lower().strip()
    true_text = ground_truth.get(fname, None)
    if true_text:
        # Calculate Word Error Rate for this file
        error = wer(true_text, pred_text)
        correct = (error == 0)
        results.append((fname, true_text, pred_text, error, correct))

# Convert to DataFrame
eval_df = pd.DataFrame(results, columns=["filename","ground_truth","whisper_output","WER","Exact_match"])

# Summary statistics
avg_wer = eval_df["WER"].mean()
exact_acc = eval_df["Exact_match"].mean()

print(f"Whisper Evaluation Results on Noraml(loud) Recordings")
print(f"Average Word Error Rate (WER): {avg_wer:.2f}")
print(f"Exact Sentence Accuracy: {exact_acc*100:.1f}%")

# Display a few examples
eval_df.head(30)


Whisper Evaluation Results on Noraml(loud) Recordings
Average Word Error Rate (WER): 0.28
Exact Sentence Accuracy: 63.3%


Unnamed: 0,filename,ground_truth,whisper_output,WER,Exact_match
0,1_Bring_me_clamp_loud.m4a,bring me clump.,bring me clump.,0.0,True
1,1_Give_me_forceps_loud.m4a,give me forceps.,give me four subs.,0.666667,False
2,1_Give_me_stapler_loud.m4a,give me stapler.,give me stapler.,0.0,True
3,1_Hand_me_scalpel_loud.m4a,hand me scalpel.,hand miscalculable,0.666667,False
4,1_Hand_over_syringe_loud.m4a,hand over syringe.,hand over syringe.,0.0,True
5,1_Need_retractor_loud.m4a,need retractor.,need retractor.,0.0,True
6,1_Need_suture_loud.m4a,need suture.,need sucher,0.5,False
7,1_Pass_needle_holder_loud.m4a,pass needle holder.,pass needle holder.,0.0,True
8,1_Pass_suction_tube_loud.m4a,pass suction tube.,pass section tube.,0.333333,False
9,1_Prepare_scissors_loud.m4a,prepare scissors.,prepare scissors.,0.0,True


In [119]:
# Now, I will Evaluate Whisper Transcription Accuracy on these recodrings, How much did Whisper model really transcript the Audio to text correctly?
# Here, I will compare Whisper's generated transcriptions with the correct (expected) commands.
# This evaluation helps measure how accurately Whisper recognized the recorded speech.

import pandas as pd
from jiwer import wer, cer, process_words

# Load the Whisper transcriptions
df = pd.read_csv("/content/drive/MyDrive/NLP_project/audio_transcriptions.csv")
df.columns = ["filename", "text", "label"]


ground_truth = {
    # Class 1: Request Instrument
    "1_Bring_me_clamp_fast_quiet.m4a": "bring me clump.",
    "1_Give_me_forceps_fast_quiet.m4a": "give me forceps.",
    "1_Give_me_stapler_fast_quiet.m4a": "give me stapler.",
    "1_Hand_me_scalpel_fast_quiet.m4a": "hand me scalpel.",
    "1_Hand_over_syringe_fast_quiet.m4a": "hand over syringe!",
    "1_Need_retractor_fast_quiet.m4a": "need retractor.",
    "1_Need_suture_fast_quiet.m4a": "need suture.",
    "1_Pass_needle_holder_fast_quiet.m4a": "pass needle holder.",
    "1_Pass_suction_tube_fast_quiet.m4a": "pass suction tube.",
    "1_Prepare_scissors_fast_quiet.m4a": "prepare scissors.",

    # Class 2: Adjust Device
    "2_Adjust_microscope_focus_fast_quiet.m4a": "adjust microscope focus.",
    "2_Increase_camera_brightness_fast_quiet.m4a": "increase camera brightness",
    "2_Increase_suction_power_fast_quiet.m4a": "increase suction power",
    "2_Lower_lighting_fast_quiet.m4a": "lower lighting",
    "2_Move_arm_left_fast_quiet.m4a": "move arm left.",
    "2_Reduce_table_height_fast_quiet.m4a": "Reduce table height.",
    "2_Set_ventilator_standby_fast_quiet.m4a": "Set ventilator standby.",
    "2_Stabilize_robotic_fast_quiet.m4a": "stabilize robotic arm",
    "2_Turn_off_cauterizer_fast_quiet.m4a": "turn off cauterizer.",
    "2_Zoom_in_endoscope_fast_quiet.m4a": "zoom in endoscope",

    # Class 3: Request Information
    "3_Camera_active_fast_quiet.m4a": "Camera active",
    "3_Display_anesthesia_level_fast_quiet.m4a": "display anesthesia level.",
    "3_Oxygen_level_stable_fast_quiet.m4a": "oxygen level stable.",
    "3_Patient_pulse_rate_fast_quiet.m4a": "patient pulse rate.",
    "3_Pressure_normal_fast_quiet.m4a": "pressure normal.",
    "3_Recording_video_feed_fast_quiet.m4a": "recording video feed.",
    "3_Show_blood_pressure_fast_quiet.m4a": "show blood pressure",
    "3_Suction_working_fast_quiet.m4a": "suction working",
    "3_Temperature_reading_fast_quiet.m4a": "temperature reading.",
    "3_What_is_heart_rate_fast_quiet.m4a": "what is heart rate?",
}

# Create a list to store comparison results
results = []
for idx, row in df.iterrows():
    fname = row["filename"]
    pred_text = str(row["text"]).lower().strip()
    true_text = ground_truth.get(fname, None)
    if true_text:
        # Calculate Word Error Rate for this file
        error = wer(true_text, pred_text)
        correct = (error == 0)
        results.append((fname, true_text, pred_text, error, correct))

# Convert to DataFrame
eval_df = pd.DataFrame(results, columns=["filename","ground_truth","whisper_output","WER","Exact_match"])

# Summary statistics
avg_wer = eval_df["WER"].mean()
exact_acc = eval_df["Exact_match"].mean()

print(f"Whisper Evaluation Results on fast and queit Recordings")
print(f"Average Word Error Rate (WER): {avg_wer:.2f}")
print(f"Exact Sentence Accuracy: {exact_acc*100:.1f}%")

# Display a few examples
eval_df.head(30)


Whisper Evaluation Results on fast and queit Recordings
Average Word Error Rate (WER): 1.02
Exact Sentence Accuracy: 6.9%


Unnamed: 0,filename,ground_truth,whisper_output,WER,Exact_match
0,1_Bring_me_clamp_fast_quiet.m4a,bring me clump.,pek miklem!,1.0,False
1,1_Give_me_forceps_fast_quiet.m4a,give me forceps.,give me four steps.,0.666667,False
2,1_Give_me_stapler_fast_quiet.m4a,give me stapler.,give me a stab below.,1.0,False
3,1_Hand_me_scalpel_fast_quiet.m4a,hand me scalpel.,helmys kalpal,1.0,False
4,1_Hand_over_syringe_fast_quiet.m4a,hand over syringe!,hand over syringe!,0.0,True
5,1_Need_retractor_fast_quiet.m4a,need retractor.,needed vector.,1.0,False
6,1_Need_suture_fast_quiet.m4a,need suture.,ни цучр!,1.0,False
7,1_Pass_needle_holder_fast_quiet.m4a,pass needle holder.,pass needed holder.,0.333333,False
8,1_Pass_suction_tube_fast_quiet.m4a,pass suction tube.,"last section, cube.",1.0,False
9,1_Prepare_scissors_fast_quiet.m4a,prepare scissors.,journalism,1.0,False


# ***Third step: Text Classification with BERT***
Now, let’s prepare for training our **BERT model**
so let's start by installing the required NLP libraries, which are:

1.   **Transformers** → for BERT and tokenization  
2.   **Datasets** → for data handling
3. **Scikit-learn** → for encoding labels and evaluation


In [12]:
!pip install -q transformers datasets scikit-learn


In [13]:
#Then let me load the transcribed data set that I created using the Whisper model.
#In this spreadsheet I can see the file name for audio, the transcription, and the label for each command.
# to ensure that evrything is accurate, I will review the first few rows


import pandas as pd

df = pd.read_csv("/content/drive/MyDrive/NLP_project/audio_transcriptions.csv")
df.columns = ["filename", "text", "label"]
df.head()


Unnamed: 0,filename,text,label
0,1_Bring_me_clamp_loud.m4a,Bring me clump.,instrument
1,1_Give_me_forceps_loud.m4a,Give me four subs.,instrument
2,1_Give_me_stapler_loud.m4a,Give me stapler.,instrument
3,1_Hand_me_scalpel_loud.m4a,Hand miscalculable,instrument
4,1_Hand_over_syringe_loud.m4a,Hand over syringe.,instrument


In [14]:
#Now: Encoding the Labels
# I need to convert the text-based labels into numeric, before training BERT.
#This would help the model to process the target categories appropriately.
#I will be using **LabelEncoder** from scikit-learn to convert the labels to numerical IDs.

from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
df["label_id"] = encoder.fit_transform(df["label"])
print("Labels:", encoder.classes_)


Labels: ['device' 'info' 'instrument']


# ***Fourth step: Model Training and Evaluation***

Here, I will training the **BERT classifier** on all 60 transcribed commands —
Each command (both the "loud" and the "fast-quite" versions) will be considered as an independent example.
I will use **5-Fold Cross-Validation** to ensure how the model generalizes.


In [15]:
# I will calculate and display the average accuracy, precision, recall, and F1-score
#across all 5 folds to summarize the model’s performance using:

from sklearn.model_selection import KFold
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
import torch, numpy as np, pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

# Load full dataset of 60 recordings
df = pd.read_csv("/content/drive/MyDrive/NLP_project/audio_transcriptions.csv",
                 header=0, names=["filename","text","label"])

encoder = LabelEncoder()
df["label_id"] = encoder.fit_transform(df["label"])

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
kf = KFold(n_splits=5, shuffle=True, random_state=42)

def to_ds(texts, labels):
    enc = tokenizer(texts, truncation=True, padding=True, max_length=64)
    class DS(torch.utils.data.Dataset):
        def __init__(self, enc, labels): self.enc, self.labels = enc, labels
        def __getitem__(self, i):
            item = {k: torch.tensor(v[i]) for k,v in self.enc.items()}
            item["labels"] = torch.tensor(self.labels[i]); return item
        def __len__(self): return len(self.labels)
    return DS(enc, labels)

fold_metrics = []

for fold, (tr_idx, te_idx) in enumerate(kf.split(df), 1):
    print(f"\n🔹 Fold {fold}")
    tr_texts = df.iloc[tr_idx]["text"].tolist()
    te_texts  = df.iloc[te_idx]["text"].tolist()
    tr_labels = df.iloc[tr_idx]["label_id"].tolist()
    te_labels  = df.iloc[te_idx]["label_id"].tolist()

    train_ds = to_ds(tr_texts, tr_labels)
    test_ds  = to_ds(te_texts, te_labels)

    model = BertForSequenceClassification.from_pretrained(
        "bert-base-uncased", num_labels=len(encoder.classes_)
    )

    args = TrainingArguments(
        output_dir=f"/content/drive/MyDrive/NLP_project/fold_{fold}",
        num_train_epochs=3,
        per_device_train_batch_size=8,
        per_device_eval_batch_size=8,
        learning_rate=2e-5,
        report_to="none"
    )

    trainer = Trainer(model=model, args=args, train_dataset=train_ds)
    trainer.train()

    preds = trainer.predict(test_ds)
    y_pred = np.argmax(preds.predictions, axis=-1)
    acc = accuracy_score(te_labels, y_pred)
    pr, rc, f1, _ = precision_recall_fscore_support(te_labels, y_pred,
                                                    average="macro", zero_division=0)
    print({"accuracy": acc, "precision": pr, "recall": rc, "f1": f1})
    fold_metrics.append((acc, pr, rc, f1))


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]


🔹 Fold 1


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss


{'accuracy': 0.4166666666666667, 'precision': 0.25, 'recall': 0.5, 'f1': 0.3333333333333333}

🔹 Fold 2


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss


{'accuracy': 0.3333333333333333, 'precision': 0.14814814814814814, 'recall': 0.2222222222222222, 'f1': 0.17777777777777778}

🔹 Fold 3


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss


{'accuracy': 0.4166666666666667, 'precision': 0.3, 'recall': 0.5, 'f1': 0.35714285714285715}

🔹 Fold 4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss


{'accuracy': 0.16666666666666666, 'precision': 0.05555555555555555, 'recall': 0.3333333333333333, 'f1': 0.09523809523809523}

🔹 Fold 5


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss


{'accuracy': 0.25, 'precision': 0.1619047619047619, 'recall': 0.25, 'f1': 0.19528619528619529}


In [16]:
# for overall accuracy:
import numpy as np

acc_avg = np.mean([m[0] for m in fold_metrics])
pr_avg  = np.mean([m[1] for m in fold_metrics])
rc_avg  = np.mean([m[2] for m in fold_metrics])
f1_avg  = np.mean([m[3] for m in fold_metrics])

print("5-Fold Average Metrics:")
print(f"Accuracy:  {acc_avg:.3f}")
print(f"Precision: {pr_avg:.3f}")
print(f"Recall:    {rc_avg:.3f}")
print(f"F1 Score:  {f1_avg:.3f}")


5-Fold Average Metrics:
Accuracy:  0.317
Precision: 0.183
Recall:    0.361
F1 Score:  0.232


# ***Fifth step: Audio Augmentation using Audacity***
Since I now know what my baseline performance is, I can do **audio data augmentation** to understand how much the Whisper model is resistant to variations acoustic

I will will modify the original records (i.e., by changing speed and pitch and adding faint background noise) using **Audacity**.
These improved recordings also replicate the following genuine conditions: quick speech, different voices, and background noise from the OR.


Now that I have my baseline performance, I can perform **audio data augmentation** to evaluate how robust the Whisper model is to acoustic variations   
Using **Audacity**, I will modify the original recordings slightly (e.g., by changing speed, pitch, and adding mild background noise).  
These augmented recordings simulate realistic conditions such as fast speech, different voices, and ambient operating room noise.


In [63]:
#To check that the augmented recoding with noise exists
!ls "/content/drive/MyDrive/NLP_project/Augmented-Audio-Noise"

class1-Request-Instrument-noise  class3-Request-Info-noise
class2-Adjust-Device-noise


In [64]:
# Here I did the noise on only the loud original recorded for each class, I added this noise using Audacity
!ls "/content/drive/MyDrive/NLP_project/Augmented-Audio-Noise/class1-Request-Instrument-noise"

1_Bring_me_clamp_loud_noise.wav     1_Need_retractor_loud_noise.wav
1_Give_me_forceps_loud_noise.wav    1_Need_suture_loud_noise.wav
1_Give_me_stapler_loud_noise.wav    1_Pass_needle_holder_loud_noise.wav
1_Hand_me_scalpel_loud_noise.wav    1_Pass_suction_tube_loud_noise.wav
1_Hand_over_syringe_loud_noise.wav  1_Prepare_scissors_loud_noise.wav


In [65]:
!ls "/content/drive/MyDrive/NLP_project/Augmented-Audio-Noise/class2-Adjust-Device-noise"

2_Adjust_microscope_focus_loud_noise.wav
2_Increase_camera_brightness_loud_noise.wav
2_Increase_suction_power_loud_noise.wav
2_Lower_lighting_loud_noise.wav
2_Move_arm_left_loud_noise.wav
2_Reduce_table_height_loud_noise.wav
2_Set_ventilator_standby_loud_noise.wav
2_Stabilize_robotic_arm_loud_noise.wav
2_Turn_off_cauterizer_loud_noise.wav
2_Zoom_in_endoscope_loud_noise.wav


In [66]:
!ls "/content/drive/MyDrive/NLP_project/Augmented-Audio-Noise/class3-Request-Info-noise"

3_Camera_active_loud_noise.wav
3_Display_anesthesia_level_loud_noise.wav
3_Oxygen_level_stable_loud_noise.wav
3_Patient_pulse_rate_loud_noise.wav
3_Pressure_normal_loud_noise.wav
3_Recording_video_feed_loud_noise.wav
3_Show_blood_pressure_loud_noise.wav
3_Suction_working_loud_noise.wav
3_Temperature_reading_loud_noise.wav
3_What_is_heart_rate_loud_noise.wav


In [67]:
# Prepare directories and CSV paths for Whisper transcription

import os
import csv
import whisper

# Define base directory (your augmented-noise dataset)
base_dir = "/content/drive/MyDrive/NLP_project/Augmented-Audio-Noise"

# Define where to save the output CSV file
output_csv = "/content/drive/MyDrive/NLP_project/audio_transcriptions_noise.csv"



In [68]:
# Now let’s run the Whisper Model to automatically transcribe all the Augmented(noise) recordings
# This will create the dataset that contain each filename and its transcription and the corresponding label to be trained upon.


rows = [("filename", "transcription", "label")]

for folder in os.listdir(base_dir):
    folder_path = os.path.join(base_dir, folder)
    if not os.path.isdir(folder_path):
        continue

    label = folder.split("-")[-1].strip().replace(" ", "_").lower()

    for file in os.listdir(folder_path):
        if not file.lower().endswith((".wav", ".m4a")):
            continue
        audio_path = os.path.join(folder_path, file)
        result = model.transcribe(audio_path)
        text = result["text"].strip()
        rows.append((file, text, label))


In [69]:
#downloading the csv file so we can use it later for text classification
import csv

output_csv = "/content/drive/MyDrive/NLP_project/audio_transcriptions_noise.csv"

with open(output_csv, "w", newline="", encoding="utf-8") as f:
    writer = csv.writer(f)
    writer.writerows(rows)

print("CSV saved successfully at:", output_csv)


CSV saved successfully at: /content/drive/MyDrive/NLP_project/audio_transcriptions_noise.csv


In [78]:
#Then let me load the transcribed data set that I created using the Whisper model. here I did it for the second csv file(with noise)
#In this spreadsheet I can see the file name for audio, the transcription, and the label for each command.
# to ensure that evrything is accurate, I will review the first few rows


import pandas as pd

df = pd.read_csv("/content/drive/MyDrive/NLP_project/audio_transcriptions_noise.csv")
df.columns = ["filename", "text", "label"]
df.head()


Unnamed: 0,filename,text,label
0,2_Adjust_microscope_focus_loud_noise.wav,Adjust Microscope Focus.,noise
1,2_Stabilize_robotic_arm_loud_noise.wav,تبلايز ربوتك أرم,noise
2,2_Zoom_in_endoscope_loud_noise.wav,Zoom in in the scope.,noise
3,2_Increase_camera_brightness_loud_noise.wav,increase camera brightness,noise
4,2_Lower_lighting_loud_noise.wav,lower lighting,noise


In [116]:
# Now, I will Evaluate Whisper Transcription Accuracy on these augmented recodrings with noise,
# How much did Whisper model really transcript the Audio to text correctly?
# Here, I will compare Whisper's generated transcriptions with the correct (expected) commands.
# This evaluation helps measure how accurately Whisper recognized the recorded speech.

import pandas as pd
from jiwer import wer, cer, process_words

# Load the Whisper transcriptions
df = pd.read_csv("/content/drive/MyDrive/NLP_project/audio_transcriptions_noise.csv")
df.columns = ["filename", "text", "label"]


ground_truth = {
    # Class 1: Request Instrument
    "1_Bring_me_clamp_loud_noise.wav": "bring me clump.",
    "1_Give_me_forceps_loud_noise.wav": "give me forceps.",
    "1_Give_me_stapler_loud_noise.wav": "give me stapler.",
    "1_Hand_me_scalpel_loud_noise.wav": "hand me scalpel.",
    "1_Hand_over_syringe_loud_noise.wav": "hand over syringe.",
    "1_Need_retractor_loud_noise.wav": "need retractor",
    "1_Need_suture_loud_noise.wav": "need suture.",
    "1_Pass_needle_holder_loud_noise.wav": "pass needle holder.",
    "1_Pass_suction_tube_loud_noise.wav": "pass suction tube.",
    "1_Prepare_scissors_loud_noise.wav": "prepare scissors.",

    # Class 2: Adjust Device
    "2_Adjust_microscope_focus_loud_noise.wav": "adjust microscope focus.",
    "2_Increase_camera_brightness_loud_noise.wav": "increase camera brightness",
    "2_Increase_suction_power_loud_noise.wav": "increase suction power",
    "2_Lower_lighting_loud_noise.wav": "lower lighting",
    "2_Move_arm_left_loud_noise.wav": "move arm left.",
    "2_Reduce_table_height_loud_noise.wav": "Reduce table height.",
    "2_Set_ventilator_standby_loud_noise.wav": "Set ventilator standby.",
    "2_Stabilize_robotic_arm_loud_noise.wav": "stabilize robotic arm",
    "2_Turn_off_cauterizer_loud_noise.wav": "turn off cauterizer.",
    "2_Zoom_in_endoscope_loud_noise.wav": "zoom in endoscope",

    # Class 3: Request Information
    "3_Camera_active_loud.wav": "Camera active",
    "3_Display_anesthesia_level_loud_noise.wav": "display anesthesia level.",
    "3_Oxygen_level_stable_loud_noise.wav": "oxygen level stable",
    "3_Patient_pulse_rate_loud_noise.wav": "patient pulse rate.",
    "3_Pressure_normal_loud_noise.wav": "pressure normal.",
    "3_Recording_video_feed_loud_noise.wav": "recording video feed",
    "3_Show_blood_pressure_loud_noise.wav": "show blood pressure",
    "3_Suction_working_loud_noise.wav": "suction working",
    "3_Temperature_reading_loud_noise.wav": "temperature reading.",
    "3_What_is_heart_rate_loud_noise.wav": "what is heart rate?",
}

# Create a list to store comparison results
results = []
for idx, row in df.iterrows():
    fname = row["filename"]
    pred_text = str(row["text"]).lower().strip()
    true_text = ground_truth.get(fname, None)
    if true_text:
        # Calculate Word Error Rate for this file
        error = wer(true_text, pred_text)
        correct = (error == 0)
        results.append((fname, true_text, pred_text, error, correct))

# Convert to DataFrame
eval_df = pd.DataFrame(results, columns=["filename","ground_truth","whisper_output","WER","Exact_match"])

# Summary statistics
avg_wer = eval_df["WER"].mean()
exact_acc = eval_df["Exact_match"].mean()

print(f"Whisper Evaluation Results on augemented Recordings with voice")
print(f"Average Word Error Rate (WER): {avg_wer:.2f}")
print(f"Exact Sentence Accuracy: {exact_acc*100:.1f}%")

# Display a few examples
eval_df.head(30)


Whisper Evaluation Results on augemented Recordings with voice
Average Word Error Rate (WER): 0.39
Exact Sentence Accuracy: 48.3%


Unnamed: 0,filename,ground_truth,whisper_output,WER,Exact_match
0,2_Adjust_microscope_focus_loud_noise.wav,adjust microscope focus.,adjust microscope focus.,0.0,True
1,2_Stabilize_robotic_arm_loud_noise.wav,stabilize robotic arm,تبلايز ربوتك أرم,1.0,False
2,2_Zoom_in_endoscope_loud_noise.wav,zoom in endoscope,zoom in in the scope.,1.0,False
3,2_Increase_camera_brightness_loud_noise.wav,increase camera brightness,increase camera brightness,0.0,True
4,2_Lower_lighting_loud_noise.wav,lower lighting,lower lighting,0.0,True
5,2_Move_arm_left_loud_noise.wav,move arm left.,move arm left.,0.0,True
6,2_Increase_suction_power_loud_noise.wav,increase suction power,increase suction power,0.0,True
7,2_Turn_off_cauterizer_loud_noise.wav,turn off cauterizer.,10 of catariser,1.0,False
8,2_Reduce_table_height_loud_noise.wav,Reduce table height.,ردوز تيبل حيط,1.0,False
9,2_Set_ventilator_standby_loud_noise.wav,Set ventilator standby.,"sit, ventilator, standby.",0.666667,False


In [70]:
#To check that the augmented recoding with Reverb&Bass&Treble exists
!ls "/content/drive/MyDrive/NLP_project/Augmented-Audio-Reverb&Bass&Treble"

'class1-Request-Instrument-Reverb&Bass&Treble'
'class2-Adjust-Device-Reverb&Bass&Treble'
'class3-Request-Info-reverb&Bass&Treble'


In [71]:
# Here I did the noise on only the loud original recorded for each class, I added this effect(Reverb&Bass&Treble) using Audacity
!ls "/content/drive/MyDrive/NLP_project/Augmented-Audio-Reverb&Bass&Treble/class1-Request-Instrument-Reverb&Bass&Treble"

1_Bring_me_clamp_loud_reverb_bass_treble.wav
1_Give_me_forceps_loud_reverb_bass_treble.wav
1_Give_me_stapler_loud_reverb_bass_treble.wav
1_Hand_me_scalpel_loud_reverb_bass_treble.wav
1_Hand_over_syringe_loud_reverb_bass_treble.wav
1_Need_retractor_loud_reverb_bass_treble.wav
1_Need_suture_loud_reverb_bass_treble.wav
1_Pass_needle_holder_loud_reverb_bass_treble.wav
1_Pass_suction_tube_loud_reverb_bass_treble.wav
1_Prepare_scissors_loud_reverb_bass_treble.wav


In [72]:
!ls "/content/drive/MyDrive/NLP_project/Augmented-Audio-Reverb&Bass&Treble/class2-Adjust-Device-Reverb&Bass&Treble"

2_Adjust_microscope_focus_loud_reverb_bass_treble.wav
2_Increase_camera_brightness_loud_reverb_bass_treble.wav
2_Increase_suction_power_loud_reverb_bass_treble.wav
2_Lower_lighting_loud_reverb_bass_treble.wav
2_Move_arm_left_loud_reverb_bass_treble.wav
2_Reduce_table_height_loud_reverb_bass_treble.wav
2_Set_ventilator_standby_loud_reverb_bass_treble.wav
2_Stabilize_robotic_arm_loud_reverb_bass_treble.wav
2_Turn_off_cauterizer_loud_reverb_bass_treble.wav
2_Zoom_in_endoscope_loud_reverb_bass_treble.wav


In [73]:
!ls "/content/drive/MyDrive/NLP_project/Augmented-Audio-Reverb&Bass&Treble/class3-Request-Info-reverb&Bass&Treble"

3_Camera_active_loud_reverb_bass_treble.wav
3_Display_anesthesia_level_loud_reverb_bass_treble.wav
3_Oxygen_level_stable_loud_reverb_bass_treble.wav
3_Patient_pulse_rate_loud_reverb_bass_treble.wav
3_Pressure_normal_loud_reverb_bass_treble.wav
3_Recording_video_feed_loud_reverb_bass_treble.wav
3_Show_blood_pressure_loud_reverb_bass_treble.wav
3_Suction_working_loud_reverb_bass_treble.wav
3_Temperature_reading_loud_reverb_bass_treble.wav.wav
3_What_is_heart_rate_loud_reverb_bass_treble.wav


In [74]:
# Prepare directories and CSV paths for Whisper transcription

import os
import csv
import whisper

# Define base directory (your augmented-noise dataset)
base_dir = "/content/drive/MyDrive/NLP_project/Augmented-Audio-Reverb&Bass&Treble"

# Define where to save the output CSV file
output_csv = "/content/drive/MyDrive/NLP_project/audio_transcriptions_Reverb&Bass&Treble.csv"

In [76]:
# Now let’s run the Whisper Model to automatically transcribe all the Augmented(Reverb&Bass&Treble) recordings
# This will create the dataset that contain each filename and its transcription and the corresponding label to be trained upon.

rows = [("filename", "transcription", "label")]

for folder in os.listdir(base_dir):
    folder_path = os.path.join(base_dir, folder)
    if not os.path.isdir(folder_path):
        continue

    label = folder.split("-")[-1].strip().replace(" ", "_").lower()

    for file in os.listdir(folder_path):
        if not file.lower().endswith((".wav", ".m4a")):
            continue
        audio_path = os.path.join(folder_path, file)
        result = model.transcribe(audio_path)
        text = result["text"].strip()
        rows.append((file, text, label))


In [77]:
#downloading the csv file so we can use it later for text classification
import csv

output_csv = "/content/drive/MyDrive/NLP_project/audio_transcriptions_Reverb&Bass&Treble.csv"

with open(output_csv, "w", newline="", encoding="utf-8") as f:
    writer = csv.writer(f)
    writer.writerows(rows)

print("CSV saved successfully at:", output_csv)

CSV saved successfully at: /content/drive/MyDrive/NLP_project/audio_transcriptions_Reverb&Bass&Treble.csv


In [79]:
#Then let me load the transcribed data set that I created using the Whisper model. here I did it for the second csv file(with Reverb&Bass&Treble effects)
#In this spreadsheet I can see the file name for audio, the transcription, and the label for each command.
# to ensure that evrything is accurate, I will review the first few rows


import pandas as pd

df = pd.read_csv("/content/drive/MyDrive/NLP_project/audio_transcriptions_Reverb&Bass&Treble.csv")
df.columns = ["filename", "text", "label"]
df.head()

Unnamed: 0,filename,text,label
0,2_Set_ventilator_standby_loud_reverb_bass_treb...,SIP Ventilated Stand By,reverb&bass&treble
1,2_Turn_off_cauterizer_loud_reverb_bass_treble.wav,10. Off Cut Riser,reverb&bass&treble
2,2_Stabilize_robotic_arm_loud_reverb_bass_trebl...,סטבילי סלבות עקעה,reverb&bass&treble
3,2_Reduce_table_height_loud_reverb_bass_treble.wav,Reduce state behind.,reverb&bass&treble
4,2_Increase_camera_brightness_loud_reverb_bass_...,Increase camera brightness.,reverb&bass&treble


In [117]:
# Now, I will Evaluate Whisper Transcription Accuracy on these augmented recodrings &reverb&bass&treble effects,
#How much did Whisper model really transcript the Audio to text correctly?
# Here, I will compare Whisper's generated transcriptions with the correct (expected) commands.
# This evaluation helps measure how accurately Whisper recognized the recorded speech.

import pandas as pd
from jiwer import wer, cer, process_words

# Load the Whisper transcriptions
df = pd.read_csv("/content/drive/MyDrive/NLP_project/audio_transcriptions_Reverb&Bass&Treble.csv")
df.columns = ["filename", "text", "label"]


ground_truth = {
    # Class 1: Request Instrument
    "1_Bring_me_clamp_loud_reverb_bass_treble.wav": "bring me clump.",
    "1_Give_me_forceps_loud_reverb_bass_treble.wav": "give me forceps.",
    "1_Give_me_stapler_loud_reverb_bass_treble.wav": "give me stapler.",
    "1_Hand_me_scalpel_loud_reverb_bass_treble.wav": "hand me scalpel.",
    "1_Hand_over_syringe_loud_reverb_bass_treble.wav": "hand over syringe",
    "1_Need_retractor_loud_reverb_bass_treble.wav": "need retractor.",
    "1_Need_suture_loud_reverb_bass_treble.wav": "need suture.",
    "1_Pass_needle_holder_loud_reverb_bass_treble.wav": "pass needle holder.",
    "1_Pass_suction_tube_loud_reverb_bass_treble.wav": "pass suction tube.",
    "1_Prepare_scissors_loud_reverb_bass_treble.wav": "prepare scissors.",

    # Class 2: Adjust Device
    "2_Adjust_microscope_focus_loud_reverb_bass_treble.wav": "adjust microscope focus.",
    "2_Increase_camera_brightness_loud_reverb_bass_treble.wav": "increase camera brightness.",
    "2_Increase_suction_power_loud_reverb_bass_treble.wav": "increase suction power.",
    "2_Lower_lighting_loud_reverb_bass_treble.wav": "lower lightingg",
    "2_Move_arm_left_loud_reverb_bass_treble.wav": "move arm left.",
    "2_Reduce_table_height_loud_reverb_bass_treble.wav": "Reduce table height.",
    "2_Set_ventilator_standby_loud_reverb_bass_treble.wav": "Set ventilator standby.",
    "2_Stabilize_robotic_arm_loud_reverb_bass_treble.wav": "stabilize robotic arm",
    "2_Turn_off_cauterizer_loud_reverb_bass_treble.wav": "turn off cauterizer.",
    "2_Zoom_in_endoscope_loud_reverb_bass_treble.wav": "zoom in endoscope",

    # Class 3: Request Information
    "3_Camera_active_loud.wav": "Camera active",
    "3_Display_anesthesia_level_loud_reverb_bass_treble.wav": "display anesthesia level.",
    "3_Oxygen_level_stable_loud_reverb_bass_treble.wav": "oxygen level stable.",
    "3_Patient_pulse_rate_loud_reverb_bass_treble.wav": "patient pulse rate.",
    "3_Pressure_normal_loud_reverb_bass_treble.wav": "pressure normal.",
    "3_Recording_video_feed_loud_reverb_bass_treble.wav": "recording video feed.",
    "3_Show_blood_pressure_loud_reverb_bass_treble.wav": "show blood pressure.",
    "3_Suction_working_loud_reverb_bass_treble.wav": "suction working",
    "3_Temperature_reading_loud_reverb_bass_treble.wav": "temperature reading.",
    "3_What_is_heart_rate_loud_reverb_bass_treble.wav": "what is heart rate?",
}

# Create a list to store comparison results
results = []
for idx, row in df.iterrows():
    fname = row["filename"].strip().replace(".m4a", ".wav")
    pred_text = str(row["text"]).lower().strip()
    true_text = ground_truth.get(fname, None)
    if true_text:
        # Calculate Word Error Rate for this file
        error = wer(true_text, pred_text)
        correct = (error == 0)
        results.append((fname, true_text, pred_text, error, correct))

# Convert to DataFrame
eval_df = pd.DataFrame(results, columns=["filename","ground_truth","whisper_output","WER","Exact_match"])

# Summary statistics
avg_wer = eval_df["WER"].mean()
exact_acc = eval_df["Exact_match"].mean()

print(f"Whisper Evaluation Results on Augmented Recordings (&reverb&bass&treble effects)")
print(f"Average Word Error Rate (WER): {avg_wer:.2f}")
print(f"Exact Sentence Accuracy: {exact_acc*100:.1f}%")

# Display a few examples
eval_df.head(30)


Whisper Evaluation Results on Augmented Recordings (&reverb&bass&treble effects)
Average Word Error Rate (WER): 0.57
Exact Sentence Accuracy: 35.7%


Unnamed: 0,filename,ground_truth,whisper_output,WER,Exact_match
0,2_Set_ventilator_standby_loud_reverb_bass_treb...,Set ventilator standby.,sip ventilated stand by,1.333333,False
1,2_Turn_off_cauterizer_loud_reverb_bass_treble.wav,turn off cauterizer.,10. off cut riser,1.0,False
2,2_Stabilize_robotic_arm_loud_reverb_bass_trebl...,stabilize robotic arm,סטבילי סלבות עקעה,1.0,False
3,2_Reduce_table_height_loud_reverb_bass_treble.wav,Reduce table height.,reduce state behind.,1.0,False
4,2_Increase_camera_brightness_loud_reverb_bass_...,increase camera brightness.,increase camera brightness.,0.0,True
5,2_Move_arm_left_loud_reverb_bass_treble.wav,move arm left.,move on left.,0.333333,False
6,2_Zoom_in_endoscope_loud_reverb_bass_treble.wav,zoom in endoscope,zoom in in the scope.,1.0,False
7,2_Lower_lighting_loud_reverb_bass_treble.wav,lower lightingg,119th,1.0,False
8,2_Increase_suction_power_loud_reverb_bass_treb...,increase suction power.,increase suction power.,0.0,True
9,2_Adjust_microscope_focus_loud_reverb_bass_tre...,adjust microscope focus.,adjust microscope focus.,0.0,True


In [110]:
import pandas as pd
df = pd.read_csv("/content/drive/MyDrive/NLP_project/audio_transcriptions_noise.csv")
print("Number of files in CSV:", len(df))
print("\nSample filenames:")
print(df["filename"].head(30).tolist())


Number of files in CSV: 30

Sample filenames:
['2_Adjust_microscope_focus_loud_noise.wav', '2_Stabilize_robotic_arm_loud_noise.wav', '2_Zoom_in_endoscope_loud_noise.wav', '2_Increase_camera_brightness_loud_noise.wav', '2_Lower_lighting_loud_noise.wav', '2_Move_arm_left_loud_noise.wav', '2_Increase_suction_power_loud_noise.wav', '2_Turn_off_cauterizer_loud_noise.wav', '2_Reduce_table_height_loud_noise.wav', '2_Set_ventilator_standby_loud_noise.wav', '3_Camera_active_loud_noise.wav', '3_Show_blood_pressure_loud_noise.wav', '3_Oxygen_level_stable_loud_noise.wav', '3_Temperature_reading_loud_noise.wav', '3_Recording_video_feed_loud_noise.wav', '3_Patient_pulse_rate_loud_noise.wav', '3_Pressure_normal_loud_noise.wav', '3_Suction_working_loud_noise.wav', '3_What_is_heart_rate_loud_noise.wav', '3_Display_anesthesia_level_loud_noise.wav', '1_Need_suture_loud_noise.wav', '1_Hand_me_scalpel_loud_noise.wav', '1_Give_me_stapler_loud_noise.wav', '1_Prepare_scissors_loud_noise.wav', '1_Give_me_force

# **IMPORTANT NOTE**

I checked the output of the csv files for both noise and Reverb&Bass&Treble effects, we can see from the outputs that without noises whisper did better job at transaction (Audio to text). more obsrevations will be written in the report.



# **Sixth step: Train BERT on the Combined Dataset**
Here I merged the three datasets — original recordings, noise-augmented, and reverb-enhanced — to check the *robustness of the Whisper transcriptions (Audio to Text)* and create a one full dataset for the BERT text classification stage.This enables the model to learn from diverse audio conditions, improving its generalization performance.

In [81]:
#combining these datasets together: 1. "audio_transcriptions.csv"  2. "audio_transcriptions_noise.csv" 3."audio_transcriptions_Reverb&Bass&Treble.csv"
import pandas as pd

# Load the three transcription CSVs
df_orig = pd.read_csv("/content/drive/MyDrive/NLP_project/audio_transcriptions.csv")
df_noise = pd.read_csv("/content/drive/MyDrive/NLP_project/audio_transcriptions_noise.csv")
df_reverb = pd.read_csv("/content/drive/MyDrive/NLP_project/audio_transcriptions_Reverb&Bass&Treble.csv")

# Standardize column names
df_orig.columns = ["filename", "text", "label"]
df_noise.columns = ["filename", "text", "label"]
df_reverb.columns = ["filename", "text", "label"]

# Merge all datasets together
df_all = pd.concat([df_orig, df_noise, df_reverb], ignore_index=True)

# Save the combined dataset
merged_path = "/content/drive/MyDrive/NLP_project/audio_transcriptions_all.csv"
df_all.to_csv(merged_path, index=False)

print(f"Combined dataset saved successfully!")
print(f"File path: {merged_path}")
print(f"Total samples: {len(df_all)}")

# Show preview
df_all.sample(10)

Combined dataset saved successfully!
File path: /content/drive/MyDrive/NLP_project/audio_transcriptions_all.csv
Total samples: 120


Unnamed: 0,filename,text,label
44,3_Pressure_normal_loud.m4a,برشة نورمل,info
47,3_Suction_working_loud.m4a,suction working,info
4,1_Hand_over_syringe_loud.m4a,Hand over syringe.,instrument
55,3_Recording_video_feed_fast_quiet.m4a,Коколь гли обязательно,info
26,2_Set_ventilator_standby_loud.m4a,"Sit, ventilator, standby.",device
64,2_Lower_lighting_loud_noise.wav,lower lighting,noise
73,3_Temperature_reading_loud_noise.wav,Temperature reading.,noise
10,1_Bring_me_clamp_fast_quiet.m4a,Pek miklem!,instrument
40,3_Camera_active_loud.m4a,Camular active.,info
107,1_Give_me_forceps_loud_reverb_bass_treble.wav,Give me four steps.,reverb&bass&treble


In [82]:
#Training BERT on this combined Dataset it was originally(60 samples) now (120 samples):
# This helps evaluate whether data augmentation improves robustness and classification accuracy.

from sklearn.model_selection import KFold
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
import torch, numpy as np, pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

# Load the merged dataset (120 recordings)
df = pd.read_csv("/content/drive/MyDrive/NLP_project/audio_transcriptions_all.csv",
                 header=0, names=["filename","text","label"])

# Encode labels
encoder = LabelEncoder()
df["label_id"] = encoder.fit_transform(df["label"])

# Initialize tokenizer and cross-validation
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Dataset helper
def to_ds(texts, labels):
    enc = tokenizer(texts, truncation=True, padding=True, max_length=64)
    class DS(torch.utils.data.Dataset):
        def __init__(self, enc, labels): self.enc, self.labels = enc, labels
        def __getitem__(self, i):
            item = {k: torch.tensor(v[i]) for k,v in self.enc.items()}
            item["labels"] = torch.tensor(self.labels[i]); return item
        def __len__(self): return len(self.labels)
    return DS(enc, labels)

# 5-fold training
fold_metrics = []

for fold, (tr_idx, te_idx) in enumerate(kf.split(df), 1):
    print(f"\n🔹 Fold {fold}")
    tr_texts = df.iloc[tr_idx]["text"].tolist()
    te_texts  = df.iloc[te_idx]["text"].tolist()
    tr_labels = df.iloc[tr_idx]["label_id"].tolist()
    te_labels  = df.iloc[te_idx]["label_id"].tolist()

    train_ds = to_ds(tr_texts, tr_labels)
    test_ds  = to_ds(te_texts, te_labels)

    model = BertForSequenceClassification.from_pretrained(
        "bert-base-uncased", num_labels=len(encoder.classes_)
    )

    args = TrainingArguments(
        output_dir=f"/content/drive/MyDrive/NLP_project/fold_{fold}_merged",
        num_train_epochs=3,
        per_device_train_batch_size=8,
        per_device_eval_batch_size=8,
        learning_rate=2e-5,
        report_to="none"
    )

    trainer = Trainer(model=model, args=args, train_dataset=train_ds)
    trainer.train()

    preds = trainer.predict(test_ds)
    y_pred = np.argmax(preds.predictions, axis=-1)

    acc = accuracy_score(te_labels, y_pred)
    pr, rc, f1, _ = precision_recall_fscore_support(
        te_labels, y_pred, average="macro", zero_division=0
    )

    print({"accuracy": acc, "precision": pr, "recall": rc, "f1": f1})
    fold_metrics.append((acc, pr, rc, f1))





🔹 Fold 1


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss


{'accuracy': 0.20833333333333334, 'precision': 0.11157894736842104, 'recall': 0.20714285714285713, 'f1': 0.11884057971014492}

🔹 Fold 2


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss


{'accuracy': 0.16666666666666666, 'precision': 0.0711111111111111, 'recall': 0.2333333333333333, 'f1': 0.10877192982456138}

🔹 Fold 3


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss


{'accuracy': 0.08333333333333333, 'precision': 0.03333333333333333, 'recall': 0.06666666666666667, 'f1': 0.04444444444444444}

🔹 Fold 4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss


{'accuracy': 0.20833333333333334, 'precision': 0.22000000000000003, 'recall': 0.13571428571428573, 'f1': 0.16292335115864526}

🔹 Fold 5


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss


{'accuracy': 0.125, 'precision': 0.031578947368421054, 'recall': 0.1, 'f1': 0.048}


In [84]:
# Calculate 5-fold averages
acc_avg = np.mean([m[0] for m in fold_metrics])
pr_avg  = np.mean([m[1] for m in fold_metrics])
rc_avg  = np.mean([m[2] for m in fold_metrics])
f1_avg  = np.mean([m[3] for m in fold_metrics])

print("\n 5-Fold Average Metrics on Combined Dataset:")
print(f"Accuracy: {acc_avg:.3f}")
print(f"Precision: {pr_avg:.3f}")
print(f"Recall:    {rc_avg:.3f}")
print(f"F1 Score:  {f1_avg:.3f}")


5-Fold Average Metrics on Combined Dataset:
Accuracy: 0.158
Precision: 0.094
Recall:    0.149
F1 Score:  0.097
