## Task 5a: Hotword Detection

This notebook contains the solutions for hotword detection and to save the list of mp3 filenames with the hot words detected into a file ```detected.txt```

Ensure to use the following file ```cv-valid-dev.csv``` to access the transcribed results of cv-valid-dev mp3 dataset with my fine-tuned model in task 4.

```Note: The assumption of the hotwords is that we will use the exact word form to classify the word as a hotword```

In [None]:
import pandas as pd
import numpy as np
import librosa
import soundfile as sf
import torch

from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
from tqdm import tqdm

# Set Pandas display options to show full text
pd.set_option("display.max_colwidth", None)

As the ```cv-valid-dev.csv``` in task 2d contains the transcriptions from the original Wav2Vec2 model, I need to transcribe it differently with my fine-tuned model to generate the file required for this task.

In [2]:
# Check if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cuda


In [None]:
# My Hugging Face Hub model name
MY_MODEL_NAME = "enlihhhhh/wav2vec2-large-960h-cv"

# Load the fine-tuned model and processor
processor = Wav2Vec2Processor.from_pretrained(MY_MODEL_NAME)
model = Wav2Vec2ForCTC.from_pretrained(MY_MODEL_NAME)

# Move model to GPU if available
model.to(device)

def transcribe_audio(audio_path):
    try:
        # Read the audio file
        audio, sample_rate = sf.read(f"../asr/data/{audio_path}")

        # Resample if not already 16kHz
        if sample_rate != 16000:
            audio = librosa.resample(audio, orig_sr=sample_rate, target_sr=16000)

        # Convert audio to tensor and move to same device as model
        input_values = processor(audio, return_tensors="pt", sampling_rate=16000).input_values.to(device)

        # Perform inference
        with torch.no_grad():
            logits = model(input_values).logits

        # Decode the logits
        predicted_ids = torch.argmax(logits, dim=-1)
        transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]

        return transcription

    except Exception as e:
        print(f"Error transcribing {audio_path}: {str(e)}")
        return None

In [4]:
# Assuming that your asr folder contains the common voice dataset
dev_common_voice = pd.read_csv("data/cv-valid-dev.csv")
dev_common_voice = dev_common_voice[['filename']]

# Call the transcribe function
tqdm.pandas(desc="Transcribing Audio Files")
dev_common_voice["generated_text"] = dev_common_voice["filename"].progress_apply(transcribe_audio)

Transcribing Audio Files: 100%|██████████| 4076/4076 [01:03<00:00, 63.98it/s]


In [5]:
# Lowercase the generated text
dev_common_voice["generated_text"] = dev_common_voice["generated_text"].str.lower()
dev_common_voice.head()

Unnamed: 0,filename,generated_text
0,cv-valid-dev/sample-000000.mp3,be careful with your propnastigations said the stranger
1,cv-valid-dev/sample-000001.mp3,then why should they be surprised when they se born
2,cv-valid-dev/sample-000002.mp3,a young arab also loaded down with bagage entered and greted the englishman
3,cv-valid-dev/sample-000003.mp3,i felt that everything i owned would be destroyed
4,cv-valid-dev/sample-000004.mp3,he moved about invisible but everyone could hear him


Now, we define the hot words ```be careful, destroy, stranger``` to detect following the assumption defined at the top of this notebook.

Based on the assumption, I will list cases where words will not be classified as hot words:
1. destroyed != destroy
2. becareful != be careful
3. strange != stranger

```Hence, we only match the exact words with the transcriptions.```

In [6]:
# Define the target hotwords
hotwords = ["be careful", "destroy", "stranger"]

# Find the matching audio filenames that contain the hotwords in the transcribed results
target_files = dev_common_voice[dev_common_voice["generated_text"].str.contains("|".join(hotwords), case=False, na=False)]["filename"]

# Convert the target_files Series to a list
audio_files_list = target_files.tolist()
print("The audio files that contain the hotwords are:")
print(audio_files_list)

The audio files that contain the hotwords are:
['cv-valid-dev/sample-000000.mp3', 'cv-valid-dev/sample-000003.mp3', 'cv-valid-dev/sample-000089.mp3', 'cv-valid-dev/sample-000508.mp3', 'cv-valid-dev/sample-000674.mp3', 'cv-valid-dev/sample-001093.mp3', 'cv-valid-dev/sample-001101.mp3', 'cv-valid-dev/sample-001243.mp3', 'cv-valid-dev/sample-001501.mp3', 'cv-valid-dev/sample-001933.mp3', 'cv-valid-dev/sample-002405.mp3', 'cv-valid-dev/sample-002453.mp3', 'cv-valid-dev/sample-003065.mp3', 'cv-valid-dev/sample-003219.mp3', 'cv-valid-dev/sample-003808.mp3']


In [7]:
# Now, we save the filenames into a .txt file
with open("detected.txt", "w") as f:
    for file in audio_files_list:
        f.write(file + "\n")
print("The audio filenames have been saved to detected.txt file")

The audio filenames have been saved to detected.txt file


In [8]:
# Save the transcribed results to a CSV file
dev_common_voice.to_csv("cv-valid-dev.csv", index=False)