# Voice Activity Detection

## Introduction

The directory `/tmp/vad-dataset` contains 900 WAV audio files, each sampled at a frequency of 16 kHz and 16-bit for a maximum duration of 1 second. Of these, 800 files correspond to recordings of 8 distinct keywords (100 files per keyword), while the remaining 100 files represent silent segments. This dataset is intended for evaluating and optimizing a Voice Activity Detection (VAD) algorithm.

### Count the number of files

In [1]:
!ls /tmp/vad-dataset/ | wc -l

900


### List 5 files

In [2]:
!ls /tmp/vad-dataset | head -n5

down_0132a06d_nohash_4.wav
down_063d48cf_nohash_0.wav
down_0a9f9af7_nohash_0.wav
down_0ff728b5_nohash_4.wav
down_14872d06_nohash_0.wav
ls: write error: Broken pipe


## Spectrogram-based VAD

VAD is used to classify an audio file as either silence or non-silence. In this exercise, we will implement a basic VAD algorithm using spectrogram features.

The VAD algorithm consists of the following steps:

1. Calculate the spectrogram of the audio.
2. Convert the amplitude values to decibels (dB).
3. For each time frame of the spectrogram, compute the average amplitude across all frequency bins.
4. For each time frame, compute the relative amplitude with respect to the frame with minimum energy.
4. Identify frames where the relative amplitude exceeds a predefined threshold (dBthres) and mark these frames as non-silence.
5. Calculate the total duration, in seconds, of the non-silence segment.
6. If this duration exceeds a specified threshold (duration_thres), classify the entire audio file as non-silence; otherwise, classify it as silence."

### Read a file from the VAD dataset

In [3]:
import tensorflow as tf
from reader import AudioReader
from preprocessing import Normalization

sampling_rate = 16000
audio_reader = AudioReader(tf.int16)
normalization = Normalization(tf.int16)

filename = '/tmp/vad-dataset/down_0132a06d_nohash_4.wav'
audio, label = audio_reader.get_audio_and_label(filename)
audio, label = normalization.normalize(audio, label)
audio = tf.squeeze(audio)

# check the normalization between [-1,1]
print("Min value of audio data:", tf.reduce_min(audio))
print("Max value of audio data:", tf.reduce_max(audio))
print(label.numpy().decode())

2024-11-22 12:00:57.520905: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-11-22 12:00:57.523076: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-11-22 12:00:57.564323: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-11-22 12:00:57.565374: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-11-22 12:00:59.936279: I tensorflow_io/core/kernels/cpu_check.cc:128] Your CPU supports instructi

### Plot the Waveform

In [4]:
import pandas as pd
import numpy as np

plot = {
    'Time (s)': np.linspace(0, len(audio) / sampling_rate, len(audio)),
    'Amplitude': audio.numpy().squeeze(),
}
plot_df = pd.DataFrame(plot)

In [5]:
(lambda: DeepnoteChart(plot_df, """{"data":{"name":"placeholder"},"mark":{"type":"line","tooltip":true},"height":220,"$schema":"https://vega.github.io/schema/vega-lite/v4.json","autosize":{"type":"fit"},"encoding":{"x":{"sort":null,"type":"quantitative","field":"Time (s)","scale":{"type":"linear","zero":false}},"y":{"sort":null,"type":"quantitative","field":"Amplitude","scale":{"type":"linear","zero":false}},"color":{"sort":null,"type":"nominal","field":"","scale":{"type":"linear","zero":false}}}}""") if 'DeepnoteChart' in globals() else _dntk.DeepnoteChart(plot_df, """{"data":{"name":"placeholder"},"mark":{"type":"line","tooltip":true},"height":220,"$schema":"https://vega.github.io/schema/vega-lite/v4.json","autosize":{"type":"fit"},"encoding":{"x":{"sort":null,"type":"quantitative","field":"Time (s)","scale":{"type":"linear","zero":false}},"y":{"sort":null,"type":"quantitative","field":"Amplitude","scale":{"type":"linear","zero":false}},"color":{"sort":null,"type":"nominal","field":"","scale":{"type":"linear","zero":false}}}}"""))()

<__main__.DeepnoteChart at 0x7fe844c4c290>

### Compute the Spectrogram

In [6]:
from preprocessing import Spectrogram

sampling_rate = 16000
frame_length_in_s = 0.04
frame_step_in_s = 0.01
spec_processor = Spectrogram(sampling_rate, frame_length_in_s, frame_step_in_s)

### Compute the Energy for each Time Frame

In [7]:
spec = spec_processor.get_spectrogram(audio)
dB = 20 * tf.math.log(spec + 1.e-6)
energy = tf.math.reduce_mean(dB, axis=1)
min_energy = tf.math.reduce_min(energy)
rel_energy = energy - min_energy
rel_energy

<tf.Tensor: shape=(97,), dtype=float32, numpy=
array([ 1.6747742 ,  1.3076019 ,  2.7266846 ,  1.0968323 ,  2.0650024 ,
        2.0477905 ,  1.5089569 ,  0.9965973 ,  1.4521179 ,  2.1545715 ,
        1.0388489 ,  1.4580688 ,  1.1069946 ,  0.        ,  0.50471497,
        0.09797668,  0.49243164,  0.96440125,  3.6753998 ,  2.6143646 ,
        2.635437  ,  2.0506744 ,  3.193344  ,  3.4331055 ,  0.9849701 ,
        0.6419678 ,  5.4634705 , 10.144775  , 13.935463  , 23.665413  ,
       37.49733   , 51.70729   , 62.861084  , 70.43498   , 77.470985  ,
       78.73361   , 81.4451    , 82.426605  , 82.65578   , 81.17423   ,
       79.84013   , 77.77415   , 75.01674   , 74.268326  , 73.356766  ,
       70.91118   , 71.53914   , 70.64348   , 69.86165   , 67.705795  ,
       69.24626   , 68.39715   , 67.555695  , 65.10646   , 65.08226   ,
       61.883507  , 60.018616  , 57.78985   , 58.92843   , 58.107246  ,
       56.445198  , 55.01168   , 53.29226   , 50.718597  , 46.318512  ,
       44.728935 

## Display Energy over Time

In [8]:
plot_data = {
    'Time (s)': np.linspace(0, len(audio) / sampling_rate, energy.shape[0]),
    'Rel. Energy (dB)': rel_energy,
}
plot_df = pd.DataFrame(plot_data)
plot_df

Unnamed: 0,Time (s),Rel. Energy (dB)
0,0.000000,1.674774
1,0.010417,1.307602
2,0.020833,2.726685
3,0.031250,1.096832
4,0.041667,2.065002
...,...,...
92,0.958333,13.130341
93,0.968750,12.018768
94,0.979167,10.861053
95,0.989583,9.650497


In [9]:
(lambda: DeepnoteChart(plot_df, """{"layer":[{"layer":[{"mark":{"clip":true,"type":"bar","color":"#4c78a8","tooltip":true},"encoding":{"x":{"sort":null,"type":"quantitative","field":"Time \\\\\\\\\\\\(s\\\\\\\\\\\\)","scale":{"type":"linear"},"format":{"type":"default","decimals":null},"formatType":"numberFormatFromNumberType"},"y":{"sort":null,"type":"quantitative","field":"Rel\\\\\\\\\\\\. Energy \\\\\\\\\\\\(dB\\\\\\\\\\\\)","scale":{"type":"linear"},"format":{"type":"default","decimals":null},"formatType":"numberFormatFromNumberType"}}}]}],"title":"","config":{"legend":{}},"$schema":"https://vega.github.io/schema/vega-lite/v5.json","encoding":{},"usermeta":{"tooltipDefaultMode":true}}""") if 'DeepnoteChart' in globals() else _dntk.DeepnoteChart(plot_df, """{"layer":[{"layer":[{"mark":{"clip":true,"type":"bar","color":"#4c78a8","tooltip":true},"encoding":{"x":{"sort":null,"type":"quantitative","field":"Time \\\\\\\\\\\\(s\\\\\\\\\\\\)","scale":{"type":"linear"},"format":{"type":"default","decimals":null},"formatType":"numberFormatFromNumberType"},"y":{"sort":null,"type":"quantitative","field":"Rel\\\\\\\\\\\\. Energy \\\\\\\\\\\\(dB\\\\\\\\\\\\)","scale":{"type":"linear"},"format":{"type":"default","decimals":null},"formatType":"numberFormatFromNumberType"}}}]}],"title":"","config":{"legend":{}},"$schema":"https://vega.github.io/schema/vega-lite/v5.json","encoding":{},"usermeta":{"tooltipDefaultMode":true}}"""))()

<__main__.DeepnoteChart at 0x7fe844c1b310>

### Get the non-silent / silent frames based on an energy threshold

In [10]:
non_silence = rel_energy > 20
non_silence

<tf.Tensor: shape=(97,), dtype=bool, numpy=
array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False])>

### Visualize the results

In [11]:
non_silence_numpy = non_silence.numpy() * 1


plot = {
    'Time (s)': np.linspace(0, len(audio) / sampling_rate, len(audio)),
    'Amplitude': audio.numpy().squeeze(),
    'Non-Silence': np.interp(
        np.linspace(0, len(non_silence_numpy), len(audio)), 
        np.arange(len(non_silence_numpy)),                 
        non_silence_numpy
    )
}
plot_df = pd.DataFrame(plot)

In [12]:
(lambda: DeepnoteChart(plot_df, """{"layer":[{"layer":[{"mark":{"clip":true,"type":"trail","color":"#4c78a8","tooltip":true},"encoding":{"x":{"sort":null,"type":"quantitative","field":"Time (s)","scale":{"type":"linear","zero":false},"format":{"type":"default","decimals":null},"formatType":"numberFormatFromNumberType"},"y":{"sort":null,"type":"quantitative","field":"Amplitude","scale":{"type":"linear","zero":false},"format":{"type":"default","decimals":null},"formatType":"numberFormatFromNumberType"}}},{"mark":{"size":100,"type":"point","opacity":0,"tooltip":true},"encoding":{"x":{"sort":null,"type":"quantitative","field":"Time (s)","scale":{"type":"linear","zero":false},"format":{"type":"default","decimals":null},"formatType":"numberFormatFromNumberType"},"y":{"sort":null,"type":"quantitative","field":"Amplitude","scale":{"type":"linear","zero":false},"format":{"type":"default","decimals":null},"formatType":"numberFormatFromNumberType"}}}]},{"layer":[{"mark":{"clip":true,"type":"trail","color":"#f58518","tooltip":true},"encoding":{"x":{"sort":null,"type":"quantitative","field":"Time (s)","scale":{"type":"linear","zero":false},"format":{"type":"default","decimals":null},"formatType":"numberFormatFromNumberType"},"y":{"sort":null,"type":"quantitative","field":"Non-Silence","scale":{"type":"linear","zero":false},"format":{"type":"default","decimals":null},"formatType":"numberFormatFromNumberType"}}},{"mark":{"size":100,"type":"point","opacity":0,"tooltip":true},"encoding":{"x":{"sort":null,"type":"quantitative","field":"Time (s)","scale":{"type":"linear","zero":false},"format":{"type":"default","decimals":null},"formatType":"numberFormatFromNumberType"},"y":{"sort":null,"type":"quantitative","field":"Non-Silence","scale":{"type":"linear","zero":false},"format":{"type":"default","decimals":null},"formatType":"numberFormatFromNumberType"}}}]}],"title":"","config":{"legend":{}},"$schema":"https://vega.github.io/schema/vega-lite/v5.json","resolve":{"scale":{"y":"independent"}},"encoding":{}}""") if 'DeepnoteChart' in globals() else _dntk.DeepnoteChart(plot_df, """{"layer":[{"layer":[{"mark":{"clip":true,"type":"trail","color":"#4c78a8","tooltip":true},"encoding":{"x":{"sort":null,"type":"quantitative","field":"Time (s)","scale":{"type":"linear","zero":false},"format":{"type":"default","decimals":null},"formatType":"numberFormatFromNumberType"},"y":{"sort":null,"type":"quantitative","field":"Amplitude","scale":{"type":"linear","zero":false},"format":{"type":"default","decimals":null},"formatType":"numberFormatFromNumberType"}}},{"mark":{"size":100,"type":"point","opacity":0,"tooltip":true},"encoding":{"x":{"sort":null,"type":"quantitative","field":"Time (s)","scale":{"type":"linear","zero":false},"format":{"type":"default","decimals":null},"formatType":"numberFormatFromNumberType"},"y":{"sort":null,"type":"quantitative","field":"Amplitude","scale":{"type":"linear","zero":false},"format":{"type":"default","decimals":null},"formatType":"numberFormatFromNumberType"}}}]},{"layer":[{"mark":{"clip":true,"type":"trail","color":"#f58518","tooltip":true},"encoding":{"x":{"sort":null,"type":"quantitative","field":"Time (s)","scale":{"type":"linear","zero":false},"format":{"type":"default","decimals":null},"formatType":"numberFormatFromNumberType"},"y":{"sort":null,"type":"quantitative","field":"Non-Silence","scale":{"type":"linear","zero":false},"format":{"type":"default","decimals":null},"formatType":"numberFormatFromNumberType"}}},{"mark":{"size":100,"type":"point","opacity":0,"tooltip":true},"encoding":{"x":{"sort":null,"type":"quantitative","field":"Time (s)","scale":{"type":"linear","zero":false},"format":{"type":"default","decimals":null},"formatType":"numberFormatFromNumberType"},"y":{"sort":null,"type":"quantitative","field":"Non-Silence","scale":{"type":"linear","zero":false},"format":{"type":"default","decimals":null},"formatType":"numberFormatFromNumberType"}}}]}],"title":"","config":{"legend":{}},"$schema":"https://vega.github.io/schema/vega-lite/v5.json","resolve":{"scale":{"y":"independent"}},"encoding":{}}"""))()

<__main__.DeepnoteChart at 0x7fe8382e3310>

### Compute the non-silence duration

In [13]:
non_silence_frames = tf.math.reduce_sum(tf.cast(non_silence, tf.float32))
non_silence_duration = frame_length_in_s + frame_step_in_s * (non_silence_frames - 1)
print(f'Speech duration {non_silence_duration.numpy():.2f} s')

Speech duration 0.53 s


### Re-organize the VAD code in a Python Class

In [14]:
class VAD():
    def __init__(
        self,
        sampling_rate,
        frame_length_in_s,
        frame_step_in_s,
        dBthres,
        duration_thres,
    ):
        self.frame_length_in_s = frame_length_in_s
        self.frame_step_in_s = frame_step_in_s
        self.spec_processor = Spectrogram(
            sampling_rate, frame_length_in_s, frame_step_in_s,
        )
        self.dBthres = dBthres
        self.duration_thres = duration_thres

    def is_silence(self, audio):
        spectrogram = self.spec_processor.get_spectrogram(audio)
        
        dB = 20 * tf.math.log(spectrogram + 1.e-6)
        energy = tf.math.reduce_mean(dB, axis=1)
        min_energy = tf.reduce_min(energy)

        rel_energy = energy - min_energy
        non_silence = rel_energy > self.dBthres
        non_silence_frames = tf.math.reduce_sum(tf.cast(non_silence, tf.float32))
        non_silence_duration = self.frame_length_in_s + self.frame_step_in_s * (non_silence_frames - 1)

        if non_silence_duration > self.duration_thres:
            return 0
        else:
            return 1   

### Test the VAD class

In [15]:
#  give some values for hyperparameters to make sure the model is working
vad_processor = VAD(16000, 0.04, 0.01, 20, 0.4)


In [16]:
# check a random audio file to test the VAD class
audio = audio_reader.get_audio('/tmp/vad-dataset/down_14872d06_nohash_0.wav')
audio = tf.squeeze(audio)
audio = normalization.normalize_audio(audio)
vad_processor.is_silence(audio)


0

In [17]:
audio = audio_reader.get_audio('/tmp/vad-dataset/silence_027.wav')
audio = tf.squeeze(audio)
audio = normalization.normalize_audio(audio)
vad_processor.is_silence(audio)

1

### Find the optimal parameters' values of the VAD class

In [18]:
# to increase the accuracy we need to define a range for each parameters
# as we know frame_step_in_s is a proportion of frame_length_in_s then we use frame_step_ratios
# define VAD parameters
vad_parameters = {
    'sampling_rate': 16000,
    'frame_length_in_s': [0.008, 0.032, 0.064],  # Frame lengths to test
    'frame_step_ratios': [0.25,0.5, 0.75],       # Frame step ratios
    'dbtreshold': [10, 12, 15],                  # dB thresholds to test
    'duration': [0.15, 0.17, 0.2]                # Duration thresholds to test
}

In [19]:
from glob import glob
from itertools import product

# load dataset
filenames = glob('/tmp/vad-dataset/*')

# to generate all the combination of parameters
param_combinations = list(product(
    vad_parameters['frame_length_in_s'],
    vad_parameters['frame_step_ratios'],
    vad_parameters['dbtreshold'],
    vad_parameters['duration']
))

# define a variable to store results for each combination
results = []

# to iterate through parameter combinations
for frame_length, frame_step_ratio, db_thresh, duration_thresh in param_combinations:
    # to calculate frame step dynamically
    frame_step = frame_length * frame_step_ratio

    # Initialize VAD processor with the current parameters
    # sampling_rate is fixed
    vad_processor = VAD(
        vad_parameters['sampling_rate'], frame_length, frame_step, db_thresh, duration_thresh
    )

    # Evaluate accuracy for the current parameter set
    correct = 0
    for filename in filenames:
        audio, label = audio_reader.get_audio_and_label(filename)
        audio = tf.squeeze(audio)
        audio, label = normalization.normalize(audio, label)
        
        is_true_silence = label.numpy().decode() == 'silence'
        predicted_silence = vad_processor.is_silence(audio)

        if predicted_silence == is_true_silence:
            correct += 1

    accuracy = 100. * correct / len(filenames)
    results.append((frame_length, frame_step, db_thresh, duration_thresh, accuracy))

# we sort results by accuracy in descending order
sorted_results = sorted(results, key=lambda x: x[-1], reverse=True)

# we extract the top 5 results
top_5_results = sorted_results[:5]

# to print the top 5 results
if top_5_results:
    print("\nTop 5 Results (Sorted by Accuracy):")
    for frame_length, frame_step, db_thresh, duration_thresh, accuracy in top_5_results:
        print(f"Frame Length: {frame_length:.3f}s, Frame Step: {frame_step:.3f}s, "
              f"dB Threshold: {db_thresh}, Duration Threshold: {duration_thresh:.3f}s -> "
              f"Accuracy: {accuracy:.2f}%")
else:
    print("\nNo parameter combinations yielded high accuracy.")


Top 5 Results (Sorted by Accuracy):
Frame Length: 0.032s, Frame Step: 0.008s, dB Threshold: 10, Duration Threshold: 0.150s -> Accuracy: 97.89%
Frame Length: 0.032s, Frame Step: 0.016s, dB Threshold: 10, Duration Threshold: 0.150s -> Accuracy: 97.89%
Frame Length: 0.008s, Frame Step: 0.002s, dB Threshold: 15, Duration Threshold: 0.150s -> Accuracy: 97.78%
Frame Length: 0.008s, Frame Step: 0.004s, dB Threshold: 15, Duration Threshold: 0.150s -> Accuracy: 97.78%
Frame Length: 0.032s, Frame Step: 0.008s, dB Threshold: 10, Duration Threshold: 0.170s -> Accuracy: 97.67%


<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=8a9d9526-dc21-42d6-ba37-8f708634743d' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>