# Voice Activity Detection (VAD) using Silero VAD

This notebook demonstrates how to perform Voice Activity Detection (VAD) using the `silero-vad` library. We will:
1. Load and preprocess an audio file.
2. Apply the VAD algorithm to detect speech segments.
3. Visualize and output the detected speech segments.


## Step 1: Install Requirements

In [1]:
# Install required packages
!pip install -q torch torchvision numpy matplotlib silero-vad soundfile

[0m

## Step 2: Load Libraries and Discover GPU Resources

In [3]:
# Import necessary libraries
from silero_vad import get_speech_ts, VADIterator, collect_chunks
import soundfile as sf
import numpy as np
import matplotlib.pyplot as plt
import torch

# Check to see what GPU resources are available
def get_best_device():
    if torch.cuda.is_available():
        print("Using CUDA")
        return "cuda"
    elif torch.backends.mps.is_available():
        print("Using MPS")
        return "mps"
    else:
        print("Using CPU")
        return "cpu"
device = get_best_device()


ImportError: cannot import name 'get_speech_ts' from 'silero_vad' (/usr/local/envs/klingon_transcribe/lib/python3.11/site-packages/silero_vad/__init__.py)

## Step 3: Load the Audio File

We start by loading an audio file using `soundfile`. The audio needs to be in a format supported by `silero-vad`.

In [None]:
# Load the audio file
audio_filepath = "test.wav"
audio, sample_rate = sf.read(audio_filepath)

# Plot the audio waveform
plt.figure(figsize=(15, 5))
plt.plot(np.linspace(0, len(audio) / sample_rate, num=len(audio)), audio)
plt.title('Audio Waveform')
plt.xlabel('Time [s]')
plt.ylabel('Amplitude')
plt.show()


## Step 4: Apply Silero VAD

Next, we initialize the `silero-vad` model and apply it to the audio file to detect speech segments.

Finally we print out the segments and audio file statistics.

In [None]:
# Initialize the VAD model
model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad', model='silero_vad', device=device, force_reload=True)
get_speech_ts = utils['get_speech_ts']

# Apply the VAD model to the audio
vad_segments = get_speech_ts(audio, model, sampling_rate=sample_rate)

# Convert VAD segments to start and end times in seconds
speech_segments = [(segment['start'] / sample_rate, segment['end'] / sample_rate) for segment in vad_segments]

# Print the VAD segments
print("Detected speech segments (in seconds):")
for start, end in speech_segments:
    print(f"Start: {start:.2f}, End: {end:.2f}")

# Print VAD statistics: Number of speech segments, total duration of speech
# segments, and speech ratio
num_speech_segments = len(speech_segments)
total_duration = sum([end - start for start, end in speech_segments])
speech_ratio = total_duration / (len(audio) / sample_rate)
total_audio_length = len(audio) / sample_rate
print(f"\nNumber of speech segments: {num_speech_segments}")
print(f"Total length of audio: {total_audio_length:.2f} seconds")
print(f"Total duration of speech segments: {total_duration:.2f} seconds")
print(f"Speech ratio: {speech_ratio:.2f}")


## Step 5: Visualize the Detected Speech Segments

We visualize the detected speech segments on the audio waveform to better understand where speech occurs.

In [None]:
# Plot the audio waveform with detected speech segments
plt.figure(figsize=(15, 5))
plt.plot(np.linspace(0, len(audio) / sample_rate, num=len(audio)), audio, label='Audio')
for start, end in speech_segments:
    plt.axvspan(start, end, color='red', alpha=0.5, label='Speech Segment' if start == speech_segments[0][0] else "")
plt.title('Audio Waveform with Detected Speech Segments')
plt.xlabel('Time [s]')
plt.ylabel('Amplitude')
plt.legend()
plt.show()

## Cleanup Models & Pipelines on GPU

In [None]:
# Cleanup models and pipelines from GPU memory
# If device is cuda then cleanup cuda resources, if mps, cleanup mps resources
if device == "cuda":
    torch.cuda.empty_cache()
elif device == "mps":
    torch.cuda.empty_cache()

## Conclusion

In this notebook, we demonstrated how to use the `silero-vad` library to detect speech segments in an audio file. We loaded and preprocessed the audio, applied the VAD algorithm, and visualized the detected speech segments. Optionally, we saved the detected speech segments as separate audio files for further analysis.