# Speaker Diarization using pyannote-audio

This notebook demonstrates how to perform speaker diarization using the `pyannote-audio` library. We will:
1. Load and preprocess an audio file.
2. Apply the pyannote-audio model to perform diarization.
3. Output the diarization results in txt, json, and srt formats.

## Explanation
Speaker diarization is the process of partitioning an audio stream into homogeneous segments according to the speaker identity. This can be useful for analyzing conversations, meetings, and other multi-speaker audio recordings.


## Step 1: Install Requirements

Install pyannote-audio and other necessary libraries.

In [None]:
# Setup installers
commands = [
    ("PIP_ROOT_USER_ACTION=ignore pip install -q pyannote.audio", "Install pyannote.audio"),
    ("PIP_ROOT_USER_ACTION=ignore pip install -q soundfile", "Install soundfile"),
    ("PIP_ROOT_USER_ACTION=ignore pip install -q numpy", "Install numpy"),
    ("PIP_ROOT_USER_ACTION=ignore pip install -q matplotlib", "Install matplotlib"),
    ("PIP_ROOT_USER_ACTION=ignore pip install -q transformers", "Install transformers"),
    ("PIP_ROOT_USER_ACTION=ignore pip install -q -U git+https://github.com/speechbrain/speechbrain.git@develop", "Install speechbrain")
]

# Import the utils module which sets up the environment
from modules import utils
from modules import disable_warnings

# Use LogTools
log_tools = utils.LogTools()

# Execute
log_tools.command_state(commands)

## Step 2: Load Libraries and Discover GPU Resources

In [None]:
# Import necessary libraries
import torch
import numpy as np
import soundfile as sf
import matplotlib.pyplot as plt
from pyannote.audio import Pipeline
from pyannote.audio.pipelines.utils.hook import ProgressHook
from scipy.signal import resample
import IPython.display as ipd
import os

# Check to see what GPU resources are available
def get_best_device():
    if torch.cuda.is_available():
        print("Using CUDA")
        return torch.device("cuda")
    elif torch.backends.mps.is_available():
        print("Using MPS")
        return torch.device("mps")
    else:
        print("Using CPU")
        return torch.device("cpu")

device = get_best_device()

# Get HF_HUB_TOKEN from os environment
HF_HUB_TOKEN = os.getenv("HF_HUB_TOKEN")
if HF_HUB_TOKEN:
    print("HF_HUB_TOKEN found")
else:
    print("HF_HUB_TOKEN not found!!!")
    exit(1)

## Step 3: Load the Audio File

We start by loading an audio file using `soundfile`. The audio needs to be in a format supported by `pyannote-audio`.

In [None]:
# Load the audio file
audio_filepath = "../../test_pcm.wav"
audio, sample_rate = sf.read(audio_filepath)

# Test to figure out what format audio was loaded into "audio" as
print(f"Audio shape: {audio.shape}")
print(f"Audio type: {type(audio)}")
print(f"Audio dtype: {audio.dtype}")
print(f"Audio sample rate: {sample_rate}")

# If audio sample_rate isn't 16000 then resample it with scipy.signal resample
new_sample_rate = 16000
if sample_rate != new_sample_rate:
    # Resample audio
    print(f"Resampling audio file to {new_sample_rate}")
    num_samples = int(len(audio) * new_sample_rate / sample_rate)
    audio = resample(audio, num_samples)
    sample_rate = new_sample_rate
    print(f"New sample rate: {new_sample_rate}")

# Plot the audio waveform
plt.figure(figsize=(15, 5))
plt.plot(np.linspace(0, len(audio) / sample_rate, num=len(audio)), audio)
plt.title('Audio Waveform')
plt.xlabel('Time [s]')
plt.ylabel('Amplitude')
plt.show()

# Play the audio from memory
import IPython.display as ipd
ipd.Audio(audio, rate=sample_rate)

## Step 4: Apply pyannote-audio Diarization

Next, we apply the pyannote-audio model to the audio file to perform diarization.

In [None]:
# Load the pyannote speaker diarization pipeline
pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization-3.1", use_auth_token=HF_HUB_TOKEN)

# Set the device for the pipeline
pipeline.to(device)

# Convert numpy array to pyannote format
waveform = {"waveform": torch.tensor(audio, dtype=torch.float32).unsqueeze(0).to(device), "sample_rate": sample_rate}

# Apply the pipeline to the audio data
with ProgressHook() as hook:
  diarization = pipeline(waveform, hook=hook)

# Print the diarization result
print(diarization)

## Step 4: Free up Resources
*Remove any local files and free up GPU resources.*

Press the large red button below to get started! 🚀

In [None]:
# Remove the output files
!rm -rf {txt_filepath} {json_filepath} {srt_filepath}
print("Local files deleted")

# Free up GPU memory
torch.cuda.empty_cache()
print("GPU memory freed")

## Conclusion

In this notebook, we demonstrated how to perform speaker diarization using the `pyannote-audio` library. We loaded and preprocessed the audio, applied the pyannote-audio model, and saved the diarization results in txt, json, and srt formats.