# Speech Diarization with Offline Spectral Clustering

## Table of Contents
* [first-bullet](#first-bullet)
* [Second Bullet Header](#second-bullet)

## Introduction

### Scope

#### What's inlcuded
This project investigates and implements a spectral clustering algorithm in the context of a speech diarization pipeleine.

This project tests the overall effectiveness of the pipeline before and after reimplementing the spectral clustering algorithm. Furthermore, this project extends the test cases beyond the initial project pipeline.

#### What's not included
This project does not reimplement pieces of an existing speech diarization pipeline (pyannote.pipeline) beyond what's necessary to allow for replacing the spectral clustering portion. This project does not investigate the running time or space complexity of the overall pipeline.

### Clustering comparisons in literature

Offline versus online clustering algorithms.

K-means versus spectral clustering
* K-means poor with non-gaussian data
* Spectral overcomes with graph cut

### Diarization Pipeline

1. Speech Activity Detection (SAD)
1. Extract short speech segments
1. Create Compact Voice Vector: Embedding Extraction
1. Cluster the segments and assign to a speaker
1. (Optional) Resegmentation such as Variational Bayesian resegmentation

## Spectral Clustering Analysis

Create an eigen-decompose affinity matrix. Run k-means on voice vectors.
Find the number of clusters using the max eigen-gap criteron

## Test Plan

### Spectral Clustering testing

#### Expected results

Reimplementation should perform better than publically availabile implementation, which differs from authors actual implementation.

Training, dev, and eval datasets are not available and tuning parameter suggestions not provided.

### Diarization Pipeline testing

DER Metric

pyannote.metrics library

Datasets
* CALLHOME
* CALLHOME American English
* 2003 NIST Rich Transcription (RT-03)
  * Telephone Speech Part


## Methods

Import a test file in the wav format. Convert from other formats as necessary, such as with ffmpeg (`ffmpeg -i *.mp3 *.wav`)


In [None]:
test_file = {'uri': 'filename', 'audio': 'data/130126_009.wav'}


### Speech Activity Detection (SAD)

Use a pre-trained activity detection model from pyannote.audio

In [None]:
from pyannote.audio.labeling.extraction import SequenceLabeling
SAD_MODEL = ('lib/pyannote-audio/tutorials/models/speech_activity_detection/train/'
             'AMI.SpeakerDiarization.MixHeadset.train/weights/0280.pt')

sad = SequenceLabeling(model=SAD_MODEL)
sad_scores = sad(test_file)

### Extract Short Speech Segments <a class="anchor" id="first-bullet"></a>

In [None]:
from pyannote.audio.signal import Binarize
binarize = Binarize(offset=0.94, onset=0.70, log_scale=True)
speech = binarize.apply(sad_scores, dimension=1)


### Create a Compact Voice Vector for Each Segment <a class="anchor" id="second-bullet"></a>

Use a pre-trained model to detect when speakers are talking versus not talking.

In [None]:
from pyannote.audio.signal import Peak
from pyannote.core import Timeline
SCD_MODEL = ('lib/pyannote-audio/tutorials/models/speaker_change_detection/train/'
             'AMI.SpeakerDiarization.MixHeadset.train/weights/0870.pt')

scd = SequenceLabeling(model=SCD_MODEL)
scd_scores = scd(test_file)
peak = Peak(alpha=0.08, min_duration=0.40, log_scale=True)
partition = peak.apply(scd_scores, dimension=1)
speech_turns = partition.crop(speech)

long_turns = Timeline(segments=[s for s in speech_turns if s.duration > 2.])

Then, extract an embeddded vector using a pretrained classifier for each segment.

In [None]:
from pyannote.audio.embedding.extraction import SequenceEmbedding
EMB_MODEL = ('lib/pyannote-audio/tutorials/models/speaker_embedding/train/'               
             'VoxCeleb.SpeakerVerification.VoxCeleb1.train/weights/2000.pt')
emb = SequenceEmbedding(model=EMB_MODEL, duration=1., step=0.5)
embeddings = emb(test_file)
X, Y = [], []
for segment in long_turns:
    x = embeddings.crop(segment, mode='strict')
    X.append(np.mean(x, axis=0))
    y = test_file['annotation'].argmax(segment)
    Y.append(y)
    
X = np.vstack(X)
_, y_true = np.unique(Y, return_inverse=True)

### Apply Spectral Clustering Algorithm to Identify Speakers

In [None]:
import lib.SpectralCluster.spectralcluster as sc
clusterer =sc.spectral_clusterer.SpectralClusterer(
            p_percentile=0.2,
            gaussian_blur_sigma=0,
            stop_eigenvalue=0.01)

labels = clusterer.predict(X)
