# GA5 DLP  

`This guide will walk you through creating the Python notebook for Week 6's assignment on speaker diarization and identification. You will be building a pipeline that transcribes audio, extracts speaker embeddings, clusters these embeddings to identify potential speakers, and then attempts to label these speakers by comparing them to known speaker profiles.`

## I. Install necessary libraries

Note: Use correct versions  

In [35]:
!pip install speechbrain==0.5.16
!pip install faster-whisper
!pip install pyannote.audio
!pip install whisper
!pip install ctranslate2==4.4.0
!pip install librosa
!pip install soundfile
!pip install sklearn

Collecting speechbrain>=1.0.0 (from pyannote.audio)
  Using cached speechbrain-1.0.2-py3-none-any.whl.metadata (23 kB)
Using cached speechbrain-1.0.2-py3-none-any.whl (824 kB)
Installing collected packages: speechbrain
  Attempting uninstall: speechbrain
    Found existing installation: speechbrain 0.5.16
    Uninstalling speechbrain-0.5.16:
      Successfully uninstalled speechbrain-0.5.16
Successfully installed speechbrain-1.0.2


Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/pip/_internal/cli/base_command.py", line 179, in exc_logging_wrapper
    status = run_func(*args)
             ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pip/_internal/cli/req_command.py", line 67, in wrapper
    return func(self, options, args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pip/_internal/commands/install.py", line 447, in run
    conflicts = self._determine_conflicts(to_install)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pip/_internal/commands/install.py", line 578, in _determine_conflicts
    return check_install_conflicts(to_install)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pip/_internal/operations/check.py", line 101, in check_install_conflicts
    package_set, _ = create_package_set_from_installed()
              

In [4]:
# !pip install speechbrain==0.5.16

Collecting speechbrain==0.5.16
  Using cached speechbrain-0.5.16-py3-none-any.whl.metadata (23 kB)
Using cached speechbrain-0.5.16-py3-none-any.whl (630 kB)
Installing collected packages: speechbrain
  Attempting uninstall: speechbrain
    Found existing installation: speechbrain 1.0.2
    Uninstalling speechbrain-1.0.2:
      Successfully uninstalled speechbrain-1.0.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pyannote-audio 3.3.2 requires speechbrain>=1.0.0, but you have speechbrain 0.5.16 which is incompatible.[0m[31m
[0mSuccessfully installed speechbrain-0.5.16


## II. Import libraries  

In [179]:
import librosa
from faster_whisper import WhisperModel
import torch
import whisper
from pyannote.audio.pipelines.speaker_verification import PretrainedSpeakerEmbedding
from pyannote.audio import Audio
from pyannote.core import Segment
import speechbrain
import ctranslate2

import numpy as np
import pandas as pd

import datetime
import re
import time

import os
from pathlib import Path
import traceback

from sklearn.cluster import KMeans
# from sklearn.cluster import AgglomerativeClustering
from scipy.spatial.distance import cdist
from sklearn.metrics.pairwise import cosine_similarity

print(f"Imports completed successfully at {datetime.datetime.now()}")

Imports completed successfully at 2025-02-26 21:17:01.574921


In [180]:
print(torch.__version__)
# print(whisper.__version__)
print(speechbrain.__version__)
print(ctranslate2.__version__)
print(librosa.__version__)

2.5.1+cu124
1.0.2
4.4.0
0.10.2.post1


## III. Define helper functions  

In [181]:
def convert_time(secs):
  '''This function is for converting time in seconds into a human-readable format, specifically'''
  return datetime.timedelta(seconds=round(secs))

In [182]:
def transcribe_audio(audio_file, model_name="base", language="en", beam_size=5, best_of=5):
  '''This function handles the audio transcription process using the faster-whisper library.'''
  try:
    # Load the model (Note: we are using fast_whisper modeel)
    model = WhisperModel(model_name, compute_type="int8")   # Optionally can specify device='cuda' to use GPU if available

    # # Note the start time
    # time_start = time.time()

    # # Get duration
    # audio_data, sample_rate = librosa.load(audio_file, sr=16000, mono=True)
    # # duration = librosa.get_duration(y=audio_data, sr=sample_rate)
    # duration = len(audio_data) / sample_rate

    # Transcribe the audio
    segments_raw, info = model.transcribe(audio_file,
                                      task="transcribe",
                                      language=language,
                                      beam_size=beam_size,
                                      best_of=best_of)

    # Format the output of transcription into a list of dictionaries
    # Each dictionary should represent a segment and contain start time, end time and (optional) text of the segment
    segments = []
    for segment in segments_raw:
      segments.append({
          "start": segment.start,
          "end": segment.end,
          "text": segment.text
      })

    return segments

  except Exception as e:
    print(f"An error occurred during transcription: {e}")
    traceback.print_exc()
    return None

In [183]:
def extract_segment_embedding(audio_file, segment, total_duration, embedding_model):
  '''This function extracts a speaker embedding for a specific segment of an audio file using a pre-trained speaker embedding model.'''
  try:
    # Crop the audio file to the segment
    start = segment['start']
    end = min(total_duration, segment["end"])
    clip = Segment(start, end)

    audio = Audio()
    waveform, sample_rate = audio.crop(audio_file, clip)

    # # Instantiate embedding model (part of pyannote library)    # Will be done in the final pipeline
    # embedding_model = PretrainedSpeakerEmbedding(
    #     "speechbrain/spkrec-ecapa-voxceleb",
    #     device=torch.device("cuda" if torch.cuda.is_available() else "cpu"))

    # Extract embedding
    # The speaker embedding model expects input with a batch dimension. Add a batch dimension to the waveform using 'waveform[None]'
    embedding = embedding_model(waveform[None])

    # Remove the batch dimension from the embedding using squeeze() to get a 1D embedding vector.
    embedding = embedding.squeeze()

    # Return the speaker embedding as a NumPy array (Use only if running on GPU!)
    # embedding = embedding.detach().cpu().numpy()

    return embedding

  except Exception as e:
    print(f"An error occurred during embedding extraction: {e}")
    traceback.print_exc()
    return None

In [184]:
def compute_segment_embeddings(audio_file, segments, embedding_model):
  '''This function iterates through all transcribed segments of an audio file and computes speaker embeddings for each segment.'''
  try:
    # Note the start time
    time_start = time.time()

    # Get total duration of the audio file
    audio_data, sample_rate = librosa.load(audio_file, sr=16000, mono=True)
    # total_duration = librosa.get_duration(y=audio_data, sr=sample_rate)
    total_duration = len(audio_data) / sample_rate

    #
    segment_embeddings = []
    for segment in segments:
      embedding = extract_segment_embedding(audio_file, segment, total_duration, embedding_model)
      segment_embeddings.append(embedding)

    # Stack all the embeddings into a 2D NumPy array where each row is an embedding vector.
    segment_embeddings = np.stack(segment_embeddings)
    print(f"Segment embeddings shape: {segment_embeddings.shape}")
    print(f"Time taken to compute segment embeddings: {time.time() - time_start}")

    # Return the 2D array of segment embeddings and the duration of the audio file.
    return segment_embeddings, total_duration

  except Exception as e:
    print(f"An error occurred during embedding computation: {e}")
    traceback.print_exc()
    return None

In [185]:
def cluster_embeddings(embeddings, n_clusters):
  '''This function clusters the computed segment embeddings using KMeans to group segments likely spoken by the same speaker.'''
  try:
    # Fit a KMeans model to embeddings
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    kmeans.fit_predict(embeddings)

    # Get labels and centroids
    labels = kmeans.labels_
    centroids = kmeans.cluster_centers_

    print(f"KMeans labels: {labels}")
    print(f"KMeans centroids shape: {centroids.shape}")
    return labels, centroids

  except Exception as e:
    print(f"An error occurred during clustering: {e}")
    traceback.print_exc()
    return None

In [186]:
def compute_cluster_averages(embeddings, labels, n_clusters):
  '''Calculate the average embedding for each cluster. This average embedding represents a cluster's speaker profile.'''
  try:
    # Calculate the average embedding for each cluster
    cluster_average_embeddings = {}
    for i in range(n_clusters):
      cluster_embeddings = embeddings[labels == i]
      cluster_average = np.mean(cluster_embeddings, axis=0)
      cluster_average_embeddings[i] = cluster_average

    # print(f"Number of clusters in cluster_average_embeddings: {len(cluster_average_embeddings)}")
    # print(f"Keys in cluster_average_embeddings: {cluster_average_embeddings.keys()}")
    # print(f"Shape of one cluster_average_embedding: {cluster_average_embeddings[0].shape}")
    return cluster_average_embeddings

  except Exception as e:
    print(f"An error occurred during computing cluster averages: {e}")
    traceback.print_exc()
    return None

In [187]:
def load_known_speaker_embeddings(known_speaker_files, embedding_model):
  '''To load audio files of known speakers, compute their speaker embeddings, and store them for speaker identification.'''
  try:
    known_speaker_embeddings = {}
    for speaker_file in known_speaker_files:
      # Extract a speaker label from the filename (e.g., if the file is "speaker_A.wav", the label is "speaker_A").
      speaker_label = os.path.splitext(os.path.basename(speaker_file))[0]

      # Load the audio file using librosa.load()
      audio_data, sample_rate = librosa.load(speaker_file, sr=16000, mono=True)
      duration = len(audio_data) / sample_rate

      # Create a segment object covering the entire duration of the loaded audio.
      # Crop the audio using this segment
      segment = Segment(0, duration)

      audio = Audio()
      waveform, sample_rate = audio.crop(speaker_file, segment)

      # Extract embedding
      embedding = embedding_model(waveform[None])
      embedding = embedding.squeeze()
      # embedding = embedding.detach().cpu().numpy()    # Use this line only if you are runniing code on GPU

      # Store the embedding in the dictionary with the corresponding speaker label as the key.
      known_speaker_embeddings[speaker_label] = embedding

    # Return a dictionary of known speaker embeddings where keys are speaker labels and values are their embedding vectors.
    return known_speaker_embeddings

  except Exception as e:
    print(f"An error occurred during loading known speaker embeddings: {e}")
    traceback.print_exc()
    return None

In [188]:
def assign_speaker_labels(cluster_avg_embeddings, known_speaker_embeddings, similarity_threshold=0.7):
  '''Compare average embeddings of clusters with embeddings of known speakers to assign speaker labels to clusters.'''
  try:
    # Initialize an empty dictionary to map cluster IDs to speaker labels and cosine similarity.
    speaker_labels = {}
    similarity_scores_per_cluster = {}

    # Iterate through each cluster ID
    for cluster_id, cluster_avg_embedding in cluster_avg_embeddings.items():
      # Add default label as 'Unkown'
      speaker_labels[cluster_id] = ("Unknown", 0)
      # Best cosine_similarity
      similarity_scores = []

      # Iterate through each known speaker embedding
      for speaker_label, speaker_embedding in known_speaker_embeddings.items():
        # Calculate the cosine similarity between the cluster average embedding and the speaker embedding.
        # Use cosine_similarity from sklearn.metrics.pairwise
        # print(f"Shape of cluster_avg_embedding: {cluster_avg_embedding.shape}")
        # print(f"Shape of speaker_embedding: {speaker_embedding.shape}")

        # Remember to reshape embeddings to 2D arrays before using cosine_similarity
        similarity = cosine_similarity(cluster_avg_embedding.reshape(1, -1), speaker_embedding.reshape(1, -1))
        # print(f"Similarity: {similarity}")
        similarity_scores.append(similarity)

        # If the similarity is above the threshold, assign the cluster to the speaker label
        if similarity >= similarity_threshold:
          print(f"Similarity: {similarity}")
          speaker_labels[cluster_id] = [speaker_label, similarity]

      similarity_scores_per_cluster[cluster_id] = similarity_scores
      print(f"Similarity scores for cluster {cluster_id}: {similarity_scores}")
    return speaker_labels, similarity_scores_per_cluster

  except Exception as e:
    print(f"An error occurred during assigning speaker labels: {e}")
    traceback.print_exc()
    return None

In [189]:
def run_pipeline(audio_file, embedding_model, known_speaker_embeddings, n_clusters=3, whisper_model_name="base", similarity_threshold=0.7):
  '''This function orchestrates the entire speaker diarization and identification pipeline, calling all the
  previously defined functions in the correct sequence.'''
  try:
    # Print a separator and a message indicating the processing of the current audio_file
    print("-" * 50)
    print(f"Processing audio file: {audio_file}")

    # Transcribe audio
    segments = transcribe_audio(audio_file, model_name=whisper_model_name)

    # Compute segment embeddings
    segment_embeddings, total_duration = compute_segment_embeddings(audio_file, segments, embedding_model)

    # Get cluster embeddings
    labels, centroids = cluster_embeddings(segment_embeddings, n_clusters)

    # Compute cluster averages
    cluster_avg_embeddings = compute_cluster_averages(segment_embeddings, labels, n_clusters)

    # Load known speaker embeddings
    # known_speaker_embeddings = load_known_speaker_embeddings(known_speaker_embeddings, embedding_model)

    # Assign speaker labels
    speaker_labels, similarity_scores_per_cluster = assign_speaker_labels(cluster_avg_embeddings, known_speaker_embeddings, similarity_threshold)
    print(f"The Speaker labels dictionary is: {speaker_labels}")

    # Annotate each segment with its cluster ID and assign speaker ID
    for i, segment in enumerate(segments):
      segment['cluster_id'] = labels[i]
      segment['speaker_id'] = speaker_labels[labels[i]][0]
      segment['similarity'] = speaker_labels[labels[i]][1]
      segment['max_similarity_score'] = max(similarity_scores_per_cluster[labels[i]])

    # Format the results into a Pandas DataFrame. The DataFrame should contain columns for "Start", "End", "Cluster", and "Speaker_ID".
    # Convert start and end times to hh:mm:ss format using convert_time.
    df = pd.DataFrame(segments)
    df['Start'] = df['start'].apply(convert_time)
    df['End'] = df['end'].apply(convert_time)
    df = df.drop(columns=['start', 'end'])
    print(df)

    # Group consecutive segments with the same speaker ID to produce a diarization-like output.
    df_grouped = df.groupby(['cluster_id', 'speaker_id']).agg({'Start': 'first', 'End': 'last'}).reset_index()
    # Print the resulting DataFrame.
    print(df_grouped)

    # Print separator
    print("-" * 50)

    # Return the results DataFrame, the number of segments transcribed, and the cluster centroids.
    return df, len(segments), centroids, df_grouped

  except Exception as e:
    print(f"An error occurred during the final pipeline execution: {e}")
    traceback.print_exc()
    return None

## IV. Loading models and data  

In [190]:
# Set device to CPU
device = torch.device('cpu')

In [191]:
speechbrain.__version__

'1.0.2'

In [192]:
# Load embedding model
embedding_model = PretrainedSpeakerEmbedding(
    "speechbrain/spkrec-ecapa-voxceleb",
    device=device)
embedding_model

  state_dict = torch.load(path, map_location=device)
  stats = torch.load(path, map_location=device)


<pyannote.audio.pipelines.speaker_verification.SpeechBrainPretrainedSpeakerEmbedding at 0x7eb870634290>

In [193]:
# Get a file from my Google drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [194]:
# Untar wavs.tar.gz
!tar -xvf /content/drive/MyDrive/DLP/GA/week_6/wavs.tar.gz

speaker_A.wav
speaker_B.wav
speaker_C.wav
speaker_D.wav
speaker_E.wav
sample.wav
sample_noisy.wav


In [195]:
# Get known speaker embeddings
known_speaker_files = ["speaker_A.wav", "speaker_B.wav", "speaker_C.wav", "speaker_D.wav", "speaker_E.wav"]

known_speaker_embeddings = load_known_speaker_embeddings(known_speaker_files, embedding_model)

len(known_speaker_embeddings), known_speaker_embeddings.keys(), known_speaker_embeddings['speaker_A'].shape

(5,
 dict_keys(['speaker_A', 'speaker_B', 'speaker_C', 'speaker_D', 'speaker_E']),
 (192,))

## V. Processing audio samples  

### Process 'clean' audio sample  

In [196]:
# Set the clean audio file
clean_audio = "sample.wav"
clean_audio

'sample.wav'

In [197]:
# Run the pipeline with clean_audio, embedding_model, known_speaker_embeddings, n_clusters=3, whisper_model_name="base", similarity_threshold=0.7

df_clean, n_segments_clean, centroids_clean, df_clean_grouped = run_pipeline(clean_audio,
                                                           embedding_model,
                                                           known_speaker_embeddings,
                                                           n_clusters=3,
                                                           whisper_model_name="base",
                                                           similarity_threshold=0.7)

print(f"Number of segments transcribed: {n_segments_clean}")
print(f"Cluster centroids: {centroids_clean.shape}")
print(df_clean)

--------------------------------------------------
Processing audio file: sample.wav
Segment embeddings shape: (6, 192)
Time taken to compute segment embeddings: 4.107895374298096
KMeans labels: [2 1 0 1 0 2]
KMeans centroids shape: (3, 192)
Similarity scores for cluster 0: [array([[0.68920267]], dtype=float32), array([[0.07233772]], dtype=float32), array([[0.07125822]], dtype=float32), array([[0.03393675]], dtype=float32), array([[0.21205282]], dtype=float32)]
Similarity: [[0.79727113]]
Similarity scores for cluster 1: [array([[0.17268004]], dtype=float32), array([[0.07001918]], dtype=float32), array([[0.79727113]], dtype=float32), array([[0.16717166]], dtype=float32), array([[0.10807592]], dtype=float32)]
Similarity: [[0.85755897]]
Similarity scores for cluster 2: [array([[-0.06704538]], dtype=float32), array([[0.85755897]], dtype=float32), array([[0.17290641]], dtype=float32), array([[0.25017476]], dtype=float32), array([[0.2543853]], dtype=float32)]
The Speaker labels dictionary is

# Q1  
`For the clean audio sample ("sample.wav"), the three clusters have best cosine similarity scores of a, b, and c. Calculate the average cosine similarity score across all clusters (round to three decimal places).
`

In [198]:
np.mean([0.85755897, 0.79727113, 0.68920267])

0.7813442566666667

In [199]:
# Q1
avg_best_cos_sim_clean = np.unique(df_clean['max_similarity_score']).mean()
print(f"{avg_best_cos_sim_clean[0][0]: .3f}")

 0.781


In [200]:
np.unique(df_clean['max_similarity_score']).mean()

array([[0.78134423]], dtype=float32)

# Q3  

`If a similarity threshold of 0.7 is required for reliable speaker identification, how many clusters in the clean sample meet or exceed this threshold?`

In [201]:
df_clean[df_clean['max_similarity_score'] >= 0.7]['cluster_id'].nunique()

2

In [202]:
df_clean[df_clean['max_similarity_score'] >= 0.7][['cluster_id', 'max_similarity_score']]

Unnamed: 0,cluster_id,max_similarity_score
0,2,[[0.85755897]]
1,1,[[0.79727113]]
3,1,[[0.79727113]]
5,2,[[0.85755897]]


In [203]:
df_clean_grouped['speaker_id'].value_counts()

Unnamed: 0_level_0,count
speaker_id,Unnamed: 1_level_1
Unknown,1
speaker_C,1
speaker_B,1


# Q6  

`Using the diarization results for the clean sample, compute the total duration (in seconds) covered by all the transcribed segments.`

In [204]:
(max(df_clean['End']) - min(df_clean['Start'])).total_seconds()

36.0

# Q8  

`Examining the clustering results for the clean sample, which cluster shows the highest confidence (i.e. the highest cosine similarity), and what known speaker label is it matched with?`

In [205]:
df_clean[df_clean['max_similarity_score'] == df_clean['max_similarity_score'][0][0].max()][['cluster_id', 'speaker_id', 'max_similarity_score']]

Unnamed: 0,cluster_id,speaker_id,max_similarity_score
0,2,speaker_B,[[0.85755897]]
5,2,speaker_B,[[0.85755897]]


### Process 'noisy' audio sample  

In [206]:
# Set the noisy audio file
noisy_audio = "sample_noisy.wav"
noisy_audio

'sample_noisy.wav'

In [207]:
# Run the pipeline with noisy_audio, embedding_model, known_speaker_embeddings, n_clusters=3, whisper_model_name="base", similarity_threshold=0.7

df_noisy, n_segments_noisy, centroids_noisy, df_noisy_grouped = run_pipeline(noisy_audio,
                                                           embedding_model,
                                                           known_speaker_embeddings,
                                                           n_clusters=3,
                                                           whisper_model_name="base",
                                                           similarity_threshold=0.7)

print(f"Number of segments transcribed: {n_segments_noisy}")
print(f"Cluster centroids: {centroids_noisy.shape}")
print(df_noisy)

--------------------------------------------------
Processing audio file: sample_noisy.wav
Segment embeddings shape: (8, 192)
Time taken to compute segment embeddings: 3.767221689224243
KMeans labels: [2 1 0 1 1 0 0 2]
KMeans centroids shape: (3, 192)
Similarity scores for cluster 0: [array([[0.5812957]], dtype=float32), array([[0.0349585]], dtype=float32), array([[-0.06661911]], dtype=float32), array([[0.01783739]], dtype=float32), array([[0.07180018]], dtype=float32)]
Similarity scores for cluster 1: [array([[0.04861312]], dtype=float32), array([[0.1614371]], dtype=float32), array([[0.5451888]], dtype=float32), array([[0.15152594]], dtype=float32), array([[0.01892857]], dtype=float32)]
Similarity scores for cluster 2: [array([[-0.07635162]], dtype=float32), array([[0.6756178]], dtype=float32), array([[0.06061047]], dtype=float32), array([[0.13118726]], dtype=float32), array([[0.15789184]], dtype=float32)]
The Speaker labels dictionary is: {0: ('Unknown', 0), 1: ('Unknown', 0), 2: ('U

# Q2  
`
For the noisy audio sample ("sample_noisy.wav"), the clusters have best similarity scores of x, y, z. Calculate the average cosine similarity score across these clusters (round to three decimal places).`

In [208]:
np.mean([0.5812957, 0.5451888, 0.6756178])

0.6007007666666667

In [209]:
# Q1
avg_best_cos_sim_noisy = np.unique(df_noisy['max_similarity_score']).mean()
print(f"{avg_best_cos_sim_noisy[0][0]: .3f}")

 0.601


# Q4  

`Using the same similarity threshold (0.7), how many clusters in the noisy sample would be considered reliably matched?`

In [210]:
df_noisy[df_noisy['max_similarity_score'] >= 0.7]['cluster_id'].nunique()

0

# Q5  

`
The transcription produced m segments for "sample.wav" and n segments for "sample_noisy.wav." Calculate the percentage increase in the number of segments in the noisy sample relative to the clean sample (round to tw0 decimal place).`  

In [211]:
df_clean.shape[0], df_noisy.shape[0]

(6, 8)

In [212]:
(df_noisy.shape[0] - df_clean.shape[0])/df_clean.shape[0] * 100

33.33333333333333

# Q7  

`If the similarity threshold were reduced from 0.7 to 0.65, which cluster in the noisy sample would now be assigned a known speaker label instead of "Unknown"?`

In [213]:
# Run the pipeline with noisy_audio with similarity threshold 0.65

df_noisy_65, n_segments_noisy_65, centroids_noisy_65, df_noisy_65_grouped = run_pipeline(noisy_audio,
                                                           embedding_model,
                                                           known_speaker_embeddings,
                                                           n_clusters=3,
                                                           whisper_model_name="base",
                                                           similarity_threshold=0.65)

print(f"Number of segments transcribed: {n_segments_noisy_65}")
print(f"Cluster centroids: {centroids_noisy_65.shape}")
print(df_noisy_65)

--------------------------------------------------
Processing audio file: sample_noisy.wav
Segment embeddings shape: (8, 192)
Time taken to compute segment embeddings: 3.769523859024048
KMeans labels: [2 1 0 1 1 0 0 2]
KMeans centroids shape: (3, 192)
Similarity scores for cluster 0: [array([[0.5812957]], dtype=float32), array([[0.0349585]], dtype=float32), array([[-0.06661911]], dtype=float32), array([[0.01783739]], dtype=float32), array([[0.07180018]], dtype=float32)]
Similarity scores for cluster 1: [array([[0.04861312]], dtype=float32), array([[0.1614371]], dtype=float32), array([[0.5451888]], dtype=float32), array([[0.15152594]], dtype=float32), array([[0.01892857]], dtype=float32)]
Similarity: [[0.6756178]]
Similarity scores for cluster 2: [array([[-0.07635162]], dtype=float32), array([[0.6756178]], dtype=float32), array([[0.06061047]], dtype=float32), array([[0.13118726]], dtype=float32), array([[0.15789184]], dtype=float32)]
The Speaker labels dictionary is: {0: ('Unknown', 0),

In [217]:
df_noisy_65[['cluster_id', 'speaker_id']].value_counts()

Unnamed: 0_level_0,Unnamed: 1_level_0,count
cluster_id,speaker_id,Unnamed: 2_level_1
0,Unknown,3
1,Unknown,3
2,speaker_B,2
