# Speaker Recognition
We try to achieve two things:


*   Given an audio recording of a conversation, get a diarized version of it. That is get a transcript of the form:  
\[timestamp\]: Speaker 1: ....  
\[timestamp\]: Speaker 2: ....
*   However, the above transcript doesn't give us who the speakers are. We use pre-existing voice samples to deduce this. It is done in two parts and uses the *Speechbrain Encoder*  which gives an embedding for an audio sample:
  1. We embed the pre-existing samples we have.
  2. We use the timestamps obtained from the diarizer to concatenate fragments of the test audio where a single speaker is speaking. We compare the embedding of this concatenation with the embeddings of the samples to match unknown speakers to known ones.  

Remark: the following code is written for google colab and you might notice artefacts of the same.



In [None]:
!pip install speechbrain pydub --quiet

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m864.1/864.1 kB[0m [31m20.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m88.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m89.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m45.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
import os
from pydub import AudioSegment
from speechbrain.inference import EncoderClassifier
import torch
import torch.nn.functional as F
import torchaudio
import numpy as np

DEVICE="cuda" if torch.cuda.is_available() else "cpu"

DEBUG:speechbrain.utils.checkpoints:Registered checkpoint save hook for _speechbrain_save
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint load hook for _speechbrain_load
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint save hook for save
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint load hook for load
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint save hook for _save
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint load hook for _recover


In [None]:
from google.colab import drive
drive.mount('/content/drive',force_remount=True)

Mounted at /content/drive


In [None]:
classifier=EncoderClassifier.from_hparams(
    source="speechbrain/spkrec-ecapa-voxceleb",
    run_opts={"device":DEVICE}
)

def embed_audiosegment(audio_seg,max_len_sec=5):
    samples = np.array(audio_seg.get_array_of_samples()).astype(np.float32)

    # Normalize to [-1, 1]
    samples /= np.iinfo(audio_seg.array_type).max

    # If stereo -> average channels
    if audio_seg.channels > 1:
        samples = samples.reshape((-1, audio_seg.channels))
        samples = samples.mean(axis=1)

    # Convert to torch tensor shape [1, time]
    waveform = torch.tensor(samples).unsqueeze(0)

    # Resample if needed
    if audio_seg.frame_rate != 16000:
        resampler = torchaudio.transforms.Resample(orig_freq=audio_seg.frame_rate, new_freq=16000)
        waveform = resampler(waveform)

    # We only use 5 seconds of the sample
    max_samples = max_len_sec * 16000
    waveform = waveform[:, :max_samples]

    # Relative length (full audio = 1.0)
    lengths = torch.tensor([1.0])
    with torch.no_grad():
        speaker_embeddings = classifier.encode_batch(torch.tensor(waveform))
        speaker_embeddings = torch.nn.functional.normalize(speaker_embeddings, dim=2)
        speaker_embeddings = speaker_embeddings.squeeze().cpu()
    return speaker_embeddings


INFO:speechbrain.utils.fetching:Fetch hyperparams.yaml: Fetching from HuggingFace Hub 'speechbrain/spkrec-ecapa-voxceleb' if not cached


hyperparams.yaml: 0.00B [00:00, ?B/s]

INFO:speechbrain.utils.fetching:Fetch custom.py: Fetching from HuggingFace Hub 'speechbrain/spkrec-ecapa-voxceleb' if not cached
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint save hook for _save
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint load hook for _load
DEBUG:speechbrain.utils.checkpoints:Registered parameter transfer hook for _load
  wrapped_fwd = torch.cuda.amp.custom_fwd(fwd, cast_inputs=cast_inputs)
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint save hook for save
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint load hook for load_if_possible
DEBUG:speechbrain.utils.parameter_transfer:Fetching files for pretraining (no collection directory set)
INFO:speechbrain.utils.fetching:Fetch embedding_model.ckpt: Fetching from HuggingFace Hub 'speechbrain/spkrec-ecapa-voxceleb' if not cached


embedding_model.ckpt:   0%|          | 0.00/83.3M [00:00<?, ?B/s]

DEBUG:speechbrain.utils.parameter_transfer:Set local path in self.paths["embedding_model"] = /root/.cache/huggingface/hub/models--speechbrain--spkrec-ecapa-voxceleb/snapshots/0f99f2d0ebe89ac095bcc5903c4dd8f72b367286/embedding_model.ckpt
INFO:speechbrain.utils.fetching:Fetch mean_var_norm_emb.ckpt: Fetching from HuggingFace Hub 'speechbrain/spkrec-ecapa-voxceleb' if not cached


mean_var_norm_emb.ckpt:   0%|          | 0.00/1.92k [00:00<?, ?B/s]

DEBUG:speechbrain.utils.parameter_transfer:Set local path in self.paths["mean_var_norm_emb"] = /root/.cache/huggingface/hub/models--speechbrain--spkrec-ecapa-voxceleb/snapshots/0f99f2d0ebe89ac095bcc5903c4dd8f72b367286/mean_var_norm_emb.ckpt
INFO:speechbrain.utils.fetching:Fetch classifier.ckpt: Fetching from HuggingFace Hub 'speechbrain/spkrec-ecapa-voxceleb' if not cached


classifier.ckpt:   0%|          | 0.00/5.53M [00:00<?, ?B/s]

DEBUG:speechbrain.utils.parameter_transfer:Set local path in self.paths["classifier"] = /root/.cache/huggingface/hub/models--speechbrain--spkrec-ecapa-voxceleb/snapshots/0f99f2d0ebe89ac095bcc5903c4dd8f72b367286/classifier.ckpt
INFO:speechbrain.utils.fetching:Fetch label_encoder.txt: Fetching from HuggingFace Hub 'speechbrain/spkrec-ecapa-voxceleb' if not cached


label_encoder.txt: 0.00B [00:00, ?B/s]

DEBUG:speechbrain.utils.parameter_transfer:Set local path in self.paths["label_encoder"] = /root/.cache/huggingface/hub/models--speechbrain--spkrec-ecapa-voxceleb/snapshots/0f99f2d0ebe89ac095bcc5903c4dd8f72b367286/label_encoder.txt
INFO:speechbrain.utils.parameter_transfer:Loading pretrained files for: embedding_model, mean_var_norm_emb, classifier, label_encoder
DEBUG:speechbrain.utils.parameter_transfer:Redirecting (loading from local path): embedding_model -> /root/.cache/huggingface/hub/models--speechbrain--spkrec-ecapa-voxceleb/snapshots/0f99f2d0ebe89ac095bcc5903c4dd8f72b367286/embedding_model.ckpt
DEBUG:speechbrain.utils.parameter_transfer:Redirecting (loading from local path): mean_var_norm_emb -> /root/.cache/huggingface/hub/models--speechbrain--spkrec-ecapa-voxceleb/snapshots/0f99f2d0ebe89ac095bcc5903c4dd8f72b367286/mean_var_norm_emb.ckpt
DEBUG:speechbrain.utils.parameter_transfer:Redirecting (loading from local path): classifier -> /root/.cache/huggingface/hub/models--speechb

# Disclaimer
Use the audio sample I have given at your own risk. It is some random podcast from my feed and is vulgar and political. To use your examples:  
1. Store the audio samples of known people at audio_pipline/samples/
2. Store the test audio in audio_pipeline/tests/  

and make necessary changes in the code. Its better to use your own test examples so that we can check the robustness of the model.

In [None]:
# Dictionary containing names with audio samples of known people.
known_speakers={
    'smitha':"/content/drive/MyDrive/AIML/project/audio_pipeline/samples/smitha.wav",
    'anand':"/content/drive/MyDrive/AIML/project/audio_pipeline/samples/anand.wav",
    'sushant':"/content/drive/MyDrive/AIML/project/audio_pipeline/samples/sushant.wav",
    'abhijit':"/content/drive/MyDrive/AIML/project/audio_pipeline/samples/abhijit.wav"
}
known_embeddings={}

for name,path in known_speakers.items():
  seg=AudioSegment.from_file(path)
  known_embeddings[name]=embed_audiosegment(seg)

print("Loaded known speakers:",list(known_embeddings.keys()))

  speaker_embeddings = classifier.encode_batch(torch.tensor(waveform))


Loaded known speakers: ['smitha', 'anand', 'sushant', 'abhijit']


In [None]:
!pip install assemblyai --quiet

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/50.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.1/50.1 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import assemblyai as aai
aai.settings.api_key = "YOUR API KEY HERE"

In [None]:
# We are using Assembly AI to get the diarized version of our audio sample. See: https://www.assemblyai.com/
# Remember to change the path below if you are using your own test example.
audio_fn="/content/drive/MyDrive/AIML/project/audio_pipeline/tests/sample6.wav"
audio=AudioSegment.from_file(audio_fn)
upload_url = aai.Transcriber().upload_file(audio_fn)
config = aai.TranscriptionConfig(speaker_labels=True)

transcriber = aai.Transcriber()
transcript = transcriber.transcribe(upload_url, config)

In [None]:
# The fragments of the test audio having the same speaker (obtained by using the timestamps from the diarizer) are of different durations.
# We want to weight them by their durations. Clearly, shorter segments must get lesser weight.
# The function below, takes embeddings (basically, a list of vectors) and weights for each of them to spit out their normalized weighted average.

def weighted_mean_embeddings(embs,weights):
  if len(embs)==1:
    return embs[0]
  W=torch.tensor(weights,dtype=torch.float32).unsqueeze(1)
  M=torch.stack(embs,dim=0)
  avg=(W*M).sum(dim=0)/(W.sum()+1e-12)
  return avg/(avg.norm(p=2)+1e-12)

# The cluster_segs dictionary stores the speaker label given by the diarizer (like Speaker 1, Speaker 2, etc.) along with the corresponding segments (and their durations)
# the speaker has spoken in.
cluster_segs={}
for utt in transcript.utterances:
  seg=audio[utt.start:utt.end]
  dur=max(0.001,(utt.end-utt.start)/1000.0)
  cluster_segs.setdefault(utt.speaker,[]).append((seg,dur))

# The cluster_embeddings dictionary has the speaker label and the corresponding test audio embedding for the speaker. For example {Speaker 1: corresponding embedding}
cluster_embeddings={}
for label,segs in cluster_segs.items():
  embs,weights=[],[]
  for seg,dur in segs:
    embs.append(embed_audiosegment(seg))
    weights.append(dur)
  cluster_embeddings[label]=weighted_mean_embeddings(embs,weights)

# UNKNOWN_THRESHOLD is a hyperparameter I am yet to optimize. It is the threshold similarity score for two speakers to be classified as the same.
# We compare the similarity of two embeddings by their cosine similarity score. This is just a fancy name for - dotproduct(a,b)/(norm(a)*norm(b))
UNKNOWN_THRESHOLD=0.55
label_to_name={}
for label,emb in cluster_embeddings.items():
  best_name,best_score=None,-1.0
  for name,kemb in known_embeddings.items():
    score=F.cosine_similarity(emb,kemb,dim=0).item()
    if score>best_score:
      best_name,best_score=name,score
  if best_score>=UNKNOWN_THRESHOLD:
    label_to_name[label]=best_name
  else:
    label_to_name[label]=f"Unknown_{label}"
  print(f"Cluster {label} → {label_to_name[label]} (cos={best_score:.3f})")

  speaker_embeddings = classifier.encode_batch(torch.tensor(waveform))


Cluster A → sushant (cos=0.671)
Cluster B → abhijit (cos=0.668)
Cluster C → smitha (cos=0.756)
Cluster D → anand (cos=0.706)
Cluster E → Unknown_E (cos=0.356)


In [None]:
# We are finally done! Prints the transcript substituting names of speakers we are certain about.
import textwrap
print("\n=== Named Diarization ===")
for utt in transcript.utterances:
    who = label_to_name.get(utt.speaker, f"Unknown_{utt.speaker}")
    print(textwrap.fill(f"[{utt.start/1000:.2f}-{utt.end/1000:.2f}] {who}: {utt.text}",width=100))


=== Named Diarization ===
[0.40-35.75] sushant: So let me, let me. The closest parallel to this, if at all parallels can be
drawn is in the 19th century when the British were fighting this menace not just in the Indian
subcontinent but they were fighting it in Africa and Sudan and other places. And what they did was.
You probably cannot do that today. But the kind of massacres that the British did at that point of
time, they eliminated them. They would put them in front of cannons and blow them. Today you will
have the entire world on you if we were to make an example of these jihadis like that.
[35.75-43.91] abhijit: Winston Churchill was using gas in Iraq. British army used gas in Iraq. The
British invented concentration camps in the Boer War and they used disease as a weapon.
[43.91-44.39] sushant: Exactly.
[44.39-45.23] abhijit: They slotted.
[45.23-47.27] sushant: They taught the Germans to build concentration.
[47.27-49.31] abhijit: Camps after Kitchener was killed in.
[49.95-51