<a href="https://colab.research.google.com/github/dxvsh/LearningPytorch/blob/main/Week6/DLP_Week6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## DLP Week 6

This week we'll be trying out Speaker Diarisation, i.e. the task of figuring out ***who spoke when*** in an audio clip.

Access the video used for diarisation in the lectures here: [Video Link](https://drive.google.com/file/d/1NFifPUnU2w6WZKyEXLyQjcn1lEXnupRt/view)

Access the audio used for diarisation in the lectures here: [Audio Link](https://drive.google.com/file/d/1sJUZe9320zQNON3o0dXV9UtmYkqsxlU8/view)

### Install necessary libraries

- `speechbrain` is used for speaker embedding extraction
- `whisper` is  used for breaking down the audio into chunks/segments and getting the text spoken in them

In [2]:
!pip install speechbrain==0.5.16 faster_whisper pyannote.audio whisper moviepy pandas pillow > /dev/null

## Necessary Imports

In [3]:
import librosa, traceback
from faster_whisper import WhisperModel
import torch
import whisper
from pathlib import Path
import pandas as pd
import numpy as np
import re, time, os, datetime
from sklearn.cluster import AgglomerativeClustering
from pyannote.audio.pipelines.speaker_verification import PretrainedSpeakerEmbedding
from pyannote.audio import Audio
from pyannote.core import Segment
import speechbrain
from scipy.spatial.distance import cdist

## Download the audio into Colab


Download the audio file used in the lectures into colab and specify the file path:

In [4]:
!gdown 1sJUZe9320zQNON3o0dXV9UtmYkqsxlU8

Downloading...
From: https://drive.google.com/uc?id=1sJUZe9320zQNON3o0dXV9UtmYkqsxlU8
To: /content/TEST-1.mp3
  0% 0.00/1.94M [00:00<?, ?B/s]100% 1.94M/1.94M [00:00<00:00, 181MB/s]


In [5]:
audio_file_path = '/content/TEST-1.mp3'

## .mp3 to .wav conversion

We need to convert our `mp3` audio into `wav` format first. Let's use `ffmpeg` for this purpose

- Sample Rate - 16KHz
- Channel - 1 (mono)
- Audio Codec - pcm_s16le

In [6]:
# !ffmpeg -i "{audio_file_path}" "{audio_file_path[:-4]}.wav"
!ffmpeg -i "{audio_file_path}" -ar 16000 -ac 1 -c:a pcm_s16le "{audio_file_path[:-4]}.wav"

ffmpeg version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2000-2021 the FFmpeg developers
  built with gcc 11 (Ubuntu 11.2.0-19ubuntu1)
  configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enab

We've got the `.wav` file now. Lets update the `audio_file_path` variable

In [7]:
audio_file_path = '/content/TEST-1.wav'

## Speaker Diarization - Using Whisper Segments & Agglomerative Heirarchical Clustering

Here's the general **sequence of steps we'll follow** for diarisation:

1.  Transcribing the audio using whisper. We'll also **use whisper to break the audio into separate chunks/segments**. This is the main thing that we want. We'll assume that each of these segments are spoken by one individual speaker.

2. Now that we have the different segments from the audio, we need extract speaker embeddings for all these segments using our speechbrain model.

3. Once the speaker embeddings have been extracted, we use **Agglomerative Clustering** to cluster the embedding vectors into the given number of clusters. For example, If you know that there are 2 speakers in the audio, then Agglomerative Clustering will cluster all the embeddings into 2 clusters. It keeps clustering vectors that are closer together (acc. to cosine dist) until there are only 2 clusters left.



In [9]:
# Available Whisper models
WHISPER_MODELS = ["tiny", "base", "small", "medium", "large-v1", "large-v2"]

# Initialize the speaker embedding model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
embedding_model = PretrainedSpeakerEmbedding(
    "speechbrain/spkrec-ecapa-voxceleb",
    device=device
)

Lets also create some helper functions as we go along:

In [10]:
def convert_time(seconds):
    """Converts seconds to readable time format."""
    return datetime.timedelta(seconds=round(seconds))

# for example, it properly converts 125 sec to 02:05
print(convert_time(125))

0:02:05


In [11]:
def load_audio(audio_file):
    """Loads audio file and get its duration."""
    audio_data, sample_rate = librosa.load(audio_file, mono=True, sr=16000)
    duration = len(audio_data) / sample_rate
    return duration, sample_rate

load_audio(audio_file_path)

(120.95275, 16000)

In [12]:
def transcribe_audio(audio_file, whisper_model="base"):
    """Transcribe audio using Whisper and get time segments."""

    # Initialize Whisper model to use
    model = WhisperModel(whisper_model, compute_type="int8")

    # Get audio duration
    duration, _ = load_audio(audio_file)

    # Transcribe audio
    options = dict(language='en', beam_size=5, best_of=5)
    segments_raw, _ = model.transcribe(audio_file, task="transcribe", **options)

    # Convert to simplified format
    segments = [] # here we'll keep track of the different segments (the start and end time of each segment)
    for segment_chunk in segments_raw:
        chunk = {"start": segment_chunk.start, "end": segment_chunk.end}
        segments.append(chunk)

        # just printing for my reference how the transcription looks
        print(f"{segment_chunk.start} - {segment_chunk.end} :  {segment_chunk.text}")

    return segments, duration

In [14]:
segments, duration = transcribe_audio(audio_file_path, "base")

0.0 - 3.92 :   Let's talk about music. How often do you listen to music?
3.92 - 7.04 :   I think I listen to music mostly when I'm driving.
7.04 - 12.8 :   I think it puts me in such a good mood when I'm out there on a drive and I play my favorite
12.8 - 20.96 :   music. I'm usually into Afro music a lot, hip-hop and Afro and R&B, so I prefer listening to music when
20.96 - 26.8 :   I'm driving or sometimes when I'm working out at the gym, something like that.
26.8 - 31.12 :   Is music an important subject in schools in your country?
31.12 - 37.2 :   In schools in my country, it is because I'm from India, so in India,
37.2 - 43.6 :   music and dance and expressing our emotions as usually through music and dancing.
43.6 - 50.64 :   So in every school they teach classical music or they have a subject where there is something
50.64 - 54.24 :   about music usually, so I think it is important.
54.24 - 56.56 :   Do you ever go to live concerts?
56.56 - 62.96 :   Oh, I've been to three concer

Now, we've properly extracted the different segments from the audio in the way we want thanks to ***whisper***.

We just the want the **start** and **end** timestamps of each segment (the actual text transcription isn't needed). So we can take that segment clip and extract the embedding for it.

In [15]:
segments

[{'start': 0.0, 'end': 3.92},
 {'start': 3.92, 'end': 7.04},
 {'start': 7.04, 'end': 12.8},
 {'start': 12.8, 'end': 20.96},
 {'start': 20.96, 'end': 26.8},
 {'start': 26.8, 'end': 31.12},
 {'start': 31.12, 'end': 37.2},
 {'start': 37.2, 'end': 43.6},
 {'start': 43.6, 'end': 50.64},
 {'start': 50.64, 'end': 54.24},
 {'start': 54.24, 'end': 56.56},
 {'start': 56.56, 'end': 62.96},
 {'start': 63.84, 'end': 68.32000000000001},
 {'start': 69.2, 'end': 73.6},
 {'start': 74.48, 'end': 80.64},
 {'start': 81.44, 'end': 86.32000000000001},
 {'start': 86.32000000000001, 'end': 94.08000000000001},
 {'start': 94.08000000000001, 'end': 98.32000000000001},
 {'start': 98.32000000000001, 'end': 104.48},
 {'start': 104.48, 'end': 109.12},
 {'start': 109.12, 'end': 112.48},
 {'start': 112.48, 'end': 116.32000000000001},
 {'start': 116.32000000000001, 'end': 121.04}]

In [21]:
len(segments)

23

So observe that our audio was split into 23 segments

In [16]:
def create_embedding(audio_file, segment, duration, embedding_model):
    """Create speaker embedding for an audio segment."""
    audio = Audio()
    start = segment["start"]
    end = min(duration, segment["end"])

    clip = Segment(start, end)
    waveform, _ = audio.crop(audio_file, clip)

    # Creates speaker embeddings for a single segment using the speechbrain model
    return embedding_model(waveform[None])

Lets try creating the embedding for an audio segment and see what it looks like:

In [19]:
embedding = create_embedding(audio_file_path, {"start": 0, "end": 3.92}, 3.92, embedding_model)
embedding

array([[-40.25594   ,  -8.491479  ,  -4.9949446 , -26.65534   ,
          9.425489  , -15.619783  , -13.361435  ,  29.846088  ,
        -34.676872  ,  -0.57370985,  26.041285  ,   4.015476  ,
         26.022345  ,  -3.417672  ,   2.490573  , -32.552284  ,
        -19.57991   ,   2.5500743 ,   3.753426  ,  -7.033936  ,
         -1.5066743 ,   4.335468  ,  -5.859436  , -16.098492  ,
         25.152824  ,  24.570328  ,   7.2257648 ,   2.2420256 ,
        -15.459007  ,  -8.607247  ,  10.132063  ,   6.4293847 ,
         -8.238471  ,  27.347834  ,  28.328703  , -10.379413  ,
         27.164042  , -14.08712   ,  19.617268  , -17.926006  ,
         21.622137  , -25.019194  , -22.539604  , -21.980099  ,
         -3.8643553 , -11.392321  ,  13.730856  ,  -5.730925  ,
        -16.300035  , -34.68206   ,  22.299427  , -16.448433  ,
        -36.93051   ,  -0.21806128,  11.828444  ,  26.695143  ,
        -12.675824  ,  26.536583  , -31.337332  , -34.932266  ,
         -2.3158658 ,  -7.0365553 , -11.

In [20]:
embedding.shape

(1, 192)

We can see that the **size** of each **speaker embedding vector** is 192.

We need to create the embeddings for each segment that we had.

The below function extracts the **speaker embeddings** for each of the segments that we got.

Then it uses **Agglomerative Clustering** to cluster the embedding vectors into the given number of clusters.

For example, if we know there are 2 speakers in the audio, then we use Agglomerative Clustering to cluster all the embeddings into 2 clusters. Vectors which are closer to each other (acc. to Cosine dist.) will get clubbed into one cluster. This clustering continues until we're left with the desired number of clusters.

In [22]:
def process_embeddings(audio_file, segments, duration, embedding_model, num_speakers=2):
    """Create and process embeddings for all segments."""

    # Create embeddings for each segment

    # lets create an empty array of the required size for storing the embeddings
    # each embedding is of size 192, so if you have say 23 segments, your embedding dimension size will be: 23 X 192
    embeddings = np.zeros(shape=(len(segments), 192))
    for i, segment in enumerate(segments):
        embeddings[i] = create_embedding(audio_file, segment, duration, embedding_model)
    embeddings = np.nan_to_num(embeddings) # replace NaN values with 0 in the embeddings

    # Cluster embeddings to assign speakers
    clustering = AgglomerativeClustering(num_speakers).fit(embeddings)
    labels = clustering.labels_

    # Add speaker labels to segments
    for i, segment in enumerate(segments):
        segment["speaker"] = f"SPEAKER{labels[i] + 1}"

    return segments

In [24]:
segments = process_embeddings(audio_file_path, segments, duration, embedding_model, num_speakers=2)
segments

[{'start': 0.0, 'end': 3.92, 'speaker': 'SPEAKER2'},
 {'start': 3.92, 'end': 7.04, 'speaker': 'SPEAKER1'},
 {'start': 7.04, 'end': 12.8, 'speaker': 'SPEAKER1'},
 {'start': 12.8, 'end': 20.96, 'speaker': 'SPEAKER1'},
 {'start': 20.96, 'end': 26.8, 'speaker': 'SPEAKER1'},
 {'start': 26.8, 'end': 31.12, 'speaker': 'SPEAKER2'},
 {'start': 31.12, 'end': 37.2, 'speaker': 'SPEAKER1'},
 {'start': 37.2, 'end': 43.6, 'speaker': 'SPEAKER1'},
 {'start': 43.6, 'end': 50.64, 'speaker': 'SPEAKER1'},
 {'start': 50.64, 'end': 54.24, 'speaker': 'SPEAKER1'},
 {'start': 54.24, 'end': 56.56, 'speaker': 'SPEAKER2'},
 {'start': 56.56, 'end': 62.96, 'speaker': 'SPEAKER1'},
 {'start': 63.84, 'end': 68.32000000000001, 'speaker': 'SPEAKER1'},
 {'start': 69.2, 'end': 73.6, 'speaker': 'SPEAKER1'},
 {'start': 74.48, 'end': 80.64, 'speaker': 'SPEAKER1'},
 {'start': 81.44, 'end': 86.32000000000001, 'speaker': 'SPEAKER2'},
 {'start': 86.32000000000001, 'end': 94.08000000000001, 'speaker': 'SPEAKER1'},
 {'start': 94.08

Agglomerative Clustering has found out which segments were spoken by which speakers.

Now we just need to clean up this output a little, and merge the continuous sections which don't have a change of speaker into one section.

For example, observe that the 2nd to 5th segments are all spoken by the 1st speaker, so we can combine those segments into one.

In [25]:
def create_output_dataframe(segments):
    """Create a DataFrame with speaker segments."""
    output = {
        'Start': [],
        'End': [],
        'Speaker': []
    }

    for i, segment in enumerate(segments):
        # Add new entry when speaker changes or at the start
        if i == 0 or segments[i-1]["speaker"] != segment["speaker"]:
            output['Start'].append(str(convert_time(segment["start"])))
            output['Speaker'].append(segment['speaker'])
            if i != 0:
                output['End'].append(str(convert_time(segments[i-1]['end'])))

    # Add final end time
    output['End'].append(str(convert_time(segments[-1]['end'])))

    return pd.DataFrame(output)

In [26]:
create_output_dataframe(segments)

Unnamed: 0,Start,End,Speaker
0,0:00:00,0:00:04,SPEAKER2
1,0:00:04,0:00:27,SPEAKER1
2,0:00:27,0:00:31,SPEAKER2
3,0:00:31,0:00:54,SPEAKER1
4,0:00:54,0:00:57,SPEAKER2
5,0:00:57,0:01:21,SPEAKER1
6,0:01:21,0:01:26,SPEAKER2
7,0:01:26,0:02:01,SPEAKER1


Finally, we've gotten everything in clean dataframe.

Let's create a function that does all of this in one go:

In [27]:
def process_audio(audio_file, whisper_model="base", num_speakers=2):
    """Process audio file for speaker diarization."""
    try:
        # Step 1: Transcribe audio and get segments
        print("Transcribing audio...")
        segments, duration = transcribe_audio(audio_file, whisper_model)
        print("------------------------")
        print(f"Audio duration: {convert_time(duration)}")
        print(f"Number of segments: {len(segments)}")

        # Step 2: Create speaker embeddings
        print("Creating speaker embeddings...")
        segments = process_embeddings(audio_file, segments, duration, embedding_model, num_speakers)

        # Step 3: Create output DataFrame
        print("Creating output DataFrame...")
        results_df = create_output_dataframe(segments)

        return results_df

    except Exception as e:
        print(f"Error processing audio: {str(e)}")
        traceback.print_exc()
        raise RuntimeError("Error processing audio file") from e

In [28]:
# Example usage:
audio_file = "/content/TEST-1.wav"
whisper_model = "base"
num_speakers = 2

print("Processing audio file...")
results_df = process_audio(audio_file, whisper_model, num_speakers)
print("Results:")
print(results_df)

Processing audio file...
Transcribing audio...
0.0 - 3.92 :   Let's talk about music. How often do you listen to music?
3.92 - 7.04 :   I think I listen to music mostly when I'm driving.
7.04 - 12.8 :   I think it puts me in such a good mood when I'm out there on a drive and I play my favorite
12.8 - 20.96 :   music. I'm usually into Afro music a lot, hip-hop and Afro and R&B, so I prefer listening to music when
20.96 - 26.8 :   I'm driving or sometimes when I'm working out at the gym, something like that.
26.8 - 31.12 :   Is music an important subject in schools in your country?
31.12 - 37.2 :   In schools in my country, it is because I'm from India, so in India,
37.2 - 43.6 :   music and dance and expressing our emotions as usually through music and dancing.
43.6 - 50.64 :   So in every school they teach classical music or they have a subject where there is something
50.64 - 54.24 :   about music usually, so I think it is important.
54.24 - 56.56 :   Do you ever go to live concerts?


In [31]:
# save the dataframe to CSV
save_path = '/content/TEST-1.csv'
results_df.to_csv(save_path)



---

### Extra: Overlaying the transcript and speaker labels on top of the video


In [44]:
from moviepy.editor import VideoFileClip, ImageClip, CompositeVideoClip
from PIL import Image, ImageDraw, ImageFont

Download the video file used in the lecture:

In [36]:
!gdown 1NFifPUnU2w6WZKyEXLyQjcn1lEXnupRt

Downloading...
From: https://drive.google.com/uc?id=1NFifPUnU2w6WZKyEXLyQjcn1lEXnupRt
To: /content/videoplayback_test1.mp4
  0% 0.00/8.87M [00:00<?, ?B/s] 53% 4.72M/8.87M [00:00<00:00, 32.3MB/s]100% 8.87M/8.87M [00:00<00:00, 54.3MB/s]


In [45]:
# load the video
video_path = '/content/videoplayback_test1.mp4'
video = VideoFileClip(video_path)

In [49]:
# function to create an image with text
def create_text_image(text, font_size=70, img_size=(640, 80), bg_color=(0, 0, 0), text_color=(255, 255, 255)):
    img = Image.new('RGB', img_size, color=bg_color)
    d = ImageDraw.Draw(img)
    try:
        font = ImageFont.truetype("arial.ttf", font_size)
    except IOError:
        font = ImageFont.load_default()
    text_width, text_height = d.textbbox((0, 0), text, font=font)[2:]
    position = ((img_size[0]-text_width)/2, (img_size[1]-text_height)/2)
    d.text(position, text, fill=text_color, font=font)
    return img

# overlay speaker labels
clips = [video]

for _, row in results_df.iterrows():
    start_time = pd.to_datetime(row['Start']).time()
    end_time = pd.to_datetime(row['End']).time()

    start_seconds = start_time.hour * 3600 + start_time.minute * 60 + start_time.second
    end_seconds = end_time.hour * 3600 + end_time.minute * 60 + end_time.second

    text_img = create_text_image(row['Speaker'])
    text_img_path = '/content/temp_text_img.png'
    text_img.save(text_img_path)

    txt_clip = ImageClip(text_img_path).set_position(('center', 'bottom')).set_start(start_seconds).set_duration(end_seconds - start_seconds)

    clips.append(txt_clip)

# combine all clips
final_video = CompositeVideoClip(clips)

# Save the modified video
final_video_path = '/content/videoplayback_label.mp4'
final_video.write_videofile(final_video_path, codec='libx264')


Moviepy - Building video /content/videoplayback_label.mp4.
MoviePy - Writing audio in videoplayback_labelTEMP_MPY_wvf_snd.mp3




MoviePy - Done.
Moviepy - Writing video /content/videoplayback_label.mp4





Moviepy - Done !
Moviepy - video ready /content/videoplayback_label.mp4


In [51]:
from IPython.display import HTML
from base64 import b64encode

def show_video(final_video_path, video_width = 1000):

    video_file = open(final_video_path, "r+b").read()

    video_url = f"data:video/mp4;base64,{b64encode(video_file).decode()}"
    return HTML(f"""<video width={video_width} controls><source src="{video_url}"></video>""")

show_video(final_video_path)

The diarisation is working perfectly!