<a href="https://colab.research.google.com/github/dxvsh/LearningPytorch/blob/main/Week6/DLP_GA6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## DLP GA6

### Install Necessary Libraries

In [None]:
!pip install speechbrain==0.5.16 faster_whisper pyannote.audio whisper moviepy ctranslate2==4.4.0 > /dev/null

### Neccessary Imports

In [None]:
import librosa, traceback
from faster_whisper import WhisperModel
import torch
import whisper
from pathlib import Path
import pandas as pd
import numpy as np
import re, time, os, datetime
from sklearn.cluster import AgglomerativeClustering, KMeans
from pyannote.audio.pipelines.speaker_verification import PretrainedSpeakerEmbedding
from pyannote.audio import Audio
from pyannote.core import Segment
import speechbrain
from scipy.spatial.distance import cdist

**Prepare Audio File**

Upload the audio file **(Test 1.mp3)** into the environment. Convert the audio from MP3 to WAV format with a sample rate of 16 kHz and set it to mono. Use 16-bit little-endian PCM (Pulse Code Modulation) encoding. The converted file will be used for processing.

Click [here](https://drive.google.com/file/d/1sJUZe9320zQNON3o0dXV9UtmYkqsxlU8/view) to view the audio file.

**Speaker Diarization Steps**

1. Load the converted audio file using the librosa library with a sample rate of 16 kHz and set it to mono and prepare it for processing.

2. Initialize the Whisper model from faster whisper for transcription. Various model options are available, such as **tiny, base, small, medium, large-v1, and large-v2**. For this work, we will use base.

3. Load the pretrained speaker embedding model using **speechbrain/spkrec-ecapa-voxceleb**. This model will be used to extract speaker embeddings from the audio.

4. Transcribe the audio using the Whisper model from faster whisper, dividing it into segments based on time. for transcribe to 'en' language, use the following options: beam_size=5 and best_of=5

5. For each audio segment, generate speaker embeddings and store them.

6. Use clustering (KMeans) to group the segments by speaker. Set the number of speakers to three.

7. Assign speaker labels to each segment and compute the distances between speaker clusters to differentiate them.




---

Download the given audio:

In [None]:
!gdown 1sJUZe9320zQNON3o0dXV9UtmYkqsxlU8

Downloading...
From: https://drive.google.com/uc?id=1sJUZe9320zQNON3o0dXV9UtmYkqsxlU8
To: /content/TEST-1.mp3
  0% 0.00/1.94M [00:00<?, ?B/s]100% 1.94M/1.94M [00:00<00:00, 57.2MB/s]


In [None]:
audio_file_path = '/content/TEST-1.mp3'

**Q1.** Find the number of samples of the original audio file when loaded as a numpy array.

> **Note**: I'm not processing the audio into wav first and using it directly as given because the question asks about the original audio.

> By default, `librosa` loads the audio using sample rate, `sr=22050`, you can set `sr=None` to preserve the original sampling rate.

In [None]:
# Load the audio file using librosa
audio, sr = librosa.load(audio_file_path, sr=None, mono=False)  # sr=None preserves original sample rate

print(audio.shape)

(2, 5334016)


This indicates that the given audio is dual channels. There are 2 channels. And each channel has **5334016** samples.

**Q2.** What is the sampling rate of the original audio?

> For checking the original sampling rate, we need to set `sr=None` while loading the audio using `librosa` otherwise it loads the audio using `sr=22050` and will report *that* as the sampling rate.

In [None]:
# sampling rate of the original audio
sr

44100

**Q3.** How many channels are present in the original audio?

**A.** We already saw that there are 2 channels in the orignal audio. But we can further check our answers using `ffmpeg`

In [None]:
!ffprobe -i "{audio_file_path}" -show_streams -select_streams a:0 -v 0

[STREAM]
index=0
codec_name=mp3
codec_long_name=MP3 (MPEG audio layer 3)
profile=unknown
codec_type=audio
codec_tag_string=[0][0][0][0]
codec_tag=0x0000
sample_fmt=fltp
sample_rate=44100
channels=2
channel_layout=stereo
bits_per_sample=0
id=N/A
r_frame_rate=0/0
avg_frame_rate=0/0
time_base=1/14112000
start_pts=353600
start_time=0.025057
duration_ts=1707540480
duration=120.999184
bit_rate=128000
max_bit_rate=N/A
bits_per_raw_sample=N/A
nb_frames=N/A
nb_read_frames=N/A
nb_read_packets=N/A
DISPOSITION:default=0
DISPOSITION:dub=0
DISPOSITION:original=0
DISPOSITION:comment=0
DISPOSITION:lyrics=0
DISPOSITION:karaoke=0
DISPOSITION:forced=0
DISPOSITION:hearing_impaired=0
DISPOSITION:visual_impaired=0
DISPOSITION:clean_effects=0
DISPOSITION:attached_pic=0
DISPOSITION:timed_thumbnails=0
TAG:encoder=Lavc60.3.
[/STREAM]


We can see that our answers are correct as per `ffmpeg` as well. According to the above output:

- sample_rate=44100
- channels=2
- channel_layout=stereo

which matches what we got.

**Q4.** What is the meaning of ”le” in the ”pcm s16le”?


- [ ] low endian
- [x] little endian
- [ ] local endian
- [ ] long endian


**Q5.** While loading the model using **WhisperModel** from the **faster_whisper** package, what argument is needed to quantize the model to 8-bit precision?


- [x] compute_type
- [ ] precision_type
- [ ] quantization_type
- [ ] download_type


**Q6.** How many segments are returned by the Whisper model for the given audio?

First convert the mp3 audio to wav

In [None]:
!ffmpeg -i "{audio_file_path}" -ar 16000 -ac 1 -c:a pcm_s16le "{audio_file_path[:-4]}.wav"

ffmpeg version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2000-2021 the FFmpeg developers
  built with gcc 11 (Ubuntu 11.2.0-19ubuntu1)
  configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enab

In [None]:
# we've got the wav file now:

audio_file_path = '/content/TEST-1.wav'

In [None]:
# Available Whisper models
WHISPER_MODELS = ["tiny", "base", "small", "medium", "large-v1", "large-v2"]

# Initialize the speaker embedding model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
embedding_model = PretrainedSpeakerEmbedding(
    "speechbrain/spkrec-ecapa-voxceleb",
    device=device
)

In [None]:
def convert_time(seconds):
    """Converts seconds to readable time format."""
    return datetime.timedelta(seconds=round(seconds))

# for example, it properly converts 125 sec to 02:05
print(convert_time(125))

0:02:05


In [None]:
def load_audio(audio_file):
    """Loads audio file and get its duration."""
    audio_data, sample_rate = librosa.load(audio_file, mono=True, sr=16000)
    duration = len(audio_data) / sample_rate
    return duration, sample_rate

load_audio(audio_file_path)

(120.95275, 16000)

In [None]:
def transcribe_audio(audio_file, whisper_model="base"):
    """Transcribe audio using Whisper and get time segments."""

    # Initialize Whisper model to use
    model = WhisperModel(whisper_model, compute_type="int8")

    # Get audio duration
    duration, _ = load_audio(audio_file)

    # Transcribe audio
    options = dict(language='en', beam_size=5, best_of=5)
    segments_raw, _ = model.transcribe(audio_file, task="transcribe", **options)

    # Convert to simplified format
    segments = [] # here we'll keep track of the different segments (the start and end time of each segment)
    for segment_chunk in segments_raw:
        chunk = {"start": segment_chunk.start, "end": segment_chunk.end}
        segments.append(chunk)

        # just printing for my reference how the transcription looks
        print(f"{segment_chunk.start} - {segment_chunk.end} :  {segment_chunk.text}")

    return segments, duration

In [None]:
segments, duration = transcribe_audio(audio_file_path, "base")

0.0 - 3.92 :   Let's talk about music. How often do you listen to music?
3.92 - 7.04 :   I think I listen to music mostly when I'm driving.
7.04 - 12.8 :   I think it puts me in such a good mood when I'm out there on a drive and I play my favorite
12.8 - 20.96 :   music. I'm usually into Afro music a lot, hip-hop and Afro and R&B, so I prefer listening to music when
20.96 - 26.8 :   I'm driving or sometimes when I'm working out at the gym, something like that.
26.8 - 31.12 :   Is music an important subject in schools in your country?
31.12 - 37.2 :   In schools in my country, it is because I'm from India, so in India,
37.2 - 43.6 :   music and dance and expressing our emotions as usually through music and dancing.
43.6 - 50.64 :   So in every school they teach classical music or they have a subject where there is something
50.64 - 54.24 :   about music usually, so I think it is important.
54.24 - 56.56 :   Do you ever go to live concerts?
56.56 - 62.96 :   Oh, I've been to three concer

In [None]:
segments

[{'start': 0.0, 'end': 3.92},
 {'start': 3.92, 'end': 7.04},
 {'start': 7.04, 'end': 12.8},
 {'start': 12.8, 'end': 20.96},
 {'start': 20.96, 'end': 26.8},
 {'start': 26.8, 'end': 31.12},
 {'start': 31.12, 'end': 37.2},
 {'start': 37.2, 'end': 43.6},
 {'start': 43.6, 'end': 50.64},
 {'start': 50.64, 'end': 54.24},
 {'start': 54.24, 'end': 56.56},
 {'start': 56.56, 'end': 62.96},
 {'start': 63.84, 'end': 68.32000000000001},
 {'start': 69.2, 'end': 73.6},
 {'start': 74.48, 'end': 80.64},
 {'start': 81.44, 'end': 86.32000000000001},
 {'start': 86.32000000000001, 'end': 94.08000000000001},
 {'start': 94.08000000000001, 'end': 98.32000000000001},
 {'start': 98.32000000000001, 'end': 104.48},
 {'start': 104.48, 'end': 109.12},
 {'start': 109.12, 'end': 112.48},
 {'start': 112.48, 'end': 116.32000000000001},
 {'start': 116.32000000000001, 'end': 121.04}]

We've obtained the proper segments thanks to whisper. Let's check how many segments it created.

In [None]:
len(segments)

23

So, whisper broke down the audion into 23 segments. We now need to pass these segments to the speechbrain model for extracting the speaker embeddings.

**Q7.** Use the **speechbrain/spkrec-ecapa-voxceleb** model for speaker embedding extraction. What is the dimension of the speaker embeddings?

In [None]:
def create_embedding(audio_file, segment, duration, embedding_model):
    """Create speaker embedding for an audio segment."""
    audio = Audio()
    start = segment["start"]
    end = min(duration, segment["end"])

    clip = Segment(start, end)
    waveform, _ = audio.crop(audio_file, clip)

    # Creates speaker embeddings for a single segment using the speechbrain model
    return embedding_model(waveform[None])

Lets try creating an embedding for a segment.

In [None]:
embedding = create_embedding(audio_file_path, {"start": 0, "end": 3.92}, 3.92, embedding_model)
embedding

array([[-40.25594   ,  -8.491479  ,  -4.9949446 , -26.65534   ,
          9.425489  , -15.619783  , -13.361435  ,  29.846088  ,
        -34.676872  ,  -0.57370985,  26.041285  ,   4.015476  ,
         26.022345  ,  -3.417672  ,   2.490573  , -32.552284  ,
        -19.57991   ,   2.5500743 ,   3.753426  ,  -7.033936  ,
         -1.5066743 ,   4.335468  ,  -5.859436  , -16.098492  ,
         25.152824  ,  24.570328  ,   7.2257648 ,   2.2420256 ,
        -15.459007  ,  -8.607247  ,  10.132063  ,   6.4293847 ,
         -8.238471  ,  27.347834  ,  28.328703  , -10.379413  ,
         27.164042  , -14.08712   ,  19.617268  , -17.926006  ,
         21.622137  , -25.019194  , -22.539604  , -21.980099  ,
         -3.8643553 , -11.392321  ,  13.730856  ,  -5.730925  ,
        -16.300035  , -34.68206   ,  22.299427  , -16.448433  ,
        -36.93051   ,  -0.21806128,  11.828444  ,  26.695143  ,
        -12.675824  ,  26.536583  , -31.337332  , -34.932266  ,
         -2.3158658 ,  -7.0365553 , -11.

In [None]:
embedding.shape

(1, 192)

Let's try it out for another segment:

In [None]:
embedding = create_embedding(audio_file_path, {'start': 20.96, 'end': 26.8}, 5.84, embedding_model)
embedding

array([[ -0.97149867,  21.238972  ,  28.899313  ,  17.766155  ,
        -13.545868  ,   9.704622  , -17.773632  ,   1.0380232 ,
        -11.902727  , -16.154942  ,  13.99373   ,  17.203987  ,
         16.769989  , -43.61952   ,  18.326788  ,  22.05579   ,
        -13.35937   ,   8.561718  , -29.26869   ,  20.302916  ,
         10.7621565 , -20.021696  ,   7.0849    , -35.765457  ,
          9.299911  ,   0.32360032, -15.321062  ,   2.0194566 ,
        -14.767116  ,   4.387259  ,  16.354015  ,  -5.409315  ,
         -4.6615305 , -10.783706  ,  15.87705   , -11.084269  ,
        -14.936838  ,   2.1695988 ,  -0.35953552, -11.899469  ,
        -10.252263  ,  27.308075  , -31.043213  ,  -8.750173  ,
          2.138279  ,  18.813904  ,   0.2733121 ,  14.902317  ,
         40.943127  , -13.979475  , -16.672977  , -18.296595  ,
         21.046278  ,  11.567109  ,  -1.9439539 , -34.326775  ,
         -8.685527  , -17.75976   , -23.779284  ,  20.677973  ,
         11.896134  ,  11.671444  ,  -1.

In [None]:
embedding.shape

(1, 192)

So we can see that the embedding vector is of size **192**

**Q8.** How many languages are supported by the Whisper model?

**A.** 100

**Q9.** Apply K-means clustering on the speaker embeddings with n clusters=3. Which two clusters are closest?


- [x] 2,3
- [ ] 1,2
- [ ] 1,3


In [None]:
from sklearn.metrics.pairwise import euclidean_distances

In [None]:
def process_embeddings(audio_file, segments, duration, embedding_model, num_speakers):
    """Create and process embeddings for all segments."""

    # Create embeddings for each segment

    # lets create an empty array of the required size for storing the embeddings
    # each embedding is of size 192, so if you have say 23 segments, your embedding dimension size will be: 23 X 192
    embeddings = np.zeros(shape=(len(segments), 192))
    for i, segment in enumerate(segments):
        embeddings[i] = create_embedding(audio_file, segment, duration, embedding_model)
    embeddings = np.nan_to_num(embeddings) # replace NaN values with 0 in the embeddings

    # Cluster embeddings to assign speakers
    clustering = KMeans(n_clusters=num_speakers, random_state=42).fit(embeddings)
    labels = clustering.labels_
    cluster_centers = clustering.cluster_centers_

    # Add speaker labels to segments
    for i, segment in enumerate(segments):
        segment["speaker"] = f"SPEAKER{labels[i] + 1}"

    # return both the embedding segments and cluster centers (needed for further processing)
    return segments, cluster_centers


In [None]:
segments, cluster_centers = process_embeddings(audio_file_path, segments, duration, embedding_model, num_speakers=3)
segments

[{'start': 0.0, 'end': 3.92, 'speaker': 'SPEAKER3'},
 {'start': 3.92, 'end': 7.04, 'speaker': 'SPEAKER1'},
 {'start': 7.04, 'end': 12.8, 'speaker': 'SPEAKER1'},
 {'start': 12.8, 'end': 20.96, 'speaker': 'SPEAKER1'},
 {'start': 20.96, 'end': 26.8, 'speaker': 'SPEAKER1'},
 {'start': 26.8, 'end': 31.12, 'speaker': 'SPEAKER2'},
 {'start': 31.12, 'end': 37.2, 'speaker': 'SPEAKER1'},
 {'start': 37.2, 'end': 43.6, 'speaker': 'SPEAKER1'},
 {'start': 43.6, 'end': 50.64, 'speaker': 'SPEAKER1'},
 {'start': 50.64, 'end': 54.24, 'speaker': 'SPEAKER1'},
 {'start': 54.24, 'end': 56.56, 'speaker': 'SPEAKER2'},
 {'start': 56.56, 'end': 62.96, 'speaker': 'SPEAKER1'},
 {'start': 63.84, 'end': 68.32000000000001, 'speaker': 'SPEAKER1'},
 {'start': 69.2, 'end': 73.6, 'speaker': 'SPEAKER1'},
 {'start': 74.48, 'end': 80.64, 'speaker': 'SPEAKER1'},
 {'start': 81.44, 'end': 86.32000000000001, 'speaker': 'SPEAKER2'},
 {'start': 86.32000000000001, 'end': 94.08000000000001, 'speaker': 'SPEAKER1'},
 {'start': 94.08

Now we just need to clean up this output a little, and merge the continuous sections which don't have a change of speaker into one section.

In [None]:
def create_output_dataframe(segments):
    """Create a DataFrame with speaker segments."""
    output = {
        'Start': [],
        'End': [],
        'Speaker': []
    }

    for i, segment in enumerate(segments):
        # Add new entry when speaker changes or at the start
        if i == 0 or segments[i-1]["speaker"] != segment["speaker"]:
            output['Start'].append(str(convert_time(segment["start"])))
            output['Speaker'].append(segment['speaker'])
            if i != 0:
                output['End'].append(str(convert_time(segments[i-1]['end'])))

    # Add final end time
    output['End'].append(str(convert_time(segments[-1]['end'])))

    return pd.DataFrame(output)

In [None]:
create_output_dataframe(segments)

Unnamed: 0,Start,End,Speaker
0,0:00:00,0:00:04,SPEAKER3
1,0:00:04,0:00:27,SPEAKER1
2,0:00:27,0:00:31,SPEAKER2
3,0:00:31,0:00:54,SPEAKER1
4,0:00:54,0:00:57,SPEAKER2
5,0:00:57,0:01:21,SPEAKER1
6,0:01:21,0:01:26,SPEAKER2
7,0:01:26,0:02:01,SPEAKER1


In [None]:
# We also have the centroids of all the clusters in the variable "cluster_centers"
# we can calculate the euclidean distance between all the 3 cluster centers to find
# which two are the closest to each other

c1 = cluster_centers[0]
c2 = cluster_centers[1]
c3 = cluster_centers[2]

euclidean_distances([c1, c2, c3])

array([[  0.        , 319.397904  , 342.37442039],
       [319.397904  ,   0.        , 196.5199781 ],
       [342.37442039, 196.5199781 ,   0.        ]])

From the above matrix, we can see that distance betweeen the clusters 2 and 3 is the lowest according to euclidean dist.

**Q10.** What is the cluster number assigned for the first 4 seconds of the audio when n clusters=3 and when n clusters=2?


- [x] 3,2
- [ ] 1,2


We can see that when we use n_clusters = 3, the first 4 seconds get assigned to Cluster 3/Speaker 3.

Let's try out the same when n_clusters = 2


In [None]:
# get the segments using whisper
segments, duration = transcribe_audio(audio_file_path, "base")

# get the extracted embedding segments using speech brain and cluster them
segments, cluster_centers = process_embeddings(audio_file_path, segments, duration, embedding_model, num_speakers=2)

# clean up the output and see the results of clustering
create_output_dataframe(segments)

0.0 - 3.92 :   Let's talk about music. How often do you listen to music?
3.92 - 7.04 :   I think I listen to music mostly when I'm driving.
7.04 - 12.8 :   I think it puts me in such a good mood when I'm out there on a drive and I play my favorite
12.8 - 20.96 :   music. I'm usually into Afro music a lot, hip-hop and Afro and R&B, so I prefer listening to music when
20.96 - 26.8 :   I'm driving or sometimes when I'm working out at the gym, something like that.
26.8 - 31.12 :   Is music an important subject in schools in your country?
31.12 - 37.2 :   In schools in my country, it is because I'm from India, so in India,
37.2 - 43.6 :   music and dance and expressing our emotions as usually through music and dancing.
43.6 - 50.64 :   So in every school they teach classical music or they have a subject where there is something
50.64 - 54.24 :   about music usually, so I think it is important.
54.24 - 56.56 :   Do you ever go to live concerts?
56.56 - 62.96 :   Oh, I've been to three concer

Unnamed: 0,Start,End,Speaker
0,0:00:00,0:00:04,SPEAKER2
1,0:00:04,0:00:27,SPEAKER1
2,0:00:27,0:00:31,SPEAKER2
3,0:00:31,0:00:54,SPEAKER1
4,0:00:54,0:00:57,SPEAKER2
5,0:00:57,0:01:21,SPEAKER1
6,0:01:21,0:01:26,SPEAKER2
7,0:01:26,0:02:01,SPEAKER1


Now, when we use n_clusters = 2, the first 4 seconds get assigned to speaker 2.