# 3. Clustering audio segments for speaker identification
In our 3<sup>rd</sup> step, the task is to cluster the generated `<hash>_vocals.wav` (generated in step 2). Since we have now removed `<hash>_accompaniments.wav` it won't act as noise, model can easily identify speaker's speech features.

## Why clustering?
Since subtitle don't tell "Who is speaking?", we need to know "Who is speaking this sentence?". Clustering will help us identify speakers for each particular (`sentence`, `<hash>_vocal`) pair. Knowing speaker will help in 4<sup>th</sup> step to generate TTS for that particular speaker.

## Clustering on audio...never heard before?
Clustering raw audio is hard and unreasonable due to its high dimensionability.

What we can try is reduce dimensionality, by creating **Embeddings** or Vectors that represent the underlying speech input.

We can use a pretrained model trained for a different task, such as for identifying if it is same speaker or not, and use transfer learning to that a trained model hidden layers and generate a vector as Embedding.

We will then be able to perform Clustering on these generated Embeddings, assign unique ID to cluster identifying them as one particular speaker.


In [4]:
# Don't run, might require in production env
# !apt install ffmpeg

In [None]:
!pip install --quiet spleeter
!pip install --quiet pysrt

In [None]:
!pip uninstall --quiet tensorflow
!pip install --upgrade --quiet  tensorflow

In [None]:
!pip uninstall --quiet keras
!pip install --upgrade --quiet keras

In [1]:
from deepdub_sentence import DeepdubSentence
from deepdub_audio import DeepdubAudio
from IPython.display import Audio
from IPython.display import display

In [2]:
IN_COLAB = 'google.colab' in str(get_ipython())

if IN_COLAB:
  from google.colab import drive

  drive.mount('/content/gdrive')

In [3]:
DRAMA = '/tale_of_nine_tailed'

if IN_COLAB:
  BASE_DIR = '/content/gdrive/MyDrive/Colab Project/Deepdub'
  OUTPUT_DIR = '/content/output_dir' + DRAMA
  MODEL_path = '/content/gdrive/MyDrive/Colab Project/Deepdub/ResCNN_triplet_training_checkpoint_265.h5'
else:
  BASE_DIR = 'T:/pycharm_repo/Working_dir/Deepdub'
  OUTPUT_DIR = './output_dir' + DRAMA
  MODEL_PATH = './pretrained_models/ResCNN_triplet_training_checkpoint_265.h5'

AUDIO_OUTPUT_DIR = OUTPUT_DIR + '/audio_segments'
SAMPLE_DIR = BASE_DIR + DRAMA
SUBTITLE_DIR = SAMPLE_DIR + '/subtitles'
EP = 1

slice_from = "10_33"
slice_to = "11_00"

In [4]:
!mkdir -p $OUTPUT_DIR

The syntax of the command is incorrect.


In [4]:
def to_sec(min_sec):
  """
  Convert a string formatted as `min_sec` or 'h_min_sec' to int of total seconds.
  """
  return int(
    min_sec.split("_")[-2])*60 + int(
    min_sec.split("_")[-1])

In [5]:
deep_s = DeepdubSentence(project_name=DRAMA,
                         subtitle_path=SUBTITLE_DIR + f'/ep{EP}_eng.srt',
                         slice_from=slice_from,
                         slice_to=slice_to,
                         shift={"seconds": 3})
sentence_df = deep_s.get_sentences()

deep_a = DeepdubAudio(project_name=DRAMA,
                      sentence_df=sentence_df,
                      audio_path=f'{OUTPUT_DIR}/clip{EP}.wav')
deep_a.create_audio_segments()

Moviepy - Running:
>>> "+ " ".join(cmd)
Moviepy - Command successful
Moviepy - Running:
>>> "+ " ".join(cmd)
Moviepy - Command successful
Moviepy - Running:
>>> "+ " ".join(cmd)
Moviepy - Command successful
Moviepy - Running:
>>> "+ " ".join(cmd)
Moviepy - Command successful
Moviepy - Running:
>>> "+ " ".join(cmd)
Moviepy - Command successful
Moviepy - Running:
>>> "+ " ".join(cmd)
Moviepy - Command successful
Moviepy - Running:
>>> "+ " ".join(cmd)
Moviepy - Command successful
Moviepy - Running:
>>> "+ " ".join(cmd)
Moviepy - Command successful
Moviepy - Running:
>>> "+ " ".join(cmd)
Moviepy - Command successful
Moviepy - Running:
>>> "+ " ".join(cmd)
Moviepy - Command successful


In [7]:
deep_a.audio_df

Unnamed: 0,start,end,hash,path
0,1970-01-01 00:00:00.000,1900-01-01 00:00:17.373,18080109877677407453,18080109877677407453.wav
1,1900-01-01 00:00:17.373,1900-01-01 00:00:19.342,11991633108791627229,11991633108791627229_gen.wav
2,1900-01-01 00:00:19.342,1900-01-01 00:00:21.543,11763586976482689099,11763586976482689099_gen.wav
3,1900-01-01 00:00:21.543,1900-01-01 00:00:21.743,8592613452553988599,8592613452553988599.wav
4,1900-01-01 00:00:21.743,1900-01-01 00:00:23.712,15623740944573960892,15623740944573960892_gen.wav
5,1900-01-01 00:00:23.712,1900-01-01 00:00:24.383,969872868272006427,969872868272006427.wav
6,1900-01-01 00:00:24.383,1900-01-01 00:00:25.482,3767497895488922469,3767497895488922469_gen.wav
7,1900-01-01 00:00:25.482,1900-01-01 00:00:25.812,17176160047264815362,17176160047264815362.wav
8,1900-01-01 00:00:25.812,1900-01-01 00:00:27.653,2555627041485743295,2555627041485743295_gen.wav
9,1900-01-01 00:00:27.653,1900-01-01 00:00:29.930,17587883856843243571,17587883856843243571.wav


In [11]:
deep_a.extract_vocal_and_accompaniments()

INFO:tensorflow:Apply unet for vocals_spectrogram
Instructions for updating:
Colocations handled automatically by placer.
INFO:tensorflow:Apply unet for accompaniment_spectrogram
INFO:tensorflow:Restoring parameters from pretrained_models\2stems\model


In [8]:
sentence_df

Unnamed: 0,start,end,sentence,hash
0,1900-01-01 00:00:17.373,1900-01-01 00:00:19.342,Why is it raining? I didn't see this on the we...,11991633108791627229
1,1900-01-01 00:00:19.342,1900-01-01 00:00:21.543,"Gosh, I know. I got my hair done today.",11763586976482689099
2,1900-01-01 00:00:21.743,1900-01-01 00:00:23.712,"- It's all wet. - My goodness, you're right.",15623740944573960892
3,1900-01-01 00:00:24.383,1900-01-01 00:00:25.482,Your hair is all wet.,3767497895488922469
4,1900-01-01 00:00:25.812,1900-01-01 00:00:27.653,That's because a fox is getting married today.,2555627041485743295


In [None]:
import random
import numpy as np
from deep_speaker.audio import read_mfcc
from deep_speaker.batcher import sample_from_mfcc
from deep_speaker.constants import SAMPLE_RATE, NUM_FRAMES
from deep_speaker.conv_models import DeepSpeakerModel

np.random.seed(123)
random.seed(123)

model = DeepSpeakerModel()
model.m.load_weights(MODEL_PATH, by_name=True)


def get_embedding(h):
  """Call the model to get the embeddings of shape (1, 512) for each file.
  """
  mfcc = sample_from_mfcc(read_mfcc(f'{AUDIO_OUTPUT_DIR}/{h}_vocals.wav', SAMPLE_RATE), NUM_FRAMES)
  embedding = model.m.predict(np.expand_dims(mfcc, axis=0))
  return embedding


sentence_df[["embedding"]] = sentence_df[["hash"]].applymap(get_embedding)
sentence_df

In [None]:
sentence_df.info()