# 3. Clustering audio segments for speaker identification
In our 3<sup>rd</sup> step, the task is to cluster the generated `<hash>_vocals.wav` (generated in step 2). Since we have now removed `<hash>_accompaniments.wav` it won't act as noise, model can easily identify speaker's speech features.

## Why clustering?
Since subtitle don't tell "Who is speaking?", we need to know "Who is speaking this sentence?". Clustering will help us identify speakers for each particular (`sentence`, `<hash>_vocal`) pair. Knowing speaker will help in 4<sup>th</sup> step to generate TTS for that particular speaker.

## Clustering on audio...never heard before?
Clustering raw audio is hard and unreasonable due to its high dimensionability.

What we can try is reduce dimensionality, by creating **Embeddings** or Vectors that represent the underlying speech input.

We can use a pretrained model trained for a different task, such as for identifying if it is same speaker or not, and use transfer learning to that a trained model hidden layers and generate a vector as Embedding.

We will then be able to perform Clustering on these generated Embeddings, assign unique ID to cluster identifying them as one particular speaker.


In [4]:
# Don't run, might require in production env
# !apt install ffmpeg

In [None]:
!pip install --quiet spleeter
!pip install --quiet pysrt

In [None]:
!pip uninstall --quiet tensorflow
!pip install --upgrade --quiet  tensorflow

In [None]:
!pip uninstall --quiet keras
!pip install --upgrade --quiet keras

In [1]:
from deepdub_sentence import DeepdubSentence
from deepdub_audio import DeepdubAudio
from IPython.display import Audio
from IPython.display import display

In [2]:
IN_COLAB = 'google.colab' in str(get_ipython())

if IN_COLAB:
  from google.colab import drive

  drive.mount('/content/gdrive')

In [6]:
DRAMA = 'hotel_del_luna'
EP = 6

if IN_COLAB:
  BASE_DIR = '/content/gdrive/MyDrive/Colab Project/Deepdub'
  OUTPUT_DIR = f'/content/gdrive/MyDrive/Colab Project/Deepdub/output_dir/{DRAMA}_{EP}'
  MODEL_PATH = '/content/gdrive/MyDrive/Colab Project/Deepdub/ResCNN_triplet_training_checkpoint_265.h5'
else:
  BASE_DIR = 'T:/pycharm_repo/Working_dir/Deepdub'
  OUTPUT_DIR = f'./output_dir/{DRAMA}_{EP}'
  MODEL_PATH = './pretrained_models/ResCNN_triplet_training_checkpoint_265.h5'

AUDIO_OUTPUT_DIR = f'{OUTPUT_DIR}/audio_segments'
SAMPLE_DIR = f'{BASE_DIR}/{DRAMA}'
SUBTITLE_DIR = f'{SAMPLE_DIR}/subtitles'

slice_from = "06_26"
slice_to = "8_50"

In [4]:
!mkdir -p $OUTPUT_DIR

The syntax of the command is incorrect.


In [7]:
def to_sec(min_sec):
  """
  Convert a string formatted as `min_sec` or 'h_min_sec' to int of total seconds.
  """
  return int(
    min_sec.split("_")[-2])*60 + int(
    min_sec.split("_")[-1])

In [8]:
deep_s = DeepdubSentence(project_name=DRAMA,
                         subtitle_path=SUBTITLE_DIR + f'/ep{EP}_eng.srt',
                         slice_from=slice_from,
                         slice_to=slice_to)
sentence_df = deep_s.get_sentences()

deep_a = DeepdubAudio(project_name=DRAMA,
                      sentence_df=sentence_df,
                      audio_path=f'{OUTPUT_DIR}/clip{EP}.wav')
deep_a.create_audio_segments()

Moviepy - Running:
>>> "+ " ".join(cmd)
Moviepy - Command successful
Moviepy - Running:
>>> "+ " ".join(cmd)
Moviepy - Command successful
Moviepy - Running:
>>> "+ " ".join(cmd)
Moviepy - Command successful
Moviepy - Running:
>>> "+ " ".join(cmd)
Moviepy - Command successful
Moviepy - Running:
>>> "+ " ".join(cmd)
Moviepy - Command successful
Moviepy - Running:
>>> "+ " ".join(cmd)
Moviepy - Command successful
Moviepy - Running:
>>> "+ " ".join(cmd)
Moviepy - Command successful
Moviepy - Running:
>>> "+ " ".join(cmd)
Moviepy - Command successful
Moviepy - Running:
>>> "+ " ".join(cmd)
Moviepy - Command successful
Moviepy - Running:
>>> "+ " ".join(cmd)
Moviepy - Command successful
Moviepy - Running:
>>> "+ " ".join(cmd)
Moviepy - Command successful
Moviepy - Running:
>>> "+ " ".join(cmd)
Moviepy - Command successful
Moviepy - Running:
>>> "+ " ".join(cmd)
Moviepy - Command successful
Moviepy - Running:
>>> "+ " ".join(cmd)
Moviepy - Command successful
Moviepy - Running:
>>> "+ " ".join

OSError: ffmpeg version 4.2.2 Copyright (c) 2000-2019 the FFmpeg developers
  built with gcc 9.2.1 (GCC) 20200122
  configuration: --enable-gpl --enable-version3 --enable-sdl2 --enable-fontconfig --enable-gnutls --enable-iconv --enable-libass --enable-libdav1d --enable-libbluray --enable-libfreetype --enable-libmp3lame --enable-libopencore-amrnb --enable-libopencore-amrwb --enable-libopenjpeg --enable-libopus --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libtheora --enable-libtwolame --enable-libvpx --enable-libwavpack --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxml2 --enable-libzimg --enable-lzma --enable-zlib --enable-gmp --enable-libvidstab --enable-libvorbis --enable-libvo-amrwbenc --enable-libmysofa --enable-libspeex --enable-libxvid --enable-libaom --enable-libmfx --enable-amf --enable-ffnvcodec --enable-cuvid --enable-d3d11va --enable-nvenc --enable-nvdec --enable-dxva2 --enable-avisynth --enable-libopenmpt
  libavutil      56. 31.100 / 56. 31.100
  libavcodec     58. 54.100 / 58. 54.100
  libavformat    58. 29.100 / 58. 29.100
  libavdevice    58.  8.100 / 58.  8.100
  libavfilter     7. 57.100 /  7. 57.100
  libswscale      5.  5.100 /  5.  5.100
  libswresample   3.  5.100 /  3.  5.100
  libpostproc    55.  5.100 / 55.  5.100
Guessed Channel Layout for Input Stream #0.0 : stereo
Input #0, wav, from './output_dir/hotel_del_luna_6/clip6.wav':
  Metadata:
    title           : Ep6
    encoder         : Lavf58.29.100
  Duration: 00:02:26.03, bitrate: 1411 kb/s
    Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 44100 Hz, stereo, s16, 1411 kb/s
Stream mapping:
  Stream #0:0 -> #0:0 (pcm_s16le (native) -> pcm_s16le (native))
Press [q] to stop, [?] for help
[trim for output stream 0:0 @ 000002804d7320c0] Value -57459000.000000 for parameter 'durationi' out of range [0 - 9.22337e+18]
[trim for output stream 0:0 @ 000002804d7320c0] Error configuring the atrim filterInput pad "default" with type audio of the filter instance "out_0_0" of abuffersink not connected to any source
Error reinitializing filters!
Failed to inject frame into filter network: Invalid argument
Error while processing the decoded data for stream #0:0
Conversion failed!


In [9]:
deep_a.audio_df

Unnamed: 0,start,end,hash,path
0,1970-01-01 00:00:00.000,1900-01-01 00:00:10.753,13268144354830886837,13268144354830886837.wav
1,1900-01-01 00:00:10.753,1900-01-01 00:00:12.323,14497962861368294049,14497962861368294049_gen.wav
2,1900-01-01 00:00:12.323,1900-01-01 00:00:13.282,9169325943668414471,9169325943668414471.wav
3,1900-01-01 00:00:13.282,1900-01-01 00:00:14.753,18048373628256788383,18048373628256788383_gen.wav
4,1900-01-01 00:00:14.753,1900-01-01 00:00:15.452,8433474818089108733,8433474818089108733.wav
...,...,...,...,...
61,1900-01-01 00:02:09.573,1900-01-01 00:02:17.272,929360901860661087,929360901860661087.wav
62,1900-01-01 00:02:17.272,1900-01-01 00:02:18.543,8803771367536902491,8803771367536902491_gen.wav
63,1900-01-01 00:02:18.543,1900-01-01 00:02:20.642,694919335453589821,694919335453589821.wav
64,1900-01-01 00:02:20.642,1900-01-01 00:02:22.613,15778045738322715580,15778045738322715580_gen.wav


In [11]:
deep_a.extract_vocal_and_accompaniments()

INFO:tensorflow:Apply unet for vocals_spectrogram
Instructions for updating:
Colocations handled automatically by placer.
INFO:tensorflow:Apply unet for accompaniment_spectrogram
INFO:tensorflow:Restoring parameters from pretrained_models\2stems\model


In [8]:
sentence_df

Unnamed: 0,start,end,sentence,hash
0,1900-01-01 00:00:17.373,1900-01-01 00:00:19.342,Why is it raining? I didn't see this on the we...,11991633108791627229
1,1900-01-01 00:00:19.342,1900-01-01 00:00:21.543,"Gosh, I know. I got my hair done today.",11763586976482689099
2,1900-01-01 00:00:21.743,1900-01-01 00:00:23.712,"- It's all wet. - My goodness, you're right.",15623740944573960892
3,1900-01-01 00:00:24.383,1900-01-01 00:00:25.482,Your hair is all wet.,3767497895488922469
4,1900-01-01 00:00:25.812,1900-01-01 00:00:27.653,That's because a fox is getting married today.,2555627041485743295


In [None]:
import random
import numpy as np
from deep_speaker.audio import read_mfcc
from deep_speaker.batcher import sample_from_mfcc
from deep_speaker.constants import SAMPLE_RATE, NUM_FRAMES
from deep_speaker.conv_models import DeepSpeakerModel

np.random.seed(123)
random.seed(123)

model = DeepSpeakerModel()
model.m.load_weights(MODEL_PATH, by_name=True)


def get_embedding(h):
  """Call the model to get the embeddings of shape (1, 512) for each file.
  """
  mfcc = sample_from_mfcc(read_mfcc(f'{AUDIO_OUTPUT_DIR}/{h}_vocals.wav', SAMPLE_RATE), NUM_FRAMES)
  embedding = model.m.predict(np.expand_dims(mfcc, axis=0))
  return embedding


sentence_df[["embedding"]] = sentence_df[["hash"]].applymap(get_embedding)
sentence_df

In [None]:
sentence_df.info()