# 2.1 Extract Background Sound Effects/accompaniment and Vocals
Background sounds effect will act as noise for our next step, clustering.

Our task will be to extract these sound effect/accompaniment (`<hash>_accompaniment.mp3`) as well as speech (or vocals) (`<hash>_vocals.mp3`). We will use speech for 3<sup>rd</sup> step, generate speech in 4<sup>th</sup> (`<hash>_gen_vocals.mp3`) and add accompaniment to create (`<hash>_gen.mp3`).

For this task we will use `sentence_df` generated by `DeepdubSentece` class. This contain information only about sentences, rest is background sound effects where there are no sentences spoken, so we can easily concatenate them back anyway.

In the context of music production, it is sometimes referred to as *unmixing* or *demixing*.

> **Note**
>
> **Currently the splitted audio,i.e., accompaniment and speech/vocal are of different length than the original by around ~2-3 microsecond.**
> 
> This may or may not be problem, depending on the length of video.
>
> A solution might be to add padding when we are going to merge `<hash>_gen_vocals.mp3` and `<hash>_accompaniment.mp3` by keeping the length of `<hash>.mp3`.

In [16]:
from deepdub_sentence import DeepdubSentence
from deepdub_audio import DeepdubAudio
from pyffmpeg import FFprobe
from IPython.display import Audio
from IPython.display import display

In [2]:
BASE_DIR = 'T:/pycharm_repo/Working_dir/Deepdub'
DRAMA = '/tale_of_nine_tailed'

OUTPUT_DIR = './output_dir' + DRAMA
AUDIO_OUTPUT_DIR = OUTPUT_DIR + '/audio_segments'
SAMPLE_DIR = BASE_DIR + DRAMA
SUBTITLE_DIR = SAMPLE_DIR + '/subtitles/'
EP = 1

slice_from = "10_33"
slice_to = "11_00"

In [3]:
def to_sec(min_sec):
  """
  Convert a string formatted as `min_sec` or 'h_min_sec' to int of total seconds.
  """
  return int(
    min_sec.split("_")[-2])*60 + int(
    min_sec.split("_")[-1])

In [4]:
deep_s = DeepdubSentence(subtitle_path=SUBTITLE_DIR + f'/ep{EP}_eng.srt',
                         slice_from=slice_from,
                         slice_to=slice_to)
sentence_df = deep_s.get_sentences()

deep_a = DeepdubAudio(project_name=DRAMA,
                      sentence_df=sentence_df,
                      audio_path=f'{OUTPUT_DIR}/clip{EP}.mp3')
deep_a.create_audio_segments()

sentence_df

Moviepy - Running:
>>> "+ " ".join(cmd)
Moviepy - Command successful
Moviepy - Running:
>>> "+ " ".join(cmd)
Moviepy - Command successful
Moviepy - Running:
>>> "+ " ".join(cmd)
Moviepy - Command successful
Moviepy - Running:
>>> "+ " ".join(cmd)
Moviepy - Command successful
Moviepy - Running:
>>> "+ " ".join(cmd)
Moviepy - Command successful
Moviepy - Running:
>>> "+ " ".join(cmd)
Moviepy - Command successful
Moviepy - Running:
>>> "+ " ".join(cmd)
Moviepy - Command successful
Moviepy - Running:
>>> "+ " ".join(cmd)
Moviepy - Command successful
Moviepy - Running:
>>> "+ " ".join(cmd)
Moviepy - Command successful
Moviepy - Running:
>>> "+ " ".join(cmd)
Moviepy - Command successful


Unnamed: 0,start,end,sentence,hash
0,1900-01-01 00:00:14.373,1900-01-01 00:00:16.342,Why is it raining? I didn't see this on the we...,11264188807342635219
1,1900-01-01 00:00:16.342,1900-01-01 00:00:18.543,"Gosh, I know. I got my hair done today.",6644551431160355997
2,1900-01-01 00:00:18.743,1900-01-01 00:00:20.712,"- It's all wet. - My goodness, you're right.",6924691847184957568
3,1900-01-01 00:00:21.383,1900-01-01 00:00:22.482,Your hair is all wet.,17775764441792456539
4,1900-01-01 00:00:22.812,1900-01-01 00:00:24.653,That's because a fox is getting married today.,5738569255861897594


If we want to extract vocal content from a mix we should **somehow expose the structure of human speech**, to begin with. Luckily, the [Short-Time Fourier Transform (STFT)](https://en.wikipedia.org/wiki/Short-time_Fourier_transform) can do this job. A STFT converts signal from time domain to frequency domain.

A great [video from 3Blue1Brown](https://www.youtube.com/watch?v=spUNpyF58BY) is also worth watching.

This [article](https://towardsdatascience.com/audio-ai-isolating-vocals-from-stereo-music-using-convolutional-neural-networks-210532383785) also explain what we need to do.

### To split the audio into vocals and accompaniment I'll be using [**Spleeter**](https://github.com/deezer/spleeter) 
This works far better than I thought, In future I might train one myself, currently quality matters.

> **What and How it does Work?**
>
> Audio is a time-domain signal, meaning for a given particular time you're getting an amplitude only. Amplitude of multiple frequencies combined. What these *individual* frequencies are is hard to find, unless you use Fourier Transform.
>
> Fourier transform converts a time-domain to frequency-domain signal, effectively identifing what frequencies make up this sound. This can be represented using [Spectrogram](https://en.wikipedia.org/wiki/Spectrogram) or spectrum of frequencies of a signal as it varies with time. 
> 
> We can utilize this Spectogram or sonographs to identify what part makes up human speech and what doesn't.
>
> The model (a supervised one, trained on pairs of vocals and accompaniment in [MUSDB](https://sigsep.github.io/datasets/musdb.html#musdb18-compressed-stems) dataset) then takes this Spectogram and generates a boolean mask which when applied to our spectogram gives the vocal spectogram. Plus when inverse of this boolean mask is applied gives us accompaniment.
>   
> These generated spectogram can be converted back to time-domain signal using Inverse-Fourier Transform to get back a vocal audio 

In [18]:
from spleeter.separator import Separator
from spleeter.audio import Codec


separator = Separator('spleeter:2stems')

for sentence_audio in (
  f'{AUDIO_OUTPUT_DIR}/{str(row.hash)}.mp3' for i, row in sentence_df.iterrows()):
  separator.separate_to_file(audio_descriptor=sentence_audio,
                             destination=AUDIO_OUTPUT_DIR,
                             synchronous=False,
                             codec=Codec.MP3,
                             filename_format='{filename}_{instrument}.{codec}',
                             bitrate='320k')
separator.join()

INFO:tensorflow:Apply unet for vocals_spectrogram
INFO:tensorflow:Apply unet for accompaniment_spectrogram
INFO:tensorflow:Restoring parameters from pretrained_models\2stems\model


In [19]:
for sentence in (
  f'{AUDIO_OUTPUT_DIR}/{str(row.hash)}' for i, row in sentence_df.iterrows()):
  file_org = f'{sentence}.mp3'
  file_vocals = f'{sentence}_vocals.mp3'
  file_accom = f'{sentence}_accompaniment.mp3'

  print(f'id: {sentence.split("/")[-1]}')
  print(f'Original duration: {FFprobe(file_org).duration}')
  print(f'Vocals duration: {FFprobe(file_vocals).duration}')
  print(f'Accompainment duration: {FFprobe(file_accom).duration}')
  print("\n")

id: 11264188807342635219
Original duration: 00:00:02.01
Vocals duration: 00:00:01.99
Accompainment duration: 00:00:01.99


id: 6644551431160355997
Original duration: 00:00:02.25
Vocals duration: 00:00:02.22
Accompainment duration: 00:00:02.22


id: 6924691847184957568
Original duration: 00:00:02.01
Vocals duration: 00:00:01.99
Accompainment duration: 00:00:01.99


id: 17775764441792456539
Original duration: 00:00:01.15
Vocals duration: 00:00:01.12
Accompainment duration: 00:00:01.12


id: 5738569255861897594
Original duration: 00:00:01.88
Vocals duration: 00:00:01.85
Accompainment duration: 00:00:01.85




In [17]:
for sentence in (
  f'{AUDIO_OUTPUT_DIR}/{str(row.hash)}' for i, row in sentence_df.iterrows()):
  file_org = f'{sentence}.mp3'
  file_vocals = f'{sentence}_vocals.mp3'
  file_accom = f'{sentence}_accompaniment.mp3'
  
  print(f'id: {sentence.split("/")[-1]}')
  display(Audio(file_org))
  display(Audio(file_vocals))
  display(Audio(file_accom))

id: 11264188807342635219


id: 6644551431160355997


id: 6924691847184957568


id: 17775764441792456539


id: 5738569255861897594
