# 2 Source separation for audio segments
## Extract Background Sound Effects/accompaniment and Vocals
Background sounds effect will act as noise for our next step, clustering.

Our task will be to extract these sound effect/accompaniment (`<hash>_accompaniment.wav`) as well as speech (or vocals) (`<hash>_vocals.wav`). We will use speech for 3<sup>rd</sup> step, generate speech in 4<sup>th</sup> (`<hash>_gen_vocals.wav`) and add accompaniment to create (`<hash>_gen.wav`).

For this task we will use `sentence_df` generated by `DeepdubSentece` class. This contain information only about sentences, rest is background sound effects where there are no sentences spoken, so we can easily concatenate them back anyway.

In the context of music production, it is sometimes referred to as *unmixing* or *demixing*.

> ~**Note**~ <span style="color: green;">**RESOLVED by moving from mp3 to wav format (a lossless format)**</span>
>
> ~**Currently the splitted audio,i.e., accompaniment and speech/vocal are of different length than the original by around ~2-3 microsecond.**~
> 
> ~This may or may not be problem, depending on the length of video.~
>
> ~A solution might be to add padding when we are going to merge `<hash>_gen_vocals.mp3` and `<hash>_accompaniment.mp3` by keeping the length of `<hash>.mp3`.~

> **But this creates <a href="#new-problem">a little problem, previously left unseen**</a>

In [1]:
from deepdub_sentence import DeepdubSentence
from deepdub_audio import DeepdubAudio
from moviepy.editor import AudioFileClip
from IPython.display import Audio
from IPython.display import display

In [2]:
BASE_DIR = 'T:/pycharm_repo/Working_dir/Deepdub'
DRAMA = '/tale_of_nine_tailed'

OUTPUT_DIR = './output_dir' + DRAMA
AUDIO_OUTPUT_DIR = OUTPUT_DIR + '/audio_segments'
SAMPLE_DIR = BASE_DIR + DRAMA
SUBTITLE_DIR = SAMPLE_DIR + '/subtitles'
EP = 1

slice_from = "10_33"
slice_to = "11_00"

In [3]:
def to_sec(min_sec):
  """
  Convert a string formatted as `min_sec` or 'h_min_sec' to int of total seconds.
  """
  return int(
    min_sec.split("_")[-2])*60 + int(
    min_sec.split("_")[-1])

In [28]:
deep_s = DeepdubSentence(project_name=DRAMA,
                         subtitle_path=SUBTITLE_DIR + f'/ep{EP}_eng.srt',
                         slice_from=slice_from,
                         slice_to=slice_to,
                         shift={"seconds": 3})
sentence_df = deep_s.get_sentences()

deep_a = DeepdubAudio(project_name=DRAMA,
                      sentence_df=sentence_df,
                      audio_path=f'{OUTPUT_DIR}/clip{EP}.wav')
deep_a.create_audio_segments()

sentence_df

Moviepy - Running:
>>> "+ " ".join(cmd)
Moviepy - Command successful
Moviepy - Running:
>>> "+ " ".join(cmd)
Moviepy - Command successful
Moviepy - Running:
>>> "+ " ".join(cmd)
Moviepy - Command successful
Moviepy - Running:
>>> "+ " ".join(cmd)
Moviepy - Command successful
Moviepy - Running:
>>> "+ " ".join(cmd)
Moviepy - Command successful
Moviepy - Running:
>>> "+ " ".join(cmd)
Moviepy - Command successful
Moviepy - Running:
>>> "+ " ".join(cmd)
Moviepy - Command successful
Moviepy - Running:
>>> "+ " ".join(cmd)
Moviepy - Command successful
Moviepy - Running:
>>> "+ " ".join(cmd)
Moviepy - Command successful
Moviepy - Running:
>>> "+ " ".join(cmd)
Moviepy - Command successful


Unnamed: 0,start,end,sentence,hash
0,1900-01-01 00:00:17.373,1900-01-01 00:00:19.342,Why is it raining? I didn't see this on the we...,11991633108791627229
1,1900-01-01 00:00:19.342,1900-01-01 00:00:21.543,"Gosh, I know. I got my hair done today.",11763586976482689099
2,1900-01-01 00:00:21.743,1900-01-01 00:00:23.712,"- It's all wet. - My goodness, you're right.",15623740944573960892
3,1900-01-01 00:00:24.383,1900-01-01 00:00:25.482,Your hair is all wet.,3767497895488922469
4,1900-01-01 00:00:25.812,1900-01-01 00:00:27.653,That's because a fox is getting married today.,2555627041485743295


In [29]:
deep_a.audio_df

Unnamed: 0,start,end,hash,path
0,1970-01-01 00:00:00.000,1900-01-01 00:00:17.373,18080109877677407453,18080109877677407453.wav
1,1900-01-01 00:00:17.373,1900-01-01 00:00:19.342,11991633108791627229,11991633108791627229_gen.wav
2,1900-01-01 00:00:19.342,1900-01-01 00:00:21.543,11763586976482689099,11763586976482689099_gen.wav
3,1900-01-01 00:00:21.543,1900-01-01 00:00:21.743,8592613452553988599,8592613452553988599.wav
4,1900-01-01 00:00:21.743,1900-01-01 00:00:23.712,15623740944573960892,15623740944573960892_gen.wav
5,1900-01-01 00:00:23.712,1900-01-01 00:00:24.383,969872868272006427,969872868272006427.wav
6,1900-01-01 00:00:24.383,1900-01-01 00:00:25.482,3767497895488922469,3767497895488922469_gen.wav
7,1900-01-01 00:00:25.482,1900-01-01 00:00:25.812,17176160047264815362,17176160047264815362.wav
8,1900-01-01 00:00:25.812,1900-01-01 00:00:27.653,2555627041485743295,2555627041485743295_gen.wav
9,1900-01-01 00:00:27.653,1900-01-01 00:00:29.930,17587883856843243571,17587883856843243571.wav


If we want to extract vocal content from a mix we should **somehow expose the structure of human speech**, to begin with. Luckily, the [Short-Time Fourier Transform (STFT)](https://en.wikipedia.org/wiki/Short-time_Fourier_transform) can do this job. A STFT converts signal from time domain to frequency domain.

A great [video from 3Blue1Brown](https://www.youtube.com/watch?v=spUNpyF58BY) is also worth watching.

This [article](https://towardsdatascience.com/audio-ai-isolating-vocals-from-stereo-music-using-convolutional-neural-networks-210532383785) also explain what we need to do.

### To split the audio into vocals and accompaniment I'll be using [**Spleeter**](https://github.com/deezer/spleeter)
This works far better than I thought (well because it is [trained on Private Dataset which they can't release due to copyright reasons](https://archives.ismir.net/ismir2019/latebreaking/000036.pdf) and **took a full week to train on single GPU**).

In future I might train one myself.

> **What and How it does Work?**
>
> Audio is a time-domain signal, meaning for a given particular time you're getting an amplitude only. Amplitude of multiple frequencies combined. What these *individual* frequencies are is hard to find, unless you use Fourier Transform.
>
> Fourier transform converts a time-domain to frequency-domain signal, effectively identifing what frequencies make up this sound. This can be represented using [Spectrogram](https://en.wikipedia.org/wiki/Spectrogram) or spectrum of frequencies of a signal as it varies with time. 
> 
> We can utilize this Spectogram or sonographs to identify what part makes up human speech and what doesn't.
>
> The model (a supervised one, U-nets pretrained on pairs of mix audio's spectograms and accompaniment spectograms on a private dataset. Training loss is a $L_1$-norm between masked input mix spectograms and source target spectograms) then takes this Spectogram and generates a boolean mask which when applied to our spectogram gives the vocal spectogram. Plus when inverse of this boolean mask is applied gives us accompaniment.
>   
> These generated spectogram can be converted back to time-domain signal using Inverse-Fourier Transform to get back a vocal audio 

In [30]:
from spleeter.separator import Separator
from spleeter.audio import Codec


separator = Separator('spleeter:2stems')

for sentence_audio in (
  f'{AUDIO_OUTPUT_DIR}/{str(row.hash)}.wav' for i, row in sentence_df.iterrows()):
  separator.separate_to_file(audio_descriptor=sentence_audio,
                             destination=AUDIO_OUTPUT_DIR,
                             synchronous=False,
                             codec=Codec.WAV,
                             filename_format='{filename}_{instrument}.{codec}')
separator.join()

INFO:tensorflow:Apply unet for vocals_spectrogram
INFO:tensorflow:Apply unet for accompaniment_spectrogram
INFO:tensorflow:Restoring parameters from pretrained_models\2stems\model


In [31]:
sentence_df.end - sentence_df.start

0   0 days 00:00:01.969000
1   0 days 00:00:02.201000
2   0 days 00:00:01.969000
3   0 days 00:00:01.099000
4   0 days 00:00:01.841000
dtype: timedelta64[ns]

In [32]:
for sentence in (
  f'{AUDIO_OUTPUT_DIR}/{str(row.hash)}'
  for i, row in sentence_df.reset_index().iterrows()):
  file_org = f'{sentence}.wav'
  file_vocals = f'{sentence}_vocals.wav'
  file_accom = f'{sentence}_accompaniment.wav'

  print(f'id: {sentence.split("/")[-1]}')
  print(f'Original duration: {AudioFileClip(file_org).duration}')
  print(f'Vocals duration: {AudioFileClip(file_vocals).duration}')
  print(f'Accompainment duration: {AudioFileClip(file_accom).duration}')
  print("\n")

id: 11991633108791627229
Original duration: 1.97
Vocals duration: 1.97
Accompainment duration: 1.97


id: 11763586976482689099
Original duration: 2.2
Vocals duration: 2.2
Accompainment duration: 2.2


id: 15623740944573960892
Original duration: 1.97
Vocals duration: 1.97
Accompainment duration: 1.97


id: 3767497895488922469
Original duration: 1.1
Vocals duration: 1.1
Accompainment duration: 1.1


id: 2555627041485743295
Original duration: 1.84
Vocals duration: 1.84
Accompainment duration: 1.84




In [33]:
for sentence in (
  f'{AUDIO_OUTPUT_DIR}/{str(row.hash)}' for i, row in sentence_df.iterrows()):
  file_org = f'{sentence}.wav'
  file_vocals = f'{sentence}_vocals.wav'
  file_accom = f'{sentence}_accompaniment.wav'

  print(f'id: {sentence.split("/")[-1]}')
  display(Audio(file_org))
  display(Audio(file_vocals))
  display(Audio(file_accom))

id: 11991633108791627229


id: 11763586976482689099


id: 15623740944573960892


id: 3767497895488922469


id: 2555627041485743295


# A new problem<a id="new-problem"></a> 
We completely depend on subtitles to identify where sentences are spoken. We can't just believe that always the subs will perfectly align with what is spoken.

This problem must be addressed.

We should give an option to shift subs forward backward a little bit to adjust if the subs don't match what is really spoken on screen.

> <span style="color:green;"> **FIXED** </span>
>
> Added a shift parameter to `DeepdubSentence` `__init__` method to shift subs.