
# Extract vocal from song



## 0. How to use
1. Run 1 - 3 to install necessary dependencies.
2. Edit and run 4A to download a song
3. Edit and run 4B to update the input output file
4. Run step after 4B to get the vocal.

If you need to extract multiple vocal clips,
1. Repeat step 2 - 4 when you need to download song from youtube
2. Repeat step 3 - 4 when you uploaded all files to Google colab

## 1. Original of the code

The code is modified from the code below
https://pytorch.org/audio/main/tutorials/hybrid_demucs_tutorial.html



## 2. Preparation

First, we install the necessary dependencies. The first requirement is
``torchaudio`` and ``torch``




In [1]:
import torch
import torchaudio

print(torch.__version__)
print(torchaudio.__version__)

1.13.1+cu116
0.13.1+cu116


In addition to ``torchaudio``, ``mir_eval`` is required to perform
signal-to-distortion ratio (SDR) calculations. To install ``mir_eval``
please use ``pip3 install mir_eval``.




In [None]:
!pip3 uninstall -y torch torchvision torchaudio
!pip3 install --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/cpu
!pip3 install mir_eval

In [4]:
from IPython.display import Audio
from torchaudio.utils import download_asset
import matplotlib.pyplot as plt

try:
    from torchaudio.pipelines import HDEMUCS_HIGH_MUSDB_PLUS
    from mir_eval import separation

except ModuleNotFoundError:
    try:
        import google.colab

        print(
            """
            To enable running this notebook in Google Colab, install nightly
            torch and torchaudio builds by adding the following code block to the top
            of the notebook before running it:
            !pip3 uninstall -y torch torchvision torchaudio
            !pip3 install --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/cpu
            !pip3 install mir_eval
            """
        )
    except ModuleNotFoundError:
        pass
    raise

## 3. Construct the pipeline

Pre-trained model weights and related pipeline components are bundled as
:py:func:`torchaudio.pipelines.HDEMUCS_HIGH_MUSDB_PLUS`. This is a
:py:class:`torchaudio.models.HDemucs` model trained on
[MUSDB18-HQ](https://zenodo.org/record/3338373)_ and additional
internal extra training data.
This specific model is suited for higher sample rates, around 44.1 kHZ
and has a nfft value of 4096 with a depth of 6 in the model implementation.



In [86]:
bundle = HDEMUCS_HIGH_MUSDB_PLUS

model = bundle.get_model()

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

model.to(device)

sample_rate = bundle.sample_rate
# sample_rate = 48000

print(f"Sample rate: {sample_rate}")

Sample rate: 44100


## 4A. Download song from youtube for testing
Run the codes below again if you want to download multiple songs.

Make sure the audio channel is stereo.

In [107]:
#set up youtube-dl
try: import youtube_dl
except:
    !pip3 install youtube-dl

import youtube_dl
import google


In [121]:
#download one song


url = "https://www.youtube.com/watch?v=CinYtExTp5o"
!youtube-dl --ignore-errors --format bestaudio --extract-audio --audio-format wav --audio-quality 0 --output "music_1.%(ext)s" {url}
# !youtube-dl --ignore-errors --format bestaudio --extract-audio --audio-format wav --audio-quality 0 --output "./%(title)s.%(ext)s" {url}

[youtube] CinYtExTp5o: Downloading webpage
[download] Destination: music_1.webm
[K[download] 100% of 3.55MiB in 00:59
[ffmpeg] Destination: music_1.wav
Deleting original file music_1.webm (pass -k to keep)


In [None]:
#download playlist

# url = "https://www.youtube.com/playlist?list=PL4wE0_pD37K-NBALqjTGOMhcemYEdzCbB"
# !youtube-dl --ignore-errors --format bestaudio --extract-audio --audio-format wav --audio-quality 0 --output "./downloads/%(title)s.%(ext)s" --yes-playlist {url}

## 4B. Setup hyperparameters
Run the codes below again if you want to extract multiple songs.

Make sure the audio channel is stereo.

If you need to extract large amount of the songs, write the for loop yourself.

In [110]:
input_file = f"music_1.wav"
output_file = f"music_1_vocal.wav"

## 5. Configure the application function

Because ``HDemucs`` is a large and memory-consuming model it is
very difficult to have sufficient memory to apply the model to
an entire song at once. To work around this limitation,
obtain the separated sources of a full song by
chunking the song into smaller segments and run through the
model piece by piece, and then rearrange back together.

When doing this, it is important to ensure some
overlap between each of the chunks, to accommodate for artifacts at the
edges. Due to the nature of the model, sometimes the edges have
inaccurate or undesired sounds included.

We provide a sample implementation of chunking and arrangement below. This
implementation takes an overlap of 1 second on each side, and then does
a linear fade in and fade out on each side. Using the faded overlaps, I
add these segments together, to ensure a constant volume throughout.
This accommodates for the artifacts by using less of the edges of the
model outputs.

<img src="https://download.pytorch.org/torchaudio/tutorial-assets/HDemucs_Drawing.jpg">



In [111]:
from torchaudio.transforms import Fade


def separate_sources(
        model,
        mix,
        segment=10.,
        overlap=0.1,
        device=None,
):
    """
    Apply model to a given mixture. Use fade, and add segments together in order to add model segment by segment.

    Args:
        segment (int): segment length in seconds
        device (torch.device, str, or None): if provided, device on which to
            execute the computation, otherwise `mix.device` is assumed.
            When `device` is different from `mix.device`, only local computations will
            be on `device`, while the entire tracks will be stored on `mix.device`.
    """
    if device is None:
        device = mix.device
    else:
        device = torch.device(device)

    batch, channels, length = mix.shape

    chunk_len = int(sample_rate * segment * (1 + overlap))
    start = 0
    end = chunk_len
    overlap_frames = overlap * sample_rate
    fade = Fade(fade_in_len=0, fade_out_len=int(overlap_frames), fade_shape='linear')

    final = torch.zeros(batch, len(model.sources), channels, length, device=device)

    while start < length - overlap_frames:
        chunk = mix[:, :, start:end]
        with torch.no_grad():
            out = model.forward(chunk)
        out = fade(out)
        final[:, :, :, start:end] += out
        if start == 0:
            fade.fade_in_len = int(overlap_frames)
            start += int(chunk_len - overlap_frames)
        else:
            start += chunk_len
        end += chunk_len
        if end >= length:
            fade.fade_out_len = 0
    return final


def plot_spectrogram(stft, title="Spectrogram"):
    magnitude = stft.abs()
    spectrogram = 20 * torch.log10(magnitude + 1e-8).numpy()
    figure, axis = plt.subplots(1, 1)
    img = axis.imshow(spectrogram, cmap="viridis", vmin=-60, vmax=0, origin="lower", aspect="auto")
    figure.suptitle(title)
    plt.colorbar(img, ax=axis)
    plt.show()

## 6. Run Model

Finally, we run the model and store the separate source files in a
directory

As a test song, we will be using A Classic Education by NightOwl from
MedleyDB (Creative Commons BY-NC-SA 4.0). This is also located in
[MUSDB18-HQ](https://zenodo.org/record/3338373)_ dataset within
the ``train`` sources.

In order to test with a different song, the variable names and urls
below can be changed alongside with the parameters to test the song
separator in different ways.




In [112]:
# We download the audio file from our storage. Feel free to download another file and use audio from a specific path
# SAMPLE_SONG = download_asset("tutorial-assets/hdemucs_mix.wav")
SAMPLE_SONG = input_file
waveform, sample_rate = torchaudio.load(SAMPLE_SONG)  # replace SAMPLE_SONG with desired path for different song
waveform = waveform.to(device)
mixture = waveform

info = torchaudio.info(SAMPLE_SONG)
encoding= info.encoding
bits_per_sample = info.bits_per_sample

# parameters
segment: int = 10
overlap = 0.1

print("Separating track")

ref = waveform.mean(0)
waveform = (waveform - ref.mean()) / ref.std()  # normalization

sources = separate_sources(
    model,
    waveform[None],
    device=device,
    segment=segment,
    overlap=overlap,
)[0]
sources = sources * ref.std() + ref.mean()

sources_list = model.sources
sources = list(sources)

audios = dict(zip(sources_list, sources))

Separating track


In [113]:
import os
def inspect_file(path):
    print("-" * 10)
    print("Source:", path)
    print("-" * 10)
    print(f" - File size: {os.path.getsize(path)} bytes")
    print(f" - {torchaudio.info(path)}")
    print()

In [114]:
path = output_file
# torchaudio.save(path, waveform, sample_rate)
torchaudio.save(
    path, 
    audios["vocals"], 
    sample_rate=sample_rate,
    encoding=encoding,
    bits_per_sample = bits_per_sample
  )
inspect_file(path)

RuntimeError: ignored

In [122]:
inspect_file(f"music_1.wav")

----------
Source: music_1.wav
----------
 - File size: 40989058 bytes
 - AudioMetaData(sample_rate=48000, num_frames=10247245, num_channels=2, bits_per_sample=16, encoding=PCM_S)

