# MVSEP MDX23 Music Separation Notebook

This notebook wraps the `MVSEP-MDX23-music-separation-model` repo so it can run the inference pipeline end-to-end inside Jupyter.

I would encourage you to run this on the CMS desktops, as without GPU processing the software is impossibly slow, and the package is not configured for Apple Silicon. 

If you really want to run it locally, a workaround is as follows:
- Start by using Rosetta to open a terminal session in Intel mode. 
- Next use `conda` to create a new environment (I recommend Python3.10).
- Comment out `onnxruntime-gpu` from `requirements.txt`, and install using `pip install -r requirements.txt`.
- Finally run `conda install onnxruntime -c conda-forge` to install the missing library.


In [None]:
# assuming we are on CMS PCs
%pip install -r requirements.txt --quiet


# 2. Configure Inference

The CLI script `inference.py` accepts the following arguments:
- `--input_audio`: one or more WAV/FLAC/MP3 files to separate (required)
- `--output_folder`: directory where stems will be stored (required)
- `--cpu`: force CPU instead of GPU
- `--overlap_large` / `--overlap_small`: overlap ratios for Demucs/MDX models (higher = slower but higher quality)
    Default is 0.6/0.5 respectively, ranges from 0.0 - 1.0
- `--single_onnx`: use only one ONNX model
- `--chunk_size`: used to allocate how much data the GPU processes at a given time
- `--large_gpu`: keep all models on GPU (needs >11â€¯GB free)
- `--use_kim_model_1`: switch back to contest Kim model v1
- `--only_vocals`: separate only into vocals/not vocals



In [None]:
from pathlib import Path

CONFIG = {
    "input_audio": [
        "/path/to/audio.wav",
        # "/path/to/your/audio2.wav",
    ],
    "output_folder": "results",
    "cpu": False,
    "overlap_large": 0.6,
    "overlap_small": 0.5,
    "single_onnx": False,
    "chunk_size": 1000000,
    "large_gpu": False,
    "use_kim_model_1": False,
    "only_vocals": False,
}

Path(CONFIG["output_folder"]).mkdir(parents=True, exist_ok=True)
CONFIG


# 3. Run Inference

The cell below converts the config into the exact CLI arguments used by `inference.py`. It processes files sequentially and prints progress for each track.



In [None]:
import shlex
import subprocess

SCRIPT = Path("inference.py")
if not SCRIPT.exists():
    raise FileNotFoundError("inference.py was not found. Make sure you're in the repo root.")

base_args = [
    "python",
    str(SCRIPT),
    "--output_folder",
    CONFIG["output_folder"],
]

optional_flags = {
    "--cpu": CONFIG["cpu"],
    "--single_onnx": CONFIG["single_onnx"],
    "--large_gpu": CONFIG["large_gpu"],
    "--use_kim_model_1": CONFIG["use_kim_model_1"],
    "--only_vocals": CONFIG["only_vocals"],
}

for flag, enabled in optional_flags.items():
    if enabled:
        base_args.append(flag)

base_args.extend(
    [
        "--overlap_large",
        str(CONFIG["overlap_large"]),
        "--overlap_small",
        str(CONFIG["overlap_small"]),
        "--chunk_size",
        str(CONFIG["chunk_size"]),
    ]
)

for audio_path in CONFIG["input_audio"]:
    resolved_audio = Path(audio_path).expanduser().resolve()
    if not resolved_audio.exists():
        raise FileNotFoundError(f"Input audio not found: {resolved_audio}")

    cmd = base_args + ["--input_audio", str(resolved_audio)]
    print("\nRunning:", " ".join(shlex.quote(part) for part in cmd))
    subprocess.run(cmd, check=True)



# 4. Inspect Outputs

List the generated WAV files and optionally preview them (Colab/Jupyter can stream audio widgets). This step helps confirm the bass/drums/other/vocals stems landed where expected.



In [None]:
from IPython.display import Audio, display

result_paths = sorted(Path(CONFIG["output_folder"]).glob("*.wav"))
if not result_paths:
    print("No WAV files found yet. Run the inference step first.")
else:
    for path in result_paths:
        print(path)
    
    # tweak as desired
    display(Audio(filename=str(result_paths[-1])))


# 5. Visualisation

We can also compare how much of the vocal energy has been isolated by looking at log-magnitude spectrograms of the original mixture and the separated vocal stem.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import librosa
import librosa.display

if not CONFIG["input_audio"]:
    raise ValueError("CONFIG['input_audio'] is empty. Add at least one file to plot spectrograms.")

mixture_path = Path(CONFIG["input_audio"][0]).expanduser()
if not mixture_path.exists():
    raise FileNotFoundError(f"Original audio not found: {mixture_path}")

mixture_stem = mixture_path.stem
possible_suffixes = [".wav"]
if mixture_path.suffix:
    possible_suffixes.append(mixture_path.suffix)

output_dir = Path(CONFIG["output_folder"]).expanduser()
vocal_path = None
for suffix in possible_suffixes:
    candidate = output_dir / f"{mixture_stem}_vocals{suffix}"
    if candidate.exists():
        vocal_path = candidate
        break

if vocal_path is None:
    checked = ", ".join(str(output_dir / f"{mixture_stem}_vocals{suffix}") for suffix in possible_suffixes)
    raise FileNotFoundError(
        "No vocal stem found. Checked: " + checked + ". Run inference first."
    )

print("Comparing spectrograms for:")
print("  Mixture:", mixture_path)
print("  Vocals:", vocal_path)


def compute_spec(path):
    audio, sr = librosa.load(path, sr=None, mono=True)
    spec = librosa.amplitude_to_db(np.abs(librosa.stft(audio, n_fft=2048, hop_length=512)), ref=np.max)
    return spec, sr

mixture_spec, mix_sr = compute_spec(mixture_path)
vocal_spec, vocal_sr = compute_spec(vocal_path)

fig, axes = plt.subplots(1, 2, figsize=(14, 5), sharey=True, constrained_layout=True)
for ax, spec, sr, title in [
    (axes[0], mixture_spec, mix_sr, "Original Mixture"),
    (axes[1], vocal_spec, vocal_sr, "Separated Vocals"),
]:
    img = librosa.display.specshow(
        spec,
        sr=sr,
        hop_length=512,
        x_axis="time",
        y_axis="log",
        cmap="magma",
        ax=ax,
    )
    ax.set_title(title)

fig.colorbar(img, ax=axes, format="%+2.0f dB", shrink=0.8, location="right")
plt.show()
