# Zeca Afonso

In this notebook we train a voice conversion model for Zeca Afonso.

The notebook works in three parts:
1. Preparing the training dataset
2. Training the voice conversion model
3. Inference with the voice conversion model

## 1. Preparing the training dataset

### 1.1 Extract voice from Zeca Afonso's discography

The first step is to extract the voice from the songs. We will use the [`demucs`](https://github.com/facebookresearch/demucs) Hybrid Transformer model. 

In [18]:
import subprocess
import tqdm
from pathlib import Path

for song in tqdm.tqdm(Path("../dataset/zeca/discography").rglob("*.opus")):
    command = f"demucs --two-stems=vocals \"{song}\" --out \"../dataset/zeca/discography_vocals\" --filename \"{song.stem}_{{stem}}.{{ext}}\""
    run = subprocess.run(command, shell=True, capture_output=True)

for song in tqdm.tqdm(Path("../dataset/zeca/discography").rglob("*.mp3")):
    command = f"demucs --two-stems=vocals \"{song}\" --out \"../dataset/zeca/discography_vocals\" --filename \"{song.stem}_{{stem}}.{{ext}}\""
    run = subprocess.run(command, shell=True, capture_output=True)

97it [5:11:00, 192.38s/it]
31it [1:44:48, 202.85s/it]


## 1.2 Splitting vocal files into <~ 10s duration files

To train the voice conversion model we need to split the files with voice into at small snippets of at most around 10 seconds.

First let's see how much audio we have without splitting and silence removal.

In [24]:
import librosa

duration = 0
for song in tqdm.tqdm(Path("../dataset/zeca/discography_vocals").rglob("*.wav")):
    if not "no_vocals" in song.stem:
        duration += librosa.get_duration(path=song)
print(f"Total duration: {duration} seconds / ({duration/60} minutes) / ({duration/3600} hours)")

256it [00:00, 45743.70it/s]

Total duration: 22833.76160997731 seconds / (380.56269349962184 minutes) / (6.342711558327031 hours)





Now let's use [`audio-slicer`](https://github.com/flutydeer/audio-slicer) to split the files and remove silence parts.

In [31]:
for song in tqdm.tqdm(Path("../dataset/zeca/discography_vocals").rglob("*.wav")):
    if not "no_vocals" in song.stem:
        command = f"python /home/andre/Repos/audio-slicer/slicer2.py --out ../dataset/zeca/discography_raw \"{song.absolute()}\""
        run = subprocess.run(command, shell=True, capture_output=True)

256it [02:58,  1.44it/s]


In [32]:
import librosa

durations = []
for song in tqdm.tqdm(Path("../dataset/zeca/discography_raw").rglob("*.wav")):
    if not "no_vocals" in song.stem:
        durations.append(librosa.get_duration(path=song))
duration = sum(durations)
print(f"Total duration: {duration} seconds / ({duration/60} minutes) / ({duration/3600} hours)")

print(f"Percentage of clips longer than 10 seconds: {len([d for d in durations if d > 10]) / len(durations) * 100}%")

1889it [00:00, 25705.96it/s]

Total duration: 15934.58 seconds / (265.5763333333333 minutes) / (4.426272222222222 hours)
Percentage of clips longer than 10 seconds: 19.957649550026467%





In [33]:
! cd .. && svc pre-resample

Preprocessing:  47%|███████████▏            | 880/1889 [00:05<00:01, 507.66it/s][2;36m[11:04:08][0m[2;36m [0m[34mINFO    [0m [1m[[0m[1;92m11:04:08[0m[1m][0m Skip                    ]8;id=123237;file:///home/andre/Repos/so-vits-svc-fork/.venv/lib/python3.12/site-packages/so_vits_svc_fork/preprocessing/preprocess_resample.py\[2mpreprocess_resample.py[0m]8;;\[2m:[0m]8;id=657060;file:///home/andre/Repos/so-vits-svc-fork/.venv/lib/python3.12/site-packages/so_vits_svc_fork/preprocessing/preprocess_resample.py#71\[2m71[0m]8;;\
[2;36m           [0m         dataset_raw/zeca/singing/[1;36m10[0m Canção [2m                         [0m
[2;36m           [0m         longe_vocals_0.wav because it is   [2m                         [0m
[2;36m           [0m         too short.                         [2m                         [0m
Preprocessing: 100%|███████████████████████| 1889/1889 [00:07<00:00, 262.58it/s]


In [34]:
! cd .. && svc pre-config

[2;36m           [0m         [35m/home/andre/Repos/so-vits-svc-fork/.venv/lib[0m [2m               [0m
[2;36m           [0m         [35m/python3.12/site-packages/so_vits_svc_fork/p[0m [2m               [0m
[2;36m           [0m         [35mreprocessing/[0m[95mpreprocess_flist_config.py[0m:[1;36m41[0m:  [2m               [0m
[2;36m           [0m         argument [32m'filename'[0m has been renamed to      [2m               [0m
[2;36m           [0m         [32m'path'[0m in version [1;36m0.10[0m.[1;36m0[0m.                    [2m               [0m
[2;36m           [0m                 This alias will be removed in        [2m               [0m
[2;36m           [0m         version [1;36m1.0[0m.                                 [2m               [0m
[2;36m           [0m           if [1;35mget_duration[0m[1m([0m[33mfilename[0m=[35mpath[0m[1m)[0m < [1;36m0.3[0m:      [2m               [0m
[2;36m           [0m                        

In [35]:
! cd .. && svc pre-hubert

[2;36m[11:06:57][0m[2;36m [0m[34mINFO    [0m [1m[[0m[1;92m11:06:57[0m[1m][0m n_jobs automatically  ]8;id=213257;file:///home/andre/Repos/so-vits-svc-fork/.venv/lib/python3.12/site-packages/so_vits_svc_fork/preprocessing/preprocess_hubert_f0.py\[2mpreprocess_hubert_f0.py[0m]8;;\[2m:[0m]8;id=395030;file:///home/andre/Repos/so-vits-svc-fork/.venv/lib/python3.12/site-packages/so_vits_svc_fork/preprocessing/preprocess_hubert_f0.py#142\[2m142[0m]8;;\
[2;36m           [0m         set to [1;36m1[0m, memory: [3;35mNone[0m MiB       [2m                           [0m
Some weights of HubertModel were not initialized from the model checkpoint at lengyue233/content-vec-best and are newly initialized: ['encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'encoder.pos_conv_embed.conv.parametrizations.weight.original1']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  0%|                       

In [None]:
! cd .. && svc train -t