# Fine-tuning FreeVC

Hi!

This notebook takes you through the steps to **fine-tune** the **Voice Conversion** model **FreeVC**. It is meant to be beginner-friendly, sparing you (and myself, the author of this notebook) most of the details of FreeVC's incredibly complicated Architecture. To those looking for a deeper dive into some of the concepts, models and techniques on which FreeVC is built, there will be some links to further reading.

> FreeVC:
>- [Paper](https://arxiv.org/abs/2210.15418)
>- [Demo](https://olawod.github.io/FreeVC-demo/)
>- [Code](https://github.com/OlaWod/FreeVC)





In [33]:
# TODO remove before shipping
# ! python initialize.py --leave_chunks
# DO_NOT_CHUNK = True
# DO_NOT_CHUNK = False

# TODO
- testing
    - what & how?

## Theory

### What is Voice Conversion?
- TODO: go through the basics

### FreeVC: Architecture

#### VITS
A significant part of FreeVC's architecture is based on [**VITS**](https://github.com/jaywalnut310/vits), an **end-to-end TTS** (text-to-speech) model. 
- TODO: further reading


#### WAVLM

## Implementation



### Installation/Prerequisites
- WavLM
- HiFiGAN
- venv:
    - `pip install -r requirements.txt`
    - TODO: ensure the following packages are included in the requirements
        - protobuf<=3.20.3
        - six==1.16.0
        - matplotlib
        - numpy<=1.22.4
- ffmpeg: to enable exporting as flac
    - `sudo apt update && sudo apt upgrade` `sudo apt install ffmpeg`
    - `ffmpeg -version` to check the installation 



In [None]:
# TODO

### Audio Preparation

To finetune the pre-trained model, we of course need some training data to adapt the model to the target speaker. With _some_ training data, I mean a whole lot of it.

We want to make sure we make the most of our data. Therefore, we'll do some simple preprocessing on it.

#### Chop it up

The **base model** (i.e. FreeVC's pretrained model that we're finetuning) is trained on the [VCTK corpus](https://datashare.ed.ac.uk/handle/10283/3443). The audio in this corpus is stored in multiple smaller files (3.4 seconds on average) than one large file.

If your finetuning-data is already in similarly sized chunks, you can **skip this step**. Otherwise, run the following cells on your file(s): This will automatically detect silent passages - for example between sentences, and thus split your audio into adequately sized chunks.

You can modify the following parameters to control the chunking:
- `MIN_SILENCE_LEN` (miliseconds): Defines the minimal length of silence necessary to split the audio at that point
- `SILENCE_THRESHOLD` (dBFS): Defines what counts as silent and what does not; anything louder than the set threshold will count as not silent. 
- `MIN_CHUNK_LEN` (seconds): Any chunk shorter than this value will be discarded and NOT saved.

In [32]:
# modified code, originally from:
#   https://www.codespeedy.com/split-audio-files-using-silence-detection-in-python/
#   retrieved on 2024-08-23
import os
from pydub import AudioSegment
from pydub.silence import split_on_silence

def chunk_audio(filelist: list[str],silence_len=800,silence_thr=-40, chunklen=0):
    count = 0
    length = 0.
    for file in filelist:
        sound = AudioSegment.from_wav(file)
        # print(sound.duration_seconds)
        # spliting audio files
        audio_chunks = split_on_silence(sound, min_silence_len=silence_len, silence_thresh=silence_thr)
        #loop is used to iterate over the output list
        for i, chunk in enumerate(audio_chunks):
            # save them as a FLAC file
            output_file = "./chunks/chunk{0}.flac".format(i)
            if not os.path.exists("./chunks"):
                os.makedirs("./chunks/")
            if chunk.duration_seconds >= chunklen:
                print("Exporting file", output_file)
                chunk.export(output_file, format="flac")
                length += chunk.duration_seconds
                count += 1
            else:
                print("Skipping Chunk {0}: Too short (< {1} seconds)".format(i,chunklen))
    print("\nAverage length of saved chunks: {0} Seconds".format(round(length/count,2)))
    


In [34]:
# VARIABLES
MIN_SILENCE_LEN = 800
SILENCE_THRESHOLD = -40
MIN_CHUNK_LEN = 1.5

# USER TODO: list of files to chunk
filelist = ["./LN_AUDIOFILES/brn1/bernie_filibuster_pt1_5min.wav"]

# TODO remove before shipping
if not DO_NOT_CHUNK:
    chunk_audio(filelist, silence_len=MIN_SILENCE_LEN, silence_thr=SILENCE_THRESHOLD, chunklen=MIN_CHUNK_LEN)


Skipping Chunk 0: Too short (< 1.5 seconds)
Skipping Chunk 1: Too short (< 1.5 seconds)
Exporting file ./chunks/chunk2.flac
Exporting file ./chunks/chunk3.flac
Exporting file ./chunks/chunk4.flac
Exporting file ./chunks/chunk5.flac
Exporting file ./chunks/chunk6.flac
Exporting file ./chunks/chunk7.flac
Exporting file ./chunks/chunk8.flac
Exporting file ./chunks/chunk9.flac
Skipping Chunk 10: Too short (< 1.5 seconds)
Exporting file ./chunks/chunk11.flac
Exporting file ./chunks/chunk12.flac
Exporting file ./chunks/chunk13.flac
Exporting file ./chunks/chunk14.flac
Exporting file ./chunks/chunk15.flac
Exporting file ./chunks/chunk16.flac
Exporting file ./chunks/chunk17.flac
Exporting file ./chunks/chunk18.flac
Skipping Chunk 19: Too short (< 1.5 seconds)
Skipping Chunk 20: Too short (< 1.5 seconds)
Exporting file ./chunks/chunk21.flac
Skipping Chunk 22: Too short (< 1.5 seconds)
Exporting file ./chunks/chunk23.flac
Exporting file ./chunks/chunk24.flac
Skipping Chunk 25: Too short (< 1.5 s

### Preprocessing

At this point, we have some audio data in appropriately sized chunks. We now need to run some very particular preprocessing steps on it, so that the model receives it in the right format.

#### Storage Format

FreeVC expects our fine-tuning data to be in the same format as its [original training data](https://datashare.ed.ac.uk/handle/10283/3443). Therefore, we need to rename some files and move them to the right places before we run any preprocessing.

You'll need to assign some **4-character** ID to your speaker - pick one that makes sense to you. If it's longer or shorter than 4 characters, this won't work.

In [2]:
# TODO: pick your Speaker ID
SPEAKER_ID = "brn1"
assert len(SPEAKER_ID) == 4

# DIRECTORY NAMES
CHUNKS = "./chunks/"
FLACS = "./dataset/flac/"
DATA_PATH = f'{FLACS}{SPEAKER_ID}/'
DATA16K = "dataset/finetuning-16k"
DATA22K = "dataset/finetuning-22k"

The following cell will rename your audio and move it into a directory with the right structure.

`<some_dir>/<sp_id>/<sp_id-filename>_mic2.flac`

In [42]:
import os
import shutil
if not os.path.exists(DATA_PATH):
    os.makedirs(DATA_PATH)

for i,file in enumerate(os.listdir(CHUNKS)):
    # print(file)
    # rename & move files to the specific format necessary
    new_filename = f'{SPEAKER_ID}-{i}_mic2.flac'
    a =os.path.join(CHUNKS,file)
    shutil.copy(a, os.path.join(DATA_PATH,new_filename))
print("Moved and renamed your training files.\nGreat Success!! Very Nice!")
# os.listdir(DATA_PATH)


Moved and renamed your training files.
Great Success!! Very Nice!


#### Downsampling

Downsamples the audio to 16kHz.
- `--sr1` sampling rate`
- `--sr2` sampling rate`
- `--in_dir` path to source dir`
- `--out_dir1` path to target dir`
- `--out_dir2` path to target dir`


In [27]:
! python downsample.py --in_dir $FLACS --out_dir1 $DATA16K --out_dir2 $DATA22K
! ln -s $DATA16K DUMMY

#### Data Splitting

Next, our fine-tuning data will need to be split into a training, test and validation set.

The original splitting-script of FreeVC uses 2 chunks from each speaker for validation, 10 chunks for testing and the rest for training. With an average of around 400 chunks per speaker, this is an average test-split of 2.5%, and validation-split of 0.5%. To me, this seems like an overly small test and validation portion.

Therefore the preprocessing script was modified:
Before, the test and validation portions were constant, at 10 and 2 samples respectively. I changed them to a relative 5% and 1% portion for the test and validation sets.

_(As to whether this improves or worsens performance, I have no empirical evidence for either and I do not intend to gather it.)_

In [7]:
val_file="./filelists/finetune-val.txt"
test_file="./filelists/finetune-test.txt"
train_file="./filelists/finetune-train.txt"
sr_wavs = f"./dataset/sr/wav"
# ! python preprocess_flist.py --train_list $train_file --test_list $test_file --val_list $val_file --source_dir $sr_wavs
! python preprocess_flist.py --train_list $train_file --test_list $test_file --val_list $val_file --source_dir $DATA16K
! rm DUMMY
! ln -s $DATA16K DUMMY

100%|███████████████████████████████████████████| 1/1 [00:00<00:00, 5178.15it/s]
Writing ./filelists/finetune-train.txt
100%|███████████████████████████████████████| 48/48 [00:00<00:00, 182196.01it/s]
Writing ./filelists/finetune-val.txt
100%|██████████████████████████████████████████| 1/1 [00:00<00:00, 14873.42it/s]
Writing ./filelists/finetune-test.txt
100%|██████████████████████████████████████████| 3/3 [00:00<00:00, 35444.82it/s]


In [16]:
# %%bash -s "$train_file" "$test_file" "$val_file" "$DATA16K"
# echo $1
# python preprocess_flist.py --train_list $1 --test_list $2 --val_list $3 --source_dir $4

$train_file


#### Speaker Encoder (pretrained)

Something something encode speaker information using a pretrained model.

In [7]:
data_root="./dataset"

./data


In [20]:
! CUDA_VISIBLE_DEVICES=0 python preprocess_spk.py --in_dir $DATA16K --out_dir_root $data_root

$DATA16K
this is $data_root


#### Data Augmentation

To make the most of our data...    ...Spectrogram Resize (SR)...

In [29]:
HIFIGAN_CFG = "hifigan/config.json"
WAV_DIR = "dataset/sr/wav"
SSL_DIR = "dataset/sr/wavlm"
! CUDA_VISIBLE_DEVICES=0 python preprocess_sr.py --in_dir $DATA22K --wav_dir $WAV_DIR --ssl_dir $SSL_DIR --config $HIFIGAN_CFG --min 68 --max 92 --sr 16000

### Finetuning

The hyperparameters of training are set in a JSON file, located in the `/configs/` directory. For finetuning, we'll use the file `freevc-finetune.json`. 

In [8]:
# train freevc: use config 'configs/freevc-finetune.json', use model 'freevc'
# ! CUDA_VISIBLE_DEVICES=0 python finetune.py --config configs/freevc-finetune.json --model freevc-finetune
MODEL_NAME = f'freevc_finetune-{SPEAKER_ID}'
MODEL_NAME = f'freevc_finetune'
! echo $MODEL_NAME
! CUDA_VISIBLE_DEVICES=0 python finetune.py -c configs/freevc-finetune.json -m freevc-finetune -d ./checkpoints/D-freevc.pth -g ./checkpoints/freevc.pth --force_new

freevc_finetune
INFO:freevc-finetune:{'train': {'log_interval': 10, 'eval_interval': 10, 'seed': 1234, 'epochs': 2, 'learning_rate': 0.0002, 'betas': [0.8, 0.99], 'eps': 1e-09, 'batch_size': 1, 'fp16_run': False, 'lr_decay': 0.999875, 'segment_size': 8960, 'init_lr_ratio': 1, 'warmup_epochs': 0, 'c_mel': 45, 'c_kl': 1.0, 'use_sr': True, 'max_speclen': 128, 'port': '8001'}, 'data': {'training_files': 'filelists/finetune-train.txt', 'validation_files': 'filelists/finetune-val.txt', 'max_wav_value': 32768.0, 'sampling_rate': 16000, 'filter_length': 1280, 'hop_length': 320, 'win_length': 1280, 'n_mel_channels': 80, 'mel_fmin': 0.0, 'mel_fmax': None}, 'model': {'inter_channels': 192, 'hidden_channels': 192, 'filter_channels': 768, 'n_heads': 2, 'n_layers': 6, 'kernel_size': 3, 'p_dropout': 0.1, 'resblock': '1', 'resblock_kernel_sizes': [3, 7, 11], 'resblock_dilation_sizes': [[1, 3, 5], [1, 3, 5], [1, 3, 5]], 'upsample_rates': [10, 8, 2, 2], 'upsample_initial_channel': 512, 'upsample_kernel_

In [4]:
# ! CUDA_VISIBLE_DEVICES=0 python convert.py --hpfile configs/freevc-finetune.json --ptfile checkpoints/freevc-finetune.pth --txtpath convert.txt --outdir outputs/freevc-finetune
# ! CUDA_VISIBLE_DEVICES=0 python convert.py --hpfile configs/freevc.json --ptfile checkpoints/freevc.pth --txtpath convert.txt --outdir outputs/freevc-base
! CUDA_VISIBLE_DEVICES=0 python convert.py --hpfile configs/freevc-finetune.json --ptfile logs/freevc-finetune/G_40.pth --txtpath convert.txt --outdir outputs/freevc-finetune

Loading model...
Loading checkpoint...
logs/freevc-finetune/G_40.pth
INFO:root:Loaded checkpoint 'logs/freevc-finetune/G_40.pth' (iteration 1)
Loading WavLM for content...
INFO:wavlm.WavLM:WavLM Config: {'extractor_mode': 'layer_norm', 'encoder_layers': 24, 'encoder_embed_dim': 1024, 'encoder_ffn_embed_dim': 4096, 'encoder_attention_heads': 16, 'activation_fn': 'gelu', 'layer_norm_first': True, 'conv_feature_layers': '[(512,10,5)] + [(512,3,2)] * 4 + [(512,2,2)] * 2', 'conv_bias': False, 'feature_grad_mult': 1.0, 'normalize': True, 'dropout': 0.0, 'attention_dropout': 0.0, 'activation_dropout': 0.0, 'encoder_layerdrop': 0.0, 'dropout_input': 0.0, 'dropout_features': 0.0, 'mask_length': 10, 'mask_prob': 0.8, 'mask_selection': 'static', 'mask_other': 0.0, 'no_mask_overlap': False, 'mask_min_space': 1, 'mask_channel_length': 10, 'mask_channel_prob': 0.0, 'mask_channel_selection': 'static', 'mask_channel_other': 0.0, 'no_mask_channel_overlap': False, 'mask_channel_min_space': 1, 'conv_pos'

In [None]:
import IPython.display as ipd
import IPython
ref_1 = ""
print("Ref 1")
IPython.display.display(ipd.Audio(ref_1.numpy(), rate=sr))
print("Example 1")
IPython.display.display(ipd.Audio(example_1.numpy(), rate=sr))
print("Example 2")
IPython.display.display(ipd.Audio(example_2.numpy(), rate=sr))


In [None]:
# DELETE LATER, JUST STATS

import os
def test_split(total_len):
    n_test = max(round(total_len*0.05), 1)
    n_val = max(round(total_len*0.01), 1)
    n_train = total_len-(n_test+n_val)

    assert total_len == n_test+n_val+n_train
    print(total_len, ":\t",n_train,", ",n_test,", ", n_val)
    assert total_len>=10, "message something"

dir = os.path.abspath("~/")
os.walk(dir)
vctk_path = os.path.abspath("../../../../../../mnt/c/Users/mhess/Downloads/VCTK-Corpus-0.92/wav48_silence_trimmed")
dirlist = os.listdir(vctk_path)
# TODO: get avg number of chunks/speaker in vctk
counter = 0
num_chunks = 0
for el in dirlist:
    combined_path = os.path.join(vctk_path,el)
    # print(f'{el}:\t{os.path.isdir(combined_path)}')
    if os.path.isdir(combined_path):
        counter += 1
        num_chunks += (len(os.listdir(combined_path)))/2

print(f'AVG chunks per speaker: {round(num_chunks/counter, 2)}')