# Fine-tuning FreeVC


Hi!

This notebook takes you through the steps to **fine-tune** the **Voice Conversion** model **FreeVC**. It is meant to be beginner-friendly, sparing you (and myself, the author of this notebook) most of the details of FreeVC's incredibly complicated Architecture. To those looking for a deeper dive into some of the concepts, models and techniques on which FreeVC is built, there will be some links to further reading.

**FreeVC: [Paper](https://arxiv.org/abs/2210.15418) | [Demo](https://olawod.github.io/FreeVC-demo/) | [Code](https://github.com/OlaWod/FreeVC)**

## Theory

### What is Voice Conversion?

Voice conversion is probably best explained using an example: There are two people, Alice and Bob. Alice wants to impersonate Bob, and she has a recording of Bob's voice. Alice then makes recordings of herself, saying things Bob would never say, such as "give lots of money to Alice". Using the recording she has of Bob, she *converts* the voice to sounding like Bob is saying all these ridiculous things.

In this scenario, Alice is the **source** and Bob is the **target voice**. The distinction may seem a bit arbitrary at first, since we need both a recording of Alice *and* Bob, but since Bob's voice is where we want to end up, this distinction makes sense.

#### Intuition
Without going into technical details, let's ask why this works *intuitively*.

When we listen to a voice, our brains more or less automatically process two different things:
- *Who is speaking?* (let's call that **identity** or **speaker information**)
- *What is being said?* (let's call that **content**)
Determining the identity comes down to factors both physically inherent to your voice - mainly timbre and pitch - as well as factors that are more under the speaker's direct control, things like accent, rhythm.

The content is largely independent of the features that make up identity. Communication works because people with different voices are able to produce the same phonemes. The same sequence of vowels and consonants - be it a word, sentence or speech - means the same thing across different speakers.

The identity and content information on a *signal* (a spoken utterance) seem to be independent of each other, to a certain degree. The core idea for voice conversion therefore is the following:
1. Given a source signal (by speaker A), strip it all its speaker information, while preserving content. 
2. Extract target speaker information (speaker B).
3. Insert target speaker information into the stripped source signal

<details>

**<summary>Difference to Voice Cloning</summary>**

How does voice conversion differ from voice cloning?

Fundamentally, voice cloning - the extracting of speaker features and using them to generate speech - falls in the realm of text-to-speech, whilst voice conversion is speech-to-speech and entirely textless. There are also some significant practical differences: For instance, sophisticated voice cloning models also preserve properties such as accent and rhythm, whereas voice conversion does not. The stripped signal in voice conversion is much more of a rigid template than the text input used in voice cloning. The advantage of this is that it allows the user more direct control over the rhythm, accent, pitch contour and so on, simply by having the desired patterns in the source signal.

</details>

### FreeVC: Architecture

A significant part of FreeVC's architecture is based on [**VITS**](https://github.com/jaywalnut310/vits), an **end-to-end text-to-speech** model. VITS is a popular model because its output sounds very human for a TTS system. A core piece of VITS is the [Conditional Variational Autoencoder](https://theaiacademy.blogspot.com/2020/05/understanding-conditional-variational.html), a type of autoencoder more suitable for d
 However, as a TTS system it uses text as the input, whereas voice conversion is a speech-to-speech task. Roughly summarized and glossing over many technical details, FreeVC modifies the architecture of VITS in two ways: 

Firstly, it replaces the text encoder with an encoder capable of handling speech. The encoder itself consists of multiple pieces: A pretrained model that transforms the raw waveform into a vector (WavLM), a *bottleneck extractor* that reduces the dimensionality of the obtained vector and hopefully sieves out the information not needed, and lastly a normalizing flow (read more on flow [here](https://towardsdatascience.com/introduction-to-normalizing-flows-d002af262a4b)).

Secondly, it adds a pretrained **speaker encoder** to the model architecture. A mel-spectrogram of the audio sample of the target speaker is given as input to the speaker encoder. It extracts the relevant speaker features, and feeds the resulting *speaker embedding* into both the flow module and the decoder.

Finally, the decoder takes the encoded source audio and the speaker embedding, and creates the output waveform from it. As in VITS, FreeVC also uses [HiFi-GAN V1](https://github.com/jik876/hifi-gan) as its decoder.

Additionally, FreeVC uses a discriminator to incorporate [adversarial learning](https://developers.google.com/machine-learning/gan/) as well as data augmentation during training of the model.

## Implementation

### Setup Venv & install requirements

#### Ensure Prerequisites2

Firstly, make sure that Python 3.9 and FFmpeg are installed.

In [1]:
# Check for Python 3.9
! python3.9 -V
# Check for FFmpeg
! ffmpeg -version

Python 3.9.20
ffmpeg version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2000-2021 the FFmpeg developers
built with gcc 11 (Ubuntu 11.2.0-19ubuntu1)
configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-lib

<details>

**<summary>Install Python 3.9 & FFmpeg</summary>**

If checking with the above commands gives you an error, follow these steps:

##### Python 3.9
```bash
sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt-get update && sudo apt upgrade
sudo apt-get install python3.9
```

##### FFmpeg
```bash
sudo apt update
sudo apt install ffmpeg
```

</details>

##### Venv & Packages

Next, as is standard procedure, we want to install the required modules inside a Virtual Environment (or venv). Because different projects have different dependencies, we want to keep them from interfering with each other, therefore we install the required dependencies in an isolated environment.

In [None]:
# create venv
! python3.9 -m venv .venv-freevc
# activate venv
! source .venv-freevc/bin/activate
# install requirements
! pip install -r requirements.txt
# add the venv to the registry of jupyter kernels, allowing us to use  allows jupyter 
! python3.9 -m ipykernel install --name=.venv-freevc

##### WavLM

&#x2757; **Download** the `WavLM Large` model found on [this page](https://github.com/microsoft/unilm/tree/master/wavlm).
> `Pre-Trained Models` > WavLM Large `Google Drive`

Next, move the downloaded file into the `wavlm/` folder.

You'll want to end up with the following folder structure:
> ```ascii
> FreeVC-finetune/
> ├─ ...
> ├─ wavlm/
> │  ├─ modules.py/
> │  ├─ WavLM-Large.pt
> │  ├─ WavLM-Large.pt.txt
> │  ├─ WavLM.py
> │  ├─ __init__.py
> ├─ ...
> ```

##### HiFi-GAN

&#x2757; **Download** the HiFi-GAN model `VCTK_V1` found on [this page](https://github.com/jik876/hifi-gan?tab=readme-ov-file). 
> `Download Pretrained Models` > Google Drive: `VCTK_V1` > Download `generator_v1`

Next, move the downloaded file into the `hifigan/` folder.

You'll want to end up with the following folder structure:
> ```ascii
> FreeVC-finetune/
> ├─ ...
> ├─ hifigan/
> │  ├─ __init__.py
> │  ├─ config.json
> │  ├─ generator_v1
> │  ├─ generator_v1.txt
> │  ├─ models.py/
> ├─ ...
> ```

In [3]:
# make sure the user followed the previous two steps correctly
import os
file = "wavlm/WavLM-Large.pt"
assert os.path.exists(file), f"{file} is missing.\nMake sure you downloaded the WavLM-Large model and put it in the correct directory"
file = "hifigan/generator_v1"
assert os.path.exists(file), f"{file} is missing.\nMake sure you downloaded the HiFi-Gan model (VCTK_V1) and put it in the correct directory"

# already creating a folder required for the next step if it doesn't exist yet
if not os.path.isdir("./checkpoints"):
    os.mkdir("./checkpoints")

##### FreeVC checkpoints

&#x2757; To fine-tune a base model, we'll also need to **download** the checkpoints of the base model. These should be able to be found under [this link](https://1drv.ms/u/s!AnvukVnlQ3ZTx1rjrOZ2abCwuBAh?e=UlhRR5). If the link does no longer work, check the README in the [original FreeVC repository](https://github.com/OlaWod/FreeVC) and try the link provided in the paragraph that reads "_We also provide the pretrained models_".

This link should take you to a OneDrive folder that contains the checkpoint. We won't need all of the files there, only two of them:
- `D-freevc.pth`
- `freevc.pth`

We can also ignore the folder `24kHz` entirely, as we're only working with the 16kHz models and audio.

Move the two checkpoints to the `checkpoints/` folder.

You'll want to end up with the following folder structure:
> ```ascii
> FreeVC-finetune/
> ├─ ...
> ├─ checkpoints/
> │  ├─ D-freevc.pth
> │  ├─ freevc.pth
> ├─ ...
> ```

><details>
>
>**<summary> &#x2753; Why two checkpoints?</summary>**
>
>FreeVC trains two different models because of it is, in part, a generative adversarial network (GAN). `freevc.pth` is the generator net, the part of the model that actually generates the voice-converted audio. `D-freevc.pth` is the checkpoint of the discriminator, the net that tries to classify an audio as  as either 'natural' (i.e. a real recording of a human, not generated by the model we are training) or 'generated'. It takes both generated and natural audio as inputs for its training. When classifying generated audio, the discriminator's performance feeds into the loss function of the generator. Intuitively, the generator's weight receive stronger updates if it manages to 'fool' the discriminator. Thus, if the generator gets better, the discriminator is forced to improve to keep up with it, and vice versa. Since these are essentially two different models, we'll need to load them separately.
>
>For inference (= just converting audio, no training), we only require the generator.
>
></details>


In [4]:
# Make sure the checkpoints are downloaded and stored in the right directories.
import os
files = ["./checkpoints/D-freevc.pth", "./checkpoints/freevc.pth"]
for file in files:
    assert os.path.exists(file), f"{file} is missing.\nMake sure you downloaded the WavLM-Large model and put it in the correct directory"


### Audio Preparation

To finetune the pre-trained model, we of course need some training data to adapt the model to the target speaker. With _some_ training data, I mean a whole lot of it.

We want to make sure we make the most of our data. Therefore, we'll do some simple preprocessing on it.

#### Chop it up

The **base model** (i.e. FreeVC's pretrained model that we're finetuning) is trained on the [VCTK corpus](https://datashare.ed.ac.uk/handle/10283/3443). The audio in this corpus is stored in multiple smaller files (3.4 seconds on average) than one large file.

If your finetuning-data is already in similarly sized chunks, you can **skip this step**. Otherwise, run the following cells on your file(s): This will automatically detect silent passages - for example between sentences, and thus split your audio into adequately sized chunks.

You can modify the following parameters to control the chunking:
- `MIN_SILENCE_LEN` (miliseconds): Defines the minimal length of silence necessary to split the audio at that point
- `SILENCE_THRESHOLD` (dBFS): Defines what counts as silent and what does not; anything louder than the set threshold will count as not silent. 
- `MIN_CHUNK_LEN` (seconds): Any chunk shorter than this value will be discarded and NOT saved.

In [22]:
# modified code, originally from:
#   https://www.codespeedy.com/split-audio-files-using-silence-detection-in-python/
#   retrieved on 2024-08-23
import os
import copy
from pydub import AudioSegment
from pydub.silence import split_on_silence
from typing import Union

def chunk_audio(filelist: list[str], silence_len=800,silence_thr=-40, chunklen: float=0., training_len: int=300, out_path="./chunks", quiet: bool=False):
    """
    Input
    ---
        filelist: list of files (relative paths required) to be used as training data\\
        silence_len: define how long a silent period needs to be for a split to occur (miliseconds)\\
        silence_thr: set the intensity threshold, values below it will be counted as silence (dBFS)\\
        chunklen: required mimimum length of a chunk for it to be saved\\
        training_len: intended length of the training data (seconds)\\
        out_path: where to store the resulting audio chunks\\
        quiet: whether or not to print any status messages\\
    """
    # necessary to avoid outside-scope filelist being emptied inside current function
    filelist = copy.copy(filelist)
    count = 0
    length = 0.
    if not os.path.exists(out_path):
        os.makedirs(out_path)
    while length <= training_len and filelist != []:
        for file in filelist:
            filelist.remove(file)
            # load file (may take long for large files)
            sound = AudioSegment.from_wav(file)
            # spliting audio files
            audio_chunks = split_on_silence(sound, min_silence_len=silence_len, silence_thresh=silence_thr)
            #loop is used to iterate over the output list
            for pre_chunk in audio_chunks:
                # save them as a FLAC file
                cut_chunks = _cut_chunks(pre_chunk)
                for chunk in cut_chunks:
                    if chunk.duration_seconds >= chunklen:
                        output_file = "{0}/chunk{1}.flac".format(out_path, count+1)
                        # if the current chunk will exceed the intended length of the training data,
                        # cut it in order to exactly reach the training length
                        if length+chunk.duration_seconds >= training_len:
                            overlength = (length+chunk.duration_seconds)-training_len
                            overlength_ms = round(overlength*1000)
                            chunk = chunk[:-overlength_ms]
                        if chunk.duration_seconds == 0: continue
                        length += chunk.duration_seconds
                        count += 1
                        chunk.export(output_file, format="flac")
                        # skip printing if quiet-flag is set (exists mostly for not cluttering the testing)
                        if quiet: continue
                        print("Exported file", output_file, "({0})".format(len(chunk)))
                    else:
                        if quiet: continue
                        print("Skipping Chunk: Too short (< {0} seconds)".format(chunklen))
    if not quiet:
        print("\nAverage length of saved chunks: {0} Seconds".format(round(length/count,2)))
        print("\nTotal length of saved chunks: {0} Seconds".format(round(length,2)))


def _cut_chunks(chunk: AudioSegment):
    out_list = []
    if chunk.duration_seconds >= 4:
        total_len = len(chunk)
        half_len = total_len // 2
        new_chunks = [chunk[:half_len], chunk[half_len:]]
        for c in new_chunks:
            out_list.extend(_cut_chunks(c))
        return out_list
    else:
        return [chunk]


In [10]:
# Unittest of function chunk_audio(), with setup/teardown fixtures
#   Testing is limited to a single set of parameters.
import unittest
import os
import shutil
from pydub import AudioSegment

class TestChunkAudio(unittest.TestCase):
    pass

    @classmethod
    def setUpClass(cls):
        cls.out_path = "./test/resources/temp"
        os.mkdir(cls.out_path)
        chunk_audio(filelist=["./test/resources/test_audio_12s.wav"], silence_len=800, silence_thr=-40, chunklen=1.5, training_len=6, out_path=cls.out_path, quiet=True)

    def test_number_of_files(self):
        # check number of generated files against expected number
        self.assertEqual(len(os.listdir(self.out_path)), 3)

    def test_total_length(self):
        # sum length of chunks...
        total_len = sum([AudioSegment.from_file(f'{self.out_path}/{file}', format="flac").duration_seconds for file in os.listdir(self.out_path)])
        # ...check against expected length. because we're dealingfloating point numbers, we may not get perfectly round numbers, thus we only check near-equality.
        self.assertAlmostEqual(total_len,6.0, 4)

    @classmethod
    def tearDownClass(cls):
        shutil.rmtree(cls.out_path)
        # for file in os.listdir(cls.out_path):

res = unittest.main(argv=[''], verbosity=3, exit=False)



in while loop atm


  return cls.from_file(file, 'wav', parameters=parameters)
  chunk.export(output_file, format="flac")
  chunk.export(output_file, format="flac")
  chunk.export(output_file, format="flac")
test_number_of_files (__main__.TestChunkAudio) ... ok
test_total_length (__main__.TestChunkAudio) ... 

c
1.9640136054421768
4.111995464852608
5.999977324263039


ok

----------------------------------------------------------------------
Ran 2 tests in 2.466s

OK


In [50]:
# VARIABLES
MIN_SILENCE_LEN = 800
SILENCE_THRESHOLD = -40
MIN_CHUNK_LEN = 1.5

# USER TODO: list of files to chunk
filelist = ["./bernie_filibuster_22sec.wav"]
filelist = ["./LN_AUDIOFILES/brn1/bernie_filibuster_pt1_5min.wav"]

# TODO remove before shipping
# if not DO_NOT_CHUNK:
chunk_audio(filelist, silence_len=MIN_SILENCE_LEN, silence_thr=SILENCE_THRESHOLD, chunklen=MIN_CHUNK_LEN, training_len=6, out_path="./z_testchunks")


Skipping Chunk: Too short (< 1.5 seconds)
Skipping Chunk: Too short (< 1.5 seconds)
Exported file ./z_testchunks/chunk1.flac (1964)
Exported file ./z_testchunks/chunk2.flac (4036)
Skipping Chunk: Too short (< 1.5 seconds)
Skipping Chunk: Too short (< 1.5 seconds)
Skipping Chunk: Too short (< 1.5 seconds)
Skipping Chunk: Too short (< 1.5 seconds)
Skipping Chunk: Too short (< 1.5 seconds)
Skipping Chunk: Too short (< 1.5 seconds)
Skipping Chunk: Too short (< 1.5 seconds)
Skipping Chunk: Too short (< 1.5 seconds)
Skipping Chunk: Too short (< 1.5 seconds)
Skipping Chunk: Too short (< 1.5 seconds)
Skipping Chunk: Too short (< 1.5 seconds)
Skipping Chunk: Too short (< 1.5 seconds)
Skipping Chunk: Too short (< 1.5 seconds)
Skipping Chunk: Too short (< 1.5 seconds)
Skipping Chunk: Too short (< 1.5 seconds)

Average length of saved chunks: 3.0 Seconds

Total length of saved chunks: 6.0 Seconds


&#x270f;&#xfe0f; Insert the relative path to the audio-files of your target speaker into `filelist`:

In [9]:
# VARIABLES
MIN_SILENCE_LEN = 800
SILENCE_THRESHOLD = -40
MIN_CHUNK_LEN = 1.5

# USER TODO: list of files to use for training
filelist = [""]

for file in filelist:
    assert os.path.exists(file), f"{file} is not a valid file path. Make sure you are using each file's relative path (from FreeVC-finetune)"

chunk_audio(filelist, silence_len=MIN_SILENCE_LEN, silence_thr=SILENCE_THRESHOLD, chunklen=MIN_CHUNK_LEN)

AssertionError:  is not a valid file path. Make sure you are using each file's relative path (from FreeVC-finetune)

### Preprocessing

At this point, we have some audio data in appropriately sized chunks. We now need to run some very particular preprocessing steps on it, so that the model receives it in the right format.

#### Storage Format

FreeVC expects our fine-tuning data to be in the same format as its original training data - the [VCTK-dataset](https://datashare.ed.ac.uk/handle/10283/3443) . Therefore, we need to rename some files and move them to the right places before we run any preprocessing.

You'll need to assign some **4-character** ID to your speaker - pick one that makes sense to you. If it's longer or shorter than 4 characters, this won't work.

&#x270f;&#xfe0f; Add the ID of your choice under `SPEAKER_ID`:

In [4]:
# USER TODO:  pick your Speaker ID
SPEAKER_ID = "brn1"

assert len(SPEAKER_ID) == 4

# DIRECTORY NAMES
CHUNKS = "./chunks/"
FLACS = "./dataset/flac/"
REL_DATA_PATH = f'{FLACS}{SPEAKER_ID}/'
DATA16K = "dataset/finetuning-16k"
DATA22K = "dataset/finetuning-22k"

The following cell will rename your audio and move it into a directory with the right structure.

`<some_dir>/<sp_id>/<sp_id-filename>_mic2.flac`

In [42]:
import os
import shutil
if not os.path.exists(REL_DATA_PATH):
    os.makedirs(REL_DATA_PATH)

for i,file in enumerate(os.listdir(CHUNKS)):
    # print(file)
    # rename & move files to the specific format necessary
    new_filename = f'{SPEAKER_ID}-{i}_mic2.flac'
    a =os.path.join(CHUNKS,file)
    shutil.copy(a, os.path.join(REL_DATA_PATH,new_filename))
print("Moved and renamed your training files.\nGreat Success!! Very Nice!")
# os.listdir(DATA_PATH)


#### Downsampling

Downsamples the audio to 16kHz.
- `--sr1` sampling rate`
- `--sr2` sampling rate`
- `--in_dir` path to source dir`
- `--out_dir1` path to target dir`
- `--out_dir2` path to target dir`


In [27]:
! python downsample.py --in_dir $FLACS --out_dir1 $DATA16K --out_dir2 $DATA22K
! ln -s $DATA16K DUMMY

#### Data Splitting

Next, our fine-tuning data will need to be split into a training, test and validation set.

The original splitting-script of FreeVC uses 2 chunks from each speaker for validation, 10 chunks for testing and the rest for training. With an average of around 400 chunks per speaker, this is an average test-split of 2.5%, and validation-split of 0.5%. To me, this seems like an overly small test and validation portion.

Therefore the preprocessing script was modified:
Before, the test and validation portions were constant, at 10 and 2 samples respectively. I changed them to a relative 5% and 1% portion for the test and validation sets.

_(As to whether this improves or worsens performance, I have no empirical evidence for either and I do not intend to gather it.)_

In [7]:
val_file="./filelists/finetune-val.txt"
test_file="./filelists/finetune-test.txt"
train_file="./filelists/finetune-train.txt"
# sr_wavs = f"./dataset/sr/wav"
# ! python preprocess_flist.py --train_list $train_file --test_list $test_file --val_list $val_file --source_dir $sr_wavs
! python preprocess_flist.py --train_list $train_file --test_list $test_file --val_list $val_file --source_dir $DATA16K
! rm DUMMY
! ln -s $DATA16K DUMMY

#### Speaker Encoder (pretrained)

Something something encode speaker information using a pretrained model.

In [7]:
# declare variable
DATA_ROOT="./dataset"

In [20]:
! CUDA_VISIBLE_DEVICES=0 python preprocess_spk.py --in_dir $DATA16K --out_dir_root $DATA_ROOT

#### Data Augmentation

To make the most of our data...    ...Spectrogram Resize (SR)...

In [29]:
# declare variable
HIFIGAN_CFG = "hifigan/config.json"
WAV_DIR = "dataset/sr/wav"
SSL_DIR = "dataset/sr/wavlm"

# Perform data augmentation (spectrogram resize)
! CUDA_VISIBLE_DEVICES=0 python preprocess_sr.py --in_dir $DATA22K --wav_dir $WAV_DIR --ssl_dir $SSL_DIR --config $HIFIGAN_CFG --min 68 --max 92 --sr 16000

### Finetuning

The hyperparameters of training are set in a JSON file, located in the `/configs/` directory. For finetuning, we'll use the file `freevc-finetune.json`. 

In [8]:
# train freevc: use config 'configs/freevc-finetune.json', use model 'freevc'
MODEL_NAME = f'freevc_finetune-{SPEAKER_ID}'
MODEL_NAME = f'freevc_finetune'
! echo $MODEL_NAME
! CUDA_VISIBLE_DEVICES=0 python finetune.py -c configs/freevc-finetune.json -m $MODEL_NAME -d ./checkpoints/D-freevc.pth -g ./checkpoints/freevc.pth --force_new

#### Generating Output

We're almost there.

As a last step, we'll need to define the audio recording(s) that we actually want to convert to our target speakers. To do so, all we need to do is edit the file `convert.txt` - let's call it the **task file**. 

The structure of the task file is simple: 
- Each row corresponds to a single *task*, i.e. one source audio being converted to a target speaker.
- There are three columns for each row, separated by a single pipe symbol (`|`). The first column defines the name of the task, the second column contains the path to the source file, and the third column contains the path to an audio file of the target speaker.

This could look as follows: (Alice is the source, Bob the target speaker.)

```txt
        alice2bob_1|PATH/TO/ALICE_1.wav|PATH/TO/BOB.wav
        alice2bob_2|PATH/TO/ALICE_2.wav|PATH/TO/BOB.wav
        alice2bob_3|PATH/TO/ALICE_3.wav|PATH/TO/BOB.wav
        ...
```

- `alice2bob_X` is simply the name of the conversion - this will mainly be used to name the output file and in the logs.
- `PATH/TO/ALICE_X.wav` is the path to the source files - recordings of Alice, which we want to convert to Bob's voice
- `PATH/TO/BOB.wav` is the path to the audiofile of Bob's voice - the target. Note that this can be the same for various different source files.

> **Tip &#x1F4A1;**
>
> Within the task file, we do **not** need to adhere to specific filenaming (such as 4-character speaker ids) 

Each user's task file will look different. Thus, this is something you'll have to do yourself. To help you however, there's a simple function (`fill_task_file()`) to potentially make things a bit easier and faster. Note that it requires you to have all your source files in the same directory, and the name will be kept relatively simple. 



In [2]:
def fill_task_file(basename: str, source: str, target: str, taskfile: str="convert.txt") -> str:
    """
    Inputs:
    ---
        basename: Name to be used as the base for naming each conversion. The name of the nth task will be 'basename_n'.\\
        source: path to the DIRECTORY containing the source files.\\
        target: path to the FILE containing the target speaker.\\
        taskfile: allows the user to name their task file something other than 'convert.txt'.\\
    
    Output:
    ---
        taskfile
    """
    # clear file
    with open(taskfile, "w", encoding="utf-8") as f: f.write()
    # fill file
    with open(taskfile, "a", encoding="utf-8") as f:
        for i,file in enumerate(os.listdir(source)):
            f.write(f"{basename}_{i}|{file}|{target}")
    return taskfile


&#x270F;&#xFE0F;
Enter the desired conversions into `convert.txt`. You can use the function `fill_task_file()` to do this in a quick and simple way.

In [None]:
# USER TODO: adjust parameters
converttxt = fill_task_file("alice2bob", source="PATH/TO/DIR/ALICE", target="PATH/TO/FILE/BOB")

In [4]:
from util_finetune import get_max_checkpoint

checkpoint = get_max_checkpoint(MODEL_NAME)

# convert the audio to the target speaker, according to the convert.txt
! CUDA_VISIBLE_DEVICES=0 python convert.py --hpfile configs/freevc-finetune.json --ptfile $checkpoint --txtpath convert.txt --outdir outputs/freevc-finetune

In [None]:
# TODO
#   Contrast finetuned with base model 

### Running multiple fine-tuning experiments

Perhaps you now want to experiment with different settings. The function `wrap_experiments()` does takes two lists as input arguments (among others):
- `training_amounts` : a list of different amounts of training data (in seconds)
- `different_epochs` : a list with varying number of epochs

When the function is run, each different training-amount is paired with each number of epochs, and a model is fine-tuned according to these two parameters (whilst the remaining parameters stay the same for all combinations).

The experiment-wrapper makes sure you have to only run the necessary pre-processing steps for each training amount once. After the pre-processing all the audio for a specific training amount, it fine-tunes separate models on the same data but with different numbers of epochs. Additionally, it provides an option to **keep the training data stored** even after a run is finished by setting the `keep_data`-flag to `True`, This way, in case of subsequent fine-tuning runs with the same amount of training data, the pre-processing steps can be skipped. Instead, the data is simply moved from its depot into the active directories used during training.

It finishes each fine-tuning run by also generating audio -- again using the 'tasks' in `convert.txt`.

Model checkpoints and generated outputs are named according to the following format: <br>`freevc-ft-<SPEAKER_ID>-<TRAINING_AMOUNT>s-<EPOCHS>ep`


<details>

**<summary>Example</summary>**

Let's say you wanna compare fine-tuning on 30 seconds and 5 minutes of training data, as well as 2 and 10 epochs.

The experiment wrapper trains a total of $4$ ($2\times 2$) fine-tuned models, in all possible combinations: 
- 30 seconds, 2 epochs
- 30 seconds, 10 epochs
- 300 seconds, 2 epochs 
- 300 seconds, 10 epochs 

It would first generate and pre-process the 30-second training data, create a fine-tuned model on 2 and 10 epochs respectively, and then repeat the process for the 300-second training data.

The two input lists in this example would look as follows:
```python
amounts = [30, 300]
epochs = [2, 10]

wrap_experiments(training_amounts=amounts, different_epochs=epochs, ...)
```

</details>

In [21]:
import json
from util_finetune import get_max_checkpoint

# function used to wrap a single experiment, wraps several audio processing functions, training and conversion
def train_and_generate(audio_len: int=300, epochs: int=5, new_config_pars: dict={}, speaker_id: str="", force_train: bool=True):
    """
    Fine-tune a single model with a certain amount of training data on a certain number of epochs.
    
    Inputs
    ---
        **audio_len**: length of audio used for training in seconds. If the value exceeds the maximum possible value of the training data, it defaults to the maximum possible value\\
        epochs: number of epochs used for finetuning\\
        new_config_pars: dict containing some specific parameters for training, mainly the required training files at this point\\
        speaker_id: 4-character speaker ID.\\
        force_train: flag indicating whether or not to overwrite an existing fine-tuned model, or to continue fine-tuning from an already existing (fine-tuned) model\\
        keep_data: flag that determines if the training data is kept or deleted. Includes everything from WAVs, FLACs, mel-spectrograms to speaker embeddings\\
    Returns
    ---

    """
    # check that the following variables are indeed declared, by asserting that they aren't the default values.
    #   You may ask, "why have default values then, if you don't actually want those values?".
    #   Well, because I want to have them in that position and I have some default values declared before, 
    #   and python doesn't let me have any parameters without default values later.
    #   Is it "nice" programming style? Probably not. Do I care? Not enough. Does it matter? Not really. Is this comment getting way too long? Yes.
    assert speaker_id != ""
    assert new_config_pars != {}

    # make changes to a copy of the finetune-config, leaving the original untouched
    config_file = shutil.copy("./configs/freevc-finetune.json", "./configs/freevc-finetune-exp.json")

    if not os.path.exists(config_file): pass
    
    MODEL_NAME = f"freevc-ft-{speaker_id}-{audio_len}s-{epochs}ep"

    if not os.path.exists(f'./checkpoints/{MODEL_NAME}') or force_train:
        with open(config_file, "r", encoding="utf-8") as cfg:
            config = json.load(cfg)
        # modify config file (set epochs and filelists)
        config["train"]["epochs"] = epochs
        config["train"]["eval_interval"] = epochs
        config["train"]["log_interval"] = epochs
        config["data"]["training_files"] = new_config_pars["training_files"]
        config["data"]["validation_files"] = new_config_pars["validation_files"]
        
        with open(config_file, "w", encoding="utf-8") as cfg:
            json.dump(config, cfg, indent=4)

        ! CUDA_VISIBLE_DEVICES=0 python finetune.py -c configs/freevc-finetune-exp.json -m $MODEL_NAME -d ./checkpoints/D-freevc.pth -g ./checkpoints/freevc.pth --force_new

    checkpoint = get_max_checkpoint(modelname=MODEL_NAME)
    # TODO: convert.txt
    converttxt = "convert.txt"
    ! CUDA_VISIBLE_DEVICES=0 python convert.py --hpfile configs/freevc-finetune-exp.json --ptfile $checkpoint --txtpath $converttxt --outdir outputs/$MODEL_NAME

    return MODEL_NAME

In [20]:
import shutil
import os
from util_finetune import move_all_files

def wrap_experiments(training_amounts: list[int], different_epochs: list[int], training_data: list[str], speaker_id: str, keep_stored: bool=True, force_train: bool=True):
    """
    Inputs
    ---
        **training_amounts**: list of different amounts of training data (i.e. length of the audio training material, (in seconds))\\
        **different_epochs**: list of different amounts of epochs to finetune the model on\\
        **training_data**: list of paths of all files to be considered for training\\
        **speaker_id**: 4-symbol ID of the speaker\\ 
        **keep_stored**: for different amounts of training data, instead of deleting the data simply "archive it" so it does not have to be regenerated in future experiments with the same amount.\\
        **force_train**: if a finetuning configuration already exists, don't train, only convert\\
    """
    # speaker_id is needed to name/identify training data 
    assert speaker_id != ""
    assert os.listdir("./")
    
    # list of models (strings of identifying model names) that will be the output of the experiment wrapper
    finetuned_models = []

    if not os.path.exists("./dataset/"):
        os.mkdir("./dataset")
    # idea: iterate through audio lengths first (all of which share the same training data).
    #   allows us to keep the training data whilst experimenting with different number of epochs
    for tl in training_amounts:
        try:
            # consistent naming scheme including speaker & amount of training data, used to keep training files stored (in ./depot/...)
            depot_path = f"./depot/{speaker_id}/{str(tl)}_sec"

            # adjust paths ONLY IF NECESSARY
            DATA16K =  "./dataset/finetuning-16k"
            CHUNKS =  "./chunks/"
            DATA_ROOT =  "./dataset"
            
            training_amt = f'{tl}sec'
            val_file= f"./filelists/finetune-{training_amt}-val.txt"
            test_file =  f"./filelists/finetune-{training_amt}-test.txt"
            train_file =  f"./filelists/finetune-{training_amt}-train.txt"
            
            # create directories if they do not exist yet
            for dir in [CHUNKS, "./filelists"]:
                if not os.path.exists(dir):
                    os.mkdir(dir)

            if os.path.exists(depot_path) and os.listdir(depot_path):
                print("path exists, moving files from depot to active directory...")
                # "recover" existing data from the depot, avoid recreating data that already exists
                #   data (./dataset)
                move_all_files(src=f'{depot_path}/dataset', dst="./dataset")
                #   chunks
                move_all_files(src=f'{depot_path}/chunks', dst=CHUNKS)
                #   filelists
                move_all_files(src=f'{depot_path}/filelists', dst="./filelists")
                #   create DUMMY-link
                ! ln -s $DATA16K DUMMY

            # OR call audio pre-processing for complete training amount
            else:
                print(f'\nPROCESSING AUDIO:\n\t Amount of Training Data: {tl}s')
                # create chunks
                chunk_audio(filelist=training_data, training_len=tl)

                # declare variables necessary for DOWNSAMPLING
                FLACS =  "./dataset/flac/"
                REL_DATA_PATH =  f'{FLACS}{speaker_id}/'
                DATA22K =  "./dataset/finetuning-22k"

                # create directories where necessary
                for dir in [FLACS, REL_DATA_PATH, DATA22K]:
                    if not os.path.exists(dir):
                        os.mkdir(dir)
                
                for i,file in enumerate(os.listdir(CHUNKS)):
                    # rename & move files to the specific format necessary
                    new_filename = f'{SPEAKER_ID}-{i}_mic2.flac'
                    a =os.path.join(CHUNKS,file)
                    shutil.copy(a, os.path.join(REL_DATA_PATH,new_filename))
                # downsampling operation
                print("Moved and renamed your training files.\nGreat Success!! Very Nice!")
                ! python downsample.py --in_dir $FLACS --out_dir1 $DATA16K --out_dir2 $DATA22K


                # create filelists
                ! python preprocess_flist.py --train_list $train_file --test_list $test_file --val_list $val_file --source_dir $DATA16K
                ! ln -s $DATA16K DUMMY

                # declare variables necessary for SPEAKER ENCODING
                ! CUDA_VISIBLE_DEVICES=0 python preprocess_spk.py --in_dir $DATA16K --out_dir_root $DATA_ROOT --num_workers 2

                # declare variables necessary for DATA AUGMENTATION
                HIFIGAN_CFG =  "hifigan/config.json"
                WAV_DIR =  "dataset/sr/wav"
                SSL_DIR =  "dataset/sr/wavlm"
                # Perform data augmentation (spectrogram resize)
                ! CUDA_VISIBLE_DEVICES=0 python preprocess_sr.py --in_dir $DATA22K --wav_dir $WAV_DIR --ssl_dir $SSL_DIR --config $HIFIGAN_CFG --min 68 --max 92 --sr 16000

            # make sure filelists actually exist
            for file in [val_file, train_file, test_file]:
                assert os.path.exists(file), f"file {file} does not exist."

            # ### EPOCH LOOP ####
            # finetune the model with different numbers of epochs, but using the same amount of training data 
            for eps in different_epochs:
                # adjust config
                add_to_config= {"training_files": train_file, "validation_files": val_file}
                
                # call training functions
                ft_model = train_and_generate(audio_len=tl, epochs=eps, new_config_pars=add_to_config, speaker_id=speaker_id, force_train=force_train)
            # ### #### #### ####
            finetuned_models.append(ft_model)
        finally:
            if keep_stored:
                # keep files and simply move them to the depot
                if not os.path.exists(depot_path):
                    os.makedirs(depot_path)

                move_all_files(src="./dataset", dst=f'{depot_path}/dataset')
                #  chunks
                move_all_files(src="./chunks", dst=f'{depot_path}/chunks')
                #  filelists
                move_all_files(src="./filelists", dst=f'{depot_path}/filelists')
                #  DUMMY
                ! rm DUMMY
                pass
            else:
                # DELETE existing chunks (./chunks) and data (./dataset)for the current training amount
                #  data (./dataset)
                shutil.rmtree("./dataset")
                #  chunks
                # TODO uncomment line
                shutil.rmtree(CHUNKS)
                #  filelists
                for file in [val_file, train_file, test_file]:
                    os.remove(file)
                # DUMMY (symlink)
                ! rm DUMMY
    # return a list of modelnames of the finetuned models, so they can be used e.g. to compare generated output
    return finetuned_models

In [3]:
# [6, 60] sec [2, 5] ep: TIME=
training_amts= [12,60,300]
# training_amts= [4,300]
epochs = [1,2,5,10]
# training_amts= [4]
# epochs = [4]
# USER TODO
SPEAKER_ID = "brn1"
assert len(SPEAKER_ID) == 4

training_data = ["./LN_AUDIOFILES/brn1/bernie_filibuster_pt1_5min.wav"]
training_data = ["bernie_filibuster_22sec.wav"]
training_data = ["./LN_AUDIOFILES/brn1/bernie_filibuster_pt1.wav"]
models = wrap_experiments(training_amts, epochs, training_data=training_data, speaker_id=SPEAKER_ID, force_train=True)

In [23]:
# [6, 60] sec [2, 5] ep: TIME=
training_amts= [60]
# training_amts= [4,300]
epochs = [20, 40]
# USER TODO
SPEAKER_ID = "brn1"
assert len(SPEAKER_ID) == 4
training_data = ["./LN_AUDIOFILES/brn1/bernie_filibuster_pt1.wav"]
models = wrap_experiments(training_amts, epochs, training_data=training_data, speaker_id=SPEAKER_ID, force_train=True)


PROCESSING AUDIO:
	 Amount of Training Data: 60s
1.4449886621315193
Exported file ./chunks/chunk1.flac (1445)
2.213968253968254
Exported file ./chunks/chunk2.flac (769)
4.177981859410431
Exported file ./chunks/chunk3.flac (1964)
6.492970521541951
Exported file ./chunks/chunk4.flac (2315)
8.807981859410432
Exported file ./chunks/chunk5.flac (2315)
10.534988662131521
Exported file ./chunks/chunk6.flac (1727)
12.473990929705218
Exported file ./chunks/chunk7.flac (1939)
15.254988662131522
Exported file ./chunks/chunk8.flac (2781)
18.03997732426304
Exported file ./chunks/chunk9.flac (2785)
19.960975056689346
Exported file ./chunks/chunk10.flac (1921)
23.472970521541953
Exported file ./chunks/chunk11.flac (3512)
24.37995464852608
Exported file ./chunks/chunk12.flac (907)
27.583945578231297
Exported file ./chunks/chunk13.flac (3204)
30.787936507936514
Exported file ./chunks/chunk14.flac (3204)
32.91591836734695
Exported file ./chunks/chunk15.flac (2128)
35.04392290249434
Exported file ./chun

In [4]:
import os
from util_finetune import get_max_checkpoint

def convert_wrapper(models: list[str], task_file="convert.txt", include_base_model: bool=True, output_path: str="./outputs"):
    """
    Wrapper function for generating audio with multiple models.\\
        Uses each model inside the list `models` to do the voice conversions outlined by the task-file.\\
        To also do conversions using the base-model (FreeVC pre-finetuning), set the `include_base_model` flag to `True`. Do NOT include the base-model in the model-list.
    """
    assert os.path.isfile(task_file), "Please pass a valid task_file as an argument."
    if "freevc.pth" in models:
        models.remove("freevc.pth")

    if include_base_model:
        checkpoint = "./checkpoints/freevc.pth"
        ! CUDA_VISIBLE_DEVICES=0 python convert.py --hpfile configs/freevc-finetune-exp.json --ptfile $checkpoint --txtpath $task_file --outdir $output_path/freevc-base

    for model in models:
        # skip if base_model, base_model shouldn't be passed into this list.
        checkpoint = get_max_checkpoint(modelname=model)
        ! CUDA_VISIBLE_DEVICES=0 python convert.py --hpfile configs/freevc-finetune-exp.json --ptfile $checkpoint --txtpath $task_file --outdir $output_path/$model
        print(f"Converted all tasks in {task_file} using the model '{model}'")
    
    return models

In [9]:
# # do additional conversions with the existing models
# models = []
# speaker_id = "brn1"
# for tl in training_amts:
#     for ep in epochs:
#         models.append(f"freevc-ft-{speaker_id}-{tl}s-{ep}ep")

# out_path = "./outputs"
# models = convert_wrapper(models=models, output_path=out_path)

In [19]:
# Display converted audio examples
import IPython.display as ipd
import IPython

sr = 16000
# where the converted audiofiles are stored 
out_path = "./outputs"

print("SOURCE AUDIO")
IPython.display.display(ipd.Audio(f'resources/audio/EXAMPLE_SOURCE.wav', rate=sr))
print("FreeVC base")
IPython.display.display(ipd.Audio(f'{out_path}/freevc-base/EXAMPLE.wav', rate=sr))
for mdl in models:
    audiofile = f'{out_path}/{mdl}/EXAMPLE.wav'
    print(mdl)
    IPython.display.display(ipd.Audio(audiofile, rate=sr))

SOURCE AUDIO


FreeVC base


freevc-ft-brn1-12s-1ep


freevc-ft-brn1-12s-2ep


freevc-ft-brn1-12s-5ep


freevc-ft-brn1-12s-10ep


freevc-ft-brn1-60s-1ep


freevc-ft-brn1-60s-2ep


freevc-ft-brn1-60s-5ep


freevc-ft-brn1-60s-10ep


freevc-ft-brn1-300s-1ep


freevc-ft-brn1-300s-2ep


freevc-ft-brn1-300s-5ep


freevc-ft-brn1-300s-10ep


### Evaluation

The evaluation of the various models is limited to my personal judgement on outputs generated from a single source- and target speaker, as well as a single example sentence. Based on that, fine-tuning FreeVC with custom data (such as a snippet of Bernie Sander's [famous filibuster](https://commons.wikimedia.org/wiki/File:Bernie_Sanders_-_full_2010-12-10_filibuster.webm), as I did) does not significantly improve the output quality.

The effect  of training data - although to be fair, I've only experimented with relatively small amounts (5 minutes at most) -- on the output quality is minimal. Varying the number of epochs has a much larger effect. The challenge here is to find the _soft-spot_ for number of epochs: Too few, and it doesn't sound much closer to the target's voice when compared to the base model. Too many epochs, and the naturalness of the output starts to decline drastically. It sounds more like the target, but it also sounds somewhat robotic and has some strange sounding noise artifacts.

With the small scale of experiments, it's hard to make a recommendation, but I'll try anyways. I expect some value between 10 and 25 epochs to give the best compromise between naturalness and resemblance to the target voice. Feel free to experiment outside of those ranges and prove me wrong!

In terms of data amount, I believe 30-60 seconds may be enough.

### DELETE ALL CELLS BELOW

In [22]:
# MOVE ACTIVE TO DEPOT
import os
from util_finetune import move_all_files

depot_path = depot_path = "./depot/brn1/300_sec"
if not os.path.exists(depot_path):
    os.makedirs(depot_path)

move_all_files(src="./dataset", dst=f'{depot_path}/dataset')
#  chunks
move_all_files(src="./chunks", dst=f'{depot_path}/chunks')
#  filelists
move_all_files(src="./filelists", dst=f'{depot_path}/filelists')
#  DUMMY
! rm DUMMY

In [21]:
# MOVE DEPOT TO ACTIVE
import os
from util_finetune import move_all_files

depot_path = depot_path = "./depot/brn1/300_sec"
# if not os.path.exists(depot_path):
#     os.makedirs(depot_path)

DATA16K = "./dataset/finetuning-16k"

move_all_files(src=f'{depot_path}/dataset', dst="./dataset")
#  chunks
move_all_files(src=f'{depot_path}/chunks', dst="./chunks")
#  filelists
move_all_files(src=f'{depot_path}/filelists', dst="./filelists")
#  DUMMY
! ln -s $DATA16K DUMMY

In [11]:


import os
import shutil
if os.path.exists("./filelists") and not os.listdir("./filelists"):
    os.rmdir("./filelists")

if os.path.exists("./dataset") and not os.listdir("./dataset"):
    os.rmdir("./dataset")
    # os.mkdir("./dataset")

os.path.exists("./dataset")

# if os.path.exists("./depot/brn1/2_sec"):
#     shutil.rmtree("./depot/brn1/2_sec")
# if os.path.exists("./depot/brn1/4_sec"):
#     shutil.rmtree("./depot/brn1/4_sec")
# if os.path.exists("./depot/brn1/6_sec"):
#     shutil.rmtree("./depot/brn1/6_sec")
if os.path.exists("./depot/brn1/60_sec"):
    shutil.rmtree("./depot/brn1/60_sec")
! rm DUMMY

rm: cannot remove 'DUMMY': No such file or directory


In [12]:
# ! ln -s $DATA16K DUMMY
! echo $DATA16K




In [None]:
import IPython.display as ipd
import IPython
ref_1 = ""
print("Ref 1")
IPython.display.display(ipd.Audio(ref_1.numpy(), rate=sr))
print("Example 1")
IPython.display.display(ipd.Audio(example_1.numpy(), rate=sr))
print("Example 2")
IPython.display.display(ipd.Audio(example_2.numpy(), rate=sr))


In [19]:
import os
for dir in os.listdir("./depot/brn1/"):
    print(dir)
    if not dir.startswith("_"):
        secs = dir.split("_")[0]
        print(secs)
        pathhh = f"./depot/brn1/{dir}/filelists"
        for file in os.listdir(pathhh):
            new_fn = f'{file.split("-")[0]}-{secs}sec-{file.rsplit("-")[]}'
            os.rename(f'{pathhh}/{file}', f'{pathhh}/{new_fn}')

_60_sec
12_sec
12
4_sec
4
_4_sec
60_sec
60
300_sec
300


In [20]:
# test directory moving
import os
import shutil

def populate_dir(dir:str, num_dummyfiles:int):
    for i in range(num_dummyfiles):
        with open(f'{dir}/{i}.txt', "w") as f:
            f.write("x")

dirname = "justfortest"
os.mkdir(dirname)
depot = f'{dirname}/depot'
os.mkdir(depot)
dataset_dir = f'{depot}/dataset'
os.mkdir(dataset_dir)
os.mkdir(f'{dataset_dir}/a')
os.mkdir(f'{dataset_dir}/b')
populate_dir(f'{dataset_dir}/a',2)
populate_dir(f'{dataset_dir}/b',2)
populate_dir(dataset_dir, 10)
main = f'{dirname}/main'
os.mkdir(main)


In [None]:
# DELETE LATER, JUST STATS

import os
def test_split(total_len):
    n_test = max(round(total_len*0.05), 1)
    n_val = max(round(total_len*0.01), 1)
    n_train = total_len-(n_test+n_val)

    assert total_len == n_test+n_val+n_train
    print(total_len, ":\t",n_train,", ",n_test,", ", n_val)
    assert total_len>=10, "message something"

dir = os.path.abspath("~/")
os.walk(dir)
vctk_path = os.path.abspath("../../../../../../mnt/c/Users/mhess/Downloads/VCTK-Corpus-0.92/wav48_silence_trimmed")
dirlist = os.listdir(vctk_path)
# TODO: get avg number of chunks/speaker in vctk
counter = 0
num_chunks = 0
for el in dirlist:
    combined_path = os.path.join(vctk_path,el)
    # print(f'{el}:\t{os.path.isdir(combined_path)}')
    if os.path.isdir(combined_path):
        counter += 1
        num_chunks += (len(os.listdir(combined_path)))/2

print(f'AVG chunks per speaker: {round(num_chunks/counter, 2)}')

In [23]:
# ? DELETE LATER

val_file= "./filelists/finetune-val.txt"
test_file =  "./filelists/finetune-test.txt"
train_file =  "./filelists/finetune-train.txt"
src_dir = "./depot/brn1/4_sec/dataset/finetuning-16k/"
# create filelists
! python preprocess_flist.py --train_list $train_file --test_list $test_file --val_list $val_file --source_dir $src_dir

  0%|                                                     | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/notmyyka/UZH/pp_vc/FreeVC/preprocess_flist.py", line 28, in <module>
    assert total_len>=10, "message something"
AssertionError: message something


In [5]:
total_len = 4

n_test = max(round(total_len*0.05), 1)
        # num val set
n_val = max(round(total_len*0.01), 1)
# num train set
n_train = total_len-(n_test+n_val)
# just making sure
assert total_len == n_test+n_val+n_train
n_test

1