This repo contains the training code for deep neural pitch extractor for Voice Conversion (VC) and TTS used in StarGANv2-VC and StyleTTS. This is the F0 network in StarGANv2-VC and pitch extractor in StyleTTS.
- Python >= 3.7
- Clone this repository:
git clone https://github.com/yl4579/PitchExtractor.git
cd PitchExtractor
- Install python requirements:
pip install SoundFile torchaudio torch pyyaml click matplotlib librosa pyworld
- Prepare your own dataset and put the
train_list.txt
andval_list.txt
in theData
folder (see Training section for more details).
python train.py --config_path ./Configs/config.yml
Please specify the training and validation data in config.yml
file. The data list format needs to be filename.wav|anything
, see train_list.txt as an example (a subset of VCTK). Note that you can put anything after the filename because the training labels are generated ad-hoc.
Checkpoints and Tensorboard logs will be saved at log_dir
. To speed up training, you may want to make batch_size
as large as your GPU RAM can take.
Since both harvest
and dio
are relatively slow, we do have to save the computed F0 ground truth for later use. In meldataset.py, it will write the computed F0 curve _f0.npy
for each .wav
file. This requires write permission in your data folder.
In meldataset.py, the F0 curves are computated using PyWorld, one with harvest
and another with dio
. Both methods are acoustic-based and are unstable under certain conditions. harvest
is faster but fails more than dio
, so we first try harvest
. When harvest
fails (determined by number of frames with non-zero values), it will compute the ground truth F0 labels with dio
. If dio
fails, the computed F0 will have NaN
and will be replaced with 0. This is supposed to occur only occasionally and should not affect training because these samples are treated as noises by the neural network and deep learning models are kwown to even benefit from slightly noisy datasets. However, if a lot of your samples have this problem (say > 5%), please remove them from the training set so that the model does not learn from the failed samples.
Data augmentation is not included in this code. For better voice conversion results, please add your own data augmentation in meldataset.py with audiomentations.