Navigation Menu

Skip to content

Woutah/AutoConvert

Repository files navigation

AutoConvert

This repository contains code for the Seminar Audio Processing and Indexing 2021 final project at Leiden University. As a part of this project, we investigate voice style transfer systems. We aim to create an easy-to-use conversion program utilising the AutoVC voice conversion model.

Some audio samples are posted here.

Interface

We implemented an easy-to-use tool which can be used to generate audio samples on-demand by inputting either .wav-files, or by recording these samples directly via a microphone.

The tool can be started by running

python record_convert.py --model_path ./path/to/melgan.ckpt --target_embedding_path ./path/to/target_emb.npy --source_embedding_path ./path/to/source_emb.npy

Where:

model_path - The location of the trained autoVC model-checkpoint (trained on Melgan spectrograms)

target_embedding_path - The (start) location of target_embedding.npy, target embedding can be changed dynamically

source_embedding_path - The location of source_embedding.npy, this should be known beforehand

Running the sample results in the following menu:

A typical conversion process consists of:

  • Recording an audio sample
  • Converting it ( source .wav → source spect → target spect → Vocoder → target .wav )
  • Playing it

The results can be saved using the save buttons, the target embedding can be loaded dynamically by using Load Target, or by generating a random embedding, using the Randomize Target Embedding button.

Live Conversion

The live-converter can be started using:

python record_convert.py --model_path ./path/to/melgan.ckpt --target_embedding_path ./path/to/target_emb.npy --source_embedding_path ./path/to/source_emb.npy

The arguments are the same as for the aforementation interface script

This script first creates a wav-buffer, which is dynamically interpreted to a target-spectrogram-buffer, this target-spectrogram-buffer is then converted back to an audio sample which is dynamically played back.

Because both librosa and the vocoders operate better on larger sample-sizes, a buffer is built up before live-conversion is attempted, this is why there is a delay of a couple of seconds before the output can be heard.

Conversion is real-time when ran on a Ryzen 3800x CPU.

Installation

Install dependencies using:

pip install -r requirements.txt

Install PyTorch using the command found here

Usage

Conversion

To convert audio files, download the pretrained network weights using the instructions here. Next, place speaker audio files in the input folder using the following structure:

input
+-- speaker1
|   +-- audio1
|   |   ...
|    ...

Run the following command to convert a specific source audio file to sound like a target speaker.

python convert.py --source speaker1 --target speaker2 --source_wav audio1

Using the --vocoder {"griffin", "wavenet", "melgan"} tag, the vocoder of the framework can be adapted to any of the following:

  • WaveNet: The default WaveNet vocoder used by the AutoVC authors. This vocoder achieves good quality with a high inference penalty.
  • Griffin-Lim: A fast vocoder with a loss of audio quality.
  • MelGAN: A fast vocoder with decent audio quality. The pretrained model on VCTK is downloadable here. As this vocoder uses a different Mel-spectrogram format, use the retrained AutoVC model downloadable here, by using the --model_path <path> flag.

Training

To train the autovc model, use the following command:

python train.py --input_dir <path_to_data>

Where <path_to_data> points to a folder in the structure described above. Training can be continued by using the --model_path <path_to_model> flag where <path_to_model> points to an AutoVC checkpoint.

Metadata format

Conversion

Conversion data is converted to the intermediary metadata.pkl file used for converting. It consists of the following structure:

metadata.pkl
{
    "source" : {
        "speaker1" : {
            "emb" : <speaker_embedding []>
            "utterances" : {
                "utterance1" : [ <part1 []>, ... , <partn []> ]
                ...
            }
        }
        ...
    }
    "target" : {
        "speaker1" : {
            "emb" : <speaker_embedding []>
        }
        ...
    }
}

Training

For training, we follow the metadata format using by AutoVC. The format is as follows:

train.pkl
[
    ["speaker_name", <speaker_embedding []>, "utterance_file_path1", ... , "utterance_file_pathn"],
    ...

]

Progress

  • Implement easy conversion using audio files
  • Split audio files into ~2 second parts for processing by AutoVC
    • Investigate audio scramble
  • Fix slow WaveNet vocoder
  • Train on larger samples
  • Train with more speakers

About

Audio Processing and Indexing

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published