Skip to content


Repository files navigation


This repository contains code for the Seminar Audio Processing and Indexing 2021 final project at Leiden University. As a part of this project, we investigate voice style transfer systems. We aim to create an easy-to-use conversion program utilising the AutoVC voice conversion model.

Some audio samples are posted here.


We implemented an easy-to-use tool which can be used to generate audio samples on-demand by inputting either .wav-files, or by recording these samples directly via a microphone.

The tool can be started by running

python --model_path ./path/to/melgan.ckpt --target_embedding_path ./path/to/target_emb.npy --source_embedding_path ./path/to/source_emb.npy


model_path - The location of the trained autoVC model-checkpoint (trained on Melgan spectrograms)

target_embedding_path - The (start) location of target_embedding.npy, target embedding can be changed dynamically

source_embedding_path - The location of source_embedding.npy, this should be known beforehand

Running the sample results in the following menu:

A typical conversion process consists of:

  • Recording an audio sample
  • Converting it ( source .wav → source spect → target spect → Vocoder → target .wav )
  • Playing it

The results can be saved using the save buttons, the target embedding can be loaded dynamically by using Load Target, or by generating a random embedding, using the Randomize Target Embedding button.

Live Conversion

The live-converter can be started using:

python --model_path ./path/to/melgan.ckpt --target_embedding_path ./path/to/target_emb.npy --source_embedding_path ./path/to/source_emb.npy

The arguments are the same as for the aforementation interface script

This script first creates a wav-buffer, which is dynamically interpreted to a target-spectrogram-buffer, this target-spectrogram-buffer is then converted back to an audio sample which is dynamically played back.

Because both librosa and the vocoders operate better on larger sample-sizes, a buffer is built up before live-conversion is attempted, this is why there is a delay of a couple of seconds before the output can be heard.

Conversion is real-time when ran on a Ryzen 3800x CPU.


Install dependencies using:

pip install -r requirements.txt

Install PyTorch using the command found here



To convert audio files, download the pretrained network weights using the instructions here. Next, place speaker audio files in the input folder using the following structure:

+-- speaker1
|   +-- audio1
|   |   ...
|    ...

Run the following command to convert a specific source audio file to sound like a target speaker.

python --source speaker1 --target speaker2 --source_wav audio1

Using the --vocoder {"griffin", "wavenet", "melgan"} tag, the vocoder of the framework can be adapted to any of the following:

  • WaveNet: The default WaveNet vocoder used by the AutoVC authors. This vocoder achieves good quality with a high inference penalty.
  • Griffin-Lim: A fast vocoder with a loss of audio quality.
  • MelGAN: A fast vocoder with decent audio quality. The pretrained model on VCTK is downloadable here. As this vocoder uses a different Mel-spectrogram format, use the retrained AutoVC model downloadable here, by using the --model_path <path> flag.


To train the autovc model, use the following command:

python --input_dir <path_to_data>

Where <path_to_data> points to a folder in the structure described above. Training can be continued by using the --model_path <path_to_model> flag where <path_to_model> points to an AutoVC checkpoint.

Metadata format


Conversion data is converted to the intermediary metadata.pkl file used for converting. It consists of the following structure:

    "source" : {
        "speaker1" : {
            "emb" : <speaker_embedding []>
            "utterances" : {
                "utterance1" : [ <part1 []>, ... , <partn []> ]
    "target" : {
        "speaker1" : {
            "emb" : <speaker_embedding []>


For training, we follow the metadata format using by AutoVC. The format is as follows:

    ["speaker_name", <speaker_embedding []>, "utterance_file_path1", ... , "utterance_file_pathn"],



  • Implement easy conversion using audio files
  • Split audio files into ~2 second parts for processing by AutoVC
    • Investigate audio scramble
  • Fix slow WaveNet vocoder
  • Train on larger samples
  • Train with more speakers


Audio Processing and Indexing






No releases published


No packages published