pip install -e .
Demiurge is a tripartite neural network architecture devised to generate and sequence audio waveforms (Donahue et al. 2019). The architecture combines a synthesis engine based on a UNAGAN + melGAN model with a custom transformer-based sequencer. The diagram below explains the relation between the different elements.
Audio generation and sequencing neural-network-based processes work as follows:
-
Modified versions of melGAN (a vocoder that is a convolutional non-autoregressive feed-forward adversarial network ) and UNAGAN (an auto-regressive unconditional sound generating boundary-equilibrium GAN) will first process audio files
.wav
from an original databaseRECORDED AUDIO DB
to produce GAN-generated.wav
sound files, which are compiled into a new databaseRAW GENERATED AUDIO DB
. -
The descriptor model in the neural sequencer extracts a series of Los Mel Frequency Cepstral Coefficients
MFCC
strings.json
from the audio files in thePREDICTOR DB
while the predictor, a time-series prediction model, generates projected descriptor sequences based on that data. -
As the predicted descriptors are just statistical values and need to be converted back to audio, a query engine matches the predicted descriptors based on the
PREDICTOR DB
with those extracted from theRAW GENERATED AUDIO DB
. The model then replaces the matched with the predicted descriptors using the audio reference from theRAW GENERATED AUDIO DB
, merging and combining the resultant sound sequences into an output.wav
audio file.
Please bear in mind that our model uses WandB to track and monitor training.
The chart below explains the GAN-based sound synthesis process. Please bear in mind that for ideal results the melGAN and UNAGAN audio databases should be the same. Cross-feeding between different databases generates unpredictable (although sometimes musically interesting) results. Please record the wandb_run_ids
for the final sound generation process.
melGAN (Kumar et al. 2019) is a fully convolutional non-autoregressive feed-forward adversarial network that uses mel-spectrograms as a lower-resolution audio representation model that can be both efficiently computed from and inverted back to raw audio format. An average melGAN run on Google Colab using a single V100 GPU may need a week to produce satisfactory results. The results obtained using a multi-GPU approach with parallel data vary. To train the model please use the following notebook.
UNAGAN (Liu et al. 2019) is an auto-regressive unconditional sound generating boundary-equilibrium GAN (Berthelot et al. 2017) that takes variable-length sequences of noise vectors to produce variable-length mel-spectrograms. A first UNAGAN model was eventually revised by Liu et al. at Academia Sinica to improve the resultant audio quality by introducing in the generator a hierarchical architecture model and circle regularization to avoid mode collapse. The model produces satisfactory results after 2 days of training on a single V100 GPU. The results obtained using a multi-GPU approach with parallel data vary. To train the model please use the following notebook.
After training melGAN and UNAGAN, you will have to use UNAGAN generate to output .wav
audio files. Please set the melgan_run_id
and unagan_run_id
created in the previous training steps. The output .wav
files will be saved to the output_dir
specified in the notebook. To train the model please use the following notebook
The sequencer model combines an MFCC
descriptor extraction model with a descriptor predictor generator and query and playback engines that generate .wav
audio files out of those MFCC
.json
files. The diagram below explains the relation between the different elements of the prediction-transformer-query-playback workflow.
As outlined above, the descriptor model plays a crucial role in the the prediction workflow. You may use pretrained descriptor data by selecting a wandb_run_id
from the descriptor model or train your own model using this notebook, following the instructions found there, to generate MFCC
.json
files.
Four different time-series predictors were implemented as training options. Both the "LSTM" and "transformer encoder-only model" are one step prediction models, while "LSTM encoder-decoder model" and "transformer model" can predict descriptor sequences with specified sequence length.
- LSTM (Hochreiter et al. 1997)
- LSTM encoder-decoder model (Cho et al. 2014)
- Transformer encoder-only model
- Transformer model (Vaswani et al. 2017)
Once you train the model, record the wandb_run_id
and paste it in the prediction notebook. Then, provide paths to the RAW generated audio DB
and Prediction DB
databases and and run the notebook to generate new descriptors. The descriptors genereted from Prediction DB
will be used as the input of the neural sequencer to predict subsequent descriptors, which will be converted into .wav
audio files using the query and playback engines (see below). To train the model please use the following notebook.
You may alternatively train the descriptor model using a database containing files in .wav
format by running
python desc/train_function.py --selected_model <1 of 4 models above> --audio_db_dir <path to database> --window_size <input sequence length> --forecast_size <output sequence length>
This is the workflow of the query and playback engines, which will translate MFCC
.json
files into .wav
audio files. This workflow partially overlaps with the instructions provided above on the descriptor predictor model.
-
The descriptor model processes the
PREDICTION DB
databse (see diagram above) to generate descriptor input sequences and saves them inDESCRIPTOR DB II
. It then predicts subsequent descriptor strings based on that data. -
The model processes the audio database into
DESCRIPTOR DB I
and links each descriptor to anID reference
connected to the specific audio segment. -
The query function replaces the new predicted descriptors generated by the descriptor model with the closest match, based on a distance function, found in the
DESCRIPTOR DB I
-
The model combines and merges these segments referenced by the replaced descriptors from the query function into a new
.wav
audio file.
To train the model please use the following notebook.