Important
This is an unofficial implementation of the paper High Fidelity Neural Audio Compression in PyTorch.
The LibriTTS960h 24khz encodec checkpoint and disc checkpoint is release in https://huggingface.co/zkniu/encodec-pytorch/tree/main
I hope we can get together to do something meaningful and rebuild encodec in this repo.
This repository is based on encodec and EnCodec_Trainer.
Based on the EnCodec_Trainer, I have made the following changes:
- support multi-gpu training.
- support AMP training (you need to reduce learning rate and scale vq epsilon from 1e-5 to 1e-3, the reason you can check issue 8)
- Couldn't work, so don't use amp
- support hydra configuration management.
- align the loss functions and hyperparameters.
- support warmup scheduler in training.
- support the test script to test the model.
- support tensorboard to monitor the training process.
- support 48khz and stereo models, thanks @leoauri in #22.
- support slurm training, thanks @leoauri. in #22.
- support loss balancer, thanks @leoauri. in #22.
- You can find all the training scripts in scripts folder
The code is tested on the following environment:
- Python 3.9
- PyTorch 2.0.0 / PyTorch 1.13
- GeForce RTX 3090 x 4 / V100-16G x 8 / A40 x 3 / A100 x 1
In order to you can run the code, you can install the environment by the help of requirements.txt.
I use the librispeech as the train datasets and use the datasets/generate_train_file.py
generate train csv which is used in the training process. You can check the datasets/generate_train_file.py
and customAudioDataset.py
to understand how to prepare your own dataset.
Also you can use ln -s
to link the dataset to the datasets
folder.
I provide a dockerfile to build a docker image with all the necessary dependencies.
- Building the image
docker build -t encodec:v1 .
- Using the image
# CPU running
docker run encodec:v1 <command> # you can add some parameters, such as -tid
# GPU running
docker run --gpus=all encodec:v1 <command>
You can use the following command to train the model using multi gpu:
CUDA_VISIBLE_DEVICES=0,1,2,3 python train_multi_gpu.py \
distributed.torch_distributed_debug=False \
distributed.find_unused_parameters=True \
distributed.world_size=4 \
common.save_interval=2 \
common.test_interval=2 \
common.max_epoch=100 \
datasets.tensor_cut=100000 \
datasets.batch_size=8 \
datasets.train_csv_path=YOUR TRAIN DATA.csv \
lr_scheduler.warmup_epoch=20 \
optimization.lr=5e-5 \
optimization.disc_lr=5e-5 \
Note:
- if you set a small
datasets.tensor_cut
, you can set a largedatasets.batch_size
to speed up the training process. - when you are training on your own dataset, I suggest you need to choose a moderate-length audio, because If you train your encodec with 1 senconds tensorcut in a small dataset and the encodec model dosen't perform well.
- if you encounter bug about
RuntimeError(f"Mismatch in number of params: ours is {len(params)}, at least one worker has a different one.")
. You can use a smalldatasets.tensor_cut
to solve this problem. - if your torch version is lower 1.8, you need to check the default value of
torch.stft(return_complex)
in theaudio_to_mel.py
- if you encounter bug about multi-gpu training, you can try to set
distributed.torch_distributed_debug=True
to get more message about this problem. - the single gpu training method is similar to the multi-gpu training method, you only need to set the
distributed.data_parallel=False
parameter to the command, like this:python train_multi_gpu.py distributed.data_parallel=False common.save_interval=5 \ common.max_epoch=100 \ datasets.tensor_cut=72000 \ datasets.batch_size=4 \ datasets.train_csv_path=YOUR TRAIN DATA.csv \ lr_scheduler.warmup_epoch=10 \ optimization.lr=5e-5 \ optimization.disc_lr=5e-5 \
- the loss is not converged to zero, but the model can be used to compress and decompress the audio. you can use the
compression.sh
to test your model in every log_interval epoch. - the original paper dataset is larger than 17000h, but I only use LibriTTS960h to train the model, so the model is not good enough. If you want to train a better model, you can use the larger dataset.
- The code is not well tested, so there may be some bugs. If you encounter any problems, you can open an issue or contact me by email.
- When I add AMP training, I found the RVQ loss always be
nan
, and I use L2 norm to normalized quantize and x, like the code -> actually, it's unstable.quantize = F.normalize(quantize) commit_loss = F.mse_loss(quantize.detach(), x)
- When you try to use amp training, you need to reduce learning rate and scale vq epsilon from 1e-5 to 1e-3, the reason you can check issue 8
- I suggest you need to focus on the generator loss, the commit loss it could be not converge, you can check some objective metrics about pesq, stoi.
Usage will depend on your cluster setup, but see scripts/train.sbatch
for an example. This uses a container with the dependencies installed. Run sbatch scripts/train.sbatch
from the repository root to use.
I have add a shell script to compress and decompress the audio by different bandwidth, you can use the compression.sh
to test your model.
The script can be used as follows:
sh compression.sh INPUT_WAV_FILE [MODEL_NAME] [CHECKPOINT]
- INPUT_WAV_FILE is the wav file you want to test
- MODEL_NAME is the model name, default is
encodec_24khz
,supportencodec_48khz
,my_encodec
,encodec_bw
- CHECKPOINT is the checkpoint path, when your MODEL_NAME is
my_encodec
,you can point out the checkpoint
if you want to test the model at a specific bandwidth, you can use the following command:
python main.py -r -b [bandwidth] -f [INPUT_FILE] [OUTPUT_WAV_FILE] -m [MODEL_NAME] -c [CHECKPOINT]
main.py from the encodec , you can use the -h
to check the help information.
Thanks to the following repositories:
- encodec
- EnCodec_Trainer
- melgan-neurips: audio_to_mel.py
The code is same as encodec LICENSE.