Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt

Yongqi Wang, Ruofan Hu, Rongjie Huang, Zhiqing Hong, Ruiqi Li, Wenrui Liu, Fuming You, Tao Jin, Zhou Zhao

Zhejiang University

This is the PyTorch implementation of Prompt-Singer (NAACL'24), a singing-voice-synthesis model with controllability over speaker gender, volume, and vocal range with natural language prompt. We provide the finetuned FLAN-T5 version of our code and checkpoints.

Note

This implementation differs slightly from the version we conduct experiments with in the paper. We use a 16kHz SoundStream to extract acoustic units of the training data of the transformer (downsampled to 16kHz). The vocoder is still trained to generate 24kHz audio.

The correctness of this open-source version has not been fully verified. Feel free to create an issue if you find any problems.

Dependencies

We recommend the following environment:

PyTorch version >= 2.1.0
Python version >= 3.8
For training new models, you'll also need an NVIDIA GPU and NCCL

For faster training install NVIDIA's apex library:

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" \
  --global-option="--cuda_ext" --global-option="--deprecated_fused_adam" \
  --global-option="--xentropy" --global-option="--fast_multihead_attn" ./

Install this fairseq-based repo and develop locally if you don't have fairseq installed:
```
pip install --editable ./
```
If you have already installed fairseq, you can run Prompt-Singer with the existing fairseq by specifying the path to the research directory (see infer and training parts).

Checkpoints

We provide checkpoints for each stage of the finetuned flan-t5 large version model on Hugging Face. In order to download the checkpoints, simply run the following python script:

from huggingface_hub import snapshot_download 
downloaded_path = snapshot_download(repo_id="Cyanbox/Prompt-Singer")

This hugging face repo contains checkpoints of the SoundStream, the finetuned FLAN-T5, the transformer backbone and the unit vocoder.

Inference

Acoustic token inference

We provide some TSV files in infer_tsv directory for testing, together with configuration and dictionary files needed. The TSV files contain the input information needed for controllable SVS, where the file names correspond to the respective control attribute categories, and the prompt sentences included in these files match these attributes. You need to modify text_encoder_version and audio_tokenizer_ckpt_path in config.yaml to the paths on your own machine before inference.

Switch to the Prompt-Singer root directory, modify the relevant parameters and paths, then run the following command to generate acoustic units:

python research/PromptSinger/generate.py  infer_tsv \
 --task t2a_sing_t5_config_task \
 --path <PATH_TO_CKPT_DIR>/prompt-singer-flant5-large-finetuned/checkpoint_last.pt \
 --gen-subset <PREFIX_NAME_OF_TEST_TSV> \
 --batch-size 1 --max-tokens 10000 \
 --max-source-positions 10000 --max-target-positions 10000 \
 --max-len-a 1 --max-len-b 0 \
 --results-path <PATH_TO_OUTPUT_DIR> --user-dir research \
 --fp16  --num-workers 0

The output file would be <PATH_TO_OUTPUT_DIR>/generate-<PREFIX_NAME_OF_TEST_TSV>.txt. Note that the model is trained to generate audio up to a maximum length of 13 seconds, and input exceeding this length will be truncated.

Wavefrom generation

After getting the output txt file from the previous step, switch to the wave_generation directory and run the following command for audio generation:

python infer.py --input_code_file <PATH_TO_OUTPUT_DIR>/generate-<PREFIX_NAME_OF_TEST_TSV>.txt \
 --output_dir <PATH_TO_AUDIO_OUTPUT_DIR> \
 --checkpoint_file <PATH_TO_CKPT_DIR>/vocoder_24k/g_00885000

The generated audio will be in <PATH_TO_AUDIO_OUTPUT_DIR>/syn.

Train with your own datasets

Due to copyright and other issues, we do not provide the original SVS datasets here. If you use the same training data as us, we provide the processed manifests in this link, which includes duplicated phonemes, rescaled and averaged f0, and gender labels. We also provide the method for constructing the training data as well as the training command.

Data Composition

We provide an example of the data composition format in the `data`` directory. The training data should consist of the following parts:

config.yaml: configuration file.
- word_dict: filename of the keyword pickle file.
- sentence_dict: filename of the sentence template pickle file.
- text_encoder_version: path to the local FLAN-T5 repo.
- audio_tokenizer_ckpt_path: path to the codec checkpoint.
- dict_length: keep it the same with the length of dict_uni.txt
- batch_max_frames: maximum length of audio data for training / inference.
- keep other parameters unchanged.
dict_uni.txt: dictionary file for model input.
English_word.pkl: pickle file containig the keywords we use.
English_sen.pkl: pickle file containig the sentence templates we use.
train.tsv and valid.tsv: data manifests for training and evaluation.
- item_name: name of data item.
- task: keep it text_to_acoustic_sing.
- audio_path: path to the wav file.
- gender: gender lable of the item, should be one of [male, female].
- phone: duplicated phonemes by the durations.
- txt: original lyrics, not used in training.
- f0_rescale: relative value of f0 with the average of its voiced part being 222.
  
  The f0_rescale in the manifests we provides above has a voiced-part mean of 222, which is slightly different from the 230 reported in the paper. However, small changes in its value will not affect model performance as long as it is consistent for all data items.
- f0_avg: average of the voiced part of f0.
- n_frames: length of the acoustic units sequences, need to be pre-calculated.
- pitch: vocal range lable of the item, should be one of [high, low].

The processing methods of phonemes and f0 are provided below.

Get aligned (duplicated) phonemes

Most SVS datasets (including m4singer, opensinger, opencpop, and popcs that we use) provide phoneme-level alignment in textgrid format. For speech data, you can obtain textgrids through MFA (Montreal Forced Aligner). Upon having the textgrids, use data_scripts/duplicate_phone.py for generating duplicated phonemes. The command is:

python duplicate_phone.py --input-tsv <PATH_TO_INPUT_TSV> --output-tsv <PATH_TO_OUTPUT_TSV> --textgrid-dir <PATH_TO_TEXTGRID_ROOT>

where the input tsv should contain item_name and n_frames, and the textgrid files should be named consistently with item_name.

F0 extraction

Use data_scripts/extract_f0.py for extracting continuous f0 values with the command

python extract_f0.py --input-tsv <PATH_TO_INPUT_TSV> --output-dir <PATH_TO_OUTPUT_DIR>

The input tsv file should contain audio_path and item_name.

F0 rescale and averaging

After finishing f0 extraction, use data_scripts/rescale_average_f0.py for calculating average and rescaled f0. The command is

python rescale_average_f0.py --input-dir <PATH_TO_INPUT_DIR> --output-tsv <PATH_TO_OUTPUT_TSV>

The average f0 value can be further used to obtain pitch label according to singer gender.

Training

Use the following command for training:

fairseq-train --task t2a_sing_t5_config_task \
  <PATH_TO_DATA_DIR> \
  --num-workers 0 \
  --save-dir <PATH_TO_SAVE_DIR> \
  --tensorboard-logdir <PATH_TO_SAVE_DIR> \
  --arch acoustic_lm_global300M_noprefix  \
  --no-epoch-checkpoints \
  --criterion acoustic_language_modeling_cross_entropy \
  --optimizer adam --adam-betas '(0.9, 0.95)' --weight-decay 0.01 --clip-norm 0.0 \
  --lr 0.0005 --lr-scheduler inverse_sqrt --warmup-updates 4000 --warmup-init-lr 1e-07 \
  --tokens-per-sample 8000 --sample-break-mode none \
  --update-freq 16 \
  --fp16 \
  --max-tokens 8000 \
  --max-update 500000 \
  --n-ctx 15000 \
  --user-dir <PATH_TO_PROMPT_SINGER>/research

Acknowledgements

Part of the code is borrowed from the following repos. We would like to thank the authors of these repos for their contribution.

Fairseq: https://github.com/facebookresearch/fairseq
AcademiCodec: https://github.com/yangdongchao/AcademiCodec
BigVGAN: https://github.com/NVIDIA/BigVGAN

Citations

If you find this code useful in your research, please cite our work:

@inproceedings{wang2024prompt,
  title={Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt},
  author={Wang, Yongqi and Hu, Ruofan and Huang, Rongjie and Hong, Zhiqing and Li, Ruiqi and Liu, Wenrui and You, Fuming and Jin, Tao and Zhao, Zhou},
  booktitle={Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)},
  pages={4780--4794},
  year={2024}
}

Disclaimer

Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's singing without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
data_scripts		data_scripts
docs		docs
fairseq		fairseq
fairseq_cli		fairseq_cli
infer_tsv		infer_tsv
research		research
scripts		scripts
tests		tests
wave_generation		wave_generation
LICENSE		LICENSE
README.md		README.md
hubconf.py		hubconf.py
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg
setup.py		setup.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt

Yongqi Wang, Ruofan Hu, Rongjie Huang, Zhiqing Hong, Ruiqi Li, Wenrui Liu, Fuming You, Tao Jin, Zhou Zhao

Zhejiang University

Note

Dependencies

Checkpoints

Inference

Acoustic token inference

Wavefrom generation

Train with your own datasets

Data Composition

Get aligned (duplicated) phonemes

F0 extraction

F0 rescale and averaging

Training

Acknowledgements

Citations

Disclaimer

About

Releases

Packages

Languages

License

cyanbx/Prompt-Singer

Folders and files

Latest commit

History

Repository files navigation

Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt

Yongqi Wang*, Ruofan Hu*, Rongjie Huang, Zhiqing Hong, Ruiqi Li, Wenrui Liu, Fuming You, Tao Jin, Zhou Zhao

Zhejiang University

Note

Dependencies

Checkpoints

Inference

Acoustic token inference

Wavefrom generation

Train with your own datasets

Data Composition

Get aligned (duplicated) phonemes

F0 extraction

F0 rescale and averaging

Training

Acknowledgements

Citations

Disclaimer

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Yongqi Wang, Ruofan Hu, Rongjie Huang, Zhiqing Hong, Ruiqi Li, Wenrui Liu, Fuming You, Tao Jin, Zhou Zhao

Packages