Detail TTS

The model newly proposed three significant important methods to become the best practice of AR TTS.

Although RVQ is used, the actual training employs continuous features, I call it fake discretization.
All in one model. The model contains gpt, diffusion, vqvae, gan and flowvae all in one. One train one inference.
Both prefixed spk emb and prompt are used to get benefit from both Valle type inference and Tortoise type training.

Here is the result obtained after the model was trained on 10000 hours of very dirty data. The model can be easily scaled up with many low quality data.

prompt 0

prompt00.mov

generated 0

prompt01.mov

prompt 1

prompt10.mov

generated 1

prompt12.mov

prompt 2

prompt20.mov

generated 2

prompt21.mov

Inference

check api.py

Dataset prepare

Change the path contains audios in script and run

python prepare/0_vad_asr_save_to_jsonl.py

Train and Fine Tune

accelerate launch train.py

For fine tuning, change the pretrain model load path.

Acknowledgements

VQ and VITS from GSV

Diffusion and GPT from tortoise

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
bpe_tokenizers		bpe_tokenizers
dtts.egg-info		dtts.egg-info
gpt		gpt
prepare		prepare
vqvae		vqvae
.gitignore		.gitignore
1.wav		1.wav
2.wav		2.wav
3.wav		3.wav
4.wav		4.wav
README.md		README.md
api.py		api.py
arch.png		arch.png
demo.ipynb		demo.ipynb
gen.wav		gen.wav
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Detail TTS

Inference

Dataset prepare

Train and Fine Tune

Acknowledgements

About

Releases

Packages

Languages

adelacvg/detail_tts

Folders and files

Latest commit

History

Repository files navigation

Detail TTS

Inference

Dataset prepare

Train and Fine Tune

Acknowledgements

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages