Usage

A video example with TTS voice - over using several speakers:

Brief

We embarked from this implementation for the start:https://github.com/ming024/FastSpeech2

However, we have made several changes so the code is not identical.

For example:

We use masking for input grapheme tokens during training;
CWT was implemented as in the original paper but we did not observe any improvements. Final model was trained without CWT. But you can train a model on your data with it: use_cwt flag in config;
Data preprocessing is slightly different, especially in langauge specific parts.

Dataset:

Russian dataset was borrowed from here https://github.com/vlomme/Multi-Tacotron-Voice-Cloning. We did not use all the speakers and filtered them based on length and records quality. Only 65 speakers were used at the end. You can check all the examples in 'examples'.

MFA:

MFA was trained from scartch after preprocessing text with russian_g2p. Using MFA might be not straightforward, so we refer to this manual: https://github.com/ivanvovk/DurIAN#6-how-to-align-your-own-data

Usage

We use russian_g2p, so you will need to install it first.

git init git clone https://github.com/nsu-ai/russian_g2p.git\ cd russian_g2p pip3 install -r requirements.txt pip install .
Then Install requirements.txt
Download weights: https://drive.google.com/drive/folders/1dX7ELe9C9-ja_liYrgph3Uu5Z5EMljjh?usp=sharing
- Move hifi gan and FS2 weights into 'pretrained';
- Check that paths in config match;
- tts.weights_path - path to pretrained FastSpeech model;
- add speakers_json to the same folder as model weights - speaker names, it should be there right now for pretrained model;
- add sats_json to the same folder as model weights - raw data pitch and energy stats;
- hifi.weights_path - path to pretrnained HIFI Gan.
If all above is set check the notebook "examples.ipynb"

Training your own model

Assuming you preprocessed the data with MFA aligner. Your folders structure should be following:

data
├── speaker_one
│   ├── record_1.TextGrid  # genrated by MFA
│   ├── record_1.wav      
│   └── record_1.lab       # just a text file with a text string
│       
└── speaker_two
    ├── ...
    └── ...

Once data is organized and the path to the data is set in config 'raw_path' run prepare_data.py.
Prepare_data.py will generate more files such as energy and pitch into a folder set by 'preprocessed_path'
Finally, set a path to a lexicon dict. Words and its translitirations generated by rissian_g2p. If you do not use rissian_g2p your dictionary will be different. An example can found in 'pretrained' folder.

Name		Name	Last commit message	Last commit date
Latest commit History 257 Commits
data_utils		data_utils
dataset_review		dataset_review
examples		examples
fs_two		fs_two
hifi		hifi
pretrained		pretrained
README.md		README.md
audio_process.py		audio_process.py
config.yaml		config.yaml
examples.ipynb		examples.ipynb
fsapi.py		fsapi.py
hifiapi.py		hifiapi.py
input_process.py		input_process.py
prepare_data.py		prepare_data.py
requirements.txt		requirements.txt
train.py		train.py
tts_king.py		tts_king.py
voice_over.ipynb		voice_over.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A video example with TTS voice - over using several speakers:

Brief

Dataset:

MFA:

Usage

Training your own model

Have Fun!

About

Releases

Packages

Contributors 2

Languages

diff7/tts-king

Folders and files

Latest commit

History

Repository files navigation

A video example with TTS voice - over using several speakers:

Brief

Dataset:

MFA:

Usage

Training your own model

Have Fun!

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages