EmoSpeech: Guiding FastSpeech2 Towards Emotional Text to Speech

How to run

Build env

You can build an environment with uv.

To set up environment with uv

If you don't have uv installed, please find the installation instructions for your OS here.

uv will create a virtual environment and install all dependencies specified in pyproject.toml. To build env with uv run:

For CPU-only:

uv sync --extra cpu

For GPU:

uv sync --extra gpu

Download and preprocess data

We used data of 10 English Speakers from ESD dataset. To download all .wav, .txt files along with .TextGrid files created using MFA:

uv run --extra cpu download_data.py

To train a model we need precomputed durations, energy, pitch and eGeMap features. From src directory run:

uv run --extra cpu -m src.preprocess.preprocess

This is how your app folder should look like:

.
└── data
    ├── data
    │   └── ssw_esd
    ├── emospeech.cpkt
    ├── phones.json
    ├── preprocessed
    │   ├── duration
    │   ├── egemap
    │   ├── energy
    │   ├── mel
    │   ├── phones.json
    │   ├── pitch
    │   ├── stats.json
    │   ├── test.txt
    │   ├── train.txt
    │   ├── trimmed_wav
    │   └── val.txt
    ├── test_ids.txt
    ├── val_ids.txt
    └── vocoder_checkpoint.pt

Training

Configure arguments in config/config.py.
Run:

uv run --extra cpu -m src.scripts.train

Testing

Testing is implemented on testing subset of ESD dataset. To synthesize audio and compute neural MOS (NISQA TTS):

Configure arguments in config/config.py under Inference section.
Run :

uv run --extra cpu -m src.scripts.test

You can find NISQA TTS for original, reconstructed and generated audio in test.log.

Inference

EmoSpeech is trained on phoneme sequences. Supported phones can be found in app/data/preprocessed/phones.json. This repositroy is created for academic research and doesn't support automatic grapheme-to-phoneme conversion. However, if you would like to synthesize arbitrary sentence with emotion conditioning you can:

Using my custom docker image with MFA

Build the docker image with MFA:

docker build -t mfa .

Run the docker container, mounting app/data to /data in the container:

docker run -it -v ./app/data:/data mfa

Create graphemes.txt file in app/data with the sentence you want to synthesize, for example:

echo "Your sentence to synthesize goes here." > app/data/graphemes.txt

Following the install guide of MFA

Generate phoneme sequence from graphemes with MFA.

1.1 Follow the installation guide
Download english g2p model: mfa model download g2p english_us_arpa
Generate phoneme.txt from graphemes.txt: mfa g2p graphemes.txt english_us_arpa phoneme.txt

Launch inference

Run uv run --extra cpu -m src.scripts.inference, specifying arguments:

Аrgument	Meaning	Possible Values	Default value
`-sq`	Phoneme sequence to synthesisze	Find in `data/phones.json`.	Not set, required argument if -pf not set.
`-pf`	Phoneme sequence file to synthesisze	`app/data/phoneme.txt`.	Not set, required argument if -sq not set.
`-emo`	Id of desired voice emotion	0: neutral, 1: angry, 2: happy, 3: sad, 4: surprise.	1
`-sp`	Id of speaker voice	From 1 to 10, correspond to 0011 ... 0020 in original ESD notation.	5
`-p`	Path where to save synthesised audio	Any with `.wav` extension.	generation_from_phoneme_sequence.wav

For example

uv run --extra cpu -m src.scripts.inference -sq "S P IY2 K ER1 F AY1 V  T AO1 K IH0 NG W IH0 TH AE1 NG G R IY0 IH0 M OW0 SH AH0 N"

uv run --extra cpu -m src.scripts.inference -pf app/data/phoneme.txt

If result file is not synthesied, check inference.log for OOV phones.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
config		config
src		src
.gitignore		.gitignore
.python-version		.python-version
CITATION.cff		CITATION.cff
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
download_data.py		download_data.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EmoSpeech: Guiding FastSpeech2 Towards Emotional Text to Speech

How to run

Build env

To set up environment with uv

Download and preprocess data

Training

Testing

Inference

Using my custom docker image with MFA

Following the install guide of MFA

Launch inference

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

EmoSpeech: Guiding FastSpeech2 Towards Emotional Text to Speech

How to run

Build env

To set up environment with uv

Download and preprocess data

Training

Testing

Inference

Using my custom docker image with MFA

Following the install guide of MFA

Launch inference

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages