Skip to content

corazon008/emospeech

 
 

Repository files navigation

EmoSpeech: Guiding FastSpeech2 Towards Emotional Text to Speech

arXiv githubio

How to run

Build env

You can build an environment with uv.

To set up environment with uv

If you don't have uv installed, please find the installation instructions for your OS here.

uv will create a virtual environment and install all dependencies specified in pyproject.toml. To build env with uv run:

For CPU-only:

uv sync --extra cpu

For GPU:

uv sync --extra gpu

Download and preprocess data

We used data of 10 English Speakers from ESD dataset. To download all .wav, .txt files along with .TextGrid files created using MFA:

uv run --extra cpu download_data.py

To train a model we need precomputed durations, energy, pitch and eGeMap features. From src directory run:

uv run --extra cpu -m src.preprocess.preprocess

This is how your app folder should look like:

.
└── data
    ├── data
    │   └── ssw_esd
    ├── emospeech.cpkt
    ├── phones.json
    ├── preprocessed
    │   ├── duration
    │   ├── egemap
    │   ├── energy
    │   ├── mel
    │   ├── phones.json
    │   ├── pitch
    │   ├── stats.json
    │   ├── test.txt
    │   ├── train.txt
    │   ├── trimmed_wav
    │   └── val.txt
    ├── test_ids.txt
    ├── val_ids.txt
    └── vocoder_checkpoint.pt

Training

  1. Configure arguments in config/config.py.
  2. Run:
uv run --extra cpu -m src.scripts.train

Testing

Testing is implemented on testing subset of ESD dataset. To synthesize audio and compute neural MOS (NISQA TTS):

  1. Configure arguments in config/config.py under Inference section.
  2. Run :
uv run --extra cpu -m src.scripts.test

You can find NISQA TTS for original, reconstructed and generated audio in test.log.

Inference

EmoSpeech is trained on phoneme sequences. Supported phones can be found in app/data/preprocessed/phones.json. This repositroy is created for academic research and doesn't support automatic grapheme-to-phoneme conversion. However, if you would like to synthesize arbitrary sentence with emotion conditioning you can:

Using my custom docker image with MFA

  1. Build the docker image with MFA:
docker build -t mfa .
  1. Run the docker container, mounting app/data to /data in the container:
docker run -it -v ./app/data:/data mfa
  1. Create graphemes.txt file in app/data with the sentence you want to synthesize, for example:
echo "Your sentence to synthesize goes here." > app/data/graphemes.txt

Following the install guide of MFA

  1. Generate phoneme sequence from graphemes with MFA.

    1.1 Follow the installation guide

  2. Download english g2p model: mfa model download g2p english_us_arpa

  3. Generate phoneme.txt from graphemes.txt: mfa g2p graphemes.txt english_us_arpa phoneme.txt

Launch inference

Run uv run --extra cpu -m src.scripts.inference, specifying arguments:

Аrgument Meaning Possible Values Default value
-sq Phoneme sequence to synthesisze Find in data/phones.json. Not set, required argument if -pf not set.
-pf Phoneme sequence file to synthesisze app/data/phoneme.txt. Not set, required argument if -sq not set.
-emo Id of desired voice emotion 0: neutral, 1: angry, 2: happy, 3: sad, 4: surprise. 1
-sp Id of speaker voice From 1 to 10, correspond to 0011 ... 0020 in original ESD notation. 5
-p Path where to save synthesised audio Any with .wav extension. generation_from_phoneme_sequence.wav

For example

uv run --extra cpu -m src.scripts.inference -sq "S P IY2 K ER1 F AY1 V  T AO1 K IH0 NG W IH0 TH AE1 NG G R IY0 IH0 M OW0 SH AH0 N"
uv run --extra cpu -m src.scripts.inference -pf app/data/phoneme.txt

If result file is not synthesied, check inference.log for OOV phones.

References

  1. FastSpeech 2 - PyTorch Implementation
  2. iSTFTNet : Fast and Lightweight Mel-spectrogram Vocoder Incorporating Inverse Short-time Fourier Transform
  3. Publicly Available Emotional Speech Dataset (ESD) for Speech Synthesis and Voice Conversion
  4. NISQA: Speech Quality and Naturalness Assessment
  5. Montreal Forced Aligner Models
  6. Modified VocGAN
  7. AdaSpeech

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 99.9%
  • Dockerfile 0.1%