You can build an environment with uv.
If you don't have uv installed, please find the installation instructions for your OS here.
uv will create a virtual environment and install all dependencies specified in pyproject.toml. To build env with uv run:
For CPU-only:
uv sync --extra cpuFor GPU:
uv sync --extra gpuWe used data of 10 English Speakers from ESD dataset. To download all .wav, .txt files along with .TextGrid files created using MFA:
uv run --extra cpu download_data.pyTo train a model we need precomputed durations, energy, pitch and eGeMap features. From src directory run:
uv run --extra cpu -m src.preprocess.preprocessThis is how your app folder should look like:
.
└── data
├── data
│ └── ssw_esd
├── emospeech.cpkt
├── phones.json
├── preprocessed
│ ├── duration
│ ├── egemap
│ ├── energy
│ ├── mel
│ ├── phones.json
│ ├── pitch
│ ├── stats.json
│ ├── test.txt
│ ├── train.txt
│ ├── trimmed_wav
│ └── val.txt
├── test_ids.txt
├── val_ids.txt
└── vocoder_checkpoint.pt
- Configure arguments in
config/config.py. - Run:
uv run --extra cpu -m src.scripts.trainTesting is implemented on testing subset of ESD dataset. To synthesize audio and compute neural MOS (NISQA TTS):
- Configure arguments in
config/config.pyunderInferencesection. - Run :
uv run --extra cpu -m src.scripts.testYou can find NISQA TTS for original, reconstructed and generated audio in test.log.
EmoSpeech is trained on phoneme sequences. Supported phones can be found in app/data/preprocessed/phones.json. This repositroy is created for academic research and doesn't support automatic grapheme-to-phoneme conversion. However, if you would like to synthesize arbitrary sentence with emotion conditioning you can:
- Build the docker image with MFA:
docker build -t mfa .- Run the docker container, mounting
app/datato/datain the container:
docker run -it -v ./app/data:/data mfa- Create
graphemes.txtfile inapp/datawith the sentence you want to synthesize, for example:
echo "Your sentence to synthesize goes here." > app/data/graphemes.txt-
Generate phoneme sequence from graphemes with MFA.
1.1 Follow the installation guide
-
Download english g2p model:
mfa model download g2p english_us_arpa -
Generate phoneme.txt from graphemes.txt:
mfa g2p graphemes.txt english_us_arpa phoneme.txt
Run uv run --extra cpu -m src.scripts.inference, specifying arguments:
| Аrgument | Meaning | Possible Values | Default value |
|---|---|---|---|
-sq |
Phoneme sequence to synthesisze | Find in data/phones.json. |
Not set, required argument if -pf not set. |
-pf |
Phoneme sequence file to synthesisze | app/data/phoneme.txt. |
Not set, required argument if -sq not set. |
-emo |
Id of desired voice emotion | 0: neutral, 1: angry, 2: happy, 3: sad, 4: surprise. | 1 |
-sp |
Id of speaker voice | From 1 to 10, correspond to 0011 ... 0020 in original ESD notation. | 5 |
-p |
Path where to save synthesised audio | Any with .wav extension. |
generation_from_phoneme_sequence.wav |
For example
uv run --extra cpu -m src.scripts.inference -sq "S P IY2 K ER1 F AY1 V T AO1 K IH0 NG W IH0 TH AE1 NG G R IY0 IH0 M OW0 SH AH0 N"uv run --extra cpu -m src.scripts.inference -pf app/data/phoneme.txtIf result file is not synthesied, check inference.log for OOV phones.
- FastSpeech 2 - PyTorch Implementation
- iSTFTNet : Fast and Lightweight Mel-spectrogram Vocoder Incorporating Inverse Short-time Fourier Transform
- Publicly Available Emotional Speech Dataset (ESD) for Speech Synthesis and Voice Conversion
- NISQA: Speech Quality and Naturalness Assessment
- Montreal Forced Aligner Models
- Modified VocGAN
- AdaSpeech