Skip to content

candlewill/Ossian

Repository files navigation

Ossian + DNN demo

Ossian is a collection of Python code for building text-to-speech (TTS) systems, with an emphasis on easing research into building TTS systems with minimal expert supervision. Work on it started with funding from the EU FP7 Project Simple4All, and this repository contains a version which is considerable more up-to-date than that previously available. In particular, the original version of the toolkit relied on HTS to perform acoustic modelling. Although it is still possible to use HTS, it now supports the use of neural nets trained with the Merlin toolkit as duration and acoustic models. All comments and feedback about ways to improve it are very welcome.

Here is some Chinese document. 一些中文文档和总结可以发现于:Chinese Ossian Doc.

Dependencies

Perl 5 is required.

Python 2.7 is required.

Use the pip package installer -- within a Python virtualenv as necessary -- to get some necessary packages:

pip install numpy
pip install scipy
pip install configobj
pip install scikit-learn
pip install regex
pip install lxml
pip install argparse

We will use the Merlin toolkit to train neural networks, creating the following dependencies:

pip install bandmat 
pip install theano
pip install matplotlib

We will use sox to process speech data:

apt-get install sox

Getting the tools

Clone the Ossian github repository as follows:

git clone https://github.com/candlewill/Ossian.git

This will create a directory called ./Ossian; the following discussion assumes that an environment variable $OSSIAN is set to point to this directory.

Install from scratch

Ossian relies on the Hidden Markov Model Toolkit (HTK) and HMM-based Speech Synthesis System (HTS) for alignment and (optionally) acoustic modelling -- here are some notes on obtaining and compiling the necessary tools. To get a copy of the HTK source code it is necessary to register on the HTK website to obtain a username and password. It is here assumed that these have been obtained and the environment variables $HTK_USERNAME and $HTK_PASSWORD point to them.

Running the following script will download and install the necessary tools (including Merlin):

./scripts/setup_tools.sh $HTK_USERNAME $HTK_PASSWORD

The script ./scripts/setup_tools.sh will do the following things:

  • clones down the Merlin repo to $OSSIAN/tools/merlin, and resets its head to 8aed278
  • cd into the merlin/tools/WORLD/ folder, and build it, then copy analysis and synth into $OSSIAN/tools/bin/:
    cd $OSSIAN/tools/merlin/tools/WORLD/
    make -f makefile
    make -f makefile analysis
    make -f makefile synth
    mkdir -p $OSSIAN/tools/bin/
    cp $OSSIAN/tools/merlin/tools/WORLD/build/{analysis,synth} $OSSIAN/tools/bin/
  • Download HTK, HDecode, HTS, and apply HTS patch. Build HTK, and install it to $OSSIAN/tools/ folder.
  • Download hts-engine, and install it to $OSSIAN/tools/
  • Download SPTK, and install it to $OSSIAN/tools/
  • The g2p-r1668-r3 and corenlp-python packages would be installed if you changed the value of SEQUITUR, STANFORD from 0 to 1.

As all the tools are installed into $OSSIAN/tools/ directory, the $OSSIAN/tools/bin directory would include all the binaries used by Ossian.

Install from pre-built

If you have installed the above mentioned tools manually and don't want to install from scratch, you can make soft link to tell the Ossian where you have installed these tools.

# 1 Mannuly clone the merlin repo
# 2 Downlaod WORLD, HTK, HDecode, HTS, HTS-engine, SPTK, build and install.
# 3 Copy all of the binaries into one folder. E.g., bin.

# 3 Where is your merlin dir
export merlin_dir=/home/dl80/heyunchao/Programs/Ossian/tools/merlin
# 4 Where is the bin direcotry inculuding all the binaries
export bin_dir=/home/dl80/heyunchao/Programs/Ossian/tools/bin

# 5 Create soft link in your Ossian/tools direcotry
cd /home/dl80/heyunchao/Programs/MyOssian_Github/tools
ln -s $merlin_dir merlin
ln -s $bin_dir bin

We provide a pre-built binary collection here Ossian_required_bin.tar. Download and move to the $bin_dir directory, if someone doesn't want to build for scratch.

Acquire some data

Ossian expects its training data to be in the directories:

 ./corpus/<OSSIAN_LANG>/speakers/<DATA_NAME>/txt/*.txt
 ./corpus/<OSSIAN_LANG>/speakers/<DATA_NAME>/wav/*.wav

Text and wave files should be numbered consistently with each other. <OSSIAN_LANG> and <DATA_NAME> are both arbitrary strings, but it is sensible to choose ones which make obvious sense.

Download and unpack this toy (Romanian) corpus for some guidance:

cd $OSSIAN
wget https://www.dropbox.com/s/uaz1ue2dked8fan/romanian_toy_demo_corpus_for_ossian.tar?dl=0
tar xvf romanian_toy_demo_corpus_for_ossian.tar\?dl\=0

This will create the following directory structures:

./corpus/rm/speakers/rss_toy_demo/
./corpus/rm/text_corpora/wikipedia_10K_words/

Let's start by building some voices on this tiny dataset. The results will sound bad, but if you can get it to speak, no matter how badly, the tools are working and you can retrain on more data of your own choosing. Below are instructions on how to train HTS-based and neural network based voices on this data.

You can download 1 hour sets of data in various languages we prepared here: http://tundra.simple4all.org/ssw8data.html

DNN-based voice using a naive recipe

Ossian trains voices according to a given 'recipe' -- the recipe specifies a sequence of processes which are applied to an utterance to turn it from text into speech, and is given in a file called $OSSIAN/recipes/<RECIPE>.cfg (where <RECIPE> is the name of a the specific recipe you are using). We will start with a recipe called naive_01_nn. If you want to add components to the synthesiser, the best way to start will be to take the file for an existing recipe, copy it to a file with a new name and modify it.

The recipe naive_01_nn is a language independent recipe which naively uses letters as acoustic modelling units. It will work reasonably for languages with sensible orthographies (e.g. Romanian) and less well for e.g. English.

Ossian will put all files generated during training on the data <DATA_NAME> in language <OSSIAN_LANG> according to recipe <RECIPE> in a directory called:

 $OSSIAN/train/<OSSIAN_LANG>/speakers/<DATA_NAME>/<RECIPE>/

When if has successfully trained a voice, the components needed at synthesis are copied to:

 $OSSIAN/voices/<OSSIAN_LANG>/<DATA_NAME>/<RECIPE>/

Assuming that we want to start by training a voice from scratch, we might want to check that these locations do not already exist for our combination of data/language/recipe:

rm -r $OSSIAN/train/rm/speakers/rss_toy_demo/naive_01_nn/ $OSSIAN/voices/rm/rss_toy_demo/naive_01_nn/

Then to train, do this:

cd $OSSIAN
python ./scripts/train.py -s rss_toy_demo -l rm naive_01_nn

As various messages printed during training will inform you, training of the neural networks themselves which will be used for duration and acoustic modelling is not directly supported within Ossian. The data and configs needed to train networks for duration and acoustic model are prepared by the above command line, but the Merlin toolkit needs to be called separately to actually train the models. The NNs it produces then need to be converted back to a suitable format for Ossian. This is a little messy, but better integration between Ossian and Merlin is an ongoing area of development.

Here's how to do this -- these same instructions will have been printed when you called ./scripts/train.py above. First, train the duration model:

cd $OSSIAN
export THEANO_FLAGS=""; python ./tools/merlin/src/run_merlin.py $OSSIAN/train/rm/speakers/rss_toy_demo/naive_01_nn/processors/duration_predictor/config.cfg

For this toy data, training on CPU like this will be quick. Alternatively, to use GPU for training, do:

./scripts/util/submit.sh ./tools/merlin/src/run_merlin.py $OSSIAN/train/rm/speakers/rss_toy_demo/naive_01_nn/processors/duration_predictor/config.cfg

If training went OK, then you can export the trained model to a better format for Ossian. The basic problem is that the NN-TTS tools store the model as a Python pickle file -- if this is made on a GPU machine, it can only be used on a GPU machine. This script converts to a more flexible format understood by Ossian -- call it with the same config file you used for training and the name of a directory when the new format should be put:

python ./scripts/util/store_merlin_model.py $OSSIAN/train/rm/speakers/rss_toy_demo/naive_01_nn/processors/duration_predictor/config.cfg $OSSIAN/voices/rm/rss_toy_demo/naive_01_nn/processors/duration_predictor

When training the duration model, there will be loads of warnings saying WARNING: no silence found! -- theses are not a problem and can be ignored.

Similarly for the acoustic model:

cd $OSSIAN
export THEANO_FLAGS=""; python ./tools/merlin/src/run_merlin.py $OSSIAN/train/rm/speakers/rss_toy_demo/naive_01_nn/processors/acoustic_predictor/config.cfg

Or:

./scripts/util/submit.sh ./tools/merlin/src/run_merlin.py $OSSIAN/train/rm/speakers/rss_toy_demo/naive_01_nn/processors/acoustic_predictor/config.cfg

Then:

python ./scripts/util/store_merlin_model.py $OSSIAN/train/rm/speakers/rss_toy_demo/naive_01_nn/processors/acoustic_predictor/config.cfg $OSSIAN/voices/rm/rss_toy_demo/naive_01_nn/processors/acoustic_predictor

If training went OK, you can synthesise speech. There is an example Romanian sentence in $OSSIAN/test/txt/romanian.txt -- we will synthesise a wave file for it in $OSSIAN/test/wav/romanian_toy_naive.wav like this:

mkdir $OSSIAN/test/wav/

python ./scripts/speak.py -l rm -s rss_toy_demo -o ./test/wav/romanian_toy_HTS.wav naive_01_nn ./test/txt/romanian.txt

You can find the audio for this sentence here for comparison (it was not used in training).

The configuration files used for duration and acoustic model training will work as-is for the toy data set, but when you move to other data sets, you will want to experiment with editing them to get better permformance. In particular, you will want to increase training_epochs to train voices on larger amounts of data; this could be set to e.g. 30 for the acoustic model and e.g. 100 for the duration model. You will also want to experiment with learning_rate, batch_size, and network architecture (hidden_layer_size, hidden_layer_type). Currently, Ossian only supports feed-forward networks.

Synthesis

The command to synthesis new wave given text as input is:

python ./scripts/speak.py -l $OSSIAN_LANG -s $DATA_NAME -o ./test/wav/${OSSIAN_LANG}_${DATA_NAME}_test.wav $RECIPE ./test/txt/test.txt

Where ./test/wav/${OSSIAN_LANG}_${DATA_NAME}_test.wav and $RECIPE ./test/txt/test.txt are the synthesized wave and input text.

The complete usage of speak.py is:

usage: speak.py [-h] -s SPEAKER -l LANG [-o OUTPUT] [-t STAGE] [-play] [-lab]
                [-bin CUSTOM_BINDIR]
                config [files [files ...]]

positional arguments:
  config              configuration to use: naive, semi-naive, gold, as
                      defined in <ROOT>/recipes/<config> -directory
  files               text files to speak, reading from stdin by default

optional arguments:
  -h, --help          show this help message and exit
  -s SPEAKER          the name of the speaker: <ROOT>/corpus/<LANG>/<SPEAKER>
  -l LANG             the language of the speaker: <ROOT>/corpus/<LANG>
  -o OUTPUT           output audio here
  -t STAGE            defines the current usage stage (definitions of stages
                      should by found in <config>/recipe.cfg
  -play               play audio after synthesis
  -lab                make label file as well as wave in output location
  -bin CUSTOM_BINDIR

If you want to export your pre-trained model, you should pack up the following files:

  1. voice/
  2. train/cn/speakers/king_cn_corpus/naive_01_nn.cn/questions_dur.hed.cont
  3. train/cn/speakers/king_cn_corpus/naive_01_nn.cn/questions_dur.hed
  4. train//cn/speakers/king_cn_corpus/naive_01_nn.cn/questions_dnn.hed.cont

Then, after you put them to the right directory, someone else could use your model to synthesis given text.

Pre-trained Model

Here, We provide a simple pre-trained model for Chinese TTS. As the model is trained on a limited small inner corpus for testing, the quality of the synthesized voice is not very good.

Simple Pre-trained Chinese Model: Ossian_cn_pretrained_model.tar.gz

Some samples generated from this model could be found here: Ossian_Chinese_samples.zip

Latest Merlin Repo

If you want to use the latest merlin repo, it is possible now. However, when export model some files no exist error would occurs. You could manually copy the corresponding files to the right folder to deal with it. These files are existed after training, but not in the right directory. You could use find -name *.dat to find where they are.

Here is a example:

# Duration model
cp ./train/cn/speakers/toy_cn_corpus/naive_01_nn.cn/dnn_training_ACOUST/inter_module/norm_info__mgc_lf0_vuv_bap_187_MVN.dat /root/Ossian/train/cn/speakers/toy_cn_corpus/naive_01_nn.cn//cmp//norm_info_mgc_lf0_vuv_bap_187_MVN.dat

cp ./train/cn/speakers/toy_cn_corpus/naive_01_nn.cn/dnn_training_ACOUST/inter_module/label_norm_HTS_3491.dat /root/Ossian/train/cn/speakers/toy_cn_corpus/naive_01_nn.cn//cmp//label_norm_HTS_3491.dat

cp ./train/cn/speakers/toy_cn_corpus/naive_01_nn.cn/dnn_training_ACOUST/nnets_model/feed_forward_6_tanh.model /root/Ossian/train/cn/speakers/toy_cn_corpus/naive_01_nn.cn//dnn_training_ACOUST//nnets_model/DNN_TANH_TANH_TANH_TANH_TANH_TANH_LINEAR__mgc_lf0_vuv_bap_0_6_1024_1024_1024_1024_1024_1024_3491.187.train.243.0.002000.rnn.model

# Acoustic model
cp ./train/cn/speakers/toy_cn_corpus/naive_01_nn.cn/dnn_training_DUR/inter_module/norm_info__dur_5_MVN.dat /root/Ossian/train/cn/speakers/toy_cn_corpus/naive_01_nn.cn///norm_info_dur_5_MVN.dat

cp ./train/cn/speakers/toy_cn_corpus/naive_01_nn.cn/dnn_training_DUR/inter_module/label_norm_HTS_3482.dat /root/Ossian/train/cn/speakers/toy_cn_corpus/naive_01_nn.cn/
...

Other recipes

We have used many other recipes with Ossian which will be documented here when cleaned up enough to be useful to others. These will give the ability to add more knowledge to the voices built, in the form of lexicons, letter-to-sound rules etc., and integrate existing trained components where they are available for the target language. Some of them could be found here:

  1. Chinese Text-to-Speech recipe

Announcement

This project is based on the CSTR-Edinburgh/Ossian. All copyright is belonging to the original project.

Yunchao He

yunchaohe@gmail.com