Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Talromur2 recipe #4680

Merged
merged 43 commits into from
Nov 8, 2022
Merged
Show file tree
Hide file tree
Changes from 37 commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
07d151c
add talromur to db.sh
G-Thor Dec 7, 2021
8152202
Merge branch 'master' of https://github.com/cadia-lvl/espnet
G-Thor Dec 7, 2021
2a1a407
Integrate icelandic g2p
G-Thor Feb 23, 2022
2f2ed4b
Add talromur recipe
G-Thor Feb 23, 2022
299e061
Add progress bar to phonemization
G-Thor Feb 23, 2022
6348585
Remove unused variables from talromur recipe
G-Thor Feb 23, 2022
16f47f7
Merge branch 'master' of https://github.com/espnet/espnet into pull_u…
G-Thor Feb 23, 2022
b467140
reformatted with black
G-Thor Feb 23, 2022
7752729
Set cluster settings to default, add name to TODO
G-Thor Feb 23, 2022
49a284e
Fix python formatting
G-Thor Feb 23, 2022
87b7b25
Add data_download.sh for talromur recipe.
G-Thor Apr 26, 2022
67e9836
Added multi-speaker VITS training recipe
G-Thor Apr 26, 2022
6320f0b
Updated ice-g2p integration to reflect recent changes
G-Thor Apr 26, 2022
759cec0
add talromur to db.sh
G-Thor Apr 26, 2022
d7a3946
Integrate icelandic g2p
G-Thor Feb 23, 2022
2f27743
Add talromur recipe
G-Thor Feb 23, 2022
5009416
Add progress bar to phonemization
G-Thor Feb 23, 2022
c6b4db3
Remove unused variables from talromur recipe
G-Thor Feb 23, 2022
2f2c8db
reformatted with black
G-Thor Feb 23, 2022
bf449bd
Set cluster settings to default, add name to TODO
G-Thor Feb 23, 2022
3aae2f7
Fix python formatting
G-Thor Feb 23, 2022
bcead94
Add data_download.sh for talromur recipe.
G-Thor Apr 26, 2022
ab374fa
Added multi-speaker VITS training recipe
G-Thor Apr 26, 2022
a3658ad
Updated ice-g2p integration to reflect recent changes
G-Thor Apr 26, 2022
cd7078d
Fix merge conflicts in db.sh
G-Thor Apr 26, 2022
512b857
refactor ice-g2p integration
G-Thor May 5, 2022
49d846a
initial commit. Not complete yet
G-Thor May 9, 2022
bbb8111
add conf files and implement data scripts
G-Thor May 25, 2022
c61baeb
Implement talromur2 recipe data preprocessing scripts
G-Thor Sep 30, 2022
1ccd7e5
Merge branch 'add_talromur2' of github.com:cadia-lvl/espnet into talr…
G-Thor Sep 30, 2022
540433c
Revert changes irrelevant to recipe
G-Thor Sep 30, 2022
8b1f890
Remove unused conf files
G-Thor Sep 30, 2022
d22126c
Write README
G-Thor Sep 30, 2022
9989814
Add Talromur 2 entry in egs2/README.md
G-Thor Sep 30, 2022
0df138e
Fix linter error E501
G-Thor Sep 30, 2022
f7d6f33
Merge branch 'master' into talromur2
G-Thor Oct 4, 2022
0649fee
apply black to new script
G-Thor Oct 4, 2022
161ab4c
remove setup.sh
G-Thor Oct 5, 2022
163e887
Apply changes from PR review
G-Thor Nov 7, 2022
6a62e9b
Update model download URL to correct version
G-Thor Nov 7, 2022
500ac0d
Make ice-g2p installer use official PyPI version
G-Thor Nov 7, 2022
37cb63b
Merge branch 'master' into talromur2
G-Thor Nov 7, 2022
e52cfd8
Fix shellcheck errors and warnings
G-Thor Nov 7, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -83,3 +83,7 @@ tools/anaconda
tools/ice-g2p
tools/fairseq
tools/._*
tools/anaconda
tools/ice-g2p*
tools/fairseq*
tools/featbin*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add line break

1 change: 1 addition & 0 deletions egs2/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,7 @@ See: https://espnet.github.io/espnet/espnet2_tutorial.html#recipes-using-espnet2
| swbd_da | NXT Switchboard Annotations | SLU | ENG | https://catalog.ldc.upenn.edu/LDC2009T26 | |
| swbd_sentiment | Speech Sentiment Annotations | SLU | ENG | https://catalog.ldc.upenn.edu/LDC2020T14 | |
| talromur | Talromur: A large Icelandic TTS corpus | TTS | ISL | https://repository.clarin.is/repository/xmlui/handle/20.500.12537/104, https://aclanthology.org/2021.nodalida-main.50.pdf | |
| talromur2 | Talromur 2: Icelandic multi-speaker TTS corpus | TTS | ISL | https://repository.clarin.is/repository/xmlui/handle/20.500.12537/167 | |
| tedlium2 | TED-LIUM corpus release 2 | ASR | ENG | https://www.openslr.org/19/, http://www.lrec-conf.org/proceedings/lrec2014/pdf/1104_Paper.pdf | |
| tedx_spanish_openslr67 | TEDx Spanish Corpus | ASR | SPA | https://www.openslr.org/67/ | |
| thchs30 | A Free Chinese Speech Corpus Released by CSLT@Tsinghua University | TTS | CMN | https://www.openslr.org/18/ | |
Expand Down
1 change: 1 addition & 0 deletions egs2/TEMPLATE/asr1/db.sh
Original file line number Diff line number Diff line change
Expand Up @@ -156,6 +156,7 @@ VOXFORGE=downloads
VOXPOPULI=downloads
HARPERVALLEY=downloads
TALROMUR=downloads
TALROMUR2=downloads
DCASE=
TEDX_SPANISH=downloads

Expand Down
51 changes: 51 additions & 0 deletions egs2/talromur2/tts1/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# Talrómur 2 recipe

This is a recipe for the [Talrómur 2 corpus](https://repository.clarin.is/repository/xmlui/handle/20.500.12537/167), which is an Icelandic speech corpus intended for multi-speaker TTS development, containing studio recordings of speech from 40 Icelandic speakers, 20 male and 20 female.

## Corpus
The corpus contains 56,225 studio-recorded single-sentence audio clips.Each speaker in the corpus contributes 929 and 1879 clips.

| Male voices | Female voices |
|---|---|
| s124, s176, s178, s181, s188 | s146, s180, s186, s208, s209 |
| s206, s220, s225, s234, s235 | s214, s215, s221, s264, s268 |
| s157, s162, s216, s222, s223 | s169, s185, s187, s200, s226 |
| s231, s236, s240, s250, s273 | s228, s247, s251, s256, s258 |

Since this corpus is intended for multi-speaker and speaker-adaptive tts no attempt is made to create single-speaker TTS models out of the individual voices in the corpus.

A more detailed description of the corpus may be found in its [README file (download link)](https://repository.clarin.is/repository/xmlui/bitstream/handle/20.500.12537/167/README.md) and in [the official repository](https://repository.clarin.is/repository/xmlui/handle/20.500.12537/167)

## Dependencies
This recipe relies on the [Ice-G2P](https://github.com/grammatek/ice-g2p) package for g2p transcriptions. This can be done by running the installation script `installers/install_ice_g2p.sh` from the tools directory. e.g. from this directory you would do
```
cd ../../../tools
installers/install_ice_g2p.sh
```

## Usage
The recipe provides 2 training scripts: train_multi_speaker_fastspeech2.sh and train_multi_speaker_tacotron2.sh. In order to train a FastSpeech 2 model with the current implementation, you need to have a teacher model for phoneme durations. Therefore for a given speaker, you need to have already trained a Tacotron model in order to train a Fastspeech model.

First download the data by running the following:
```
. ./db.sh
local/data_download.sh ${TALROMUR2}
```
Now, to train a Tacotron 2 model, simply run `./train_multi_speaker_tacotron2.sh`
Once a Tacotron model has been trained, you can run `./train_fastspeech2.sh` to obtain a FastSpeech 2 model.


---
## Pretrained models
Training outputs are present on Huggingface for each speaker:

A fully trained x-vector based Tacotron2 model has been uploaded to HuggingFace: [Link](https://huggingface.co/espnet/talromur2_xvector_tacotron2).

This Tacotron2 model was trained with slightly modified parameters for the ice-g2p Transcriber object, defined within `IsG2P` in `espnet2/text/phoneme_tokenizer.py`.

`word_sep=".", syllab_symbol=""` was used instead of the default `word_sep=",", syllab_symbol="."`. When using the model for inference, these nonstandard parameters must be used when producing the phonemized inputs.

---

## Acknowledgments
This project was funded by the Language Technology Programme for Icelandic 2019-2023. The programme, which is managed and coordinated by Almannarómur, is funded by the Icelandic Ministry of Education, Science and Culture.
110 changes: 110 additions & 0 deletions egs2/talromur2/tts1/cmd.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
# ====== About run.pl, queue.pl, slurm.pl, and ssh.pl ======
# Usage: <cmd>.pl [options] JOB=1:<nj> <log> <command...>
# e.g.
# run.pl --mem 4G JOB=1:10 echo.JOB.log echo JOB
#
# Options:
# --time <time>: Limit the maximum time to execute.
# --mem <mem>: Limit the maximum memory usage.
# -–max-jobs-run <njob>: Limit the number parallel jobs. This is ignored for non-array jobs.
# --num-threads <ngpu>: Specify the number of CPU core.
# --gpu <ngpu>: Specify the number of GPU devices.
# --config: Change the configuration file from default.
#
# "JOB=1:10" is used for "array jobs" and it can control the number of parallel jobs.
# The left string of "=", i.e. "JOB", is replaced by <N>(Nth job) in the command and the log file name,
# e.g. "echo JOB" is changed to "echo 3" for the 3rd job and "echo 8" for 8th job respectively.
# Note that the number must start with a positive number, so you can't use "JOB=0:10" for example.
#
# run.pl, queue.pl, slurm.pl, and ssh.pl have unified interface, not depending on its backend.
# These options are mapping to specific options for each backend and
# it is configured by "conf/queue.conf" and "conf/slurm.conf" by default.
# If jobs failed, your configuration might be wrong for your environment.
#
#
# The official documentation for run.pl, queue.pl, slurm.pl, and ssh.pl:
# "Parallelization in Kaldi": http://kaldi-asr.org/doc/queue.html
# =========================================================~


# Select the backend used by run.sh from "local", "stdout", "sge", "slurm", or "ssh"
cmd_backend='local'

# Local machine, without any Job scheduling system
if [ "${cmd_backend}" = local ]; then

# The other usage
export train_cmd="run.pl"
# Used for "*_train.py": "--gpu" is appended optionally by run.sh
export cuda_cmd="run.pl"
# Used for "*_recog.py"
export decode_cmd="run.pl"

# Local machine logging to stdout and log file, without any Job scheduling system
elif [ "${cmd_backend}" = stdout ]; then

# The other usage
export train_cmd="stdout.pl"
# Used for "*_train.py": "--gpu" is appended optionally by run.sh
export cuda_cmd="stdout.pl"
# Used for "*_recog.py"
export decode_cmd="stdout.pl"


# "qsub" (Sun Grid Engine, or derivation of it)
elif [ "${cmd_backend}" = sge ]; then
# The default setting is written in conf/queue.conf.
# You must change "-q g.q" for the "queue" for your environment.
# To know the "queue" names, type "qhost -q"
# Note that to use "--gpu *", you have to setup "complex_value" for the system scheduler.

export train_cmd="queue.pl"
export cuda_cmd="queue.pl"
export decode_cmd="queue.pl"


# "qsub" (Torque/PBS.)
elif [ "${cmd_backend}" = pbs ]; then
# The default setting is written in conf/pbs.conf.

export train_cmd="pbs.pl"
export cuda_cmd="pbs.pl"
export decode_cmd="pbs.pl"


# "sbatch" (Slurm)
elif [ "${cmd_backend}" = slurm ]; then
# The default setting is written in conf/slurm.conf.
# You must change "-p cpu" and "-p gpu" for the "partition" for your environment.
# To know the "partion" names, type "sinfo".
# You can use "--gpu * " by default for slurm and it is interpreted as "--gres gpu:*"
# The devices are allocated exclusively using "${CUDA_VISIBLE_DEVICES}".

export train_cmd="slurm.pl"
export cuda_cmd="slurm.pl"
export decode_cmd="slurm.pl"

elif [ "${cmd_backend}" = ssh ]; then
# You have to create ".queue/machines" to specify the host to execute jobs.
# e.g. .queue/machines
# host1
# host2
# host3
# Assuming you can login them without any password, i.e. You have to set ssh keys.

export train_cmd="ssh.pl"
export cuda_cmd="ssh.pl"
export decode_cmd="ssh.pl"

# This is an example of specifying several unique options in the JHU CLSP cluster setup.
# Users can modify/add their own command options according to their cluster environments.
elif [ "${cmd_backend}" = jhu ]; then

export train_cmd="queue.pl --mem 2G"
export cuda_cmd="queue-freegpu.pl --mem 2G --gpu 1 --config conf/queue.conf"
export decode_cmd="queue.pl --mem 4G"

else
echo "$0: Error: Unknown cmd_backend=${cmd_backend}" 1>&2
return 1
fi
15 changes: 15 additions & 0 deletions egs2/talromur2/tts1/conf/decode.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# This configuration is the basic decoding setting for Tacotron 2.
# It can be also applied to Transformer. If you met some problems
# such as deletions or repetitions, it is worthwhile to try
# `use_att_constraint: true` to make the generation more stable.
# Note that attention constraint is not supported in Transformer.

##########################################################
# DECODING SETTING #
##########################################################
threshold: 0.5 # threshold to stop the generation
maxlenratio: 10.0 # maximum length of generated samples = input length * maxlenratio
minlenratio: 0.0 # minimum length of generated samples = input length * minlenratio
use_att_constraint: false # Whether to use attention constraint, which is introduced in Deep Voice 3
backward_window: 1 # Backward window size in the attention constraint
forward_window: 3 # Forward window size in the attention constraint
7 changes: 7 additions & 0 deletions egs2/talromur2/tts1/conf/mfcc.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
--sample-frequency=16000
--frame-length=25 # the default is 25
--low-freq=20 # the default.
--high-freq=7600 # the default is zero meaning use the Nyquist (8k in this case).
--num-mel-bins=30
--num-ceps=30
--snip-edges=false
11 changes: 11 additions & 0 deletions egs2/talromur2/tts1/conf/pbs.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# Default configuration
command qsub -V -v PATH -S /bin/bash
option name=* -N $0
option mem=* -l mem=$0
option mem=0 # Do not add anything to qsub_opts
option num_threads=* -l ncpus=$0
option num_threads=1 # Do not add anything to qsub_opts
option num_nodes=* -l nodes=$0:ppn=1
default gpu=0
option gpu=0
option gpu=* -l ngpus=$0
12 changes: 12 additions & 0 deletions egs2/talromur2/tts1/conf/queue.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# Default configuration
command qsub -v PATH -cwd -S /bin/bash -j y -l arch=*64*
option name=* -N $0
option mem=* -l mem_free=$0,ram_free=$0
option mem=0 # Do not add anything to qsub_opts
option num_threads=* -pe smp $0
option num_threads=1 # Do not add anything to qsub_opts
option max_jobs_run=* -tc $0
option num_nodes=* -pe mpi $0 # You must set this PE as allocation_rule=1
default gpu=0
option gpu=0
option gpu=* -l gpu=$0 -q g.q
14 changes: 14 additions & 0 deletions egs2/talromur2/tts1/conf/slurm.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Default configuration
command sbatch --export=PATH
option name=* --job-name $0
option time=* --time $0
option mem=* --mem-per-cpu $0
option mem=0
option num_threads=* --cpus-per-task $0
option num_threads=1 --cpus-per-task 1
option num_nodes=* --nodes $0
default gpu=0
option gpu=0 -p cpu
option gpu=* -p gpu --gres=gpu:$0 -c $0 # Recommend allocating more CPU than, or equal to the number of GPU
# note: the --max-jobs-run option is supported as a special case
# by slurm.pl and you don't have to handle it in the config file.
10 changes: 10 additions & 0 deletions egs2/talromur2/tts1/conf/tuning/decode_fastspeech.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# This configuration is the decoding setting for FastSpeech or FastSpeech2.

##########################################################
# DECODING SETTING #
##########################################################
speed_control_alpha: 1 # alpha to control the speed of generated speech
# 1 < alpha makes slower and 1 > alpha makes faster
use_teacher_forcing: false # whether to use teacher forcing
# if true, we use groundtruth of durations
# (+ pitch & energy for FastSpeech2)
15 changes: 15 additions & 0 deletions egs2/talromur2/tts1/conf/tuning/decode_tacotron2.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# This configuration is the basic decoding setting for Tacotron 2.
# It can be also applied to Transformer. If you met some problems
# such as deletions or repetitions, it is worthwhile to try
# `use_att_constraint: true` to make the generation more stable.
# Note that attention constraint is not supported in Transformer.

##########################################################
# DECODING SETTING #
##########################################################
threshold: 0.5 # threshold to stop the generation
maxlenratio: 10.0 # maximum length of generated samples = input length * maxlenratio
minlenratio: 0.0 # minimum length of generated samples = input length * minlenratio
use_att_constraint: false # Whether to use attention constraint, which is introduced in Deep Voice 3
backward_window: 1 # Backward window size in the attention constraint
forward_window: 3 # Forward window size in the attention constraint
10 changes: 10 additions & 0 deletions egs2/talromur2/tts1/conf/tuning/decode_vits.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# This configuration is the decoding setting for VITS.

##########################################################
# DECODING SETTING #
##########################################################
noise_scale: 0.667 # noise scale parameter for the flow in VITS.
noise_scale_dur: 0.8 # noise scale parameter for the stochastic duration predictor in VITS.
speed_control_alpha: 1 # alpha to control the speed of generated speech.
# 1 < alpha makes slower and 1 > alpha makes faster.
use_teacher_forcing: false # whether to use teacher forcing.