Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SVS] Add new recipes #5158

Merged
merged 13 commits into from
May 12, 2023
3 changes: 3 additions & 0 deletions egs2/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,7 @@ See: https://espnet.github.io/espnet/espnet2_tutorial.html#recipes-using-espnet2
| lrs2 | The Oxford-BBC Lip Reading Sentences 2 (LRS2) Dataset | Lipreading/ASR | ENG | https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs2.html | |
| lrs3 | The Oxford-BBC Lip Reading Sentences 3 (LRS3) Dataset | ASR | ENG | https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs3.html | |
| lt_slurp_spatialized | Spatialized Libri-Trans and Spatialized SLURP (LT-S and SLURP-S), Enhancement for Translation and Understanding Dataset | SE/ST/SLU | ENG | | |
| m4singer | Multi-Style, Multi-Singer and Musical Score Provided Mandarin Singing Corpus | SVS | CMN | https://drive.google.com/file/d/1xC37E59EWRRFFLdG3aJkVqwtLDgtFNqW/view?usp=share_link | |
| magicdata | MAGICDATA Mandarin Chinese Read Speech Corpus | ASR | ENG | https://www.openslr.org/68/ | |
| media | MEDIA speech database for French | SLU/Entity Classifi. | FRA | https://catalogue.elra.info/en-us/repository/browse/ELRA-S0272/ | |
| mediaspeech | MediaSpeech: Multilanguage ASR Benchmark and Dataset | ASR | FRA | https://www.openslr.org/108/ | |
Expand All @@ -104,9 +105,11 @@ See: https://espnet.github.io/espnet/espnet2_tutorial.html#recipes-using-espnet2
| must_c_v2 | https://ict.fbk.eu/must-c/ | ASR/MT/ST | ENG->DEU | https://ict.fbk.eu/must-c/ | |
| nsc | National Speech Corpus | ASR | ENG-SG | https://www.imda.gov.sg/programme-listing/digital-services-lab/national-speech-corpus | |
| ofuton_p_utagoe_db | Ofuton_p_utagoe Singing voice synthesis corpus | SVS | JPN | https://sites.google.com/view/oftn-utagoedb/%E3%83%9B%E3%83%BC%E3%83%A0 | |
| oniku_kurumi_utagoe_db | Oniku Singing voice synthesis corpus | SVS | JPN | http://onikuru.info/db-download/ | |
| open_li110 | Corpus combination with 110 languages | Multilingual ASR | 100+ languages | | |
| open_li52 | Corpus combination with 52 languages(Commonvocie + voxforge) | Multilingual ASR | 52 languages | | |
| opencpop | Opencpop: Mandarin singing voice synthesis corpus | SVS | CMN | https://wenet.org.cn/opencpop/ | |
| pjs | Phoneme-balanced Japanese Singing-voice corpus | SVS | JPN | https://sites.google.com/site/shinnosuketakamichi/research-topics/pjs_corpus | |
| polyphone_swiss_french | Swiss French Polyphone corpus | ASR | FRA | http://catalog.elra.info/en-us/repository/browse/ELRA-S0030_02 | |
| portmedia_dom | PortMedia French corpus | SLU/Entity Classifi. | FRA | https://catalogue.elra.info/en-us/repository/browse/ELRA-S0371/ | |
| portmedia_lang | PortMedia Italian corpus | SLU/Entity Classifi. | ITA | https://catalogue.elra.info/en-us/repository/browse/ELRA-S0371/ | |
Expand Down
2 changes: 2 additions & 0 deletions egs2/TEMPLATE/asr1/db.sh
Original file line number Diff line number Diff line change
Expand Up @@ -172,6 +172,8 @@ CATSLU=downloads
ELRA_E0024=
ELRA_S0272=
ELRA_S0371=
M4SINGER=
ONIKU=

# For only CMU TIR environment
if [[ "$(hostname)" == tir* ]]; then
Expand Down
2 changes: 1 addition & 1 deletion egs2/TEMPLATE/asr1/pyscripts/utils/check_align.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ def compare(key, score, label):
phns = customed_dic[syb]
score[i].append("_".join(phns))
for p in phns:
if index >= len(labels):
if index >= len(label):
raise ValueError("Syllables are longer than phones in {}".format(key))
elif label[index][2] == p:
index += 1
Expand Down
2 changes: 1 addition & 1 deletion egs2/TEMPLATE/asr1/scripts/audio/format_score_scp.sh
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ if [ -n "${segments}" ]; then
nj=$((nj<nutt?nj:nutt))

${cmd} "JOB=1:${nj}" "${logdir}/format_score_scp.JOB.log" \
pyscripts/audio/format_score_scp.py \
pyscripts/utils/format_score_scp.py \
${opts} \
"--segment=${logdir}/segments.JOB" \
"${scp}" "${outdir}/format_score.JOB"
Expand Down
110 changes: 110 additions & 0 deletions egs2/m4singer/svs1/cmd.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
# ====== About run.pl, queue.pl, slurm.pl, and ssh.pl ======
# Usage: <cmd>.pl [options] JOB=1:<nj> <log> <command...>
# e.g.
# run.pl --mem 4G JOB=1:10 echo.JOB.log echo JOB
#
# Options:
# --time <time>: Limit the maximum time to execute.
# --mem <mem>: Limit the maximum memory usage.
# -–max-jobs-run <njob>: Limit the number parallel jobs. This is ignored for non-array jobs.
# --num-threads <ngpu>: Specify the number of CPU core.
# --gpu <ngpu>: Specify the number of GPU devices.
# --config: Change the configuration file from default.
#
# "JOB=1:10" is used for "array jobs" and it can control the number of parallel jobs.
# The left string of "=", i.e. "JOB", is replaced by <N>(Nth job) in the command and the log file name,
# e.g. "echo JOB" is changed to "echo 3" for the 3rd job and "echo 8" for 8th job respectively.
# Note that the number must start with a positive number, so you can't use "JOB=0:10" for example.
#
# run.pl, queue.pl, slurm.pl, and ssh.pl have unified interface, not depending on its backend.
# These options are mapping to specific options for each backend and
# it is configured by "conf/queue.conf" and "conf/slurm.conf" by default.
# If jobs failed, your configuration might be wrong for your environment.
#
#
# The official documentation for run.pl, queue.pl, slurm.pl, and ssh.pl:
# "Parallelization in Kaldi": http://kaldi-asr.org/doc/queue.html
# =========================================================~


# Select the backend used by run.sh from "local", "stdout", "sge", "slurm", or "ssh"
cmd_backend='local'

# Local machine, without any Job scheduling system
if [ "${cmd_backend}" = local ]; then

# The other usage
export train_cmd="run.pl"
# Used for "*_train.py": "--gpu" is appended optionally by run.sh
export cuda_cmd="run.pl"
# Used for "*_recog.py"
export decode_cmd="run.pl"

# Local machine logging to stdout and log file, without any Job scheduling system
elif [ "${cmd_backend}" = stdout ]; then

# The other usage
export train_cmd="stdout.pl"
# Used for "*_train.py": "--gpu" is appended optionally by run.sh
export cuda_cmd="stdout.pl"
# Used for "*_recog.py"
export decode_cmd="stdout.pl"


# "qsub" (Sun Grid Engine, or derivation of it)
elif [ "${cmd_backend}" = sge ]; then
# The default setting is written in conf/queue.conf.
# You must change "-q g.q" for the "queue" for your environment.
# To know the "queue" names, type "qhost -q"
# Note that to use "--gpu *", you have to setup "complex_value" for the system scheduler.

export train_cmd="queue.pl"
export cuda_cmd="queue.pl"
export decode_cmd="queue.pl"


# "qsub" (Torque/PBS.)
elif [ "${cmd_backend}" = pbs ]; then
# The default setting is written in conf/pbs.conf.

export train_cmd="pbs.pl"
export cuda_cmd="pbs.pl"
export decode_cmd="pbs.pl"


# "sbatch" (Slurm)
elif [ "${cmd_backend}" = slurm ]; then
# The default setting is written in conf/slurm.conf.
# You must change "-p cpu" and "-p gpu" for the "partition" for your environment.
# To know the "partion" names, type "sinfo".
# You can use "--gpu * " by default for slurm and it is interpreted as "--gres gpu:*"
# The devices are allocated exclusively using "${CUDA_VISIBLE_DEVICES}".

export train_cmd="slurm.pl"
export cuda_cmd="slurm.pl"
export decode_cmd="slurm.pl"

elif [ "${cmd_backend}" = ssh ]; then
# You have to create ".queue/machines" to specify the host to execute jobs.
# e.g. .queue/machines
# host1
# host2
# host3
# Assuming you can login them without any password, i.e. You have to set ssh keys.

export train_cmd="ssh.pl"
export cuda_cmd="ssh.pl"
export decode_cmd="ssh.pl"

# This is an example of specifying several unique options in the JHU CLSP cluster setup.
# Users can modify/add their own command options according to their cluster environments.
elif [ "${cmd_backend}" = jhu ]; then

export train_cmd="queue.pl --mem 2G"
export cuda_cmd="queue-freegpu.pl --mem 2G --gpu 1 --config conf/queue.conf"
export decode_cmd="queue.pl --mem 4G"

else
echo "$0: Error: Unknown cmd_backend=${cmd_backend}" 1>&2
return 1
fi
1 change: 1 addition & 0 deletions egs2/m4singer/svs1/conf/decode.yaml
1 change: 1 addition & 0 deletions egs2/m4singer/svs1/conf/train.yaml
10 changes: 10 additions & 0 deletions egs2/m4singer/svs1/conf/tuning/decode_rnn.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# This configuration is the decoding setting for FastSpeech or FastSpeech2.

##########################################################
# DECODING SETTING #
##########################################################
# speed_control_alpha: 1 # alpha to control the speed of generated speech
# 1 < alpha makes slower and 1 > alpha makes faster
use_teacher_forcing: false # whether to use teacher forcing
# if true, we use groundtruth of durations
# (+ pitch & energy for FastSpeech2)
75 changes: 75 additions & 0 deletions egs2/m4singer/svs1/conf/tuning/train_naive_rnn.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@


##########################################################
# SVS MODEL SETTING #
##########################################################
svs: naive_rnn # model architecture
svs_conf: # keyword arguments for the selected model
midi_dim: 129 # midi dimension (note number + silence)
embed_dim: 512 # char or phn embedding dimension
eprenet_conv_layers: 0 # prenet (from bytesing) conv layers
eprenet_conv_chans: 256 # prenet (from bytesing) conv channels numbers
eprenet_conv_filts: 3 # prenet (from bytesing) conv filters size
elayers: 3 # number of lstm layers in encoder
eunits: 512 # number of lstm units
ebidirectional: True # if bidirectional in encoder
midi_embed_integration_type: add # how to integrate midi information
dlayers: 5 # number of lstm layers in decoder
dunits: 1024 # number of lstm units in decoder
dbidirectional: True # if bidirectional in decoder
postnet_layers: 5 # number of layers in postnet
postnet_chans: 512 # number of channels in postnet
postnet_filts: 5 # filter size of postnet layer
use_batch_norm: true # whether to use batch normalization in postnet
reduction_factor: 1 # reduction factor
eprenet_dropout_rate: 0.2 # prenet dropout rate
edropout_rate: 0.1 # encoder dropout rate
ddropout_rate: 0.1 # decoder dropout rate
postnet_dropout_rate: 0.5 # postnet dropout_rate
init_type: pytorch # parameter initialization
use_masking: true # whether to apply masking for padded part in loss calculation
loss_type: L1

# extra module for additional inputs
pitch_extract: dio # pitch extractor type
pitch_extract_conf:
use_token_averaged_f0: false
pitch_normalize: global_mvn # normalizer for the pitch feature


##########################################################
# OPTIMIZER SETTING #
##########################################################
optim: adam # optimizer type
optim_conf: # keyword arguments for selected optimizer
lr: 1.0e-03 # learning rate
eps: 1.0e-06 # epsilon
weight_decay: 0.0 # weight decay coefficient

##########################################################
# OTHER TRAINING SETTING #
##########################################################
# num_iters_per_epoch: 200 # number of iterations per epoch
max_epoch: 500 # number of epochs
grad_clip: 1.0 # gradient clipping norm
grad_noise: false # whether to use gradient noise injection
accum_grad: 1 # gradient accumulation

batch_type: sorted
batch_size: 16

sort_in_batch: descending # how to sort data in making batch
sort_batch: descending # how to sort created batches
num_workers: 8 # number of workers of data loader
train_dtype: float32 # dtype in training
log_interval: null # log interval in iterations
keep_nbest_models: 2 # number of models to keep
num_att_plot: 3 # number of attention figures to be saved in every check
seed: 0 # random seed number
best_model_criterion:
- - valid
- loss
- min
- - train
- loss
- min
75 changes: 75 additions & 0 deletions egs2/m4singer/svs1/conf/tuning/train_naive_rnn_dp.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@


##########################################################
# SVS MODEL SETTING #
##########################################################
svs: naive_rnn_dp # model architecture
svs_conf: # keyword arguments for the selected model
midi_dim: 129 # midi dimension (note number + silence)
embed_dim: 512 # char or phn embedding dimension
tempo_dim: 500
eprenet_conv_layers: 0 # prenet (from bytesing) conv layers
eprenet_conv_chans: 256 # prenet (from bytesing) conv channels numbers
eprenet_conv_filts: 3 # prenet (from bytesing) conv filters size
elayers: 3 # number of lstm layers in encoder
eunits: 256 # number of lstm units
ebidirectional: True # if bidirectional in encoder
midi_embed_integration_type: add # how to integrate midi information
dlayers: 2 # number of lstm layers in decoder
dunits: 256 # number of lstm units in decoder
dbidirectional: True # if bidirectional in decoder
postnet_layers: 5 # number of layers in postnet
postnet_chans: 512 # number of channels in postnet
postnet_filts: 5 # filter size of postnet layer
use_batch_norm: true # whether to use batch normalization in postnet
reduction_factor: 1 # reduction factor
eprenet_dropout_rate: 0.2 # prenet dropout rate
edropout_rate: 0.1 # encoder dropout rate
ddropout_rate: 0.1 # decoder dropout rate
postnet_dropout_rate: 0.5 # postnet dropout_rate
init_type: pytorch # parameter initialization
use_masking: true # whether to apply masking for padded part in loss calculation

# extra module for additional inputs
pitch_extract: dio # pitch extractor type
pitch_extract_conf:
use_token_averaged_f0: false
pitch_normalize: global_mvn # normalizer for the pitch feature


##########################################################
# OPTIMIZER SETTING #
##########################################################
optim: adam # optimizer type
optim_conf: # keyword arguments for selected optimizer
lr: 1.0e-03 # learning rate
eps: 1.0e-06 # epsilon
weight_decay: 0.0 # weight decay coefficient

##########################################################
# OTHER TRAINING SETTING #
##########################################################
# num_iters_per_epoch: 200 # number of iterations per epoch
max_epoch: 500 # number of epochs
grad_clip: 1.0 # gradient clipping norm
grad_noise: false # whether to use gradient noise injection
accum_grad: 1 # gradient accumulation

batch_type: sorted
batch_size: 16

sort_in_batch: descending # how to sort data in making batch
sort_batch: descending # how to sort created batches
num_workers: 8 # number of workers of data loader
train_dtype: float32 # dtype in training
log_interval: null # log interval in iterations
keep_nbest_models: 2 # number of models to keep
num_att_plot: 3 # number of attention figures to be saved in every check
seed: 0 # random seed number
best_model_criterion:
- - valid
- loss
- min
- - train
- loss
- min