Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

qasr update #4642

Merged
merged 11 commits into from
Sep 24, 2022
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
17 changes: 17 additions & 0 deletions egs2/qasr_tts/tts1/README.MD
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# QASR-TTS RECIPE

- Our goal is to build a TTS character based system using semi-supervised data selection in a low-resource scenario by proposing two different methodologies:
the first one by training non-autoregressive (non-AR) model from scratch on very small amount of data; and the second one by training
autoregressive (AR) model finetuned on top of a pre-trained model.

- Step 1: Prepare the data

- Step 2: Download a pretrained model

- Step 3: Replace token list with the pretrained model's one

- Step 4: Finetune the pre-trained model on our 1 hour dataset and we excluded the embedding layer since we are finetuning on a different language

- Step 5: Use the finetuned model as a teacher model to train the Non-AR model FastSpeech2

- Step 6: Train Parallel Wav GAN model to produce better wav samples
32 changes: 32 additions & 0 deletions egs2/qasr_tts/tts1/RESULTS.MD
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
- Environments

- python version: `3.8.12`

- espnet version: `espnet 0.10.7a1`

- chainer version: `chainer 6.0.0`

- pytorch version: `pytorch 1.10.0`



- Model files

- training config file (teacher model): `./conf/tuning/finetune_transformer.yaml`

- training config file: `./conf/tuning/train_conformer_fastspeech2.yaml`



- Results: {using our pretrained ASR model on MGB2 data} here is the recipe "egs/mgb2"

- FastSpeech2 (Trans.) w/ PWG: (R = 1)

- CER: 3.9

- WER: 9.13

- MOS

- intelligibility: 4.4 ± 0.06
- naturalness: 4.2 ± 0.06
32 changes: 32 additions & 0 deletions egs2/qasr_tts/tts1/adapt.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Copyright 2021 Massa Baali

pretrained_model=$1
# From data preparation to statistics calculation
./run.sh --stage 1 --stop_stage 5

# download pretrained model kan-bayashi/ljspeech_tts_train_transformer_raw_char_tacotron_train.loss.ave
wget ${pretrained}

# Replace token list with pretrained model's one
pyscripts/utils/make_token_list_from_config.py pretrained_model_path/exp/ljspeech_tts_train_transformer_raw_char_tacotron/config.yaml
# tokens.txt is created in model directory
mv dump/token_list/ljspeech_tts_train_transformer_raw_char_tacotron/tokens.{txt,txt.bak}
ln -s pretrained_dir/exp/ljspeech_tts_train_transformer_raw_char_tacotron/tokens.txt dump/token_list/new_model

# Train the model
./run.sh --stage 6 --train_config conf/tuning/finetune_transformer.yaml --train_args \
"--init_param pretrained_model/exp/tts_train_transformer_raw_char_tacotron/train.loss.ave_5best.pth":::tts.enc.embed \
--tag finetune_pretrained_transformers

# Now the trained model above will be used as a teacher model for the Non-AR model FastSpeech2
# Prepare durations file
./run.sh --stage 7 --tts_exp exp/tts_finetune_pretrained_transformers \
--inference_args "--use_teacher_forcing true" \
--test_sets "tr_no_dev dev eval1"

# Since fastspeech2 requires extra feature calculation, run from stage 5.
./run.sh --stage 5 \
--train_config conf/tuning/train_conformer_fastspeech2.yaml \
--teacher_dumpdir exp/tts_finetune_pretrained_transformers/decode_use_teacher_forcingtrue_train.loss.ave \
--tts_stats_dir exp/tts_finetune_pretrained_transformers/decode_use_teacher_forcingtrue_train.loss.ave/stats \
--write_collected_feats true
110 changes: 110 additions & 0 deletions egs2/qasr_tts/tts1/cmd.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
# ====== About run.pl, queue.pl, slurm.pl, and ssh.pl ======
# Usage: <cmd>.pl [options] JOB=1:<nj> <log> <command...>
# e.g.
# run.pl --mem 4G JOB=1:10 echo.JOB.log echo JOB
#
# Options:
# --time <time>: Limit the maximum time to execute.
# --mem <mem>: Limit the maximum memory usage.
# -–max-jobs-run <njob>: Limit the number parallel jobs. This is ignored for non-array jobs.
# --num-threads <ngpu>: Specify the number of CPU core.
# --gpu <ngpu>: Specify the number of GPU devices.
# --config: Change the configuration file from default.
#
# "JOB=1:10" is used for "array jobs" and it can control the number of parallel jobs.
# The left string of "=", i.e. "JOB", is replaced by <N>(Nth job) in the command and the log file name,
# e.g. "echo JOB" is changed to "echo 3" for the 3rd job and "echo 8" for 8th job respectively.
# Note that the number must start with a positive number, so you can't use "JOB=0:10" for example.
#
# run.pl, queue.pl, slurm.pl, and ssh.pl have unified interface, not depending on its backend.
# These options are mapping to specific options for each backend and
# it is configured by "conf/queue.conf" and "conf/slurm.conf" by default.
# If jobs failed, your configuration might be wrong for your environment.
#
#
# The official documentation for run.pl, queue.pl, slurm.pl, and ssh.pl:
# "Parallelization in Kaldi": http://kaldi-asr.org/doc/queue.html
# =========================================================~


# Select the backend used by run.sh from "local", "stdout", "sge", "slurm", or "ssh"
cmd_backend='local'

# Local machine, without any Job scheduling system
if [ "${cmd_backend}" = local ]; then

# The other usage
export train_cmd="run.pl"
# Used for "*_train.py": "--gpu" is appended optionally by run.sh
export cuda_cmd="run.pl"
# Used for "*_recog.py"
export decode_cmd="run.pl"

# Local machine logging to stdout and log file, without any Job scheduling system
elif [ "${cmd_backend}" = stdout ]; then

# The other usage
export train_cmd="stdout.pl"
# Used for "*_train.py": "--gpu" is appended optionally by run.sh
export cuda_cmd="stdout.pl"
# Used for "*_recog.py"
export decode_cmd="stdout.pl"


# "qsub" (Sun Grid Engine, or derivation of it)
elif [ "${cmd_backend}" = sge ]; then
# The default setting is written in conf/queue.conf.
# You must change "-q g.q" for the "queue" for your environment.
# To know the "queue" names, type "qhost -q"
# Note that to use "--gpu *", you have to setup "complex_value" for the system scheduler.

export train_cmd="queue.pl"
export cuda_cmd="queue.pl"
export decode_cmd="queue.pl"


# "qsub" (Torque/PBS.)
elif [ "${cmd_backend}" = pbs ]; then
# The default setting is written in conf/pbs.conf.

export train_cmd="pbs.pl"
export cuda_cmd="pbs.pl"
export decode_cmd="pbs.pl"


# "sbatch" (Slurm)
elif [ "${cmd_backend}" = slurm ]; then
# The default setting is written in conf/slurm.conf.
# You must change "-p cpu" and "-p gpu" for the "partition" for your environment.
# To know the "partion" names, type "sinfo".
# You can use "--gpu * " by default for slurm and it is interpreted as "--gres gpu:*"
# The devices are allocated exclusively using "${CUDA_VISIBLE_DEVICES}".

export train_cmd="slurm.pl"
export cuda_cmd="slurm.pl"
export decode_cmd="slurm.pl"

elif [ "${cmd_backend}" = ssh ]; then
# You have to create ".queue/machines" to specify the host to execute jobs.
# e.g. .queue/machines
# host1
# host2
# host3
# Assuming you can login them without any password, i.e. You have to set ssh keys.

export train_cmd="ssh.pl"
export cuda_cmd="ssh.pl"
export decode_cmd="ssh.pl"

# This is an example of specifying several unique options in the JHU CLSP cluster setup.
# Users can modify/add their own command options according to their cluster environments.
elif [ "${cmd_backend}" = jhu ]; then

export train_cmd="queue.pl --mem 2G"
export cuda_cmd="queue-freegpu.pl --mem 2G --gpu 1 --config conf/queue.conf"
export decode_cmd="queue.pl --mem 4G"

else
echo "$0: Error: Unknown cmd_backend=${cmd_backend}" 1>&2
return 1
fi
1 change: 1 addition & 0 deletions egs2/qasr_tts/tts1/conf/decode.yaml
11 changes: 11 additions & 0 deletions egs2/qasr_tts/tts1/conf/pbs.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# Default configuration
command qsub -V -v PATH -S /bin/bash
option name=* -N $0
option mem=* -l mem=$0
option mem=0 # Do not add anything to qsub_opts
option num_threads=* -l ncpus=$0
option num_threads=1 # Do not add anything to qsub_opts
option num_nodes=* -l nodes=$0:ppn=1
default gpu=0
option gpu=0
option gpu=* -l ngpus=$0
12 changes: 12 additions & 0 deletions egs2/qasr_tts/tts1/conf/queue.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# Default configuration
command qsub -v PATH -cwd -S /bin/bash -j y -l arch=*64*
option name=* -N $0
option mem=* -l mem_free=$0,ram_free=$0
option mem=0 # Do not add anything to qsub_opts
option num_threads=* -pe smp $0
option num_threads=1 # Do not add anything to qsub_opts
option max_jobs_run=* -tc $0
option num_nodes=* -pe mpi $0 # You must set this PE as allocation_rule=1
default gpu=0
option gpu=0
option gpu=* -l gpu=$0 -q g.q
14 changes: 14 additions & 0 deletions egs2/qasr_tts/tts1/conf/slurm.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Default configuration
command sbatch --export=PATH
option name=* --job-name $0
option time=* --time $0
option mem=* --mem-per-cpu $0
option mem=0
option num_threads=* --cpus-per-task $0
option num_threads=1 --cpus-per-task 1
option num_nodes=* --nodes $0
default gpu=0
option gpu=0 -p cpu
option gpu=* -p gpu --gres=gpu:$0 -c $0 # Recommend allocating more CPU than, or equal to the number of GPU
# note: the --max-jobs-run option is supported as a special case
# by slurm.pl and you don't have to handle it in the config file.
1 change: 1 addition & 0 deletions egs2/qasr_tts/tts1/conf/train.yaml
10 changes: 10 additions & 0 deletions egs2/qasr_tts/tts1/conf/tuning/decode_fastspeech.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# This configuration is the decoding setting for FastSpeech or FastSpeech2.

##########################################################
# DECODING SETTING #
##########################################################
speed_control_alpha: 1 # alpha to control the speed of generated speech
# 1 < alpha makes slower and 1 > alpha makes faster
use_teacher_forcing: false # whether to use teacher forcing
# if true, we use groundtruth of durations
# (+ pitch & energy for FastSpeech2)
16 changes: 16 additions & 0 deletions egs2/qasr_tts/tts1/conf/tuning/decode_tacotron2.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# This configuration is the basic decoding setting for Tacotron 2.
# It can be also applied to Transformer. If you met some problems
# such as deletions or repetitions, it is worthwhile to try
# `use_att_constraint: true` to make the generation more stable.
# Note that attention constraint is not supported in Transformer.

##########################################################
# DECODING SETTING #
##########################################################
threshold: 0.5 # threshold to stop the generation
maxlenratio: 10.0 # maximum length of generated samples = input length * maxlenratio
minlenratio: 0.0 # minimum length of generated samples = input length * minlenratio
use_att_constraint: true # whether to use attention constraint, which is introduced in deep voice 3
backward_window: 1 # backward window size in the attention constraint
forward_window: 3 # forward window size in the attention constraint
use_teacher_forcing: false # whether to use teacher forcing
78 changes: 78 additions & 0 deletions egs2/qasr_tts/tts1/conf/tuning/finetune_tacotron2.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
# This configuration is for ESPnet2 to finetune Tacotron 2. Compared to the
# original paper, this configuration additionally use the guided attention
# loss to accelerate the learning of the diagonal attention. It requires
# only a single GPU with 12 GB memory and it takes ~1 days to finish the
# training on Titan V.

##########################################################
# TTS MODEL SETTING #
##########################################################
tts: tacotron2 # model architecture
tts_conf: # keyword arguments for the selected model
embed_dim: 512 # char or phn embedding dimension
elayers: 1 # number of blstm layers in encoder
eunits: 512 # number of blstm units
econv_layers: 3 # number of convolutional layers in encoder
econv_chans: 512 # number of channels in convolutional layer
econv_filts: 5 # filter size of convolutional layer
atype: location # attention function type
adim: 512 # attention dimension
aconv_chans: 32 # number of channels in convolutional layer of attention
aconv_filts: 15 # filter size of convolutional layer of attention
cumulate_att_w: true # whether to cumulate attention weight
dlayers: 2 # number of lstm layers in decoder
dunits: 1024 # number of lstm units in decoder
prenet_layers: 2 # number of layers in prenet
prenet_units: 256 # number of units in prenet
postnet_layers: 5 # number of layers in postnet
postnet_chans: 512 # number of channels in postnet
postnet_filts: 5 # filter size of postnet layer
output_activation: null # activation function for the final output
use_batch_norm: true # whether to use batch normalization in encoder
use_concate: true # whether to concatenate encoder embedding with decoder outputs
use_residual: false # whether to use residual connection in encoder
dropout_rate: 0.5 # dropout rate
zoneout_rate: 0.1 # zoneout rate
reduction_factor: 1 # reduction factor
spk_embed_dim: null # speaker embedding dimension
use_masking: true # whether to apply masking for padded part in loss calculation
bce_pos_weight: 5.0 # weight of positive sample in binary cross entropy calculation
use_guided_attn_loss: true # whether to use guided attention loss
guided_attn_loss_sigma: 0.4 # sigma of guided attention loss
guided_attn_loss_lambda: 1.0 # strength of guided attention loss

##########################################################
# OPTIMIZER SETTING #
##########################################################
optim: adam # optimizer type
optim_conf: # keyword arguments for selected optimizer
lr: 1.0e-04 # learning rate
eps: 1.0e-06 # epsilon
weight_decay: 0.0 # weight decay coefficient

##########################################################
# OTHER TRAINING SETTING #
##########################################################
num_iters_per_epoch: 200 # number of iters per epoch
max_epoch: 100 # number of epochs
grad_clip: 1.0 # gradient clipping norm
grad_noise: false # whether to use gradient noise injection
accum_grad: 1 # gradient accumulation
# batch_bins: 1000000 # batch bins (for feats_type=fbank)
batch_bins: 3750000 # batch bins (for feats_type=raw, *= n_shift / n_mels)
batch_type: numel # how to make batch
sort_in_batch: descending # how to sort data in making batch
sort_batch: descending # how to sort created batches
num_workers: 1 # number of workers of data loader
train_dtype: float32 # dtype in training
log_interval: null # log interval in iterations
keep_nbest_models: 5 # number of models to keep
num_att_plot: 3 # number of attention figures to be saved in every check
seed: 0 # random seed number
best_model_criterion:
- - valid
- loss
- min
- - train
- loss
- min