Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GAN SVS] Add VISinger2, UHifiGAN, Avocodo #5123

Merged
merged 65 commits into from
May 23, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
65 commits
Select commit Hold shift + click to select a range
09a6e59
add uhifigan
jerryuhoo Mar 18, 2023
0f4f19c
update parameters and improve compatibility
jerryuhoo Mar 19, 2023
a948b13
add avocodo and improve code structure
jerryuhoo Mar 21, 2023
db7869a
fix avocodo inference
jerryuhoo Mar 22, 2023
29e98bf
fix compatibility of different vocoders
jerryuhoo Mar 23, 2023
e67d66e
add visinger2 vocoder
jerryuhoo Mar 23, 2023
7c0d234
fix visinger2 vocoder bug
jerryuhoo Mar 24, 2023
e9d4994
add visinger2 vocoder discriminator
jerryuhoo Mar 25, 2023
b11d74a
add teacher forcing inference in visinger
jerryuhoo Mar 25, 2023
cfed0fe
Fix teacher forcing SVS bug in last commit.
jerryuhoo Mar 25, 2023
fd06cd1
Refactor length regulator and fix VISinger bug.
jerryuhoo Mar 26, 2023
3cf9852
remove decoder_input_pitch in the last commit
jerryuhoo Mar 26, 2023
e0a7118
visinger2 generator draft
jerryuhoo Mar 26, 2023
f072f02
add uhifigan avocodo mfd vocoder
jerryuhoo Mar 27, 2023
815bdf9
fix visinger2 inference
jerryuhoo Mar 27, 2023
308a129
fix uhifigan-avocodo inference
jerryuhoo Mar 27, 2023
bd4204d
add pisinger draft
jerryuhoo Mar 27, 2023
ad7c2cd
update pisinger
jerryuhoo Mar 28, 2023
3f0b78d
fix pisinger inference
jerryuhoo Mar 28, 2023
1e8489b
Merge branch 'espnet:master' into uhifigan
jerryuhoo Apr 9, 2023
05c8545
Merge branch 'espnet:master' into uhifigan
jerryuhoo Apr 9, 2023
5c1102b
fix loading ying feature
jerryuhoo Apr 9, 2023
6e3fd96
update visinger
jerryuhoo Apr 11, 2023
e06bb2f
fix visinger2 vocoder unit test
jerryuhoo Apr 16, 2023
b2b0ea1
Refactor test data into function
jerryuhoo Apr 16, 2023
39549aa
add unit test for avocodo
jerryuhoo Apr 17, 2023
b283814
add unit test for ddsp and uhifigan
jerryuhoo Apr 17, 2023
74dc72d
update VISinger 2
jerryuhoo Apr 17, 2023
9fea2ae
add unit test for flow and phoneme
jerryuhoo Apr 17, 2023
f1c3552
Sort imports using isort
jerryuhoo Apr 17, 2023
12c62c4
Merge branch 'master' into uhifigan
jerryuhoo Apr 17, 2023
0750c0c
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Apr 17, 2023
d103699
fix CI errors
jerryuhoo Apr 17, 2023
b8b0b86
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Apr 17, 2023
9f54d1a
fix parameter names
jerryuhoo Apr 17, 2023
ea07410
Merge branch 'master' into uhifigan
jerryuhoo Apr 23, 2023
e3adcb7
Update configs for VISinger 1 and VISinger 2
jerryuhoo Apr 23, 2023
f950d25
Add slur into VISinger model
jerryuhoo Apr 23, 2023
6f1b0f8
fix inference
jerryuhoo Apr 23, 2023
af14c6d
Add slur to unit test
jerryuhoo Apr 23, 2023
03553c4
Update comments for gan_svs modules
jerryuhoo Apr 28, 2023
5ad654d
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Apr 28, 2023
5caf097
fix MFD parameters
jerryuhoo Apr 28, 2023
6762164
update config name and remove unused variables
jerryuhoo Apr 29, 2023
99130df
update visinger configs
jerryuhoo May 6, 2023
3c0c592
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 6, 2023
63d3530
fix visinger 2 mfd bug
jerryuhoo May 15, 2023
1b45bdc
fix hifigan weight norm bug
jerryuhoo May 15, 2023
264b67e
improve uhifigan sine signal expand option
jerryuhoo May 15, 2023
627bdf4
clean gan_svs code
jerryuhoo May 15, 2023
231f660
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 15, 2023
2c86f85
fix svs f0 bug
jerryuhoo May 20, 2023
2aee662
code clean
jerryuhoo May 21, 2023
e378a16
update gan_svs configs
jerryuhoo May 21, 2023
fe8adad
fix multi-frequency discriminator sample rate bug
jerryuhoo May 22, 2023
4e2b95a
fix import
ftshijt May 22, 2023
6fa5fba
fix comment
ftshijt May 22, 2023
3e7164a
fix comment
ftshijt May 22, 2023
5899bcc
Add TODOs
jerryuhoo May 22, 2023
4d5e56c
Unified segment size in avocodo config
jerryuhoo May 22, 2023
154f1dc
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 22, 2023
73f980c
update gan_svs config
jerryuhoo May 22, 2023
989836f
update comments for gan_svs
jerryuhoo May 22, 2023
bf3a5aa
Merge branch 'master' of https://github.com/ftshijt/espnet into uhifigan
ftshijt May 23, 2023
912cfe9
Merge branch 'uhifigan' of https://github.com/jerryuhoo/espnet into u…
ftshijt May 23, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
23 changes: 22 additions & 1 deletion egs2/TEMPLATE/svs1/svs.sh
Original file line number Diff line number Diff line change
Expand Up @@ -65,9 +65,10 @@ n_shift=256 # The number of shift points.
win_length=null # Window length.
score_feats_extract=frame_score_feats # The type of music score feats (frame_score_feats or syllable_score_feats)
pitch_extract=None
ying_extract=None
# Only used for the model using pitch features (e.g. FastSpeech2)
f0min=80 # Maximum f0 for pitch extraction.
f0max=400 # Minimum f0 for pitch extraction.
f0max=800 # Minimum f0 for pitch extraction.

oov="<unk>" # Out of vocabrary symbol.
blank="<blank>" # CTC blank symbol.
Expand Down Expand Up @@ -527,6 +528,9 @@ if ! "${skip_train}"; then
_opts+="--pitch_extract_conf hop_length=${n_shift} "
_opts+="--pitch_extract_conf f0max=${f0max} "
_opts+="--pitch_extract_conf f0min=${f0min} "
_opts+="--ying_extract ${ying_extract} "
_opts+="--ying_extract_conf fs=${fs} "
_opts+="--ying_extract_conf w_step=${n_shift} "
_opts+="--energy_extract_conf fs=${fs} "
_opts+="--energy_extract_conf n_fft=${n_fft} "
_opts+="--energy_extract_conf hop_length=${n_shift} "
Expand Down Expand Up @@ -669,6 +673,7 @@ if ! "${skip_train}"; then
_opts+="--feats_extract_conf hop_length=${n_shift} "
_opts+="--feats_extract_conf win_length=${win_length} "
_opts+="--pitch_extract ${pitch_extract} "
_opts+="--ying_extract ${ying_extract} "
if [ "${feats_extract}" = fbank ]; then
_opts+="--feats_extract_conf fs=${fs} "
_opts+="--feats_extract_conf fmin=${fmin} "
Expand All @@ -682,6 +687,10 @@ if ! "${skip_train}"; then
_opts+="--pitch_extract_conf f0max=${f0max} "
_opts+="--pitch_extract_conf f0min=${f0min} "
fi
if [ "${ying_extract}" = ying ]; then
_opts+="--ying_extract_conf fs=${fs} "
_opts+="--ying_extract_conf w_step=${n_shift} "
fi

if [ "${num_splits}" -gt 1 ]; then
# If you met a memory error when parsing text files, this option may help you.
Expand Down Expand Up @@ -800,6 +809,14 @@ if ! "${skip_train}"; then
_opts+="--train_data_path_and_name_and_type ${_train_collect_dir}/${_scp},feats,${_type} "
_opts+="--valid_data_path_and_name_and_type ${_valid_collect_dir}/${_scp},feats,${_type} "
fi
if [ -e "${svs_stats_dir}/train/collect_feats/ying.scp" ]; then
_scp=ying.scp
_type=npy
_train_collect_dir=${svs_stats_dir}/train/collect_feats
_valid_collect_dir=${svs_stats_dir}/valid/collect_feats
_opts+="--train_data_path_and_name_and_type ${_train_collect_dir}/${_scp},ying,${_type} "
_opts+="--valid_data_path_and_name_and_type ${_valid_collect_dir}/${_scp},ying,${_type} "
fi

# Check extra statistics
if [ -e "${svs_stats_dir}/train/pitch_stats.npz" ]; then
Expand All @@ -817,6 +834,10 @@ if ! "${skip_train}"; then
_opts+="--energy_extract_conf win_length=${win_length} "
_opts+="--energy_normalize_conf stats_file=${svs_stats_dir}/train/energy_stats.npz "
fi
if [ -e "${svs_stats_dir}/train/ying_stats.npz" ]; then
_opts+="--ying_extract_conf fs=${fs} "
_opts+="--ying_extract_conf w_step=${n_shift} "
fi


# Add X-vector to the inputs if needed
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,13 @@
svs: vits
svs_conf:
# generator related
generator_type: vits_generator
generator_type: visinger
vocoder_generator_type: hifigan # hifigan, avocodo, uhifigan, visinger2
generator_params:
hidden_channels: 192
spks: -1
global_channels: -1
segment_size: 32
segment_size: 20
text_encoder_attention_heads: 2
text_encoder_ffn_expand: 4
text_encoder_blocks: 6
Expand All @@ -40,26 +41,27 @@ svs_conf:
text_encoder_conformer_kernel_size: -1
decoder_kernel_size: 7
decoder_channels: 512
decoder_upsample_scales: [8, 8, 2, 2]
decoder_upsample_kernel_sizes: [16, 16, 4, 4]
decoder_upsample_scales: [8, 8, 4, 2]
decoder_upsample_kernel_sizes: [16, 16, 8, 4]
decoder_resblock_kernel_sizes: [3, 7, 11]
decoder_resblock_dilations: [[1, 3, 5], [1, 3, 5], [1, 3, 5]]
use_weight_norm_in_decoder: true
posterior_encoder_kernel_size: 5
posterior_encoder_layers: 16
posterior_encoder_kernel_size: 3
posterior_encoder_layers: 8
posterior_encoder_stacks: 1
posterior_encoder_base_dilation: 1
posterior_encoder_dropout_rate: 0.0
use_weight_norm_in_posterior_encoder: true
flow_flows: 4
flow_flows: -1 # 4
flow_kernel_size: 5
flow_base_dilation: 1
flow_layers: 4
flow_dropout_rate: 0.0
use_weight_norm_in_flow: true
use_only_mean_in_flow: true
use_phoneme_predictor: false
# discriminator related
discriminator_type: hifigan_multi_scale_multi_period_discriminator
discriminator_type: visinger2 # avocodo, hifigan_multi_scale_multi_period_discriminator, visinger2, avocodo_plus
discriminator_params:
scales: 1
scale_downsample_pooling: "AvgPool1d"
Expand All @@ -73,9 +75,9 @@ svs_conf:
kernel_sizes: [15, 41, 5, 3]
channels: 128
max_downsample_channels: 1024
max_groups: 16
max_groups: 256
bias: True
downsample_scales: [2, 2, 4, 4, 1]
downsample_scales: [4, 4, 4, 4]
nonlinear_activation: "LeakyReLU"
nonlinear_activation_params:
negative_slope: 0.1
Expand All @@ -96,6 +98,14 @@ svs_conf:
negative_slope: 0.1
use_weight_norm: True
use_spectral_norm: False
multi_freq_disc_params:
hop_length_factors: [2.5, 5, 7.5, 10, 12.5, 15]
hidden_channels: [256, 256, 256, 256, 256]
domain: "double"
mel_scale: True
divisors: [32, 16, 8, 4, 2, 1, 1]
strides: [1, 2, 1, 2, 1, 2, 1]

# loss function related
generator_adv_loss_params:
average_by_discriminators: false # whether to average loss value by #discriminators
Expand All @@ -108,31 +118,34 @@ svs_conf:
average_by_layers: false # whether to average loss value by #layers of each discriminator
include_final_outputs: true # whether to include final outputs for loss calculation
mel_loss_params:
fs: 22050 # must be the same as the training data
n_fft: 1024 # fft points
hop_length: 256 # hop size
win_length: null # window length
fs: 44100 # must be the same as the training data
n_fft: 2048 # fft points
hop_length: 512 # hop size
win_length: 2048 # window length
window: hann # window type
n_mels: 80 # number of Mel basis
fmin: 0 # minimum frequency for Mel basis
fmax: null # maximum frequency for Mel basis
fmax: 22050 # maximum frequency for Mel basis
log_base: null # null represent natural log
lambda_adv: 1.0 # loss scaling coefficient for adversarial loss
lambda_mel: 45.0 # loss scaling coefficient for Mel loss
lambda_feat_match: 2.0 # loss scaling coefficient for feat match loss
lambda_dur: 0.1 # loss scaling coefficient for duration loss
lambda_pitch: 1.0 # loss scaling coefficient for pitch loss
lambda_pitch: 10.0 # loss scaling coefficient for pitch loss
lambda_phoneme: 1.0 # loss scaling coefficient for ctc loss
lambda_kl: 1.0 # loss scaling coefficient for KL divergence loss
# others
sampling_rate: 22050 # needed in the inference for saving wav
sampling_rate: 44100 # needed in the inference for saving wav
cache_generator_outputs: true # whether to cache generator outputs in the training

# extra module for additional inputs
pitch_extract: dio # pitch extractor type
pitch_extract: dio # pitch extractor type
pitch_extract_conf:
use_token_averaged_f0: false
pitch_normalize: global_mvn # normalizer for the pitch feature
use_log_f0: false
pitch_normalize: None # normalizer for the pitch feature

# ying_extract: ying

##########################################################
# OPTIMIZER & SCHEDULER SETTING #
Expand All @@ -146,7 +159,7 @@ optim_conf:
weight_decay: 0.0
scheduler: exponentiallr
scheduler_conf:
gamma: 0.999875
gamma: 0.998
# optimizer setting for discriminator
optim2: adamw
optim2_conf:
Expand All @@ -156,17 +169,17 @@ optim2_conf:
weight_decay: 0.0
scheduler2: exponentiallr
scheduler2_conf:
gamma: 0.999875
gamma: 0.998
generator_first: false # whether to start updating generator first

##########################################################
# OTHER TRAINING SETTING #
##########################################################
num_iters_per_epoch: 1000 # number of iterations per epoch
max_epoch: 600 # number of epochs
max_epoch: 500 # number of epochs
accum_grad: 1 # gradient accumulation
batch_bins: 500000 # batch bins (feats_type=raw)
batch_type: numel # how to make batch
batch_size: 8 # batch size
batch_type: sorted # how to make batch
grad_clip: -1 # gradient clipping norm
grad_noise: false # whether to use gradient noise injection
sort_in_batch: descending # how to sort data in making batch
Expand Down