Skip to content

Commit

Permalink
Merge pull request #5365 from Jungjee/spk_sampler
Browse files Browse the repository at this point in the history
ESPnet-SPK: sampler
  • Loading branch information
mergify[bot] committed Aug 7, 2023
2 parents 1409d89 + a66805a commit ac8b312
Show file tree
Hide file tree
Showing 16 changed files with 635 additions and 80 deletions.
1 change: 1 addition & 0 deletions ci/test_integration_espnet2.sh
Original file line number Diff line number Diff line change
Expand Up @@ -272,6 +272,7 @@ gen_dummy_coverage
echo "==== [ESPnet2] SPK ==="
./run.sh --ngpu 0 --stage 0 --stop-stage 4 --feats-type "raw" --python "${python}" --spk-args "--num_workers 0"
./run.sh --ngpu 0 --stage 4 --stop-stage 4 --feats-type "raw" --python "${python}" --spk_config conf/train_rawnet3_dataaug_debug.yaml --spk-args "--num_workers 0"
./run.sh --ngpu 0 --stage 4 --stop-stage 4 --feats-type "raw" --python "${python}" --spk_config conf/train_rawnet3_sampler.yaml --spk-args "--num_workers 0"
# Remove generated files in order to reduce the disk usage
rm -rf exp dump data
cd "${cwd}"
Expand Down
27 changes: 25 additions & 2 deletions doc/espnet2_training_option.md
Original file line number Diff line number Diff line change
Expand Up @@ -237,7 +237,7 @@ The behavior for batch-size during multi-GPU training is **different from that o
We adopt variable mini-batch size with considering the dimension of the input features
to make the best use of the GPU memory.

There are 5 types:
There are 6 types:

|batch_type|Option to change batch-size|Variable batch-size|Requirement|
|---|---|---|---|
Expand All @@ -246,6 +246,7 @@ There are 5 types:
|folded|--batch_size|Yes|Length information of features|
|length|--batch_bins|Yes|Length information of features|
|numel|--batch_bins|Yes|Shape information of features|
|catbel|--batch_size|No|-|

Note that **--batch_size is ignored if --batch_type=length or --batch_type=numel**.

Expand Down Expand Up @@ -405,7 +406,6 @@ i.e. `bins = sum(numel(feat) for feats in batch for feat in feats)`,
where `numel` returns the infinite product of the shape of each feature;
`shape[0] * shape[1] * ...`


```bash
python -m espnet.bin.asr_train \
--batch_bins 200000 --batch_type numel \
Expand All @@ -419,6 +419,29 @@ python -m espnet.bin.asr_train \
--valid_shape_file "valid_shape2.txt"
```


### `--batch_type catbel`

This type of batch_type focuses on the case of classification tasks.
It guarantees that within each mini-batch, all samples belong to different classes.
`--batch_size` is used to determine the mini-batch size.
This batch type does not go along with the default `sequence` iterator_type.
It is instead designed to be used with `category` iterator_type.
Therefore, instead of explicitely giving `--batch_type catbel`, it is more recommended
to give `--iterator_type category` which will automatically set `batch_type` to `catbel`.
It is also important to use a preprocessor that adjusts the sample duration to enable
mini-batch construction. One example would be `espnet2/train/preprocessor/SpkPreprocessor`.


```bash
python -m espnet.bin.spk_train \
--batch_bins 256 --iterator_type category \
--train_data_path_and_name_and_type "train.scp,feats,npy" \
--valid_data_path_and_name_and_type "valid.scp,feats,npy" \
--train_shape_file "train_shape.txt" \
--valid_shape_file "valid_shape.txt" \
```

## Gradient accumulating
There are several ways to deal with larger model architectures than the capacity of your GPU device memory during training.

Expand Down
104 changes: 104 additions & 0 deletions egs2/TEMPLATE/spk1/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
# ESPnet2 Spk1 Recipe TEMPLATE

This is a template of Spk1 recipe for ESPnet2.
It follows d-vector style training/inference for speaker verification.
In other words, it trains a DNN as a closed set speaker classifier.
After training the classification head is removed. The last hidden layer
(or sometimes another layer) is used as a speaker representation (i.e.,
speaker embedding) to represent diverse open set speakers.

## Table of Contents

* [ESPnet2 ASR2 Recipe TEMPLATE](#espnet2-asr2-recipe-template)
* [Table of Contents](#table-of-contents)
* [Recipe flow](#recipe-flow)
* [1\. Data preparation](#1-data-preparation)
* [2\. Speed perturbation](#2-speed-perturbation)
* [3\. Wav format](#3-wav-format)
* [4\. Generate discrete tokens](#4-generate-discrete-tokens)
* [5\. Generate dump folder](#5-generate-dump-folder)
* [6\. Removal of long / short data](#6-removal-of-long--short-data)
* [7\. Input / Output Token list generation](#7-input-output-token-list-generation)
* [8\. LM statistics collection](#8-lm-statistics-collection)
* [9\. LM training](#9-lm-training)
* [10\. LM perplexity](#10-lm-perplexity)
* [11\. Ngram-LM training](#11-ngram-lm-training)
* [12\. ASR statistics collection](#12-asr-statistics-collection)
* [13\. ASR training](#13-asr-training)
* [14\. ASR inference](#14-asr-inference)
* [15\. ASR scoring](#15-asr-scoring)
* [16\-18\. (Optional) Pack results for upload](#16-18-optional-pack-results-for-upload)
* [How to run](#how-to-run)
* [LibriSpeech training](#librispeech-training)
* [Related works](#related-works)

## Recipe flow

Spk1 recipe consists of 4 stages.

### 1. Data preparation

Data preparation stage.

#### ESPnet format:

It calls `local/data.sh` to create Kaldi-style data directories in `data/` for training, validation, and evaluation sets. It's the same as `asr1` tasks.

See also:
- [About Kaldi-style data directory](https://github.com/espnet/espnet/tree/master/egs2/TEMPLATE#about-kaldi-style-data-directory)

### 2. Wav format

Format the wave files in `wav.scp` to a single format (wav / flac / kaldi_ark).

### 3. Spk statistics collection

Statistics calculation stage.
It collects the shape information of input and output texts for Spk training.
Currently, it's close to a dummy because we set all utterances to have equal
duration in the training phase.

### 4. Spk training

Spk model training stage.
You can change the training setting via `--spk_config` and `--spk_args` options.

See also:
- [Change the configuration for training](https://espnet.github.io/espnet/espnet2_training_option.html)
- [Distributed training](https://espnet.github.io/espnet/espnet2_distributed.html)

## How to run

### VoxCeleb Training
Here, we show the procedure to run the recipe using `egs2/voxceleb/spk1`.

Move to the recipe directory.
```sh
$ cd egs2/voxceleb/spk1
```

Modify `VOXCELEB1`, `VOXCELEB2` variables in `db.sh` if you want to change the download directory.
```sh
$ vim db.sh
```

Modify `cmd.sh` and `conf/*.conf` if you want to use the job scheduler.
See the detail in [using job scheduling system](https://espnet.github.io/espnet/parallelization.html).
```sh
$ vim cmd.sh
```

Run `run.sh`, which conducts all of the stages explained above.
```sh
$ ./run.sh
```

## Related works
```
@INPROCEEDINGS{jung2022pushing,
title={Pushing the limits of raw waveform speaker recognition},
author={Jung, Jee-weon and Kim, You Jin and Heo, Hee-Soo and Lee, Bong-Jin and Kwon, Youngki and Chung, Joon Son},
year={2022},
booktitle={Proc. INTERSPEECH}
}
```
20 changes: 20 additions & 0 deletions egs2/TEMPLATE/spk1/spk.sh
Original file line number Diff line number Diff line change
Expand Up @@ -191,6 +191,14 @@ if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
if [ "${skip_train}" = false ]; then
utils/copy_data_dir.sh --validate_opts --non-print data/"${train_set}" "${data_feats}/${train_set}"

# copy extra files that are not covered by copy_data_dir.sh
# category2utt will be used bydata sampler
cp data/"${train_set}/spk2utt" "${data_feats}/${train_set}/category2utt"
for x in music noise speech; do
cp data/musan_${x}.scp ${data_feats}/musan_${x}.scp
done
cp data/rirs.scp ${data_feats}/rirs.scp

# shellcheck disable=SC2086
scripts/audio/format_wav_scp.sh --nj "${nj}" --cmd "${train_cmd}" \
--audio-format "${audio_format}" --fs "${fs}" \
Expand All @@ -210,6 +218,10 @@ if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# Train can be either multi-column data or not, but valid/test always require multi-column trial
for dset in ${_dsets}; do
utils/copy_data_dir.sh --validate_opts --non-print data/"${dset}" "${data_feats}/${dset}"

# copy extra files that are not covered by copy_data_dir.sh
# category2utt will be used bydata sampler
cp data/"${train_set}/spk2utt" "${data_feats}/${train_set}/category2utt"
cp data/${dset}/trial_label "${data_feats}/${dset}"

# shellcheck disable=SC2086
Expand All @@ -234,6 +246,12 @@ if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
elif [ "${feats_type}" = raw_copy ]; then
if [ "${skip_train}" = false ]; then
utils/copy_data_dir.sh --validate_opts --non-print data/"${train_set}" "${data_feats}/${train_set}"
# category2utt will be used bydata sampler
cp data/"${train_set}/spk2utt" "${data_feats}/${train_set}/category2utt"
for x in music noise speech; do
cp data/musan_${x}.scp ${data_feats}/musan_${x}.scp
done
cp data/rirs.scp ${data_feats}/rirs.scp

echo "${feats_type}" > "${data_feats}/${train_set}/feats_type"
if "${multi_columns_output_wav_scp}"; then
Expand All @@ -248,6 +266,8 @@ if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
for dset in ${_dsets}; do
utils/copy_data_dir.sh --validate_opts --non-print data/"${dset}" "${data_feats}/${dset}"
cp data/${dset}/trial_label "${data_feats}/${dset}"
cp data/${dset}/trial.scp "${data_feats}/${dset}"
cp data/${dset}/trial2.scp "${data_feats}/${dset}"

echo "${feats_type}" > "${data_feats}/${dset}/feats_type"
echo "multi_${audio_format}" > "${data_feats}/${dset}/audio_format"
Expand Down
6 changes: 3 additions & 3 deletions egs2/mini_an4/spk1/conf/train_rawnet3_dataaug_debug.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -20,11 +20,11 @@ preprocessor_conf:
target_duration: 3.0 # seconds
sample_rate: 16000
num_eval: 1
rir_scp: data/train_nodev/rirs.scp
rir_scp: dump/raw/rirs.scp
rir_apply_prob: 1.0
noise_info:
- [0.4, "data/train_nodev/noises.scp", [1, 1], [0, 10]]
- [0.5, "data/train_nodev/noises.scp", [1, 2], [10, 20]]
- [0.4, "dump/raw/musan_music.scp", [1, 1], [0, 10]]
- [0.5, "dump/raw/musan_speech.scp", [1, 2], [10, 20]]
noise_apply_prob: 1.0
short_noise_thres: 0.5

Expand Down
43 changes: 43 additions & 0 deletions egs2/mini_an4/spk1/conf/train_rawnet3_sampler.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# This is a debug config for CI
frontend: raw

encoder: rawnet3
encoder_conf:
model_scale: 4
ndim: 16

pooling: chn_attn_stat
pooling_conf:
input_size: 24

projector: rawnet3
projector_conf:
input_size: 48
output_size: 8

preprocessor: spk
preprocessor_conf:
target_duration: 3.0 # seconds
sample_rate: 16000
num_eval: 2
rir_apply_prob: 0.0
noise_apply_prob: 0.0

model_conf:
extract_feats_in_collect_stats: false

loss: aamsoftmax
loss_conf:
nout: 8
nclasses: 10
margin: 0.3
scale: 15

optim: adam
num_att_plot: 0

max_epoch: 1
num_iters_per_epoch: 1
iterator_type: category
valid_iterator_type: sequence
batch_size: 2
6 changes: 4 additions & 2 deletions egs2/mini_an4/spk1/local/data.sh
Original file line number Diff line number Diff line change
Expand Up @@ -83,8 +83,10 @@ EOF
python local/make_trial.py data/${x}/wav.scp data/${x}
done

find downloads/noise/ -iname "*.wav" | awk '{print "noise" NR " " $1}' > data/${train_set}/noises.scp
find downloads/rirs/ -iname "*.wav" | awk '{print "rir" NR " " $1}' > data/${train_set}/rirs.scp
find downloads/noise/ -iname "*.wav" | awk '{print "noise" NR " " $1}' > data/musan_music.scp
find downloads/noise/ -iname "*.wav" | awk '{print "noise" NR " " $1}' > data/musan_noise.scp
find downloads/noise/ -iname "*.wav" | awk '{print "noise" NR " " $1}' > data/musan_speech.scp
find downloads/rirs/ -iname "*.wav" | awk '{print "rir" NR " " $1}' > data/rirs.scp
fi


Expand Down
2 changes: 1 addition & 1 deletion egs2/voxceleb/spk1/conf/train_RawNet3.yaml
47 changes: 0 additions & 47 deletions egs2/voxceleb/spk1/conf/tuning/train_RawNet3.yaml

This file was deleted.

0 comments on commit ac8b312

Please sign in to comment.