Merge pull request #5365 from Jungjee/spk_sampler

ESPnet-SPK: sampler
espnet · Aug 7, 2023 · ac8b312 · ac8b312
2 parents 1409d89 + a66805a
commit ac8b312
Show file tree

Hide file tree

Showing 16 changed files with 635 additions and 80 deletions.
diff --git a/ci/test_integration_espnet2.sh b/ci/test_integration_espnet2.sh
@@ -272,6 +272,7 @@ gen_dummy_coverage
 echo "==== [ESPnet2] SPK ==="
 ./run.sh --ngpu 0 --stage 0 --stop-stage 4 --feats-type "raw" --python "${python}" --spk-args "--num_workers 0"
 ./run.sh --ngpu 0 --stage 4 --stop-stage 4 --feats-type "raw" --python "${python}" --spk_config conf/train_rawnet3_dataaug_debug.yaml --spk-args "--num_workers 0"
+./run.sh --ngpu 0 --stage 4 --stop-stage 4 --feats-type "raw" --python "${python}" --spk_config conf/train_rawnet3_sampler.yaml --spk-args "--num_workers 0"
 # Remove generated files in order to reduce the disk usage
 rm -rf exp dump data
 cd "${cwd}"

diff --git a/doc/espnet2_training_option.md b/doc/espnet2_training_option.md
@@ -237,7 +237,7 @@ The behavior for batch-size during multi-GPU training is **different from that o
 We adopt variable mini-batch size with considering the dimension of the input features
 to make the best use of the GPU memory.
 
-There are 5 types:
+There are 6 types:
 
 |batch_type|Option to change batch-size|Variable batch-size|Requirement|
 |---|---|---|---|
@@ -246,6 +246,7 @@ There are 5 types:
 |folded|--batch_size|Yes|Length information of features|
 |length|--batch_bins|Yes|Length information of features|
 |numel|--batch_bins|Yes|Shape information of features|
+|catbel|--batch_size|No|-|
 
 Note that **--batch_size is ignored if --batch_type=length or --batch_type=numel**.
 
@@ -405,7 +406,6 @@ i.e. `bins = sum(numel(feat) for feats in batch for feat in feats)`,
 where `numel` returns the infinite product of the shape of each feature;
 `shape[0] * shape[1] * ...`
 
-
 ```bash
 python -m espnet.bin.asr_train \
   --batch_bins 200000 --batch_type numel \
@@ -419,6 +419,29 @@ python -m espnet.bin.asr_train \
   --valid_shape_file "valid_shape2.txt"
 ```
 
+
+### `--batch_type catbel`
+
+This type of batch_type focuses on the case of classification tasks.
+It guarantees that within each mini-batch, all samples belong to different classes.
+`--batch_size` is used to determine the mini-batch size.
+This batch type does not go along with the default `sequence` iterator_type.
+It is instead designed to be used with `category` iterator_type.
+Therefore, instead of explicitely giving `--batch_type catbel`, it is more recommended
+to give `--iterator_type category` which will automatically set `batch_type` to `catbel`.
+It is also important to use a preprocessor that adjusts the sample duration to enable
+mini-batch construction. One example would be `espnet2/train/preprocessor/SpkPreprocessor`.
+
+
+```bash
+python -m espnet.bin.spk_train \
+  --batch_bins 256 --iterator_type category \
+  --train_data_path_and_name_and_type "train.scp,feats,npy" \
+  --valid_data_path_and_name_and_type  "valid.scp,feats,npy" \
+  --train_shape_file "train_shape.txt" \
+  --valid_shape_file "valid_shape.txt" \
+```
+
 ## Gradient accumulating
 There are several ways to deal with larger model architectures than the capacity of your GPU device memory during training.
 

diff --git a/egs2/TEMPLATE/spk1/README.md b/egs2/TEMPLATE/spk1/README.md
@@ -0,0 +1,104 @@
+# ESPnet2 Spk1 Recipe TEMPLATE
+
+This is a template of Spk1 recipe for ESPnet2.
+It follows d-vector style training/inference for speaker verification.
+In other words, it trains a DNN as a closed set speaker classifier.
+After training the classification head is removed. The last hidden layer
+(or sometimes another layer) is used as a speaker representation (i.e.,
+speaker embedding) to represent diverse open set speakers.
+
+## Table of Contents
+
+* [ESPnet2 ASR2 Recipe TEMPLATE](#espnet2-asr2-recipe-template)
+  * [Table of Contents](#table-of-contents)
+  * [Recipe flow](#recipe-flow)
+    * [1\. Data preparation](#1-data-preparation)
+    * [2\. Speed perturbation](#2-speed-perturbation)
+    * [3\. Wav format](#3-wav-format)
+    * [4\. Generate discrete tokens](#4-generate-discrete-tokens)
+    * [5\. Generate dump folder](#5-generate-dump-folder)
+    * [6\. Removal of long / short data](#6-removal-of-long--short-data)
+    * [7\. Input / Output Token list generation](#7-input-output-token-list-generation)
+    * [8\. LM statistics collection](#8-lm-statistics-collection)
+    * [9\. LM training](#9-lm-training)
+    * [10\. LM perplexity](#10-lm-perplexity)
+    * [11\. Ngram-LM training](#11-ngram-lm-training)
+    * [12\. ASR statistics collection](#12-asr-statistics-collection)
+    * [13\. ASR training](#13-asr-training)
+    * [14\. ASR inference](#14-asr-inference)
+    * [15\. ASR scoring](#15-asr-scoring)
+    * [16\-18\. (Optional) Pack results for upload](#16-18-optional-pack-results-for-upload)
+  * [How to run](#how-to-run)
+    * [LibriSpeech training](#librispeech-training)
+  * [Related works](#related-works)
+
+## Recipe flow
+
+Spk1 recipe consists of 4 stages.
+
+### 1. Data preparation
+
+Data preparation stage.
+
+#### ESPnet format:
+
+It calls `local/data.sh` to create Kaldi-style data directories in `data/` for training, validation, and evaluation sets. It's the same as `asr1` tasks.
+
+See also:
+- [About Kaldi-style data directory](https://github.com/espnet/espnet/tree/master/egs2/TEMPLATE#about-kaldi-style-data-directory)
+
+### 2. Wav format
+
+Format the wave files in `wav.scp` to a single format (wav / flac / kaldi_ark).
+
+### 3. Spk statistics collection
+
+Statistics calculation stage.
+It collects the shape information of input and output texts for Spk training.
+Currently, it's close to a dummy because we set all utterances to have equal
+duration in the training phase.
+
+### 4. Spk training
+
+Spk model training stage.
+You can change the training setting via `--spk_config` and `--spk_args` options.
+
+See also:
+- [Change the configuration for training](https://espnet.github.io/espnet/espnet2_training_option.html)
+- [Distributed training](https://espnet.github.io/espnet/espnet2_distributed.html)
+
+## How to run
+
+### VoxCeleb Training
+Here, we show the procedure to run the recipe using `egs2/voxceleb/spk1`.
+
+Move to the recipe directory.
+```sh
+$ cd egs2/voxceleb/spk1
+```
+
+Modify `VOXCELEB1`, `VOXCELEB2` variables in `db.sh` if you want to change the download directory.
+```sh
+$ vim db.sh
+```
+
+Modify `cmd.sh` and `conf/*.conf` if you want to use the job scheduler.
+See the detail in [using job scheduling system](https://espnet.github.io/espnet/parallelization.html).
+```sh
+$ vim cmd.sh
+```
+
+Run `run.sh`, which conducts all of the stages explained above.
+```sh
+$ ./run.sh
+```
+
+## Related works
+```
+@INPROCEEDINGS{jung2022pushing,
+  title={Pushing the limits of raw waveform speaker recognition},
+  author={Jung, Jee-weon and Kim, You Jin and Heo, Hee-Soo and Lee, Bong-Jin and Kwon, Youngki and Chung, Joon Son},
+  year={2022},
+  booktitle={Proc. INTERSPEECH}
+}
+```
diff --git a/egs2/TEMPLATE/spk1/spk.sh b/egs2/TEMPLATE/spk1/spk.sh
@@ -191,6 +191,14 @@ if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
         if [ "${skip_train}" = false ]; then
             utils/copy_data_dir.sh --validate_opts --non-print data/"${train_set}" "${data_feats}/${train_set}"
 
+            # copy extra files that are not covered by copy_data_dir.sh
+            # category2utt will be used bydata sampler
+            cp data/"${train_set}/spk2utt" "${data_feats}/${train_set}/category2utt"
+            for x in music noise speech; do
+                cp data/musan_${x}.scp ${data_feats}/musan_${x}.scp
+            done
+            cp data/rirs.scp ${data_feats}/rirs.scp
+
             # shellcheck disable=SC2086
             scripts/audio/format_wav_scp.sh --nj "${nj}" --cmd "${train_cmd}" \
                 --audio-format "${audio_format}" --fs "${fs}" \
@@ -210,6 +218,10 @@ if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
         # Train can be either multi-column data or not, but valid/test always require multi-column trial
         for dset in ${_dsets}; do
             utils/copy_data_dir.sh --validate_opts --non-print data/"${dset}" "${data_feats}/${dset}"
+
+            # copy extra files that are not covered by copy_data_dir.sh
+            # category2utt will be used bydata sampler
+            cp data/"${train_set}/spk2utt" "${data_feats}/${train_set}/category2utt"
             cp data/${dset}/trial_label "${data_feats}/${dset}"
 
             # shellcheck disable=SC2086
@@ -234,6 +246,12 @@ if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
     elif [ "${feats_type}" = raw_copy ]; then
         if [ "${skip_train}" = false ]; then
             utils/copy_data_dir.sh --validate_opts --non-print data/"${train_set}" "${data_feats}/${train_set}"
+            # category2utt will be used bydata sampler
+            cp data/"${train_set}/spk2utt" "${data_feats}/${train_set}/category2utt"
+            for x in music noise speech; do
+                cp data/musan_${x}.scp ${data_feats}/musan_${x}.scp
+            done
+            cp data/rirs.scp ${data_feats}/rirs.scp
 
             echo "${feats_type}" > "${data_feats}/${train_set}/feats_type"
             if "${multi_columns_output_wav_scp}"; then
@@ -248,6 +266,8 @@ if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
         for dset in ${_dsets}; do
             utils/copy_data_dir.sh --validate_opts --non-print data/"${dset}" "${data_feats}/${dset}"
             cp data/${dset}/trial_label "${data_feats}/${dset}"
+            cp data/${dset}/trial.scp "${data_feats}/${dset}"
+            cp data/${dset}/trial2.scp "${data_feats}/${dset}"
 
             echo "${feats_type}" > "${data_feats}/${dset}/feats_type"
             echo "multi_${audio_format}" > "${data_feats}/${dset}/audio_format"

diff --git a/egs2/mini_an4/spk1/conf/train_rawnet3_dataaug_debug.yaml b/egs2/mini_an4/spk1/conf/train_rawnet3_dataaug_debug.yaml
@@ -20,11 +20,11 @@ preprocessor_conf:
   target_duration: 3.0  # seconds
   sample_rate: 16000
   num_eval: 1
-  rir_scp: data/train_nodev/rirs.scp
+  rir_scp: dump/raw/rirs.scp
   rir_apply_prob: 1.0
   noise_info:
-    - [0.4, "data/train_nodev/noises.scp", [1, 1], [0, 10]]
-    - [0.5, "data/train_nodev/noises.scp", [1, 2], [10, 20]]
+    - [0.4, "dump/raw/musan_music.scp", [1, 1], [0, 10]]
+    - [0.5, "dump/raw/musan_speech.scp", [1, 2], [10, 20]]
   noise_apply_prob: 1.0
   short_noise_thres: 0.5
 

diff --git a/egs2/mini_an4/spk1/conf/train_rawnet3_sampler.yaml b/egs2/mini_an4/spk1/conf/train_rawnet3_sampler.yaml
@@ -0,0 +1,43 @@
+# This is a debug config for CI
+frontend: raw
+
+encoder: rawnet3
+encoder_conf:
+  model_scale: 4
+  ndim: 16
+
+pooling: chn_attn_stat
+pooling_conf:
+  input_size: 24
+
+projector: rawnet3
+projector_conf:
+  input_size: 48
+  output_size: 8
+
+preprocessor: spk
+preprocessor_conf:
+  target_duration: 3.0  # seconds
+  sample_rate: 16000
+  num_eval: 2
+  rir_apply_prob: 0.0
+  noise_apply_prob: 0.0
+
+model_conf:
+  extract_feats_in_collect_stats: false
+
+loss: aamsoftmax
+loss_conf:
+  nout: 8
+  nclasses: 10
+  margin: 0.3
+  scale: 15
+
+optim: adam
+num_att_plot: 0
+
+max_epoch: 1
+num_iters_per_epoch: 1
+iterator_type: category
+valid_iterator_type: sequence
+batch_size: 2
diff --git a/egs2/mini_an4/spk1/local/data.sh b/egs2/mini_an4/spk1/local/data.sh
@@ -83,8 +83,10 @@ EOF
         python local/make_trial.py data/${x}/wav.scp data/${x}
     done
 
-    find downloads/noise/ -iname "*.wav" | awk '{print "noise" NR " " $1}' > data/${train_set}/noises.scp
-    find downloads/rirs/ -iname "*.wav" | awk '{print "rir" NR " " $1}' > data/${train_set}/rirs.scp
+    find downloads/noise/ -iname "*.wav" | awk '{print "noise" NR " " $1}' > data/musan_music.scp
+    find downloads/noise/ -iname "*.wav" | awk '{print "noise" NR " " $1}' > data/musan_noise.scp
+    find downloads/noise/ -iname "*.wav" | awk '{print "noise" NR " " $1}' > data/musan_speech.scp
+    find downloads/rirs/ -iname "*.wav" | awk '{print "rir" NR " " $1}' > data/rirs.scp
 fi
 
 

diff --git a/egs2/voxceleb/spk1/conf/train_RawNet3.yaml b/egs2/voxceleb/spk1/conf/train_RawNet3.yaml
@@ -1 +1 @@
-tuning/train_RawNet3_sgdr_bs.yaml
+./tuning/train_RawNet3_sgdr_bs512_da_sampler_adam.yaml
diff --git a/egs2/voxceleb/spk1/conf/tuning/train_RawNet3.yaml b/egs2/voxceleb/spk1/conf/tuning/train_RawNet3.yaml