Resolve conflicts

espnet · Aug 8, 2023 · 3a82677 · 3a82677
2 parents 3bc80ac + ac8b312
commit 3a82677
Show file tree

Hide file tree

Showing 53 changed files with 1,377 additions and 156 deletions.
diff --git a/ci/test_integration_espnet2.sh b/ci/test_integration_espnet2.sh
@@ -280,6 +280,7 @@ gen_dummy_coverage
 echo "==== [ESPnet2] SPK ==="
 ./run.sh --ngpu 0 --stage 0 --stop-stage 4 --feats-type "raw" --python "${python}" --spk-args "--num_workers 0"
 ./run.sh --ngpu 0 --stage 4 --stop-stage 4 --feats-type "raw" --python "${python}" --spk_config conf/train_rawnet3_dataaug_debug.yaml --spk-args "--num_workers 0"
+./run.sh --ngpu 0 --stage 4 --stop-stage 4 --feats-type "raw" --python "${python}" --spk_config conf/train_rawnet3_sampler.yaml --spk-args "--num_workers 0"
 # Remove generated files in order to reduce the disk usage
 rm -rf exp dump data
 cd "${cwd}"

diff --git a/doc/espnet2_training_option.md b/doc/espnet2_training_option.md
@@ -237,7 +237,7 @@ The behavior for batch-size during multi-GPU training is **different from that o
 We adopt variable mini-batch size with considering the dimension of the input features
 to make the best use of the GPU memory.
 
-There are 5 types:
+There are 6 types:
 
 |batch_type|Option to change batch-size|Variable batch-size|Requirement|
 |---|---|---|---|
@@ -246,6 +246,7 @@ There are 5 types:
 |folded|--batch_size|Yes|Length information of features|
 |length|--batch_bins|Yes|Length information of features|
 |numel|--batch_bins|Yes|Shape information of features|
+|catbel|--batch_size|No|-|
 
 Note that **--batch_size is ignored if --batch_type=length or --batch_type=numel**.
 
@@ -405,7 +406,6 @@ i.e. `bins = sum(numel(feat) for feats in batch for feat in feats)`,
 where `numel` returns the infinite product of the shape of each feature;
 `shape[0] * shape[1] * ...`
 
-
 ```bash
 python -m espnet.bin.asr_train \
   --batch_bins 200000 --batch_type numel \
@@ -419,6 +419,29 @@ python -m espnet.bin.asr_train \
   --valid_shape_file "valid_shape2.txt"
 ```
 
+
+### `--batch_type catbel`
+
+This type of batch_type focuses on the case of classification tasks.
+It guarantees that within each mini-batch, all samples belong to different classes.
+`--batch_size` is used to determine the mini-batch size.
+This batch type does not go along with the default `sequence` iterator_type.
+It is instead designed to be used with `category` iterator_type.
+Therefore, instead of explicitely giving `--batch_type catbel`, it is more recommended
+to give `--iterator_type category` which will automatically set `batch_type` to `catbel`.
+It is also important to use a preprocessor that adjusts the sample duration to enable
+mini-batch construction. One example would be `espnet2/train/preprocessor/SpkPreprocessor`.
+
+
+```bash
+python -m espnet.bin.spk_train \
+  --batch_bins 256 --iterator_type category \
+  --train_data_path_and_name_and_type "train.scp,feats,npy" \
+  --valid_data_path_and_name_and_type  "valid.scp,feats,npy" \
+  --train_shape_file "train_shape.txt" \
+  --valid_shape_file "valid_shape.txt" \
+```
+
 ## Gradient accumulating
 There are several ways to deal with larger model architectures than the capacity of your GPU device memory during training.
 

diff --git a/egs/dipco/asr1/local/download_data.sh b/egs/dipco/asr1/local/download_data.sh
@@ -9,10 +9,11 @@ if [ ! -e DiPCo ]; then
   echo "$0: downloading DIPCo data (it won't re-download if it was already downloaded.)"
   # the following command won't re-get it if it's already there
   # because of the --continue switch.
-  wget --continue https://s3.amazonaws.com/dipco/DiPCo.tgz || exit 1
-  tar xf "DiPCo.tgz"
+  git clone https://huggingface.co/datasets/huckiyang/DiPCo
+  # Remove .git to reduce data space.
+  rm -rf DiPCo/.git
 else
-  echo "$0: not downloading or un-tarring TEDLIUM_release2 because it already exists."
+  echo "$0: not downloading or un-tarring DIPCo because it already exists."
 fi
 
 

diff --git a/egs2/README.md b/egs2/README.md
@@ -53,7 +53,8 @@ See: https://espnet.github.io/espnet/espnet2_tutorial.html#recipes-using-espnet2
 | fsc_unseen              | Fluent Speech Commands Dataset MASE Eval Unseen splits                                                                           | SLU                     | ENG                   | https://github.com/maseEval/mase                                                                             |              |
 | gigaspeech              | GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio                                          | ASR                     | ENG                   | https://github.com/SpeechColab/GigaSpeech                                                                    |              |
 | googlei18n_lowresource  | Googlei18n crowdsource project                                                                                                   | TTS                     | ENG                   | https://github.com/mirumee/google-i18n-address (most in openslr as separate entries)                         |              |
-| grabo                   | Grabo dataset                                                                                                                    | SLU                     | ENG + NLD             | https://www.esat.kuleuven.be/psi/spraak/downloads/                                                           |              |
+| grabo                   | Grabo dataset                                                                                                                    | SLU                     | ENG + NLD             | https://www.esat.kuleuven.be/psi/spraak/downloads/
+| gramvaani                   | GramVaani ASR Challenge 2022                                                                                                                     | ASR                     | HI             | https://sites.google.com/view/gramvaaniasrchallenge/dataset                                      |              |
 | harpervalley             | HarperValleyBank: A Domain-Specific Spoken Dialog Corpus                                                                            | SLU                     | ENG                   | https://github.com/cricketclub/gridspace-stanford-harper-valley                                                       |              |
 | hkust                   | HKUST/MTS: A very large scale Mandarin telephone speech corpus                                                                   | ASR                     | CMN                   | https://catalog.ldc.upenn.edu/LDC2005S15                                                                     |              |
 | how2                    | How2: A Large-scale Dataset for Multimodal Language Understanding                                                                | ASR/MT/ST               | ENG->POR              | https://github.com/srvk/how2-dataset                                                                         |              |

diff --git a/egs2/TEMPLATE/asr1/asr.sh b/egs2/TEMPLATE/asr1/asr.sh
@@ -951,12 +951,18 @@ if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ] && ! [[ " ${skip_stages} " =~ [
             log "Error: not supported SOT training for whisper token_list"
             exit 2
         fi
+
+        _opts=""
+        if [ "${token_type}" = "whisper_multilingual" ]; then
+            _opts+=" --language ${lang}"
+        fi
+
         # The first symbol in token_list must be "<blank>" and the last must be also sos/eos:
         # 0 is reserved for CTC-blank for ASR and also used as ignore-index in the other task
         echo ${token_list}
         ${python} -m espnet2.bin.whisper_export_vocabulary  \
             --whisper_model "${token_type}" \
-            --output "${token_list}"
+            --output "${token_list}" ${_opts}
     elif [ "${token_type}" = hugging_face ]; then
         log "Stage 5: Generate hugging_face token_list from ${hugging_face_model_name_or_path}"
 

diff --git a/egs2/TEMPLATE/asr1/db.sh b/egs2/TEMPLATE/asr1/db.sh
@@ -188,6 +188,7 @@ ITAKO=
 NATSUME=
 KIRITAN=
 NAMINE=
+GRAMVAANI=downloads
 
 # For only CMU TIR environment
 if [[ "$(hostname)" == tir* ]]; then

diff --git a/egs2/TEMPLATE/asr1/pyscripts/utils/evaluate_whisper_inference.py b/egs2/TEMPLATE/asr1/pyscripts/utils/evaluate_whisper_inference.py
@@ -7,14 +7,11 @@
 from pathlib import Path
 from typing import Any, Dict, List, Optional, Sequence, Tuple, Union
 
-import numpy as np
 import torch
-import torch.quantization
 import whisper
-from typeguard import check_argument_types, check_return_type
+from typeguard import check_argument_types
 
 from espnet2.fileio.datadir_writer import DatadirWriter
-from espnet2.torch_utils.device_funcs import to_device
 from espnet2.torch_utils.set_all_random_seed import set_all_random_seed
 from espnet2.utils import config_argparse
 from espnet2.utils.nested_dict_action import NestedDictAction
@@ -28,12 +25,14 @@ class Speech2Text:
     def __init__(
         self,
         model_tag: str = "base",
+        model_dir: str = "./models",
         device: str = "cpu",
     ):
         assert check_argument_types()
 
-        self.model = whisper.load_model(model_tag).to(device)
-        self.device = device
+        self.model = whisper.load_model(
+            name=model_tag, download_root=model_dir, device=device
+        )
 
     @torch.no_grad()
     def __call__(self, speech: str, **decode_options) -> Optional[str]:
@@ -62,6 +61,7 @@ def inference(
     data_path_and_name_and_type: str,
     key_file: Optional[str],
     model_tag: Optional[str],
+    model_dir: Optional[str],
     allow_variable_data_keys: bool,
     decode_options: Dict,
 ):
@@ -85,6 +85,7 @@ def inference(
     # 2. Build speech2text
     speech2text = Speech2Text(
         model_tag=model_tag,
+        model_dir=model_dir,
         device=device,
     )
 
@@ -152,8 +153,23 @@ def get_parser():
     group.add_argument(
         "--model_tag",
         type=str,
-        help="Pretrained model tag. If specify this option, *_train_config and "
-        "*_file will be overwritten",
+        default="base",
+        choices=[
+            "base.en",
+            "base",
+            "small.en",
+            "small",
+            "medium.en",
+            "medium",
+            "large",
+        ],
+        help="Model tag of the released whisper models.",
+    )
+    group.add_argument(
+        "--model_dir",
+        type=str_or_none,
+        default="./models",
+        help="The directory to download whisper models.",
     )
 
     group = parser.add_argument_group("Decoding options related")

diff --git a/egs2/TEMPLATE/asr1/scripts/utils/evaluate_asr.sh b/egs2/TEMPLATE/asr1/scripts/utils/evaluate_asr.sh
@@ -29,6 +29,7 @@ SECONDS=0
 stage=1
 stop_stage=2
 nj=8
+inference_nj=8
 gpu_inference=false
 fs=16000
 
@@ -37,11 +38,13 @@ model_tag=""
 asr_model_file=""
 lm_file=""
 whisper_tag=""
+whisper_dir=""
 
 # Inference option related configuration
 inference_config=""
 inference_args=""
-decode_options="{task: transcribe}"
+## change the language id according to your dataset
+decode_options="{task: transcribe, language: en, beam_size: 1}"
 
 # Scoring related configuration
 bpemodel=""
@@ -58,6 +61,7 @@ Options:
     --stage          # Processes starts from the specified stage (default="${stage}").
     --stop_stage     # Processes is stopped at the specified stage (default="${stop_stage}").
     --nj             # Number of parallel jobs (default="${nj}").
+    --inference_nj   # Number of parallel jobs in inference (default="${inference_nj}").
     --gpu_inference  # Whether to use gpu in the inference (default="${gpu_inference}").
     --fs             # Sampling rate for ASR model inputs (default="${fs}").
 
@@ -67,6 +71,7 @@ Options:
     --asr_model_file  # ASR model file path in local (default="${asr_model_file}").
     --lm_file         # LM model file path in local (default="${lm_file}").
     --whisper_tag     # Whisper model tag for evaluation with Whisper (default="${whisper_tag}").
+    --whisper_dir     # Whisper model directory to download (default="${whisper_dir}").
 
     # Inference related configuration
     --inference_config  # ASR inference configuration file (default="${inference_config}").
@@ -86,13 +91,14 @@ Examples:
     $0 --model_tag <model_tag> wav.scp asr_outputs
 
     # Use pretrained model and perform inference and scoring
-    $0 --model_tag <model_tag> --stop-stage 2 --gt_text /path/to/text wav.scp asr_results
+    $0 --model_tag <model_tag> --stop-stage 3 --gt_text /path/to/text wav.scp asr_results
 
     # Use local model and perform inference and scoring
-    $0 --asr_model_file /path/to/model.pth --stop-stage 2 --gt_text /path/to/text wav.scp asr_results
+    $0 --asr_model_file /path/to/model.pth --stop-stage 3 --gt_text /path/to/text wav.scp asr_results
 
     # Use whisper model and perform inference and scoring
-    $0 --whisper_tag small --stop-stage 2 --gt_text /path/to/text wav.scp asr_results
+    $0 --whisper_tag small --whisper_dir /path/to/download --decode_options "{task: transcribe; language: en}" \
+        --stop-stage 3 --gt_text /path/to/text wav.scp asr_results
 
 EOF
 )
@@ -128,6 +134,7 @@ if ${gpu_inference}; then
     # shellcheck disable=SC2154
     _cmd="${cuda_cmd}"
     _ngpu=1
+    inference_nj=1
 else
     # shellcheck disable=SC2154
     _cmd="${decode_cmd}"
@@ -173,7 +180,7 @@ if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
     # 1. Split the key file
     key_file=${wavscp}
     split_scps=""
-    _nj=$(min "${nj}" "$(wc -l < "${key_file}")")
+    _nj=$(min "${inference_nj}" "$(wc -l < "${key_file}")")
     for n in $(seq "${_nj}"); do
         split_scps+=" ${logdir}/keys.${n}.scp"
     done
@@ -184,13 +191,17 @@ if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
     log "Decoding started... log: '${logdir}/asr_inference.*.log'"
 
     if [ -n "${whisper_tag}" ]; then
+        if [ -z "${whisper_dir}" ]; then
+            whisper_dir=${outdir}/models
+        fi
         # shellcheck disable=SC2046,SC2086
         ${_cmd} --gpu "${_ngpu}" JOB=1:"${_nj}" "${logdir}"/asr_inference.JOB.log \
             python3 pyscripts/utils/evaluate_whisper_inference.py \
                 --ngpu "${_ngpu}" \
                 --data_path_and_name_and_type "${wavscp}" \
                 --key_file "${logdir}"/keys.JOB.scp \
-                --model_tag "${whisper_tag}" \
+                --model_tag ${whisper_tag} \
+                --model_dir ${whisper_dir} \
                 --output_dir "${logdir}"/output.JOB \
                 --decode_options "${decode_options}" || { cat $(grep -l -i error "${logdir}"/asr_inference.*.log) ; exit 1; }
     else

diff --git a/egs2/TEMPLATE/spk1/README.md b/egs2/TEMPLATE/spk1/README.md
@@ -0,0 +1,104 @@
+# ESPnet2 Spk1 Recipe TEMPLATE
+
+This is a template of Spk1 recipe for ESPnet2.
+It follows d-vector style training/inference for speaker verification.
+In other words, it trains a DNN as a closed set speaker classifier.
+After training the classification head is removed. The last hidden layer
+(or sometimes another layer) is used as a speaker representation (i.e.,
+speaker embedding) to represent diverse open set speakers.
+
+## Table of Contents
+
+* [ESPnet2 ASR2 Recipe TEMPLATE](#espnet2-asr2-recipe-template)
+  * [Table of Contents](#table-of-contents)
+  * [Recipe flow](#recipe-flow)
+    * [1\. Data preparation](#1-data-preparation)
+    * [2\. Speed perturbation](#2-speed-perturbation)
+    * [3\. Wav format](#3-wav-format)
+    * [4\. Generate discrete tokens](#4-generate-discrete-tokens)
+    * [5\. Generate dump folder](#5-generate-dump-folder)
+    * [6\. Removal of long / short data](#6-removal-of-long--short-data)
+    * [7\. Input / Output Token list generation](#7-input-output-token-list-generation)
+    * [8\. LM statistics collection](#8-lm-statistics-collection)
+    * [9\. LM training](#9-lm-training)
+    * [10\. LM perplexity](#10-lm-perplexity)
+    * [11\. Ngram-LM training](#11-ngram-lm-training)
+    * [12\. ASR statistics collection](#12-asr-statistics-collection)
+    * [13\. ASR training](#13-asr-training)
+    * [14\. ASR inference](#14-asr-inference)
+    * [15\. ASR scoring](#15-asr-scoring)
+    * [16\-18\. (Optional) Pack results for upload](#16-18-optional-pack-results-for-upload)
+  * [How to run](#how-to-run)
+    * [LibriSpeech training](#librispeech-training)
+  * [Related works](#related-works)
+
+## Recipe flow
+
+Spk1 recipe consists of 4 stages.
+
+### 1. Data preparation
+
+Data preparation stage.
+
+#### ESPnet format:
+
+It calls `local/data.sh` to create Kaldi-style data directories in `data/` for training, validation, and evaluation sets. It's the same as `asr1` tasks.
+
+See also:
+- [About Kaldi-style data directory](https://github.com/espnet/espnet/tree/master/egs2/TEMPLATE#about-kaldi-style-data-directory)
+
+### 2. Wav format
+
+Format the wave files in `wav.scp` to a single format (wav / flac / kaldi_ark).
+
+### 3. Spk statistics collection
+
+Statistics calculation stage.
+It collects the shape information of input and output texts for Spk training.
+Currently, it's close to a dummy because we set all utterances to have equal
+duration in the training phase.
+
+### 4. Spk training
+
+Spk model training stage.
+You can change the training setting via `--spk_config` and `--spk_args` options.
+
+See also:
+- [Change the configuration for training](https://espnet.github.io/espnet/espnet2_training_option.html)
+- [Distributed training](https://espnet.github.io/espnet/espnet2_distributed.html)
+
+## How to run
+
+### VoxCeleb Training
+Here, we show the procedure to run the recipe using `egs2/voxceleb/spk1`.
+
+Move to the recipe directory.
+```sh
+$ cd egs2/voxceleb/spk1
+```
+
+Modify `VOXCELEB1`, `VOXCELEB2` variables in `db.sh` if you want to change the download directory.
+```sh
+$ vim db.sh
+```
+
+Modify `cmd.sh` and `conf/*.conf` if you want to use the job scheduler.
+See the detail in [using job scheduling system](https://espnet.github.io/espnet/parallelization.html).
+```sh
+$ vim cmd.sh
+```
+
+Run `run.sh`, which conducts all of the stages explained above.
+```sh
+$ ./run.sh
+```
+
+## Related works
+```
+@INPROCEEDINGS{jung2022pushing,
+  title={Pushing the limits of raw waveform speaker recognition},
+  author={Jung, Jee-weon and Kim, You Jin and Heo, Hee-Soo and Lee, Bong-Jin and Kwon, Youngki and Chung, Joon Son},
+  year={2022},
+  booktitle={Proc. INTERSPEECH}
+}
+```