Skip to content

Commit

Permalink
Resolve conflicts
Browse files Browse the repository at this point in the history
  • Loading branch information
Emrys365 committed Aug 8, 2023
2 parents 3bc80ac + ac8b312 commit 3a82677
Show file tree
Hide file tree
Showing 53 changed files with 1,377 additions and 156 deletions.
1 change: 1 addition & 0 deletions ci/test_integration_espnet2.sh
Original file line number Diff line number Diff line change
Expand Up @@ -280,6 +280,7 @@ gen_dummy_coverage
echo "==== [ESPnet2] SPK ==="
./run.sh --ngpu 0 --stage 0 --stop-stage 4 --feats-type "raw" --python "${python}" --spk-args "--num_workers 0"
./run.sh --ngpu 0 --stage 4 --stop-stage 4 --feats-type "raw" --python "${python}" --spk_config conf/train_rawnet3_dataaug_debug.yaml --spk-args "--num_workers 0"
./run.sh --ngpu 0 --stage 4 --stop-stage 4 --feats-type "raw" --python "${python}" --spk_config conf/train_rawnet3_sampler.yaml --spk-args "--num_workers 0"
# Remove generated files in order to reduce the disk usage
rm -rf exp dump data
cd "${cwd}"
Expand Down
27 changes: 25 additions & 2 deletions doc/espnet2_training_option.md
Original file line number Diff line number Diff line change
Expand Up @@ -237,7 +237,7 @@ The behavior for batch-size during multi-GPU training is **different from that o
We adopt variable mini-batch size with considering the dimension of the input features
to make the best use of the GPU memory.

There are 5 types:
There are 6 types:

|batch_type|Option to change batch-size|Variable batch-size|Requirement|
|---|---|---|---|
Expand All @@ -246,6 +246,7 @@ There are 5 types:
|folded|--batch_size|Yes|Length information of features|
|length|--batch_bins|Yes|Length information of features|
|numel|--batch_bins|Yes|Shape information of features|
|catbel|--batch_size|No|-|

Note that **--batch_size is ignored if --batch_type=length or --batch_type=numel**.

Expand Down Expand Up @@ -405,7 +406,6 @@ i.e. `bins = sum(numel(feat) for feats in batch for feat in feats)`,
where `numel` returns the infinite product of the shape of each feature;
`shape[0] * shape[1] * ...`


```bash
python -m espnet.bin.asr_train \
--batch_bins 200000 --batch_type numel \
Expand All @@ -419,6 +419,29 @@ python -m espnet.bin.asr_train \
--valid_shape_file "valid_shape2.txt"
```


### `--batch_type catbel`

This type of batch_type focuses on the case of classification tasks.
It guarantees that within each mini-batch, all samples belong to different classes.
`--batch_size` is used to determine the mini-batch size.
This batch type does not go along with the default `sequence` iterator_type.
It is instead designed to be used with `category` iterator_type.
Therefore, instead of explicitely giving `--batch_type catbel`, it is more recommended
to give `--iterator_type category` which will automatically set `batch_type` to `catbel`.
It is also important to use a preprocessor that adjusts the sample duration to enable
mini-batch construction. One example would be `espnet2/train/preprocessor/SpkPreprocessor`.


```bash
python -m espnet.bin.spk_train \
--batch_bins 256 --iterator_type category \
--train_data_path_and_name_and_type "train.scp,feats,npy" \
--valid_data_path_and_name_and_type "valid.scp,feats,npy" \
--train_shape_file "train_shape.txt" \
--valid_shape_file "valid_shape.txt" \
```

## Gradient accumulating
There are several ways to deal with larger model architectures than the capacity of your GPU device memory during training.

Expand Down
7 changes: 4 additions & 3 deletions egs/dipco/asr1/local/download_data.sh
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,11 @@ if [ ! -e DiPCo ]; then
echo "$0: downloading DIPCo data (it won't re-download if it was already downloaded.)"
# the following command won't re-get it if it's already there
# because of the --continue switch.
wget --continue https://s3.amazonaws.com/dipco/DiPCo.tgz || exit 1
tar xf "DiPCo.tgz"
git clone https://huggingface.co/datasets/huckiyang/DiPCo
# Remove .git to reduce data space.
rm -rf DiPCo/.git
else
echo "$0: not downloading or un-tarring TEDLIUM_release2 because it already exists."
echo "$0: not downloading or un-tarring DIPCo because it already exists."
fi


Expand Down
3 changes: 2 additions & 1 deletion egs2/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,8 @@ See: https://espnet.github.io/espnet/espnet2_tutorial.html#recipes-using-espnet2
| fsc_unseen | Fluent Speech Commands Dataset MASE Eval Unseen splits | SLU | ENG | https://github.com/maseEval/mase | |
| gigaspeech | GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio | ASR | ENG | https://github.com/SpeechColab/GigaSpeech | |
| googlei18n_lowresource | Googlei18n crowdsource project | TTS | ENG | https://github.com/mirumee/google-i18n-address (most in openslr as separate entries) | |
| grabo | Grabo dataset | SLU | ENG + NLD | https://www.esat.kuleuven.be/psi/spraak/downloads/ | |
| grabo | Grabo dataset | SLU | ENG + NLD | https://www.esat.kuleuven.be/psi/spraak/downloads/
| gramvaani | GramVaani ASR Challenge 2022 | ASR | HI | https://sites.google.com/view/gramvaaniasrchallenge/dataset | |
| harpervalley | HarperValleyBank: A Domain-Specific Spoken Dialog Corpus | SLU | ENG | https://github.com/cricketclub/gridspace-stanford-harper-valley | |
| hkust | HKUST/MTS: A very large scale Mandarin telephone speech corpus | ASR | CMN | https://catalog.ldc.upenn.edu/LDC2005S15 | |
| how2 | How2: A Large-scale Dataset for Multimodal Language Understanding | ASR/MT/ST | ENG->POR | https://github.com/srvk/how2-dataset | |
Expand Down
8 changes: 7 additions & 1 deletion egs2/TEMPLATE/asr1/asr.sh
Original file line number Diff line number Diff line change
Expand Up @@ -951,12 +951,18 @@ if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ] && ! [[ " ${skip_stages} " =~ [
log "Error: not supported SOT training for whisper token_list"
exit 2
fi

_opts=""
if [ "${token_type}" = "whisper_multilingual" ]; then
_opts+=" --language ${lang}"
fi

# The first symbol in token_list must be "<blank>" and the last must be also sos/eos:
# 0 is reserved for CTC-blank for ASR and also used as ignore-index in the other task
echo ${token_list}
${python} -m espnet2.bin.whisper_export_vocabulary \
--whisper_model "${token_type}" \
--output "${token_list}"
--output "${token_list}" ${_opts}
elif [ "${token_type}" = hugging_face ]; then
log "Stage 5: Generate hugging_face token_list from ${hugging_face_model_name_or_path}"

Expand Down
1 change: 1 addition & 0 deletions egs2/TEMPLATE/asr1/db.sh
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -188,6 +188,7 @@ ITAKO=
NATSUME=
KIRITAN=
NAMINE=
GRAMVAANI=downloads

# For only CMU TIR environment
if [[ "$(hostname)" == tir* ]]; then
Expand Down
32 changes: 24 additions & 8 deletions egs2/TEMPLATE/asr1/pyscripts/utils/evaluate_whisper_inference.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,14 +7,11 @@
from pathlib import Path
from typing import Any, Dict, List, Optional, Sequence, Tuple, Union

import numpy as np
import torch
import torch.quantization
import whisper
from typeguard import check_argument_types, check_return_type
from typeguard import check_argument_types

from espnet2.fileio.datadir_writer import DatadirWriter
from espnet2.torch_utils.device_funcs import to_device
from espnet2.torch_utils.set_all_random_seed import set_all_random_seed
from espnet2.utils import config_argparse
from espnet2.utils.nested_dict_action import NestedDictAction
Expand All @@ -28,12 +25,14 @@ class Speech2Text:
def __init__(
self,
model_tag: str = "base",
model_dir: str = "./models",
device: str = "cpu",
):
assert check_argument_types()

self.model = whisper.load_model(model_tag).to(device)
self.device = device
self.model = whisper.load_model(
name=model_tag, download_root=model_dir, device=device
)

@torch.no_grad()
def __call__(self, speech: str, **decode_options) -> Optional[str]:
Expand Down Expand Up @@ -62,6 +61,7 @@ def inference(
data_path_and_name_and_type: str,
key_file: Optional[str],
model_tag: Optional[str],
model_dir: Optional[str],
allow_variable_data_keys: bool,
decode_options: Dict,
):
Expand All @@ -85,6 +85,7 @@ def inference(
# 2. Build speech2text
speech2text = Speech2Text(
model_tag=model_tag,
model_dir=model_dir,
device=device,
)

Expand Down Expand Up @@ -152,8 +153,23 @@ def get_parser():
group.add_argument(
"--model_tag",
type=str,
help="Pretrained model tag. If specify this option, *_train_config and "
"*_file will be overwritten",
default="base",
choices=[
"base.en",
"base",
"small.en",
"small",
"medium.en",
"medium",
"large",
],
help="Model tag of the released whisper models.",
)
group.add_argument(
"--model_dir",
type=str_or_none,
default="./models",
help="The directory to download whisper models.",
)

group = parser.add_argument_group("Decoding options related")
Expand Down
23 changes: 17 additions & 6 deletions egs2/TEMPLATE/asr1/scripts/utils/evaluate_asr.sh
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ SECONDS=0
stage=1
stop_stage=2
nj=8
inference_nj=8
gpu_inference=false
fs=16000

Expand All @@ -37,11 +38,13 @@ model_tag=""
asr_model_file=""
lm_file=""
whisper_tag=""
whisper_dir=""

# Inference option related configuration
inference_config=""
inference_args=""
decode_options="{task: transcribe}"
## change the language id according to your dataset
decode_options="{task: transcribe, language: en, beam_size: 1}"

# Scoring related configuration
bpemodel=""
Expand All @@ -58,6 +61,7 @@ Options:
--stage # Processes starts from the specified stage (default="${stage}").
--stop_stage # Processes is stopped at the specified stage (default="${stop_stage}").
--nj # Number of parallel jobs (default="${nj}").
--inference_nj # Number of parallel jobs in inference (default="${inference_nj}").
--gpu_inference # Whether to use gpu in the inference (default="${gpu_inference}").
--fs # Sampling rate for ASR model inputs (default="${fs}").
Expand All @@ -67,6 +71,7 @@ Options:
--asr_model_file # ASR model file path in local (default="${asr_model_file}").
--lm_file # LM model file path in local (default="${lm_file}").
--whisper_tag # Whisper model tag for evaluation with Whisper (default="${whisper_tag}").
--whisper_dir # Whisper model directory to download (default="${whisper_dir}").
# Inference related configuration
--inference_config # ASR inference configuration file (default="${inference_config}").
Expand All @@ -86,13 +91,14 @@ Examples:
$0 --model_tag <model_tag> wav.scp asr_outputs
# Use pretrained model and perform inference and scoring
$0 --model_tag <model_tag> --stop-stage 2 --gt_text /path/to/text wav.scp asr_results
$0 --model_tag <model_tag> --stop-stage 3 --gt_text /path/to/text wav.scp asr_results
# Use local model and perform inference and scoring
$0 --asr_model_file /path/to/model.pth --stop-stage 2 --gt_text /path/to/text wav.scp asr_results
$0 --asr_model_file /path/to/model.pth --stop-stage 3 --gt_text /path/to/text wav.scp asr_results
# Use whisper model and perform inference and scoring
$0 --whisper_tag small --stop-stage 2 --gt_text /path/to/text wav.scp asr_results
$0 --whisper_tag small --whisper_dir /path/to/download --decode_options "{task: transcribe; language: en}" \
--stop-stage 3 --gt_text /path/to/text wav.scp asr_results
EOF
)
Expand Down Expand Up @@ -128,6 +134,7 @@ if ${gpu_inference}; then
# shellcheck disable=SC2154
_cmd="${cuda_cmd}"
_ngpu=1
inference_nj=1
else
# shellcheck disable=SC2154
_cmd="${decode_cmd}"
Expand Down Expand Up @@ -173,7 +180,7 @@ if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# 1. Split the key file
key_file=${wavscp}
split_scps=""
_nj=$(min "${nj}" "$(wc -l < "${key_file}")")
_nj=$(min "${inference_nj}" "$(wc -l < "${key_file}")")
for n in $(seq "${_nj}"); do
split_scps+=" ${logdir}/keys.${n}.scp"
done
Expand All @@ -184,13 +191,17 @@ if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
log "Decoding started... log: '${logdir}/asr_inference.*.log'"

if [ -n "${whisper_tag}" ]; then
if [ -z "${whisper_dir}" ]; then
whisper_dir=${outdir}/models
fi
# shellcheck disable=SC2046,SC2086
${_cmd} --gpu "${_ngpu}" JOB=1:"${_nj}" "${logdir}"/asr_inference.JOB.log \
python3 pyscripts/utils/evaluate_whisper_inference.py \
--ngpu "${_ngpu}" \
--data_path_and_name_and_type "${wavscp}" \
--key_file "${logdir}"/keys.JOB.scp \
--model_tag "${whisper_tag}" \
--model_tag ${whisper_tag} \
--model_dir ${whisper_dir} \
--output_dir "${logdir}"/output.JOB \
--decode_options "${decode_options}" || { cat $(grep -l -i error "${logdir}"/asr_inference.*.log) ; exit 1; }
else
Expand Down
104 changes: 104 additions & 0 deletions egs2/TEMPLATE/spk1/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
# ESPnet2 Spk1 Recipe TEMPLATE

This is a template of Spk1 recipe for ESPnet2.
It follows d-vector style training/inference for speaker verification.
In other words, it trains a DNN as a closed set speaker classifier.
After training the classification head is removed. The last hidden layer
(or sometimes another layer) is used as a speaker representation (i.e.,
speaker embedding) to represent diverse open set speakers.

## Table of Contents

* [ESPnet2 ASR2 Recipe TEMPLATE](#espnet2-asr2-recipe-template)
* [Table of Contents](#table-of-contents)
* [Recipe flow](#recipe-flow)
* [1\. Data preparation](#1-data-preparation)
* [2\. Speed perturbation](#2-speed-perturbation)
* [3\. Wav format](#3-wav-format)
* [4\. Generate discrete tokens](#4-generate-discrete-tokens)
* [5\. Generate dump folder](#5-generate-dump-folder)
* [6\. Removal of long / short data](#6-removal-of-long--short-data)
* [7\. Input / Output Token list generation](#7-input-output-token-list-generation)
* [8\. LM statistics collection](#8-lm-statistics-collection)
* [9\. LM training](#9-lm-training)
* [10\. LM perplexity](#10-lm-perplexity)
* [11\. Ngram-LM training](#11-ngram-lm-training)
* [12\. ASR statistics collection](#12-asr-statistics-collection)
* [13\. ASR training](#13-asr-training)
* [14\. ASR inference](#14-asr-inference)
* [15\. ASR scoring](#15-asr-scoring)
* [16\-18\. (Optional) Pack results for upload](#16-18-optional-pack-results-for-upload)
* [How to run](#how-to-run)
* [LibriSpeech training](#librispeech-training)
* [Related works](#related-works)

## Recipe flow

Spk1 recipe consists of 4 stages.

### 1. Data preparation

Data preparation stage.

#### ESPnet format:

It calls `local/data.sh` to create Kaldi-style data directories in `data/` for training, validation, and evaluation sets. It's the same as `asr1` tasks.

See also:
- [About Kaldi-style data directory](https://github.com/espnet/espnet/tree/master/egs2/TEMPLATE#about-kaldi-style-data-directory)

### 2. Wav format

Format the wave files in `wav.scp` to a single format (wav / flac / kaldi_ark).

### 3. Spk statistics collection

Statistics calculation stage.
It collects the shape information of input and output texts for Spk training.
Currently, it's close to a dummy because we set all utterances to have equal
duration in the training phase.

### 4. Spk training

Spk model training stage.
You can change the training setting via `--spk_config` and `--spk_args` options.

See also:
- [Change the configuration for training](https://espnet.github.io/espnet/espnet2_training_option.html)
- [Distributed training](https://espnet.github.io/espnet/espnet2_distributed.html)

## How to run

### VoxCeleb Training
Here, we show the procedure to run the recipe using `egs2/voxceleb/spk1`.

Move to the recipe directory.
```sh
$ cd egs2/voxceleb/spk1
```

Modify `VOXCELEB1`, `VOXCELEB2` variables in `db.sh` if you want to change the download directory.
```sh
$ vim db.sh
```

Modify `cmd.sh` and `conf/*.conf` if you want to use the job scheduler.
See the detail in [using job scheduling system](https://espnet.github.io/espnet/parallelization.html).
```sh
$ vim cmd.sh
```

Run `run.sh`, which conducts all of the stages explained above.
```sh
$ ./run.sh
```

## Related works
```
@INPROCEEDINGS{jung2022pushing,
title={Pushing the limits of raw waveform speaker recognition},
author={Jung, Jee-weon and Kim, You Jin and Heo, Hee-Soo and Lee, Bong-Jin and Kwon, Youngki and Chung, Joon Son},
year={2022},
booktitle={Proc. INTERSPEECH}
}
```

0 comments on commit 3a82677

Please sign in to comment.