Skip to content

Commit

Permalink
Merge pull request #5641 from popcornell/c8dasr
Browse files Browse the repository at this point in the history
CHiME-8 DASR recipe based on CHiME-7 DASR baseline
  • Loading branch information
sw005320 committed Feb 16, 2024
2 parents 332fdc1 + 3ad06d5 commit 7ab5e42
Show file tree
Hide file tree
Showing 63 changed files with 5,027 additions and 0 deletions.
85 changes: 85 additions & 0 deletions egs2/chime8_task1/HELP.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
## <a id="common_issues"> Common Issues </a>

⚠️ If you use `run.pl` please check GSS logs when it is running and ensure you don't have any other processes on the GPUs.

1. `AssertionError: Torch not compiled with CUDA enabled` <br> for some reason you installed Pytorch without CUDA support. <br>
Please install Pytorch with CUDA support as explained in [pytorch website](https://pytorch.org/).
2. `ERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory: 'YOUR_PATH/espnet/tools/venv/lib/pyth
on3.9/site-packages/numpy-1.23.5.dist-info/METADATA'`. This is due to numpy installation getting corrupted for some reason.
You can remove the site-packages/numpy- folder manually and try to reinstall numpy 1.23.5 with pip.
3. `FileNotFoundError: [Errno 2] No such file or directory: 'PATH2YOURESPNET/espnet/tools/venv/bin/sox'
` during CHiME-6 generation from CHiME-5, `correct_signals_for_clock_drift.py` script: try to install conda sox again, via `conda install -c conda-forge sox`.
4. `ModuleNotFoundError: No module named 's3prl'` for some reason s3prl did not install, run `YOUR_ESPNET_ROOT/tools/installers/install_s3prl.sh`
5. `Command 'gss' not found` for some reason gss did not install, you can run `YOUR_ESPNET_ROOT/tools/installers/install_gss.sh`
7. `wav-reverberate command not found` you need to install Kaldi. go to `YOUR_ESPNET_ROOT/tools/kaldi` and follow the instructions
in `INSTALL`.
8. `WARNING [enhancer.py:245] Out of memory error while processing the batch` you got out-of-memory (OOM) when running GSS.
You could try changing parameters as `gss_max_batch_dur` and in local/run_gss.sh `context-duration`
(this latter could degrade results however). See `local/run_gss.sh` for more info. Also it could be that your GPUs are set in shared mode
and all jobs are placed in the same GPU. You need to set them in exclusive mode.
9. **Much worse WER than baseline** and you are using `run.pl`. **Check the GSS results and logs it is likely that enhancement failed**.
**GSS currently does not work well if you use multi-gpu inference and your GPUs are in shared mode**. You need to run `set nvidia-smi -c 3` for each GPU.

## Number of Utterances for Each Dataset in this Recipe
### Training Set
#### all
- kaldi/train_all_ihm: 175403
- kaldi/train_all_ihm_rvb: 701612
- kaldi/train_all_mdm_ihm: 2150180
- kaldi/train_all_mdm_ihm_rvb: 2851792
- kaldi/train_all_mdm_ihm_rvb_gss: 2914483 (used for training here)
#### chime6

- kaldi/chime6/train/mdm: 1403340
- kaldi/chime6/train/gss: 62691
- kaldi/chime6/train/ihm: 118234
#### mixer6

- kaldi/mixer6/train/mdm: 571437
- kaldi/mixer6/train/ihm: 57169

### Development Set
#### all
- kaldi/dev_ihm_all; kaldi/dev_all_gss: 25121
#### chime6
- kaldi/chime6/dev/gss (used for validation here); kaldi/chime6/dev/ihm: 6644
#### dipco
- kaldi/dipco/dev/gss; kaldi/dipco/dev/ihm: 3673
#### mixer6
- kaldi/mixer6/dev/gss; kaldi/mixer6/dev/ihm: 14804

## Memory Consumption (Useful for SLURM etc.)

Figures kindly reported by Christoph Boeddeker, running this baseline code
on Paderborn Center for Parallel Computing cluster (which uses SLURM).
These figures could be useful to anyone that uses job schedulers and clusters
for which resources are assigned strictly (e.g. job killed if it exceed requested
memory resources).

Used as default:
- train: 3G mem
- cuda: 4G mem (1 GPU)
- decode: 4G mem

scripts/audio/format_wav_scp.sh:
- Some spikes to the range of 15 to 17 GB

`${python} -m espnet2.bin.${asr_task}_inference${inference_bin_tag`}:
- Few spikes to the 9 to 11 GB range.


## Using your own Speech Separation Front-End with the pre-trained ASR model.

Some suggestions from Naoyuki Kamo see https://github.com/espnet/espnet/pull/4999 <br>
There are two possible approaches.

1. After obtaining the output of the baseline GSS enhancement.
1. (you are more familiar with Kaldi): copy data/kaldi/{chime6,dipco,mixer6}/{dev,eval}/gss using utils/copy_data_dir.sh and then substitute the
file paths in each `wav.scp` manifest file with the ones produced by your approach.
2. (you are more familiar with lhotse): copy data/lhotse/{chime6,dipco,mixer6}/{dev,eval}/gss lhotse manifests and then
replace the recordings manifests with the paths to your own recordings.
2. Directly create your Kaldi or lhotse manifests for the ASR decoding, you can follow
either the "style" of the baseline GSS ones or the ones belonging to close-talk mics.

To evaluate the new enhanced data, e.g. `kaldi/chime6/dev/my_enhanced`, you need to include it into `asr_tt_set` in `run.sh` or
from command line: `run.sh --stage 3 --asr-tt-set "kaldi/chime6/dev/gss" --decode-train dev --use-pretrained popcornell/chime7_task1_asr1_baseline --asr-dprep-stage 4`.
35 changes: 35 additions & 0 deletions egs2/chime8_task1/INSTALL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
### <a id="espnet_installation"> Installation </a>

Firstly, clone ESPNet. <br/>
```bash
git clone https://github.com/espnet/espnet/
```
Next, ESPNet must be installed, go to espnet/tools. <br />
This will install a new environment called espnet with Python 3.9.2
```bash
cd espnet/tools
./setup_anaconda.sh venv "" 3.9.2
```
Activate this new environment.
```bash
source ./venv/bin/activate
```
Then install ESPNet with Pytorch 1.13.1 be sure to put the correct version for **cudatoolkit**.
```bash
make TH_VERSION=1.13.1 CUDA_VERSION=11.6
```
If you plan to train the ASR model, you would need to compile Kaldi. Otherwise you can
skip this step. Go to the `kaldi` directory and follow instructions in `INSTALL`.
```bash
cd kaldi
cat INSTALL
```
Finally, get in this recipe **asr1 folder** and install other baseline required packages (e.g. lhotse) using this script:
```bash
cd ../egs2/chime8_task1/asr1
./local/install_dependencies.sh
```
You should be good to go !

⚠️ if you encounter any problem have a look at [HELP.md](HELP.md) here. <br>
Or reach us, see [README.md](./README.md).</a>
218 changes: 218 additions & 0 deletions egs2/chime8_task1/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,218 @@
# CHiME-8 DASR (CHiME-8 Task 1)

### Distant Automatic Speech Transcription with Multiple Devices in Diverse Scenarios
[![Twitter](https://img.shields.io/twitter/url/https/twitter.com/chimechallenge.svg?style=social&label=Follow%20%40chimechallenge)](https://twitter.com/chimechallenge)
[![Slack][slack-badge]][slack-invite]
---


#### 📢 If you want to participate see [official challenge website](https://www.chimechallenge.org/current/task1/index) for registration.


### <a id="reach_us">Any Question/Problem ? Reach us !</a>

If you are considering participating or just want to learn more then please join the <a href="https://groups.google.com/g/chime5/">CHiME Google Group</a>. <br>
We have also a [CHiME Slack Workspace][slack-invite], join the `chime-8-dasr` channel there or contact us directly.<br>
We also have a [Troubleshooting page](./HELP.md).


## DASR Data Download and Generation

Data generation is handled here using [chime-utils](https://github.com/chimechallenge/chime-utils). <br>
If you are **only interested in obtaining the data** you should use [chime-utils](https://github.com/chimechallenge/chime-utils) directly. <br>

Data generation and downloading is done automatically in this recipe in stage 0. You can skip it if you have already the data. <br>
Note that Mixer 6 Speech has to be obtained via LDC. See [official challenge website](https://www.chimechallenge.org/current/task1/data). <br>
CHiME-6, DiPCo and NOTSOFAR1 will be downloaded automatically.
## System Description

<img src="https://www.chimechallenge.org/challenges/chime7/task1/images/baseline.png" width="450" height="120" />

The system here is effectively the same as used for the CHiME-7 DASR Challenge (except for some minor changes). <br>
It is described in detail in the [CHiME-7 DASR paper](https://arxiv.org/abs/2306.13734) and the [website of the previous challenge](https://www.chimechallenge.org/challenges/chime7/task1/baseline). <br>
The system consists of:
1. diarization component based on [Pyannote diarization pipeline 2.0](https://huggingface.co/pyannote/speaker-diarization)
- this is in `diar_asr1` folder
2. Envelope-variance selection [4] + Guided source separation [2] + WavLM-based ASR model [1].
- this is in `asr1` folder.


#### <a id="whatisnew">What is new compared to CHiME-7 DASR Baseline ? </a>

- GSS now is much more memory efficient see https://github.com/desh2608/gss/pull/39 (many thanks to Christoph Boeddeker).
- We raised the clustering threshold for the pre-trained Pyannote EEND segmentation model and raised maximum number of speakers to 8 to handle NOTSOFAR1.
- Some bugs have been fixed.

## 📊 Results

As explained in [official challenge website](https://www.chimechallenge.org/current/task1/index) this year
systems will be ranked according to macro tcpWER [5] across the 4 scenarios (5 s collar). <br>
The 4 scenarios we feature this year are very diverse see ([see website for statistics](https://www.chimechallenge.org/current/task1/index)), and this diversity
significantly complicates speaker counting.


```bash
+-----+--------------+--------------+----------+----------+--------------+-------------+-----------------+------------------+------------------+------------------+
| | session_id | error_rate | errors | length | insertions | deletions | substitutions | missed_speaker | falarm_speaker | scored_speaker |
|-----+--------------+--------------+----------+----------+--------------+-------------+-----------------+------------------+------------------+------------------|
| dev | chime6 | 0.825381 | 52070 | 63086 | 12747 | 29466 | 9857 | 0 | 5 | 8 |
| dev | mixer6 | 0.287729 | 26621 | 92521 | 4882 | 8809 | 12930 | 0 | 24 | 70 |
| dev | dipco | 0.674161 | 11574 | 17168 | 3066 | 5563 | 2945 | 0 | 2 | 8 |
| dev | notsofar1 | 0.508768 | 90660 | 178195 | 14872 | 55195 | 20593 | 105 | 7 | 592 |
+-----+--------------+--------------+----------+----------+--------------+-------------+-----------------+------------------+------------------+------------------+
###############################################################################
### Macro-Averaged tcpWER for across all Scenario (Ranking Metric) ############
###############################################################################
+-----+--------------+
| | error_rate |
|-----+--------------|
| dev | 0.57401 |
+-----+--------------+
```

## Reproducing the Baseline

⚠️ **GSS currently does not work well if you use multi-gpu inference and your GPUs are in shared mode** <br>
Please if you use `run.pl` set your GPUs in EXCLUSIVE_PROCESS with `nvidia-smi -i 3 -c 3` where `-i X` is the GPU index.

### Inference-only

If you want to perform inference with the pre-trained models:
- ASR ([HF repo](https://huggingface.co/popcornell/chime7_task1_asr1_baseline))
- Pyannote Segmentation ([HF repo](https://huggingface.co/popcornell/chime7_task1_asr1_baseline))


By default, the scripts hereafter will perform inference on dev set of all 4 scenarios: CHiME-6, DiPCo, Mixer 6 and NOTSOFAR1. <br>
To limit e.g. only to CHiME-6 and DiPCo you can pass these options:

```bash
--gss-dsets "chime6_dev,dipco_dev" --asr-tt-set "kaldi/chime6/dev/gss kaldi/dipco/dev/gss"
```

#### Full-System (Diarization+GSS+ASR)


Got to `diar_asr1`:
```bash
cd diar_asr1
```
If you have already generated the data via [chime-utils](https://github.com/chimechallenge/chime-utils) and the data is in `/path/to/chime8_dasr`:
```bash
./run.sh --chime8-root /path/to/chime8_dasr --stage 1 --ngpu YOUR_NUMBER_OF_GPUs \
--use-pretrained popcornell/chime7_task1_asr1_baseline \
--run-on dev
```
If you need to generate the data yet.
CHiME-6, DiPCo and NOTSOFAR1 will be downloaded automatically. Ensure you have ~1TB of space in a path of your choice `/your/path/to/download`. <br>
Mixer 6 Speech has to be obtained via LDC and unpacked in a directory of your choice `/your/path/to/mixer6_root`. <br>
Data will be generated in `/your/path/to/chime8_dasr` again choose the most convenient location for you.


```bash
./run.sh --chime8-root /path/to/chime8_dasr \
--download-dir /your/path/to/download \
--mixer6-root /your/path/to/mixer6_root \
--stage 0 --ngpu YOUR_NUMBER_OF_GPUs \
--use-pretrained popcornell/chime7_task1_asr1_baseline \
--run-on dev
```

You can use `--stage` and `--gss-asr-stage` args to resume the inference in whatever step.

#### GSS+ASR only with Oracle-Diarization (or diarization from your own diarizer)

We provide also a GSS + ASR only script to be used with oracle diarization
or you diarizer output if you wish only to work on diarization. <br>
We assume here you have already generated the data and start from stage 1.

If you want to use oracle diarization, go to `asr1`:

```bash
cd asr1

./run.sh --chime8-root /path/to/chime8_dasr --stage 1 --ngpu YOUR_NUMBER_OF_GPUs \
--use-pretrained popcornell/chime7_task1_asr1_baseline \
--run-on dev
```

If you want to use your custom diarization, go to `diar_asr1`:
```bash
cd diar_asr1

./run.sh --chime8-root /path/to/chime8_dasr --stage 3 --ngpu YOUR_NUMBER_OF_GPUs \
--use-pretrained popcornell/chime7_task1_asr1_baseline \
--run-on dev --diarization-dir /path/to/your/diarization/output
```

It is assumed that your diarizer produces JSON manifests (same as CHiME-8 DASR annotation see [data page]())
and these manifests are in `/path/to/your/diarization/output`. <br>
`/path/to/your/diarization/output` should have this structure (you can ignore `.rttms`):

```
├── chime6
│   └── dev
│   ├── S02.json
│   ├── S02.rttm
│   ├── S09.json
│   └── S09.rttm
├── dipco
│   └── dev
│   ├── S28.json
│   ├── S28.rttm
│   ├── S29.json
│   └── S29.rttm
├── mixer6
......
├── notsofar1
```


### Training the ASR model

We assume here you have already generated the data and start from stage 1.
If you want to use retrain the ASR model, go to `asr1` and choose a name for the new model:

```bash
cd asr1

./run.sh --chime8-root /path/to/chime8_dasr --stage 1 --ngpu YOUR_NUMBER_OF_GPUs \
--run-on train --asr-tag YOUR_NEW_ASR_NAME
```

You can use You can use `--stage` and `--asr-stage` and `--asr-dprep-stage` args to resume the inference in whatever step.

### Fine-Tuning the Pyannote Segmentation Model

We assume here you have already generated the data and start from stage 1.
If you want to fine-tune the segmentation model, go to `diar_asr1` and choose a name for the new model:

```bash
cd diar_asr1

./run.sh --chime8-root /path/to/chime8_dasr --stage 1 --ngpu YOUR_NUMBER_OF_GPUs \
--pyan-ft 1
```

Note that the data preparation for the fine-tuning is done in `diar_asr1/local/pyannote_dprep.py`
and you have also to set up `diar_asr1/local/database.yml` properly to use your own data. <br>
See the [pyannote documentation](https://github.com/pyannote/pyannote-audio/blob/develop/tutorials/training_a_model.ipynb) for more info.

You can use You can use `--stage` and `--gss-asr-stage` args to resume the inference in whatever step.
## Acknowledgements

We would like to thank Naoyuki Kamo for his precious help, Christoph Boeddeker for
reporting many bugs and the memory consumption figures and feedback for evaluation script.


## <a id="reference"> 6. References </a>

[1] Chang, X., Maekaku, T., Fujita, Y., & Watanabe, S. (2022). End-to-end integration of speech recognition, speech enhancement, and self-supervised learning representation. <https://arxiv.org/abs/2204.00540> <br>
[2] Boeddeker, C., Heitkaemper, J., Schmalenstroeer, J., Drude, L., Heymann, J., & Haeb-Umbach, R. (2018, September). Front-end processing for the CHiME-5 dinner party scenario. In CHiME5 Workshop, Hyderabad, India (Vol. 1). <br>
[3] Kim, S., Hori, T., & Watanabe, S. (2017, March). Joint CTC-attention based end-to-end speech recognition using multi-task learning. Proc. of ICASSP (pp. 4835-4839). IEEE. <br>
[4] Wolf, M., & Nadeu, C. (2014). Channel selection measures for multi-microphone speech recognition. Speech Communication, 57, 170-180. <br>
[5] von Neumann, T., Boeddeker, C., Delcroix, M., & Haeb-Umbach, R. (2023). MeetEval: A Toolkit for Computation of Word Error Rates for Meeting Transcription Systems. arXiv preprint arXiv:2307.11394. <br>


[slack-badge]: https://img.shields.io/badge/slack-chat-green.svg?logo=slack
[slack-invite]: https://join.slack.com/t/chime-fey5388/shared_invite/zt-1oha0gedv-JEUr1mSztR7~iK9AxM4HOA
[twitter]: https://twitter.com/chimechallenge<h2>References</h2>
1 change: 1 addition & 0 deletions egs2/chime8_task1/asr1/asr.sh

0 comments on commit 7ab5e42

Please sign in to comment.