Merge pull request #5641 from popcornell/c8dasr

CHiME-8 DASR recipe based on CHiME-7 DASR baseline
espnet · Feb 16, 2024 · 7ab5e42 · 7ab5e42
2 parents 332fdc1 + 3ad06d5
commit 7ab5e42
Show file tree

Hide file tree

Showing 63 changed files with 5,027 additions and 0 deletions.
diff --git a/egs2/chime8_task1/HELP.md b/egs2/chime8_task1/HELP.md
@@ -0,0 +1,85 @@
+## <a id="common_issues"> Common Issues </a>
+
+⚠️ If you use `run.pl` please check GSS logs when it is running and ensure you don't have any other processes on the GPUs.
+
+1. `AssertionError: Torch not compiled with CUDA enabled` <br> for some reason you installed Pytorch without CUDA support. <br>
+ Please install Pytorch with CUDA support as explained in [pytorch website](https://pytorch.org/).
+2. `ERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory: 'YOUR_PATH/espnet/tools/venv/lib/pyth
+on3.9/site-packages/numpy-1.23.5.dist-info/METADATA'`. This is due to numpy installation getting corrupted for some reason.
+You can remove the site-packages/numpy- folder manually and try to reinstall numpy 1.23.5 with pip.
+3. `FileNotFoundError: [Errno 2] No such file or directory: 'PATH2YOURESPNET/espnet/tools/venv/bin/sox'
+` during CHiME-6 generation from CHiME-5, `correct_signals_for_clock_drift.py` script: try to install conda sox again, via `conda install -c conda-forge sox`.
+4. `ModuleNotFoundError: No module named 's3prl'` for some reason s3prl did not install, run `YOUR_ESPNET_ROOT/tools/installers/install_s3prl.sh`
+5. `Command 'gss' not found` for some reason gss did not install, you can run `YOUR_ESPNET_ROOT/tools/installers/install_gss.sh`
+7. `wav-reverberate command not found` you need to install Kaldi. go to `YOUR_ESPNET_ROOT/tools/kaldi` and follow the instructions
+in `INSTALL`.
+8. `WARNING  [enhancer.py:245] Out of memory error while processing the batch` you got out-of-memory (OOM) when running GSS.
+You could try changing parameters as `gss_max_batch_dur` and in local/run_gss.sh `context-duration`
+(this latter could degrade results however). See `local/run_gss.sh` for more info. Also it could be that your GPUs are set in shared mode
+and all jobs are placed in the same GPU. You need to set them in exclusive mode.
+9. **Much worse WER than baseline** and you are using `run.pl`. **Check the GSS results and logs it is likely that enhancement failed**.
+**GSS currently does not work well if you use multi-gpu inference and your GPUs are in shared mode**. You need to run `set nvidia-smi -c 3` for each GPU.
+
+## Number of Utterances for Each Dataset in this Recipe
+### Training Set
+#### all
+- kaldi/train_all_ihm: 175403
+- kaldi/train_all_ihm_rvb: 701612
+- kaldi/train_all_mdm_ihm: 2150180
+- kaldi/train_all_mdm_ihm_rvb: 2851792
+- kaldi/train_all_mdm_ihm_rvb_gss: 2914483 (used for training here)
+#### chime6
+
+- kaldi/chime6/train/mdm: 1403340
+- kaldi/chime6/train/gss: 62691
+- kaldi/chime6/train/ihm: 118234
+#### mixer6
+
+- kaldi/mixer6/train/mdm: 571437
+- kaldi/mixer6/train/ihm: 57169
+
+### Development Set
+#### all
+- kaldi/dev_ihm_all; kaldi/dev_all_gss: 25121
+#### chime6
+- kaldi/chime6/dev/gss (used for validation here); kaldi/chime6/dev/ihm: 6644
+#### dipco
+- kaldi/dipco/dev/gss; kaldi/dipco/dev/ihm: 3673
+#### mixer6
+- kaldi/mixer6/dev/gss; kaldi/mixer6/dev/ihm: 14804
+
+## Memory Consumption (Useful for SLURM etc.)
+
+Figures kindly reported by Christoph Boeddeker, running this baseline code
+on Paderborn Center for Parallel Computing cluster (which uses SLURM).
+These figures could be useful to anyone that uses job schedulers and clusters
+for which resources are assigned strictly (e.g. job killed if it exceed requested
+memory resources).
+
+Used as default:
+ - train: 3G mem
+ - cuda: 4G mem (1 GPU)
+ - decode: 4G mem
+
+scripts/audio/format_wav_scp.sh:
+ - Some spikes to the range of 15 to 17 GB
+
+`${python} -m espnet2.bin.${asr_task}_inference${inference_bin_tag`}:
+ - Few spikes to the 9 to 11 GB range.
+
+
+## Using your own Speech Separation Front-End with the pre-trained ASR model.
+
+Some suggestions from Naoyuki Kamo see https://github.com/espnet/espnet/pull/4999 <br>
+There are two possible approaches.
+
+1. After obtaining the output of the baseline GSS enhancement.
+   1. (you are more familiar with Kaldi): copy data/kaldi/{chime6,dipco,mixer6}/{dev,eval}/gss using utils/copy_data_dir.sh and then substitute the
+   file paths in each `wav.scp` manifest file with the ones produced by your approach.
+   2. (you are more familiar with lhotse): copy data/lhotse/{chime6,dipco,mixer6}/{dev,eval}/gss lhotse manifests and then
+   replace the recordings manifests with the paths to your own recordings.
+2. Directly create your Kaldi or lhotse manifests for the ASR decoding, you can follow
+either the "style" of the baseline GSS ones or the ones belonging to close-talk mics.
+
+To evaluate the new enhanced data, e.g. `kaldi/chime6/dev/my_enhanced`, you need to include it into `asr_tt_set` in `run.sh` or
+from command line: `run.sh --stage 3 --asr-tt-set "kaldi/chime6/dev/gss" --decode-train dev --use-pretrained popcornell/chime7_task1_asr1_baseline --asr-dprep-stage 4`.
diff --git a/egs2/chime8_task1/INSTALL.md b/egs2/chime8_task1/INSTALL.md
@@ -0,0 +1,35 @@
+### <a id="espnet_installation"> Installation  </a>
+
+Firstly, clone ESPNet. <br/>
+```bash
+git clone https://github.com/espnet/espnet/
+```
+Next, ESPNet must be installed, go to espnet/tools. <br />
+This will install a new environment called espnet with Python 3.9.2
+```bash
+cd espnet/tools
+./setup_anaconda.sh venv "" 3.9.2
+```
+Activate this new environment.
+```bash
+source ./venv/bin/activate
+```
+Then install ESPNet with Pytorch 1.13.1 be sure to put the correct version for **cudatoolkit**.
+```bash
+make TH_VERSION=1.13.1 CUDA_VERSION=11.6
+```
+If you plan to train the ASR model, you would need to compile Kaldi. Otherwise you can
+skip this step. Go to the `kaldi` directory and follow instructions in `INSTALL`.
+```bash
+cd kaldi
+cat INSTALL
+```
+Finally, get in this recipe **asr1 folder** and install other baseline required packages (e.g. lhotse) using this script:
+```bash
+cd ../egs2/chime8_task1/asr1
+./local/install_dependencies.sh
+```
+You should be good to go !
+
+⚠️ if you encounter any problem have a look at [HELP.md](HELP.md) here. <br>
+Or reach us, see [README.md](./README.md).</a>
diff --git a/egs2/chime8_task1/README.md b/egs2/chime8_task1/README.md
@@ -0,0 +1,218 @@
+# CHiME-8 DASR (CHiME-8 Task 1)
+
+### Distant Automatic Speech Transcription with Multiple Devices in Diverse Scenarios
+[![Twitter](https://img.shields.io/twitter/url/https/twitter.com/chimechallenge.svg?style=social&label=Follow%20%40chimechallenge)](https://twitter.com/chimechallenge)
+[![Slack][slack-badge]][slack-invite]
+---
+
+
+#### 📢  If you want to participate see [official challenge website](https://www.chimechallenge.org/current/task1/index) for registration.
+
+
+### <a id="reach_us">Any Question/Problem ? Reach us !</a>
+
+If you are considering participating or just want to learn more then please join the <a href="https://groups.google.com/g/chime5/">CHiME Google Group</a>. <br>
+We have also a [CHiME Slack Workspace][slack-invite], join the `chime-8-dasr` channel there or contact us directly.<br>
+We also have a [Troubleshooting page](./HELP.md).
+
+
+## DASR Data Download and Generation
+
+Data generation is handled here using [chime-utils](https://github.com/chimechallenge/chime-utils). <br>
+If you are **only interested in obtaining the data** you should use [chime-utils](https://github.com/chimechallenge/chime-utils) directly. <br>
+
+Data generation and downloading is done automatically in this recipe in stage 0. You can skip it if you have already the data. <br>
+Note that Mixer 6 Speech has to be obtained via LDC. See [official challenge website](https://www.chimechallenge.org/current/task1/data). <br>
+CHiME-6, DiPCo and NOTSOFAR1 will be downloaded automatically.
+## System Description
+
+<img src="https://www.chimechallenge.org/challenges/chime7/task1/images/baseline.png" width="450" height="120" />
+
+The system here is effectively the same as used for the CHiME-7 DASR Challenge (except for some minor changes). <br>
+It is described in detail in the [CHiME-7 DASR paper](https://arxiv.org/abs/2306.13734) and the [website of the previous challenge](https://www.chimechallenge.org/challenges/chime7/task1/baseline). <br>
+The system consists of:
+1. diarization component based on [Pyannote diarization pipeline 2.0](https://huggingface.co/pyannote/speaker-diarization)
+   - this is in `diar_asr1` folder
+2. Envelope-variance selection [4] + Guided source separation [2] + WavLM-based ASR model [1].
+   - this is in `asr1` folder.
+
+
+#### <a id="whatisnew">What is new compared to CHiME-7 DASR Baseline ? </a>
+
+- GSS now is much more memory efficient see https://github.com/desh2608/gss/pull/39 (many thanks to Christoph Boeddeker).
+- We raised the clustering threshold for the pre-trained Pyannote EEND segmentation model and raised maximum number of speakers to 8 to handle NOTSOFAR1.
+- Some bugs have been fixed.
+
+## 📊 Results
+
+As explained in [official challenge website](https://www.chimechallenge.org/current/task1/index) this year
+systems will be ranked according to macro tcpWER [5] across the 4 scenarios (5 s collar). <br>
+The 4 scenarios we feature this year are very diverse see ([see website for statistics](https://www.chimechallenge.org/current/task1/index)), and this diversity
+significantly complicates speaker counting.
+
+
+```bash
++-----+--------------+--------------+----------+----------+--------------+-------------+-----------------+------------------+------------------+------------------+
+|     | session_id   |   error_rate |   errors |   length |   insertions |   deletions |   substitutions |   missed_speaker |   falarm_speaker |   scored_speaker |
+|-----+--------------+--------------+----------+----------+--------------+-------------+-----------------+------------------+------------------+------------------|
+| dev | chime6       |     0.825381 |    52070 |    63086 |        12747 |       29466 |            9857 |                0 |                5 |                8 |
+| dev | mixer6       |     0.287729 |    26621 |    92521 |         4882 |        8809 |           12930 |                0 |               24 |               70 |
+| dev | dipco        |     0.674161 |    11574 |    17168 |         3066 |        5563 |            2945 |                0 |                2 |                8 |
+| dev | notsofar1    |     0.508768 |    90660 |   178195 |        14872 |       55195 |           20593 |              105 |                7 |              592 |
++-----+--------------+--------------+----------+----------+--------------+-------------+-----------------+------------------+------------------+------------------+
+###############################################################################
+### Macro-Averaged tcpWER for across all Scenario (Ranking Metric) ############
+###############################################################################
++-----+--------------+
+|     |   error_rate |
+|-----+--------------|
+| dev |      0.57401 |
++-----+--------------+
+```
+
+## Reproducing the Baseline
+
+⚠️  **GSS currently does not work well if you use multi-gpu inference and your GPUs are in shared mode** <br>
+Please if you use `run.pl` set your GPUs in EXCLUSIVE_PROCESS with `nvidia-smi -i 3 -c 3` where `-i X` is the GPU index.
+
+### Inference-only
+
+If you want to perform inference with the pre-trained models:
+- ASR ([HF repo](https://huggingface.co/popcornell/chime7_task1_asr1_baseline))
+- Pyannote Segmentation ([HF repo](https://huggingface.co/popcornell/chime7_task1_asr1_baseline))
+
+
+By default, the scripts hereafter will perform inference on dev set of all 4 scenarios: CHiME-6, DiPCo, Mixer 6 and NOTSOFAR1. <br>
+To limit e.g. only to CHiME-6 and DiPCo you can pass these options:
+
+```bash
+--gss-dsets "chime6_dev,dipco_dev" --asr-tt-set "kaldi/chime6/dev/gss kaldi/dipco/dev/gss"
+```
+
+#### Full-System (Diarization+GSS+ASR)
+
+
+Got to `diar_asr1`:
+```bash
+cd diar_asr1
+```
+If you have already generated the data via [chime-utils](https://github.com/chimechallenge/chime-utils) and the data is in `/path/to/chime8_dasr`:
+```bash
+./run.sh --chime8-root /path/to/chime8_dasr --stage 1 --ngpu YOUR_NUMBER_OF_GPUs \
+--use-pretrained popcornell/chime7_task1_asr1_baseline \
+--run-on dev
+```
+If you need to generate the data yet.
+CHiME-6, DiPCo and NOTSOFAR1 will be downloaded automatically. Ensure you have ~1TB of space in a path of your choice `/your/path/to/download`. <br>
+Mixer 6 Speech has to be obtained via LDC and unpacked in a directory of your choice `/your/path/to/mixer6_root`. <br>
+Data will be generated in `/your/path/to/chime8_dasr` again choose the most convenient location for you.
+
+
+```bash
+./run.sh --chime8-root /path/to/chime8_dasr \
+--download-dir /your/path/to/download \
+--mixer6-root /your/path/to/mixer6_root \
+--stage 0 --ngpu YOUR_NUMBER_OF_GPUs \
+--use-pretrained popcornell/chime7_task1_asr1_baseline \
+--run-on dev
+```
+
+You can use `--stage` and `--gss-asr-stage` args to resume the inference in whatever step.
+
+#### GSS+ASR only with Oracle-Diarization (or diarization from your own diarizer)
+
+We provide also a GSS + ASR only script to be used with oracle diarization
+or you diarizer output if you wish only to work on diarization. <br>
+We assume here you have already generated the data and start from stage 1.
+
+If you want to use oracle diarization, go to `asr1`:
+
+```bash
+cd asr1
+
+./run.sh --chime8-root /path/to/chime8_dasr --stage 1 --ngpu YOUR_NUMBER_OF_GPUs \
+--use-pretrained popcornell/chime7_task1_asr1_baseline \
+--run-on dev
+```
+
+If you want to use your custom diarization, go to `diar_asr1`:
+```bash
+cd diar_asr1
+
+./run.sh --chime8-root /path/to/chime8_dasr --stage 3 --ngpu YOUR_NUMBER_OF_GPUs \
+--use-pretrained popcornell/chime7_task1_asr1_baseline \
+--run-on dev --diarization-dir /path/to/your/diarization/output
+```
+
+It is assumed that your diarizer produces JSON manifests (same as CHiME-8 DASR annotation see [data page]())
+and these manifests are in `/path/to/your/diarization/output`. <br>
+`/path/to/your/diarization/output` should have this structure (you can ignore `.rttms`):
+
+```
+├── chime6
+│   └── dev
+│       ├── S02.json
+│       ├── S02.rttm
+│       ├── S09.json
+│       └── S09.rttm
+├── dipco
+│   └── dev
+│       ├── S28.json
+│       ├── S28.rttm
+│       ├── S29.json
+│       └── S29.rttm
+├── mixer6
+......
+├── notsofar1
+```
+
+
+### Training the ASR model
+
+We assume here you have already generated the data and start from stage 1.
+If you want to use retrain the ASR model, go to `asr1` and choose a name for the new model:
+
+```bash
+cd asr1
+
+./run.sh --chime8-root /path/to/chime8_dasr --stage 1 --ngpu YOUR_NUMBER_OF_GPUs \
+--run-on train --asr-tag YOUR_NEW_ASR_NAME
+```
+
+You can use You can use `--stage` and `--asr-stage` and `--asr-dprep-stage`  args to resume the inference in whatever step.
+
+### Fine-Tuning the Pyannote Segmentation Model
+
+We assume here you have already generated the data and start from stage 1.
+If you want to fine-tune the segmentation model, go to `diar_asr1` and choose a name for the new model:
+
+```bash
+cd diar_asr1
+
+./run.sh --chime8-root /path/to/chime8_dasr --stage 1 --ngpu YOUR_NUMBER_OF_GPUs \
+--pyan-ft 1
+```
+
+Note that the data preparation for the fine-tuning is done in `diar_asr1/local/pyannote_dprep.py`
+and you have also to set up `diar_asr1/local/database.yml` properly to use your own data. <br>
+See the [pyannote documentation](https://github.com/pyannote/pyannote-audio/blob/develop/tutorials/training_a_model.ipynb) for more info.
+
+You can use You can use `--stage` and `--gss-asr-stage` args to resume the inference in whatever step.
+## Acknowledgements
+
+We would like to thank Naoyuki Kamo for his precious help, Christoph Boeddeker for
+reporting many bugs and the memory consumption figures and feedback for evaluation script.
+
+
+## <a id="reference"> 6. References </a>
+
+[1] Chang, X., Maekaku, T., Fujita, Y., & Watanabe, S. (2022). End-to-end integration of speech recognition, speech enhancement, and self-supervised learning representation. <https://arxiv.org/abs/2204.00540> <br>
+[2] Boeddeker, C., Heitkaemper, J., Schmalenstroeer, J., Drude, L., Heymann, J., & Haeb-Umbach, R. (2018, September). Front-end processing for the CHiME-5 dinner party scenario. In CHiME5 Workshop, Hyderabad, India (Vol. 1). <br>
+[3] Kim, S., Hori, T., & Watanabe, S. (2017, March). Joint CTC-attention based end-to-end speech recognition using multi-task learning. Proc. of ICASSP (pp. 4835-4839). IEEE. <br>
+[4] Wolf, M., & Nadeu, C. (2014). Channel selection measures for multi-microphone speech recognition. Speech Communication, 57, 170-180. <br>
+[5] von Neumann, T., Boeddeker, C., Delcroix, M., & Haeb-Umbach, R. (2023). MeetEval: A Toolkit for Computation of Word Error Rates for Meeting Transcription Systems. arXiv preprint arXiv:2307.11394. <br>
+
+
+[slack-badge]: https://img.shields.io/badge/slack-chat-green.svg?logo=slack
+[slack-invite]: https://join.slack.com/t/chime-fey5388/shared_invite/zt-1oha0gedv-JEUr1mSztR7~iK9AxM4HOA
+[twitter]: https://twitter.com/chimechallenge<h2>References</h2>
diff --git a/egs2/chime8_task1/asr1/asr.sh b/egs2/chime8_task1/asr1/asr.sh
@@ -0,0 +1 @@
+../../TEMPLATE/asr1/asr.sh