Merge pull request #4934 from popcornell/chime7task1

Fixes + Channel Selection for CHiME-7 Task
espnet · Feb 14, 2023 · 5d4615f · 5d4615f
2 parents a24d72a + 5179f7a
commit 5d4615f
Show file tree

Hide file tree

Showing 14 changed files with 9,292 additions and 151 deletions.
diff --git a/egs/chime6/asr1/local/distant_audio_list b/egs/chime6/asr1/local/distant_audio_list
diff --git a/egs2/chime7_task1/asr1/README.md b/egs2/chime7_task1/asr1/README.md
@@ -5,7 +5,10 @@
 [![Twitter](https://img.shields.io/twitter/url/https/twitter.com/chimechallenge.svg?style=social&label=Follow%20%40chimechallenge)](https://twitter.com/chimechallenge)
 [![Slack][slack-badge]][slack-invite]
 
+
 ---
+
+#### If you want to participate please fill this [Google form](https://forms.gle/vbk4gpF77hP5LgKM8) (one contact person per-team only)
 ### Sections
 1. <a href="#description">Short Description </a>
 2. <a href="#data_creation">Data Download and Creation</a>
@@ -18,7 +21,7 @@
 
 <img src="https://www.chimechallenge.org/current/task1/images/task_overview.png" width="450" height="230" />
 
-This CHiME-7 Challenge Task inherits diretly from the previous [CHiME-6 Challenge](https://chimechallenge.github.io/chime6/). 
+This CHiME-7 Challenge Task inherits directly from the previous [CHiME-6 Challenge](https://chimechallenge.github.io/chime6/). 
 Its focus is on distant automatic speech transcription and segmentation with multiple recording devices.
 
 The goal of each participant is to devise an automated system that can tackle this problem, and is able to generalize across different array topologies and different application scenarios: meetings, dinner parties and interviews.
@@ -35,7 +38,8 @@ Participants can possibly exploit commonly used open-source datasets (e.g. Libri
 
 If you are considering participating or just want to learn more then please join the <a href="https://groups.google.com/g/chime5/">CHiME Google Group</a>. <br>
 We have also a [CHiME Slack Workspace][slack-invite].<br>
-Follow us on [Twitter][Twitter], we will also use that to make announcements. 
+Follow us on [Twitter][Twitter], we will also use that to make announcements. <br>
+
 
 
 ## <a id="installation">2. Installation </a>
@@ -55,25 +59,20 @@ git clone https://github.com/espnet/espnet/
 Next, ESPNet must be installed, go to espnet/tools. <br />
 This will install a new environment called espnet with Python 3.9.2
 ```bash
-cd espnet
-bash ./tools/setup_anaconda.sh venv espnet 3.9.2
+cd espnet/tools
+./setup_anaconda.sh venv "" 3.9.2
 ```
 Activate this new environment.
 ```bash
-source ./tools/venv/bin/activate
-conda activate espnet
+source ./venv/bin/activate
 ```
 Then install ESPNet with Pytorch 1.13.1 be sure to put the correct version for **cudatoolkit**. 
 ```bash
 make TH_VERSION=1.13.1 CUDA_VERSION=11.6
 ```
-Go to tools/kaldi and follow the instructions in INSTALL for installing Kaldi. 
-```bash
-cd ./tools/kaldi
-```
 Finally, get in this recipe folder and install other baseline required packages (e.g. lhotse) using this script: 
 ```bash
-cd egs2/chime7_task1/asr1
+cd ../egs2/chime7_task1/asr1
 ./local/install_dependencies.sh
 ```
 You should be good to go !
@@ -94,7 +93,6 @@ these are:
 2. Amazon Alexa Dinner Party Corpus (DiPCO) [2]
 3. LDC Mixer 6 Speech (here we use a new re-annotated version) [3]
 
-
 Unfortunately, only DiPCo can be downloaded automatically, the others must be 
 downloaded manually and then processed via our scripts here. <br>
 See [Data page](https://www.chimechallenge.org/current/task1/data) for 
@@ -105,28 +103,36 @@ To generate the data you need to have downloaded and unpacked manually Mixer 6 S
 and the CHiME-5 dataset as obtained from instructions here [Data page](https://www.chimechallenge.org/current/task1/data).
 
 
-Stage 0 of `run.sh` here handles CHiME-7 DASR dataset creation and calls `local/gen_task1_data.sh`. <br>
+**Stage 0** of `run.sh` here handles CHiME-7 DASR dataset creation and calls `local/gen_task1_data.sh`. <br>
 Note that DiPCo will be downloaded and extracted automatically. <br>
 To **ONLY** generate the data you will need to run:
-
 ```bash
 ./run.sh --chime5-root YOUR_PATH_TO_CHiME5 --dipco-root PATH_WHERE_DOWNLOAD_DIPCO \
---mixer6-root YOUR_PATH_TO_MIXER6 --stage 0 --stop-stage 0
+--mixer6-root YOUR_PATH_TO_MIXER6 --chime6-path PATH_WHERE_STORE_CHiME6 --stage 0 --stop-stage 0
 ```
 If you have already CHiME-6 data you can use that without re-creating it from CHiME-5. 
 ```bash
 ./run.sh --chime6-root YOUR_PATH_TO_CHiME6 --dipco-root PATH_WHERE_DOWNLOAD_DIPCO \
 --mixer6-root YOUR_PATH_TO_MIXER6 --stage 0 --stop-stage 0
 ```
-if you want to run the recipe from data prep to ASR training and decoding, instead remove the stop-stage flag.
-But please you want to take a look at arguments such as `ngpu`, `gss_max_batch_dur` 
+**If you want to run the recipe from data prep to ASR training and decoding**, instead remove the stop-stage flag.
+But please you want to take a look at arguments such as `gss_max_batch_dur` 
 and `asr_batch_size` because you may want to adjust these based on your hardware. 
 ```bash
 ./run.sh --chime6-root YOUR_PATH_TO_CHiME6 --dipco-root PATH_WHERE_DOWNLOAD_DIPCO \
---mixer6-root YOUR_PATH_TO_MIXER6 --stage 0 
+--mixer6-root YOUR_PATH_TO_MIXER6 --stage 0 --ngpu YOUR_NUMBER_OF_GPUs
 ```
-
-
+**We also provide a pre-trained model**, you can run only inference on development set 
+using:
+```bash
+./run.sh --chime6-root YOUR_PATH_TO_CHiME6 --dipco-root PATH_WHERE_DOWNLOAD_DIPCO \
+--mixer6-root YOUR_PATH_TO_MIXER6 --stage 0 --ngpu YOUR_NUMBER_OF_GPUs \
+--use-pretrained popcornell/chime7_task1_asr1_baseline \
+--decode_only 1 
+```
+You should be able to replicate our results detailed in Section 3.1.2 with the 
+top 80% envelope variance automatic channel selection (a bit more than 34% WER on CHiME-6). <br>
+Note that results may differ a little because GSS inference is not deterministic. 
 ### <a id="data_description">2.2 Quick Data Overview</a>
 The generated dataset folder after running the script should look like this: 
 ```
@@ -158,69 +164,194 @@ chime7_task1
 └── mixer6
     ├── audio
     │   ├── dev
-    │   ├── train_calls
+    │   ├── train_call
     │   └── train_intv
     ├── transcriptions
     │   ├── dev
-    │   ├── train_calls
+    │   ├── train_call
     │   └── train_intv
     └── transcriptions_scoring
         ├── dev
-        ├── train_calls
+        ├── train_call
         └── train_intv
 ```
 **NOTE**: eval directories for chime6 are empty at this stage. <br>
+Eval data (which is actually blind only for Mixer6) can be generated by using as an argument 
+`--gen-eval 1` (note you will need the Mixer6 eval set which will be released [later](https://www.chimechallenge.org/current/dates)).
+<br>
 
 ---
 To find out if the data has been generated correctly, you can run this 
-snippet from this directory (assuming you called the challenge dataset dir chime7_task1). <br>
-Your MD5 checksum should check out with ours. 
+script from this directory <br>
+If it runs successfully your MD5 checksums are correct. <br>
+These are stored here in this repo in `local/chime7_dasr_md5.json`.
 ```bash
-find chime7_task1 -type f -exec md5sum {} \; | sort -k 2 | md5sum
+python ./local/check_data_gen.py -c PATH_TO_CHIME7TASK_DATA 
+```
+When evaluation data is released, please re-check again: 
+```bash
+python ./local/check_data_gen.py -c PATH_TO_CHIME7TASK_DATA -e 1
 ```
-In our case it returns:
-`f938b57b3b67cdf7a0c0a25d56c45df2` <br>
 
-Additional description is available in [Data page](https://www.chimechallenge.org/current/task1/data)
+Additional data description is available in [Data page](https://www.chimechallenge.org/current/task1/data)
 
 ## <a id="baseline">3. Baseline System</a>
 
 The baseline system in this recipe is similar to `egs2/chime6` one, which 
 itself is inherited direcly from CHiME-6 Challenge Kaldi recipe for Track 1 [s5_track1](https://github.com/kaldi-asr/kaldi/tree/master/egs/chime6/s5_track1). <br>
 
-It is composed of two modules:
+It is composed of two modules (with an optional channel selection module):
 1. Guided Source Separation (GSS) [5], here we employ the GPU-based version (much faster) from [Desh Raj](https://github.com/desh2608/gss).
 2. End-to-end ASR model based on [4], which is a transformer encoder/decoder model trained <br>
 with joint CTC/attention [6]. It uses WavLM [7] as a feature extractor. 
+3. Optional Automatic Channel selection based on Envelope Variance Measure (EV) [8].
 
 ### 3.1 Results 
 
 #### 3.1.1 Main Track [Work in Progress]
+The main track baseline unfortunately is currently WIP, we hope we can finish it 
+in the next weeks. <br>
+It will be based on a TS-VAD diarization model which leverages pre-trained self-supervised
+representation. <br>
+We apologize for the inconvenience. <br>
+The result of the diarization will be fed to this recipe GSS+ASR pipeline.
 
 #### 3.1.2 Sub-Track 1: Oracle Diarization + ASR
-
-Dataset | **microphone** | **SA-WER** |  
---------|--------|------------|
-CHiME-6 dev | GSS-all | TBA        | 
-DiPCo dev | GSS-all| TBA        | 
-Mixer6 Speech dev | GSS-all | TBA        | 
-
-
-## <a id="eval_script">4. Evaluation [Work in Progress]
+Pretrained model: [popcornell/chime7_task1_asr1_baseline](popcornell/chime7_task1_asr1_baseline) <br>
+Detailed decoding results (insertions, deletions etc) are available in the model Huggingface repository
+see [channel selection log top 25%](https://huggingface.co/popcornell/chime7_task1_asr1_baseline/blob/main/decoding_channel_selection.txt%20), [channel selection log top 80%](https://huggingface.co/popcornell/chime7_task1_asr1_baseline/blob/main/decoding_channel_selection_80.txt).
+
+Here we report the results obtained using channel selection (25% of all channels and 
+80% respectively) prior to performing GSS and decoding with the baseline pre-trained 
+ASR model. 
+
+<table>
+<thead>
+  <tr>
+    <th>Dataset</th>
+    <th>split</th>
+    <th>front-end</th>
+    <th>SA-WER (%)</th>
+    <th>macro SA-WER (%)</th>
+  </tr>
+</thead>
+<tbody>
+  <tr>
+    <td>CHiME-6</td>
+    <td rowspan="3">dev</td>
+    <td>GSS (EV top 25%)</td>
+    <td>44.5 </td>
+    <td rowspan="3">37.4<br></td>
+  </tr>
+  <tr>
+    <td>DiPCo</td>
+    <td>GSS (EV top 25%)</td>
+    <td>40.1</td>
+  </tr>
+  <tr>
+    <td>Mixer-6</td>
+    <td>GSS (EV top 25%)</td>
+    <td>27.7</td>
+  </tr>
+</tbody>
+</table>
+
+
+<table>
+<thead>
+  <tr>
+    <th>Dataset</th>
+    <th>split</th>
+    <th>front-end</th>
+    <th>SA-WER (%)</th>
+    <th>macro SA-WER (%)</th>
+  </tr>
+</thead>
+<tbody>
+  <tr>
+    <td>CHiME-6</td>
+    <td rowspan="3">dev</td>
+    <td>GSS (EV top 80%)</td>
+    <td>34.5</td>
+    <td rowspan="3">30.8</td>
+  </tr>
+  <tr>
+    <td>DiPCo</td>
+    <td>GSS (EV top 80%)</td>
+    <td>36.8</td>
+  </tr>
+  <tr>
+    <td>Mixer-6</td>
+    <td>GSS (EV top 80%)</td>
+    <td>21.2</td>
+  </tr>
+</tbody>
+</table>
+
+Such baseline system would rank four on dev set based on the rules of the past CHiME-6 Challenge 
+on Track 1 (unconstrained LM). 
+The system with top 80% mics and EV selection will be used as the baseline also for evaluation.  
+
+
+Here we present results when only the outer microphones are used on CHiME-6 and DiPCO, while Mixer 6 uses all microphones. <br> 
+**However such system will be against the rules set for this Challenge as such 
+manual microphone selection based on the domain accounts for domain identification.** 
+It is however worth to report these results. <br>
+Here all outer for DiPCo means we used channels `2,5,9,12,16,19,23,26,30,33` (diametrically opposite mics on each array).
+<table>
+<thead>
+  <tr>
+    <th>Dataset</th>
+    <th>split</th>
+    <th>front-end</th>
+    <th>SA-WER (%)</th>
+    <th>macro SA-WER (%)</th>
+  </tr>
+</thead>
+<tbody>
+  <tr>
+    <td>CHiME-6</td>
+    <td rowspan="3">dev</td>
+    <td>GSS (all outer)</td>
+    <td>35.5</td>
+    <td rowspan="3">31.8</td>
+  </tr>
+  <tr>
+    <td>DiPCo</td>
+    <td>GSS (all outer)</td>
+    <td>39.3</td>
+  </tr>
+  <tr>
+    <td>Mixer-6</td>
+    <td>GSS (all)</td>
+    <td>20.6</td>
+  </tr>
+</tbody>
+</table>
+
+As seen it seems that for Mixer 6 Speech using all microphones will be better, but this is not true for CHiME-6 and DiPCo as seen by the channel 
+selection results. <br>
+One of our hopes is that participants can devise new techniques to improve in such regard and devise
+better channel selection/fusion strategies. 
+
+## <a id="eval_script"> 4. Evaluation Script </a>
+
+Will be added together with the diarization baseline. 
+Evaluation will be performed as described in the [Task Main Page](https://www.chimechallenge.org/current/task1/index) and needs joint diarization and transcription. 
+by computing SA-WER for each speaker, where the hypothesis are re-ordered based on the best reordering defined by diarization error rate.
 
 
 ## <a id="common_issues"> 5. Common Issues </a>
 
 1. `AssertionError: Torch not compiled with CUDA enabled` <br> for some reason you installed Pytorch without CUDA support. <br>
  Please install Pytorch with CUDA support as explained in [pytorch website](https://pytorch.org/).
+2. `ERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory: 'YOUR_PATH/espnet/tools/venv/lib/pyth
+on3.9/site-packages/numpy-1.23.5.dist-info/METADATA'`. This is due to numpy installation getting corrupted for some reason.
+You can remove the site-packages/numpy- folder manually and try to reinstall numpy 1.23.5 with pip. 
 
+## Acknowledgements
 
-
-[sox]:
-[google_group]: 
-[gpu_gss]:
-[gss]: 
-
+We would like to thank Dr. Naoyuki Kamo for his precious help. 
 
 ## <a id="reference"> 6. References </a>
 
@@ -230,8 +361,8 @@ Mixer6 Speech dev | GSS-all | TBA        |
 [4] Chang, X., Maekaku, T., Fujita, Y., & Watanabe, S. (2022). End-to-end integration of speech recognition, speech enhancement, and self-supervised learning representation. <https://arxiv.org/abs/2204.00540> <br>
 [5] Boeddeker, C., Heitkaemper, J., Schmalenstroeer, J., Drude, L., Heymann, J., & Haeb-Umbach, R. (2018, September). Front-end processing for the CHiME-5 dinner party scenario. In CHiME5 Workshop, Hyderabad, India (Vol. 1). <br>
 [6] Kim, S., Hori, T., & Watanabe, S. (2017, March). Joint CTC-attention based end-to-end speech recognition using multi-task learning. Proc. of ICASSP (pp. 4835-4839). IEEE. <br>
-[7] Chen, S., Wang, C., Chen, Z., Wu, Y., Liu, S., Chen, Z., ... & Wei, F. (2022). Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6), 1505-1518.
-
+[7] Chen, S., Wang, C., Chen, Z., Wu, Y., Liu, S., Chen, Z., ... & Wei, F. (2022). Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6), 1505-1518. <br>
+[8] Wolf, M., & Nadeu, C. (2014). Channel selection measures for multi-microphone speech recognition. Speech Communication, 57, 170-180.
 
 [slack-badge]: https://img.shields.io/badge/slack-chat-green.svg?logo=slack
 [slack-invite]: https://join.slack.com/t/chime-fey5388/shared_invite/zt-1oha0gedv-JEUr1mSztR7~iK9AxM4HOA