Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes + Channel Selection for CHiME-7 Task #4934

Merged
merged 39 commits into from
Feb 14, 2023
Merged
Show file tree
Hide file tree
Changes from 34 commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
69fe371
addressing Taejin pointed out issues
popcornell Feb 10, 2023
5bf3c5a
addressing Taejin pointed out issues
popcornell Feb 10, 2023
644e58e
fixed md5sum check on original chime6 script
popcornell Feb 10, 2023
dc770b4
Merge branch 'master' of https://github.com/espnet/espnet
popcornell Feb 10, 2023
445e3ae
adding channel selection
popcornell Feb 11, 2023
b217ed7
revert
popcornell Feb 11, 2023
2518f81
revert
popcornell Feb 11, 2023
721a64a
revert
popcornell Feb 11, 2023
c4a58b2
added skip stages to asr dprep
popcornell Feb 11, 2023
f57c1d5
added flag to generate evaluation
popcornell Feb 11, 2023
447bd9d
addes contain function to data.sh
popcornell Feb 11, 2023
1f106f7
minor changes to run.sh
popcornell Feb 11, 2023
0d84fcb
with pretrained
popcornell Feb 11, 2023
263c36c
data.sh, skipping for decoding only
popcornell Feb 11, 2023
92dedbd
soundfile much faster than torchaudio
popcornell Feb 11, 2023
133dcd6
revised channel selection
popcornell Feb 11, 2023
a594319
applied linters
popcornell Feb 11, 2023
d888a38
applied linters
popcornell Feb 11, 2023
e8dc4d3
added jiwer and conda prefix
popcornell Feb 12, 2023
dd91c90
added dr kamo suggestion
popcornell Feb 12, 2023
17efdb2
changed stage
popcornell Feb 12, 2023
a531c03
better default
popcornell Feb 12, 2023
a412cc4
readme changed instructions
popcornell Feb 12, 2023
d99720d
gss2lhotse changed
popcornell Feb 12, 2023
df99724
Merge branch 'master' into chime7task1
popcornell Feb 12, 2023
ea808d1
prevent exiting on data.sh
popcornell Feb 13, 2023
4180cf7
sox is appended after
popcornell Feb 13, 2023
31292ce
data prep is needed
popcornell Feb 13, 2023
d73ddda
addressed LDC path issues with train calls and mixer6
popcornell Feb 13, 2023
930d388
changed error display
popcornell Feb 13, 2023
ebe8db9
some comments changed
popcornell Feb 13, 2023
270fad9
default is 80% mics channel selection
popcornell Feb 13, 2023
cfbb957
Merge branch 'chime7task1' of https://github.com/popcornell/espnet
popcornell Feb 13, 2023
c1abe1b
applied black
popcornell Feb 13, 2023
f28cca8
applied black
popcornell Feb 13, 2023
e87df34
added registration link to README.md
popcornell Feb 13, 2023
a49870a
added details about evaluation script
popcornell Feb 13, 2023
00308e1
added details about non determinism in GSS inference
popcornell Feb 13, 2023
5179f7a
Merge branch 'master' into chime7task1
popcornell Feb 14, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
Empty file modified egs/chime6/asr1/local/distant_audio_list
100644 → 100755
Empty file.
214 changes: 167 additions & 47 deletions egs2/chime7_task1/asr1/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,8 @@ Participants can possibly exploit commonly used open-source datasets (e.g. Libri

If you are considering participating or just want to learn more then please join the <a href="https://groups.google.com/g/chime5/">CHiME Google Group</a>. <br>
We have also a [CHiME Slack Workspace][slack-invite].<br>
Follow us on [Twitter][Twitter], we will also use that to make announcements.
Follow us on [Twitter][Twitter], we will also use that to make announcements. <br>
**If you want to participate please fill this [Google form](https://forms.gle/vbk4gpF77hP5LgKM8) (one registration per team).**


## <a id="installation">2. Installation </a>
Expand All @@ -55,25 +56,20 @@ git clone https://github.com/espnet/espnet/
Next, ESPNet must be installed, go to espnet/tools. <br />
This will install a new environment called espnet with Python 3.9.2
```bash
cd espnet
bash ./tools/setup_anaconda.sh venv espnet 3.9.2
cd espnet/tools
./setup_anaconda.sh venv "" 3.9.2
```
Activate this new environment.
```bash
source ./tools/venv/bin/activate
conda activate espnet
source ./venv/bin/activate
```
Then install ESPNet with Pytorch 1.13.1 be sure to put the correct version for **cudatoolkit**.
```bash
make TH_VERSION=1.13.1 CUDA_VERSION=11.6
```
Go to tools/kaldi and follow the instructions in INSTALL for installing Kaldi.
```bash
cd ./tools/kaldi
```
Finally, get in this recipe folder and install other baseline required packages (e.g. lhotse) using this script:
```bash
cd egs2/chime7_task1/asr1
cd ../egs2/chime7_task1/asr1
./local/install_dependencies.sh
```
You should be good to go !
Expand All @@ -94,7 +90,6 @@ these are:
2. Amazon Alexa Dinner Party Corpus (DiPCO) [2]
3. LDC Mixer 6 Speech (here we use a new re-annotated version) [3]


Unfortunately, only DiPCo can be downloaded automatically, the others must be
downloaded manually and then processed via our scripts here. <br>
See [Data page](https://www.chimechallenge.org/current/task1/data) for
Expand All @@ -105,28 +100,33 @@ To generate the data you need to have downloaded and unpacked manually Mixer 6 S
and the CHiME-5 dataset as obtained from instructions here [Data page](https://www.chimechallenge.org/current/task1/data).


Stage 0 of `run.sh` here handles CHiME-7 DASR dataset creation and calls `local/gen_task1_data.sh`. <br>
**Stage 0** of `run.sh` here handles CHiME-7 DASR dataset creation and calls `local/gen_task1_data.sh`. <br>
Note that DiPCo will be downloaded and extracted automatically. <br>
To **ONLY** generate the data you will need to run:

```bash
./run.sh --chime5-root YOUR_PATH_TO_CHiME5 --dipco-root PATH_WHERE_DOWNLOAD_DIPCO \
--mixer6-root YOUR_PATH_TO_MIXER6 --stage 0 --stop-stage 0
--mixer6-root YOUR_PATH_TO_MIXER6 --chime6-path PATH_WHERE_STORE_CHiME6 --stage 0 --stop-stage 0
```
If you have already CHiME-6 data you can use that without re-creating it from CHiME-5.
```bash
./run.sh --chime6-root YOUR_PATH_TO_CHiME6 --dipco-root PATH_WHERE_DOWNLOAD_DIPCO \
--mixer6-root YOUR_PATH_TO_MIXER6 --stage 0 --stop-stage 0
```
if you want to run the recipe from data prep to ASR training and decoding, instead remove the stop-stage flag.
But please you want to take a look at arguments such as `ngpu`, `gss_max_batch_dur`
**If you want to run the recipe from data prep to ASR training and decoding**, instead remove the stop-stage flag.
But please you want to take a look at arguments such as `gss_max_batch_dur`
and `asr_batch_size` because you may want to adjust these based on your hardware.
```bash
./run.sh --chime6-root YOUR_PATH_TO_CHiME6 --dipco-root PATH_WHERE_DOWNLOAD_DIPCO \
--mixer6-root YOUR_PATH_TO_MIXER6 --stage 0
--mixer6-root YOUR_PATH_TO_MIXER6 --stage 0 --ngpu YOUR_NUMBER_OF_GPUs
```
**We also provide a pre-trained model**, you can run only inference on development set
using:
```bash
./run.sh --chime6-root YOUR_PATH_TO_CHiME6 --dipco-root PATH_WHERE_DOWNLOAD_DIPCO \
--mixer6-root YOUR_PATH_TO_MIXER6 --stage 0 --ngpu YOUR_NUMBER_OF_GPUs \
--use-pretrained popcornell/chime7_task1_asr1_baseline \
--decode_only 1
```


### <a id="data_description">2.2 Quick Data Overview</a>
The generated dataset folder after running the script should look like this:
```
Expand Down Expand Up @@ -158,69 +158,189 @@ chime7_task1
└── mixer6
├── audio
│   ├── dev
│   ├── train_calls
│   ├── train_call
│   └── train_intv
├── transcriptions
│   ├── dev
│   ├── train_calls
│   ├── train_call
│   └── train_intv
└── transcriptions_scoring
├── dev
├── train_calls
├── train_call
└── train_intv
```
**NOTE**: eval directories for chime6 are empty at this stage. <br>
Eval data (which is actually blind only for Mixer6) can be generated by using as an argument
`--gen-eval 1` (note you will need the Mixer6 eval set which will be released [later](https://www.chimechallenge.org/current/dates)).
<br>

---
To find out if the data has been generated correctly, you can run this
snippet from this directory (assuming you called the challenge dataset dir chime7_task1). <br>
Your MD5 checksum should check out with ours.
script from this directory <br>
If it runs successfully your MD5 checksums are correct. <br>
These are stored here in this repo in `local/chime7_dasr_md5.json`.
```bash
python ./local/check_data_gen.py -c PATH_TO_CHIME7TASK_DATA
```
When evaluation data is released, please re-check again:
```bash
find chime7_task1 -type f -exec md5sum {} \; | sort -k 2 | md5sum
python ./local/check_data_gen.py -c PATH_TO_CHIME7TASK_DATA -e 1
```
In our case it returns:
`f938b57b3b67cdf7a0c0a25d56c45df2` <br>

Additional description is available in [Data page](https://www.chimechallenge.org/current/task1/data)
Additional data description is available in [Data page](https://www.chimechallenge.org/current/task1/data)

## <a id="baseline">3. Baseline System</a>

The baseline system in this recipe is similar to `egs2/chime6` one, which
itself is inherited direcly from CHiME-6 Challenge Kaldi recipe for Track 1 [s5_track1](https://github.com/kaldi-asr/kaldi/tree/master/egs/chime6/s5_track1). <br>

It is composed of two modules:
It is composed of two modules (with an optional channel selection module):
1. Guided Source Separation (GSS) [5], here we employ the GPU-based version (much faster) from [Desh Raj](https://github.com/desh2608/gss).
2. End-to-end ASR model based on [4], which is a transformer encoder/decoder model trained <br>
with joint CTC/attention [6]. It uses WavLM [7] as a feature extractor.
3. Optional Automatic Channel selection based on Envelope Variance Measure (EV) [8].

### 3.1 Results

#### 3.1.1 Main Track [Work in Progress]
The main track baseline unfortunately is currently WIP, we hope we can finish it
in the next weeks. <br>
It will be based on a TS-VAD diarization model which leverages pre-trained self-supervised
representation. <br>
We apologize for the inconvenience. <br>
The result of the diarization will be fed to this recipe GSS+ASR pipeline.

#### 3.1.2 Sub-Track 1: Oracle Diarization + ASR

Dataset | **microphone** | **SA-WER** |
--------|--------|------------|
CHiME-6 dev | GSS-all | TBA |
DiPCo dev | GSS-all| TBA |
Mixer6 Speech dev | GSS-all | TBA |


## <a id="eval_script">4. Evaluation [Work in Progress]

Pretrained model: [popcornell/chime7_task1_asr1_baseline](popcornell/chime7_task1_asr1_baseline) <br>
Detailed decoding results (insertions, deletions etc) are available in the model Huggingface repository
see [channel selection log top 25%](https://huggingface.co/popcornell/chime7_task1_asr1_baseline/blob/main/decoding_channel_selection.txt%20), [channel selection log top 80%](https://huggingface.co/popcornell/chime7_task1_asr1_baseline/blob/main/decoding_channel_selection_80.txt).

Here we report the results obtained using channel selection (25% of all channels and
80% respectively) prior to performing GSS and decoding with the baseline pre-trained
ASR model.

<table>
<thead>
<tr>
<th>Dataset</th>
<th>split</th>
<th>front-end</th>
<th>SA-WER (%)</th>
<th>macro SA-WER (%)</th>
</tr>
</thead>
<tbody>
<tr>
<td>CHiME-6</td>
<td rowspan="3">dev</td>
<td>GSS (EV top 25%)</td>
<td>44.5 </td>
<td rowspan="3">37.4<br></td>
</tr>
<tr>
<td>DiPCo</td>
<td>GSS (EV top 25%)</td>
<td>40.1</td>
</tr>
<tr>
<td>Mixer-6</td>
<td>GSS (EV top 25%)</td>
<td>27.7</td>
</tr>
</tbody>
</table>


<table>
<thead>
<tr>
<th>Dataset</th>
<th>split</th>
<th>front-end</th>
<th>SA-WER (%)</th>
<th>macro SA-WER (%)</th>
</tr>
</thead>
<tbody>
<tr>
<td>CHiME-6</td>
<td rowspan="3">dev</td>
<td>GSS (EV top 80%)</td>
<td>34.5</td>
<td rowspan="3">30.8</td>
</tr>
<tr>
<td>DiPCo</td>
<td>GSS (EV top 80%)</td>
<td>36.8</td>
</tr>
<tr>
<td>Mixer-6</td>
<td>GSS (EV top 80%)</td>
<td>21.2</td>
</tr>
</tbody>
</table>

Such baseline system would rank four on dev set based on the rules of the past CHiME-6 Challenge
on Track 1 (unconstrained LM).
The system with top 80% mics and EV selection will be used as the baseline also for evaluation.


Here we present results when only the outer microphones are used on CHiME-6 and DiPCO, while Mixer 6 uses all microphones. <br>
**However such system will be against the rules set for this Challenge as such
manual microphone selection based on the domain accounts for domain identification.**
It is however worth to report these results. <br>
Here all outer for DiPCo means we used channels `2,5,9,12,16,19,23,26,30,33` (diametrically opposite mics on each array).
<table>
<thead>
<tr>
<th>Dataset</th>
<th>split</th>
<th>front-end</th>
<th>SA-WER (%)</th>
<th>macro SA-WER (%)</th>
</tr>
</thead>
<tbody>
<tr>
<td>CHiME-6</td>
<td rowspan="3">dev</td>
<td>GSS (all outer)</td>
<td>35.5</td>
<td rowspan="3">31.8</td>
</tr>
<tr>
<td>DiPCo</td>
<td>GSS (all outer)</td>
<td>39.3</td>
</tr>
<tr>
<td>Mixer-6</td>
<td>GSS (all)</td>
<td>20.6</td>
</tr>
</tbody>
</table>

As seen it seems that for Mixer 6 Speech using all microphones will be better, but this is not true for CHiME-6 and DiPCo as seen by the channel
selection results. <br>
One of our hopes is that participants can devise new techniques to improve in such regard and devise
better channel selection/fusion strategies.

## <a id="eval_script">4. Evaluation Script [Work in Progress]

## <a id="common_issues"> 5. Common Issues </a>

1. `AssertionError: Torch not compiled with CUDA enabled` <br> for some reason you installed Pytorch without CUDA support. <br>
Please install Pytorch with CUDA support as explained in [pytorch website](https://pytorch.org/).
2. `ERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory: 'YOUR_PATH/espnet/tools/venv/lib/pyth
on3.9/site-packages/numpy-1.23.5.dist-info/METADATA'`. This is due to numpy installation getting corrupted for some reason.
You can remove the site-packages/numpy- folder manually and try to reinstall numpy 1.23.5 with pip.

## Acknowledgements


[sox]:
[google_group]:
[gpu_gss]:
[gss]:

We would like to thank Dr. Naoyuki Kamo for his precious help.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for mention me, but sorry, I don't have Ph.D. :->

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I am sorry. I can remove it. Honestly you deserve an honorary one ;)


## <a id="reference"> 6. References </a>

Expand All @@ -230,8 +350,8 @@ Mixer6 Speech dev | GSS-all | TBA |
[4] Chang, X., Maekaku, T., Fujita, Y., & Watanabe, S. (2022). End-to-end integration of speech recognition, speech enhancement, and self-supervised learning representation. <https://arxiv.org/abs/2204.00540> <br>
[5] Boeddeker, C., Heitkaemper, J., Schmalenstroeer, J., Drude, L., Heymann, J., & Haeb-Umbach, R. (2018, September). Front-end processing for the CHiME-5 dinner party scenario. In CHiME5 Workshop, Hyderabad, India (Vol. 1). <br>
[6] Kim, S., Hori, T., & Watanabe, S. (2017, March). Joint CTC-attention based end-to-end speech recognition using multi-task learning. Proc. of ICASSP (pp. 4835-4839). IEEE. <br>
[7] Chen, S., Wang, C., Chen, Z., Wu, Y., Liu, S., Chen, Z., ... & Wei, F. (2022). Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6), 1505-1518.

[7] Chen, S., Wang, C., Chen, Z., Wu, Y., Liu, S., Chen, Z., ... & Wei, F. (2022). Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6), 1505-1518. <br>
[8] Wolf, M., & Nadeu, C. (2014). Channel selection measures for multi-microphone speech recognition. Speech Communication, 57, 170-180.

[slack-badge]: https://img.shields.io/badge/slack-chat-green.svg?logo=slack
[slack-invite]: https://join.slack.com/t/chime-fey5388/shared_invite/zt-1oha0gedv-JEUr1mSztR7~iK9AxM4HOA
Expand Down
77 changes: 77 additions & 0 deletions egs2/chime7_task1/asr1/local/check_data_gen.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
import argparse
import glob
import hashlib
import json
import os
from pathlib import Path

import tqdm


def md5_file(fname):
hash_md5 = hashlib.md5()
with open(fname, "rb") as f:
for chunk in iter(lambda: f.read(8192), b""):
hash_md5.update(chunk)
return hash_md5.hexdigest()


def glob_check(root_folder, has_eval=False, input_json=None):

all_files = []
for ext in [".json", ".uem", ".wav", ".flac"]:
all_files.extend(
glob.glob(os.path.join(root_folder, "**/*{}".format(ext)), recursive=True)
)

for f in tqdm.tqdm(all_files):
digest = md5_file(f)
if not has_eval and Path(f).parent == "eval":
continue

if not input_json[str(Path(f).relative_to(root_folder))] == digest:
print(
"MD5 Checksum for {} is not the same. "
"Data has not been generated correctly."
"You can retry to generate it or re-download it."
"If this does not work, please reach us. ".format(
str(Path(f).relative_to(root_folder))
)
)


if __name__ == "__main__":
parser = argparse.ArgumentParser(
"Compute MD5 hash for each file recursively to check"
"if the data generation and download was successful or not."
)

parser.add_argument(
"-c,--chime7dasr_root",
type=str,
metavar="STR",
dest="chime7_root",
help="Path to chime7dasr dataset main directory."
"It should contain chime6, dipco and mixer6 as sub-folders.",
)
parser.add_argument(
"-e,--has_eval",
required=False,
type=int,
default=0,
dest="has_eval",
help="Whether to check also " "for evaluation (released later).",
)
parser.add_argument(
"-i,--input_json",
type=str,
default="local/chime7_dasr_md5.json",
dest="input_json",
required=False,
help="Input JSON file to check against containing md5 checksums for each file.",
)
args = parser.parse_args()
with open(args.input_json, "r") as f:
checksum_json = json.load(f)

glob_check(args.chime7_root, bool(args.has_eval), checksum_json)