Tools and reference meeting transcription pipeline of the LibriWASN data set (preprint, Zenodo data set link).
The LibriWASN data set consists of recordings of the same audio signals which also were played back to record the LibriCSS data set. The data was recorded by nine different devices (five smartphones with a single recording channel and four microphone arrays) resulting in 29 audio channels in total. Note that the sampling clocks of the different devices are not synchronized so that there exists a sampling rate offset (SRO) between the recordings of different devices.
The data set and auxiliary materials are available on Zenodo. Available auxiliary:
- Pictures of the recording setups
- Speaker and microphone position information
- Ground-truth diarization information of who speaks when
Clone the repository:
git clone https://github.com/fgnt/libriwasn.git
Install package:
pip install -e ./libriwasn
If you want to use the provided automatic speech recognition (ASR) system, please install the required packages:
pip install -e ./libriwasn[asr]
In order to calculate the concatenated minimum-Permutation Word Error Rate (cpWER), we utilize the meeteval package. This can be installed in the following way:
pip install cython
git clone https://github.com/fgnt/meeteval
pip install -e ./meeteval[cli]
To download the LibriWASN data set we provide two options stated below. Note that the LibriCSS is addtionally downloaded because it is used as reference in the experiments and its transcriptions are also used for the LibriWASN data.
To download the data to your desired direcetory, e.g., /your/database/path/, run the following command:
python -m libriwasn.database.download -db /your/database/path/
-
Download file DownloadLibriWASN.sh to your desired path where the data should be stored, e.g., /your/database/path/
-
Adjust permission for execution: chmod u+x DownloadLibriWASN.sh
-
Execute ./DownloadLibriWASN.sh from shell. This will download all files, check sanity by md5sum and extract the files to /your/database/path/.
The downloaded data has the following database structure w.r.t. the path your desired database path:
├── LibriWASN
│ ├── aux_files # Additional information about the LibriWASN data set
│ │ ├── LibirWASN200_Picture.png
│ │ ├── ...
│ │ └── Positions800.pdf
│ ├── libriwasn_200 # The LibriWASN^200 data set
│ │ ├── 0L # Different overlap conditionons
│ │ │ ├── <session0 folder>
│ │ │ │ ├── record
│ │ │ │ │ ├── Nexus_ntlab_asn_ID_0.wav # (Multi-channel) audio signal(s) recorded by the different devices
│ │ │ │ │ ├── ...
│ │ │ │ │ └── Xiaomi_ntlab_asn_ID_0.wav
│ │ │ │ └── vad
│ │ │ │ └── speaker.csv # Ground-truth diarization information
│ │ │ ├── ...
│ │ │ └── <session9 folder>
│ │ ├── ...
│ │ └── OV40
│ └── libriwasn_800 # The LibriWASN^800 data set
│ ├── 0L
│ │ ├── <session0 folder>
│ │ │ └── record
│ │ │ ├── Nexus_ntlab_asn_ID_0.wav
│ │ │ ├── ...
│ │ │ └── Xiaomi_ntlab_asn_ID_0.wav
│ │ ├── ...
│ │ └── <session9 folder>
│ ├── ...
│ └── OV40
└── LibriCSS # The LibriCSS data set
├── 0L
│ ├── <session0 folder>
│ │ ├── clean
│ │ │ ├── each_spk.wav # Signals which were played back to record LibriCSS and LibriWASN
│ │ │ └── mix.wav
│ │ ├── record
│ │ │ └── raw_recording.wav # Multi-channel signal recorded by the microphone array
│ │ └── transcription
│ │ │ ├── meeting_info.txt # Transcription used for LibriCSS and LibriWASN
│ │ │ └── segments
│ ├── ...
│ └── <session9 folder>
├── ...
├── OV40
├── all_res.json
├── readme.txt
└── segment_libricss.py
To run the reference system you first have to create a json file for the database:
python -m libriwasn.database.create_json -db /your/database/path/
The generated json file has the following structure:
{
"datasets": {
"libricss": {
example_id: {
"audio_path": {
"clean_observation": ..., # clean speech mixture
"observation": ..., # recorded multi-channel signal
"played_signals": ... # clean signal per speaker
},
"num_samples": {
"clean_observation": ...,
"observation": ...,
"original_source": [..., ...],
"played_signals": ...
},
"onset": { # Onset of utterance in samples
"original_source": [..., ...]
}
"overlap_condition": ..., # 0L, 0S, OV10, OV20, OV30 or OV40
"session": ..., # session0, session1, ... or session9
"source_id": [..., ...],
"speaker_id": [..., ...],
"transcription": [..., ...],
}
},
"libriwasn200": {
"audio_path": {
"clean_observation": ...,
"observation": {
"Nexus": ...,
"Pixel6a": ...,
"Pixel6b": ...,
"Pixel7": ...,
"Soundcard": ...,
"Xiaomi": ...,
"asnupb2": ...,
"asnupb4": ...,
"asnupb7": ...
},
"played_signals": ...
},
...
},
"libriwasn800": {
"audio_path": {
"clean_observation": ...,
"observation": {
"Nexus": ...,
"Pixel6a": ...,
"Pixel6b": ...,
"Pixel7": ...,
"Soundcard": ...,
"Xiaomi": ...,
"asnupb2": ...,
"asnupb4": ...,
"asnupb7": ...
},
"played_signals": ...
},
...
},
}
}
Note that the onsets of the utterances and also the lengths of the utterances (num_samples) fit to the recording of the Soundcard. Due to sampling rate offsets (SROs) and sampling time offsets (STOs) of the other devices w.r.t. the Soundcard both quantities will not perfectely fit to the recordings of the other devices.
Create a segmental time mark (STM) file for the reference transcription:
python -m libriwasn.database.write_ref_stm --json_path /your/database/path/libriwasn.json
This STM file is used when calculating the concatenated minimum-Permutation Word Error Rate (cpWER).
In the following the usage of the reference system will be explained. Note that the corresponding python scripts also serve as examples showing how to use the database and the different parts of the reference system (mask estimation, beamforming, ...).
To run experiments using the reference system change your working directory to your desired experiment directory, e.g. /your/libriwasn_experiment/path/. Afterwards, run the following command, for example, to separate the speakers' signals using sys-2 of the experimental section of the paper on LibriWASN200:
python -m libriwasn.reference_system.separate_sources with sys2_libriwasn200 db_json=/your/database/path/libriwasn.json
This script will write the separated singals of the speakers into the directory /your/libriwasn_experiment/path/separated_sources/sys2_libriwasn200/ and create a json file (/your/libriwasn_experiment/path/separated_sources/sys2_libriwasn200/per_utt.json) which contains all necessary information to evalute the meeting transcription performance. Note that the path of the json file is also outputed in the console when the script is finished,
As next step, the separated signals can be trancribed using the created json file as input:
python -m libriwasn.reference_system.transcribe --json_path /your/libriwasn_experiment/path/separated_sources/sys2_libriwasn200/per_utt.json
This will create an STM file for the transcription hypothesis: /your/libriwasn_experiment/path/separated_sources/sys2_libriwasn200/stm/hyp.stm. Again the path of the STM file is outputed in the console when the script is finished.
Finally, the cpWER can be calculated:
meeteval-wer cpwer -h /your/libriwasn_experiment/path/separated_sources/sys2_libriwasn200/stm/hyp.stm -r /your/database/path/ref_transcription.stm
This command produces a json file for the average cpWER (/your/libriwasn_experiment/path/separated_sources/sys2_libriwasn200/stm/hyp_cpwer.json) and another file with a more detailed description per session (/your/libriwasn_experiment/path/separated_sources/sys2_libriwasn200/stm/hyp_cpwer_per_reco.json). Finally, run the following command in order to get the cpWER per overlap condition:
python -m libriwasn.reference_system.wer_per_overlap_condition --json_path /your/libriwasn_experiment/path/separated_sources/sys2_libriwasn200/stm/hyp_cpwer_per_reco.json
To generate the audio data for the different experiments described in the paper run one of the following commands
Clean:
python -m libriwasn.reference_system.segment_meetings with clean db_json=/your/database/path/libriwasn.json
LibriCSS Sys-1:
python -m libriwasn.reference_system.segment_meetings with libricss db_json=/your/database/path/libriwasn.json
LibriCSS Sys-2:
python -m libriwasn.reference_system.separate_sources with sys2_libriwasn200 db_json=/your/database/path/libriwasn.json
LibriWASN200 Sys-1:
python -m libriwasn.reference_system.segment_meetings with libriwasn200 db_json=/your/database/path/libriwasn.json
LibriWASN200 Sys-2:
python -m libriwasn.reference_system.separate_sources with sys2_libriwasn200 db_json=/your/database/path/libriwasn.json
LibriWASN200 Sys-3:
python -m libriwasn.reference_system.separate_sources with sys3_libriwasn200 db_json=/your/database/path/libriwasn.json
LibriWASN200 Sys-4:
python -m libriwasn.reference_system.separate_sources with sys4_libriwasn200 db_json=/your/database/path/libriwasn.json
LibriWASN800 Sys-1:
python -m libriwasn.reference_system.segment_meetings with slibriwasn800 db_json=/your/database/path/libriwasn.json
LibriWASN800 Sys-2:
python -m libriwasn.reference_system.separate_sources with sys2_libriwasn800 db_json=/your/database/path/libriwasn.json
LibriWASN800 Sys-3:
python -m libriwasn.reference_system.separate_sources with sys3_libriwasn800 db_json=/your/database/path/libriwasn.json
LibriWASN800 Sys-4:
python -m libriwasn.reference_system.separate_sources with sys4_libriwasn800 db_json=/your/database/path/libriwasn.json
Most scripts of the refercne system can be parallelized (start <num_processes>):
mpiexec -np <num_processes> python -m libriwasn.reference_system.segment_meetings ...
mpiexec -np <num_processes> python -m libriwasn.reference_system.separate_sources ...
mpiexec -np <num_processes> python -m libriwasn.reference_system.transcribe ...
To speed up the transcription system GPU-based decoding can be enabled:
python -m libriwasn.reference_system.transcribe --enable_gpu=True ...
Note that a parellelization via MPI (mentioned above) is not supported for GPU-based decoding.
Tiny changes were made to some parts of the code w.r.t. the version of the code in the paper. This might lead to tiny differences in the resulting cpWER in comparison to the values in the paper. However, these differences are not significant.
In the paper we used Kaldi-alignments to segment the clean LibriSpeech utterances in the experiment "Clean (played back LibriSpeech utterances)" if they were too long for the automatic speech recognition system (see comment in the code). Here, we use the segmentation using an energy-based voice activity detection (VAD), which is also used for all other systems. We also fixed as small error in the VAD-based segmentation w.r.t. the code used to write the paper.
The diarization information, which is provided via Zenodo, tends to underestimate the speakers' activities. Therefore, we here adapt the utterance boundaries (see onset and num_samples in the database json file) provided by LibriCSS data set to the LibriWASN data set in order to obtain the activities for the experiments with an oracle segmentation.
Moreover, we observed that there are some abrupt changes in the amount of partly around 1000 samples in the time differences of arrival (TDOAs) between the LibriCSS recordings and the clean speech mixture. These high TDOAs cannot be explained by the time differences of flight (TDOFs) and therefore may result from technical issues during the recording. Another effect of these high TDOAs is that the utterance boundaries used for the oracle segmentation experiment on LibriCSS might not always perfectly fit to the LibriCSS recordings. This might result in a slightly higher cpWER compared to the LibriWASN data sets which do not show this effect.
If you are using the LibriWASN data set or this code please cite the following paper:
@InProceedings{SchGbHaeb2023,
Title = {LibriWASN: A Data Set for Meeting Separation, Diarization, and Recognition with Asynchronous Recording Devices},
Author = {Joerg Schmalenstroeer and Tobias Gburrek and Reinhold Haeb-Umbach},
Booktitle = {ITG conference on Speech Communication (ITG 2023)},
Year = {2023},
Month = {Sep},
}