add centralized data preparation for OWSM #5478

jctian98 · 2023-10-17T02:20:25Z

What?

This PR response issue 5469 to provide data.sh and related files for OWSM recipes from v1 to v3.
The main modification is under: <espnet_path>/egs2/owsm_v* directory

User Guidance for Data Preparation (copy from README.md)

(1) Please work progressively from v1 to v3: this means you need to prepare data for v1, v2 and v3 in order to obtain the full v3 data. To start the data preparation, run bash local/data.sh --VERSION v1 # or v2, v3
(2) Please revise db.sh for all datasets before running local/data.sh. Some datasets cannot be downloaded and untared automatically due to license issues. Users should take care of it by themselves.
(3) Due to the large volume of data, we are not confident the scripts will run smoothly for each dataset. Please raise an issue if you believe there is a bug.
(4) This script only prepares data for train and valid subsets. Test data should be prepared separately following the conventional Espnet2 format.
(5) Even though we provide this centralized data preparation script and combine all datasets in it, we strongly recommend users to NOT use the merged train_v* and valid_v* for feature extractions. Instead, users may run stage 2-4 for each dataset separately and combine all datasets together under dump/raw directory. This will allow you to handle all datasets simultaneously; inspection and debugging will also be easier. This is exactly what we did in our experiments.
(6) The detailed data list is in local/data.sh. Also see: https://arxiv.org/pdf/2309.13876.pdf

List of datasets

V1: Aishell, CoVoST2, GigaSpeech, LibriSpeech, MuST-C, SPGISpeech, TEDLIUM3
V2: all in V1, GigaST, Multilingual Librispeech, WenetSpeech
V3: all in V2, AIDATATANG, AMI, Babel, CommonVoice, Fisher (SwitchBoard), Fisher Callhome Spanish, FLEURS, Googlei18n, KsponSpeech, MagicData, ReazonSpeech, Russian Open STT, VCTK, VoxForge, VoxPopuli, WSJ

TODO list (future PRs will link this PR for continuity)

(1) Extend to v4: We intend to collect more data with community efforts.
(2) Unified data collecting and processing policy for multilingual speech data (see more in discussion below).

Discussion: unified data collecting and processing policy

During this PR, we find the following problems are not being solved at this moment. We intend to categorize these problems into 3 divisions. We intend to solve or alleviate these problems with some unified policy and make our solution a public tool or script in Espnet

Pre-processing during data preparation

(1) Wrong language-id: it is found that some language-ids are not correct in the original datasets. E.g., English utterances are in the non-English corpus and are then labeled as English (seen in OpenSLR 32, 35, 52). These errors are at the utterance level. We currently don't have a good solution, except removing these datasets as a whole.
(2) Langauge-id inclusion: Some languages are considered subsets of other languages. E.g., Mandarin and Chinese-TW can both be considered as Chinese; languages with different dialects also have their own language-ids. However, since each utterance has only one exclusive language-id, the current data setup cannot solve this problem perfectly. We mainly keep them as-is.
(3) Special Symbols: Some datasets have transcriptions that contain special symbols like [Breath], [Laughter] etc. We intend to remove these special symbols. However, for each new dataset, we need to find all special symbols manually.
(4) Unified transcription processing: We need a consistent text processing policy for all raw transcriptions, which should consider upper/lower class, wide characters, illegal characters, spaces, digit normalization, etc. We currently mainly keep the raw transcriptions as-is (except, we change transcriptions that are all in upper class into lower class).

Data Cleaning with extra models (force-alignment, VAD, etc)

(1) Meaningless Speech: Some speech examples are very spontaneous and contain nearly no meaningful content. These examples are usually much longer than expected and will harm the time-stamp prediction. This issue is observed in Babel and Magic data. A potential solution is to use force-alignment to clip it.
(2) Long-form misalignment: current setup find segment information as: start_time of the first utterance + end_time of the last utterance. However, not all audio between these two time-stamps is transcribed (e.g., some pieces in the middle are too noisy/meaningless and are discarded in the original dataset). Thus, text and audio sometimes are not well aligned. We don't have a very good policy here.

Postprocessing during scoring

(1) Test-time transcription: Although the current owsm model can output text with upper/lower case characters & punctuations, all evaluation is conducted with normed text w/o punctuations. This is common in ASR research but maybe this is sub-optimal for OWSM-alike models.

for more information, see https://pre-commit.ci

codecov · 2023-10-17T02:39:43Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (4610653) 76.54% compared to head (ee00c6c) 76.54%.
Report is 60 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #5478   +/-   ##
=======================================
  Coverage   76.54%   76.54%           
=======================================
  Files         720      720           
  Lines       66599    66602    +3     
=======================================
+ Hits        50975    50978    +3     
  Misses      15624    15624

Flag	Coverage Δ
test_configuration_espnet2	`∅ <ø> (∅)`
test_integration_espnet1	`62.92% <ø> (ø)`
test_integration_espnet2	`50.10% <ø> (+<0.01%)`	⬆️
test_python_espnet1	`19.08% <ø> (+<0.01%)`	⬆️
test_python_espnet2	`52.38% <ø> (-0.01%)`	⬇️
test_utils	`22.15% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

pyf98 · 2023-10-17T14:23:58Z

Thanks! Looking forward to v3

pyf98 · 2023-10-17T14:27:07Z

egs2/mixed_v1/s2t1/local/data.sh

+    train_sets="data/GigaST/XL.en-* \
+                data/MLS/train.* \
+                data/WenetSpeech/L"
+    # question (jinchuan): why don't include GigaST-dev?


I remember GigaST does not have DEV

Great. I'll remove this question

merge remote

for more information, see https://pre-commit.ci

sw005320 · 2023-10-17T23:03:43Z

How did you deal with the wide characters, including the space symbols?
Did you deal with them as they are?

jctian98 · 2023-10-18T01:01:27Z

A summary so far:

(1) There is a shared ./local/data.sh for all egs2/mixed_v* recipes, which is used to provide the combined dataset for v1, v2 and v3. E.g., for v1, this script can be used with bash local/data.sh --VERSION v1
(2) The script should be used progressively. That means the user should run v1 to v3 in order to get the full v3 data.
(3) For v3, the babel dataset is currently absent. Dan was responsible for this. I have contacted him and will update once I get the script.
(4) For each dataset, we have a ./local/prepare_<dataset>.sh to handle its preparation. Scripts for v1 and v2 were done by @pyf98 ; scripts for v3 are newly added in this PR. Specifically, most datasets in v3 are processed as (1) prepare as original espnet / kaldi format with the existing egs2/<dataset>/asr1/local/data.sh scripts and (2) transform into the OWSM data format with the script ./local/kaldi_to_whisper.py
(5) Have conducted some tests on our scripts, especially for all datasets included in v3. However, due to the large volume of data, some of these scripts are only tested partially (like, only run on dev set).
(6) The original data.sh scripts for some tasks are not smooth (e.g., swbd, fisher_callhome_spanish). The users should also take care of the data sources on some datasets. So the whole process cannot be expected to be very smooth. We can make revisions if we receive users' feedback.
(7) v1 and v2 adopt the original whisper's language IDs. Since we adopt more languages than whisper, our language IDs in v3 are changed to iso-639-3 format.

TODO:
(1) double check the scripts and solve the CI issues
(2) update babel script

Answer for questions above:
(1) we try to keep the original text data as-is. So haven't done any special operations on wide characters.
(2) string.split() and " ".join() are rapidly used so multiple space and \t might be replaced by single space.

sw005320 · 2023-10-18T16:47:57Z

Can you add some info to https://github.com/espnet/espnet/blob/master/egs2/mixed_v3/s2t1/README.md?

(1) we try to keep the original text data as-is. So haven't done any special operations on wide characters.

This is risky as each corpus has a different annotation policy (e.g., punctuations, special characters like noises).
Please make sure to make it consistent by taking a look at preprocessed data for each corpus.

(2) string.split() and " ".join() are rapidly used so multiple space and \t might be replaced by single space.

What happened to the wide-character space, then? Can string.split() deal with the wide-character space?

pyf98 · 2023-10-18T17:05:19Z

Some corpora indeed contain special white spaces. That’s why I applied string.split. I think it works. After applying it, I did not see any warning or errors about space characters. But please correct me if I am wrong

…

On Wed, Oct 18, 2023 at 12:48 Shinji Watanabe ***@***.***> wrote: Can you add some info to https://github.com/espnet/espnet/blob/master/egs2/mixed_v3/s2t1/README.md? (1) we try to keep the original text data as-is. So haven't done any special operations on wide characters. This is risky as each corpus has a different annotation policy (e.g., punctuations, special characters like noises). Please make sure to make it consistent by taking a look at preprocessed data for each corpus. (2) string.split() and " ".join() are rapidly used so multiple space and \t might be replaced by single space. What happened to the wide-character space, then? Can string.split() deal with the wide-character space? — Reply to this email directly, view it on GitHub <#5478 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AG6TJMJH6GGZ55ORGWUZTKTYAAB4TAVCNFSM6AAAAAA6DCT7RGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONRYHE2TEOBVGU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

sw005320 · 2023-10-18T17:09:53Z

Some corpora indeed contain special white spaces. That’s why I applied string.split. I think it works. After applying it, I did not see any warning or errors about space characters. But please correct me if I am wrong

I think you're right.
string.split() seems to be working on various spaces, including the wide character.

jctian98 · 2023-10-19T02:52:14Z

I think it's better to discuss how to do the text normalization (punctuations, wide characters, multiple spaces etc.). We also need to take care of some edge cases for audio. After we fix the policy, we can apply it to all datasets.

I think we can write the README file after we finish the scripts, to avoid further revisions.

egs2/mixed_v1/s2t1/local/data.sh

mergify · 2023-10-25T10:37:00Z

This pull request is now in conflict :(

for more information, see https://pre-commit.ci

mergify · 2023-11-01T09:26:44Z

This pull request is now in conflict :(

fetch and merge

for more information, see https://pre-commit.ci

sw005320 · 2023-11-11T12:33:28Z

Can you add some TODO and discussions here?
Since this PR includes the directory change, to promote the owsm activities, I want to merge this with the current stage.
Then, please prepare the follow up PRs with the copy of the some TODO and discussions.

merge

for more information, see https://pre-commit.ci

jctian98 · 2023-11-11T22:02:09Z

Can you add some TODO and discussions here?

It's at the top of this PR. Please review it.

for more information, see https://pre-commit.ci

sw005320 · 2023-11-27T12:55:01Z

Please let me know if this PR is ready to be merged.

jctian98 · 2023-11-27T16:14:28Z

After the WSJ case is solved as in slack, I think this PR is ready for merge.

sw005320 · 2023-12-05T11:13:11Z

Thanks a lot, @jctian98!
I’m looking forward to the results of v3.1 and the next iteration with v4

add whisper data.sh for v1 and v2

3be1f40

mergify bot added the ESPnet2 label Oct 17, 2023

[pre-commit.ci] auto fixes from pre-commit.com hooks

37ab173

for more information, see https://pre-commit.ci

pyf98 reviewed Oct 17, 2023

View reviewed changes

sw005320 added the Recipe label Oct 17, 2023

sw005320 added this to the v.202312 milestone Oct 17, 2023

sw005320 approved these changes Oct 17, 2023

View reviewed changes

jctian98 and others added 3 commits October 17, 2023 17:21

add OWSM v3 data recipe

92bf631

Merge commit 'FETCH_HEAD' into owsm_data

ac8e423

merge remote

[pre-commit.ci] auto fixes from pre-commit.com hooks

a3c24bd

for more information, see https://pre-commit.ci

jctian98 added 2 commits October 17, 2023 21:56

fix ci issues

063dc3f

update with ci issues

5e14a62

ftshijt reviewed Oct 23, 2023

View reviewed changes

egs2/mixed_v1/s2t1/local/data.sh Outdated Show resolved Hide resolved

Mddct mentioned this pull request Oct 23, 2023

迷你项目：python cli(command line interface) and api(application programming interface) wenet-e2e/wenet#2069

Closed

12 tasks

change egs name from mixed_v* to owsm_v*

7b707cd

mergify bot added the README label Oct 23, 2023

mergify bot added the conflicts label Oct 25, 2023

kan-bayashi modified the milestones: v.202310, v.202312 Oct 25, 2023

jctian98 added 2 commits October 29, 2023 20:53

v3 shuold be ready except wsj

14204e2

add wsj

ae05a6c

[pre-commit.ci] auto fixes from pre-commit.com hooks

ec109e2

for more information, see https://pre-commit.ci

mergify bot added the conflicts label Nov 1, 2023

jctian98 added 4 commits November 9, 2023 23:53

almost finish all scripts

31ad173

fix small problems

8a09625

Merge commit 'FETCH_HEAD' into owsm_data

952acf6

fetch and merge

merge master

2fd2668

mergify bot removed the conflicts label Nov 10, 2023

[pre-commit.ci] auto fixes from pre-commit.com hooks

c53afd0

for more information, see https://pre-commit.ci

jctian98 and others added 7 commits November 11, 2023 11:14

update the langauge mapping

bdaf344

Merge commit 'FETCH_HEAD' into owsm_data

d379fd0

merge

Merge branch 'master' into owsm_data

b2cb427

Merge commit 'FETCH_HEAD' into owsm_data

f5e5414

merge

fix CI issue

51e3691

Merge commit 'FETCH_HEAD' into owsm_data

7f75d15

merge

[pre-commit.ci] auto fixes from pre-commit.com hooks

66176bc

for more information, see https://pre-commit.ci

jctian98 added 2 commits November 26, 2023 10:01

update wsj and commonvoice

3d89d78

Merge commit 'FETCH_HEAD' into owsm_data

8f1e0fa

mergify bot added the ESPnet1 label Nov 26, 2023

pre-commit-ci bot and others added 3 commits November 26, 2023 16:17

[pre-commit.ci] auto fixes from pre-commit.com hooks

77fe14b

for more information, see https://pre-commit.ci

update wsj text norm script

c391765

update wsj text norm 2

642fd22

revise voxpopuli

ee00c6c

sw005320 merged commit a45a53c into espnet:master Dec 5, 2023
27 checks passed

jctian98 deleted the owsm_data branch May 17, 2024 22:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add centralized data preparation for OWSM #5478

add centralized data preparation for OWSM #5478

jctian98 commented Oct 17, 2023 •

edited

codecov bot commented Oct 17, 2023 •

edited

pyf98 commented Oct 17, 2023

pyf98 Oct 17, 2023

jctian98 Oct 17, 2023

sw005320 commented Oct 17, 2023

jctian98 commented Oct 18, 2023

sw005320 commented Oct 18, 2023

pyf98 commented Oct 18, 2023 via email

sw005320 commented Oct 18, 2023

jctian98 commented Oct 19, 2023

mergify bot commented Oct 25, 2023

mergify bot commented Nov 1, 2023

sw005320 commented Nov 11, 2023

jctian98 commented Nov 11, 2023 •

edited

sw005320 commented Nov 27, 2023

jctian98 commented Nov 27, 2023

sw005320 commented Dec 5, 2023

add centralized data preparation for OWSM #5478

add centralized data preparation for OWSM #5478

Conversation

jctian98 commented Oct 17, 2023 • edited

What?

User Guidance for Data Preparation (copy from README.md)

List of datasets

TODO list (future PRs will link this PR for continuity)

Discussion: unified data collecting and processing policy

Pre-processing during data preparation

Data Cleaning with extra models (force-alignment, VAD, etc)

Postprocessing during scoring

codecov bot commented Oct 17, 2023 • edited

Codecov Report

pyf98 commented Oct 17, 2023

pyf98 Oct 17, 2023

Choose a reason for hiding this comment

jctian98 Oct 17, 2023

Choose a reason for hiding this comment

sw005320 commented Oct 17, 2023

jctian98 commented Oct 18, 2023

sw005320 commented Oct 18, 2023

pyf98 commented Oct 18, 2023 via email

sw005320 commented Oct 18, 2023

jctian98 commented Oct 19, 2023

mergify bot commented Oct 25, 2023

mergify bot commented Nov 1, 2023

sw005320 commented Nov 11, 2023

jctian98 commented Nov 11, 2023 • edited

sw005320 commented Nov 27, 2023

jctian98 commented Nov 27, 2023

sw005320 commented Dec 5, 2023

jctian98 commented Oct 17, 2023 •

edited

codecov bot commented Oct 17, 2023 •

edited

jctian98 commented Nov 11, 2023 •

edited