Add Whisper SOT recipe for Librimix #5371

LiChenda · 2023-07-25T07:59:54Z

What?

This PR adds the Whisper SOT-style multi-talker ASR recipe for the Librimix dataset.

Why?

To add the multi-talker Whisper recipe.

for more information, see https://pre-commit.ci

simpleoier

Thanks! It looks good to me in general.

simpleoier · 2023-07-25T20:05:35Z

egs2/librimix/sot_asr1/run_whisper_sot.sh

+    --train_set "${train_set}" \
+    --valid_set "${valid_set}" \
+    --test_sets "${test_sets}" \
+    --lm_train_text "data/${train_set}/text_spk1 data/${train_set}/text_spk2 data/local/other_text/text" \


Can we simply use the ${train_set}/text which contains the <sc> in the text?

LM is not used in this recipe. I'll remove it.

simpleoier · 2023-07-25T20:06:02Z

egs2/librimix/sot_asr1/run_whisper_sot.sh

+    --valid_set "${valid_set}" \
+    --test_sets "${test_sets}" \
+    --lm_train_text "data/${train_set}/text_spk1 data/${train_set}/text_spk2 data/local/other_text/text" \
+    --bpe_train_text "data/${train_set}/text_spk1 data/${train_set}/text_spk2" "$@"


ditto. (not sure about LM, but bpe should be fine to use the text file)

simpleoier · 2023-07-25T21:38:07Z

espnet2/bin/asr_inference.py

@@ -347,7 +348,7 @@ def __init__(
        if bpemodel not in ["whisper_en", "whisper_multilingual"]:
            converter = TokenIDConverter(token_list=token_list)
        else:
-            converter = OpenAIWhisperTokenIDConverter(model_type=bpemodel)
+            converter = OpenAIWhisperTokenIDConverter(model_type=bpemodel, sot=sot_asr)


Is it necessary to check if <sc> in the token list in this case? Or has it been done in the token_id_converter class?

If sot_asr is True, the OpenAIWhisperTokenIDConverter will add the <sc> token in it's init function.

In the inference part, we can retrieve the value of sot_asr from the asr_train_args, which eliminates the need for dumplicate params in both train_args and decode_args, making them independent of each other.
In most cases, I think decode_args should only contains the params specific to inference, like beam_size, lm_weight, etc., without including those realted to model or tokenizer initialization.

Thanks for the suggestion! I updated the asr_inference.py, and now it get the sot_asr from asr_train_args.

espnet2/bin/asr_inference.py

Co-authored-by: Wangyou Zhang <C0me_On@163.com>

codecov · 2023-07-27T13:51:52Z

Codecov Report

Merging #5371 (18903d4) into master (8a8709e) will increase coverage by 0.00%.
Report is 1 commits behind head on master.
The diff coverage is 82.69%.

@@           Coverage Diff           @@
##           master    #5371   +/-   ##
=======================================
  Coverage   77.17%   77.17%           
=======================================
  Files         684      684           
  Lines       62643    62686   +43     
=======================================
+ Hits        48343    48380   +37     
- Misses      14300    14306    +6

Flag	Coverage Δ
test_configuration_espnet2	`∅ <ø> (∅)`
test_integration_espnet1	`65.73% <ø> (ø)`
test_integration_espnet2	`49.07% <20.00%> (-0.03%)`	⬇️
test_python_espnet1	`19.94% <0.00%> (-0.02%)`	⬇️
test_python_espnet2	`52.30% <82.69%> (+0.01%)`	⬆️
test_utils	`23.10% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
espnet2/asr/decoder/whisper_decoder.py	`94.66% <100.00%> (+1.56%)`	⬆️
espnet2/asr/encoder/whisper_encoder.py	`80.45% <100.00%> (ø)`
espnet2/bin/whisper_export_vocabulary.py	`92.59% <100.00%> (+0.75%)`	⬆️
espnet2/text/whisper_token_id_converter.py	`87.87% <100.00%> (+2.69%)`	⬆️
espnet2/text/whisper_tokenizer.py	`88.23% <100.00%> (+2.52%)`	⬆️
espnet2/text/build_tokenizer.py	`78.37% <0.00%> (ø)`
espnet2/bin/asr_inference.py	`86.98% <0.00%> (-0.68%)`	⬇️
espnet2/train/preprocessor.py	`77.53% <16.66%> (-0.39%)`	⬇️

... and 1 file with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

for more information, see https://pre-commit.ci

pengchengguo · 2023-08-02T13:45:43Z

espnet2/bin/whisper_export_vocabulary.py

    for i in range(full_vocab_size - vocab_size):
        fout.write("()" + "\n")

+    if sot_asr:
+        full_vocab_size += 1
+        fout.write("<sc>" + "\n")


Is it possible to import custom tokens from a file, like additional_vocab. This would allow us to add any tokens as needed, rather than only adding a token for the SOT training.

The Whisper's vocabulary is not imported form a file, but loaded form the whisper pypi package. So, here I add the special token in the code.

I think it would be better to have at least a way to specify which special token would be used as a speaker change. The hard-corded token symbol may confuse the user.

I guess this would depend on other places, so if it is difficult to make this speaker change token a variable, we could at least provide the comment in the source code and README that is a reserved token in an appropriate place.

Thanks for your suggestions! Now it's configurable.

mergify · 2023-08-03T19:35:41Z

This pull request is now in conflict :(

for more information, see https://pre-commit.ci

sw005320 · 2023-08-16T11:47:56Z

Thanks, @LiChenda!
@pengchengguo, can you review it again?
Do we need to take care of the other part for the symbol?

sw005320 · 2023-09-25T12:03:29Z

@pengchengguo, this is a reminder. Can you review this PR again?

pengchengguo

I only have a few advice and the rest looks good to me.

pengchengguo · 2023-09-26T03:33:29Z

egs2/librimix/sot_asr1/local/data.sh

@@ -188,7 +188,7 @@ if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
        done

        paste -d "" \
-            <(<data/${dset}/text_spk1 awk '{$0=$0" <sc>"; print($0)}') \
+            <(<data/${dset}/text_spk1 awk '{$0=$0" <sc> "; print($0)}') \


Xuankai and I have considered whether a "space" should be added after "sc". We don't know if it matters and how the original paper does it. Do you have any comments?

I think there should be no essential difference. But the second makes the text look more natural. My uploaded pre-trained model was trained with "space" after "". Do you have any technical concerns about that "space"?

pengchengguo · 2023-09-26T08:11:38Z

espnet2/bin/whisper_export_vocabulary.py

    for i in range(full_vocab_size - vocab_size):
        fout.write("()" + "\n")

+    if sot_asr:


The exported token.txt file includes 50257 normal tokens, 1501 "{}" (should be timestamps), and 1 sc token.
During training, the tokenizer will add additional 1501 timestamps tokens and 1 sc token, as implemented in whisper_token_id_converter.py and whisper_token_id_converter.py.
Although the token.txt file will not be used in practice, it is better to change the {} to real timestamps and make it consistent with training progress.

Thanks for your comments, I also feel like that will be better. Now updated.

sw005320

Minor comments

sw005320 · 2023-09-25T12:02:56Z

espnet2/text/whisper_tokenizer.py

@@ -61,6 +68,16 @@ def __init__(self, model_type: str, language: str = "en"):
        else:
            raise ValueError("tokenizer unsupported:", model_type)

+        self.tokenizer = copy.deepcopy(self.tokenizer)
+        timestamps = [f"<|{i*30/1500:.2f}|>" for i in range(0, 1501)]


I'm assuming that 30 comes from the whisper's sentence length and 1500 comes from the 20ms shift.
It's okay to embed such numbers, but it would be great to add some comments (logging?) and also make them variables inside the function.

Thank you for pointing it out! I updated the code according to your comments.

LiChenda and others added 6 commits July 25, 2023 13:17

update whisper tokenizer

54c2f9e

whisper sot decode ok

48fc9f4

Merge branch 'espnet:master' into hackthon23

e30e2df

fix a data prepare issue

2ba9507

add config

08392ea

Merge remote-tracking branch 'chenda/hackthon23'

50364f8

mergify bot added the ESPnet2 label Jul 25, 2023

pre-commit-ci bot and others added 3 commits July 25, 2023 08:03

[pre-commit.ci] auto fixes from pre-commit.com hooks

6761137

for more information, see https://pre-commit.ci

update config

728634c

Merge remote-tracking branch 'chenda/hackthon23'

bfa5a6c

sw005320 requested review from pengchengguo and simpleoier July 25, 2023 19:58

sw005320 added New Features ASR Automatic speech recogntion labels Jul 25, 2023

sw005320 added this to the v.202307 milestone Jul 25, 2023

simpleoier reviewed Jul 25, 2023

View reviewed changes

Emrys365 reviewed Jul 26, 2023

View reviewed changes

espnet2/bin/asr_inference.py Outdated Show resolved Hide resolved

LiChenda added 4 commits July 27, 2023 19:48

add readme

e53e8f4

update whisper model

2987dca

update for the ci

c41bfb4

Merge branch 'master' into hackthon23

dbfad9c

LiChenda marked this pull request as ready for review July 27, 2023 12:23

mergify bot added the README label Jul 27, 2023

LiChenda and others added 2 commits July 27, 2023 20:23

Update espnet2/bin/asr_inference.py

d5bd029

Co-authored-by: Wangyou Zhang <C0me_On@163.com>

Update asr_inference.py

e775128

LiChenda and others added 4 commits July 27, 2023 22:29

espnet2/bin/whisper_export_vocabulary.py

dc10c37

Merge remote-tracking branch 'chenda/hackthon23'

7b461a7

[pre-commit.ci] auto fixes from pre-commit.com hooks

c74eb9c

for more information, see https://pre-commit.ci

Merge branch 'espnet:master' into hackthon23

d7a46cf

Update for test

50b9dba

pengchengguo reviewed Aug 2, 2023

View reviewed changes

kan-bayashi modified the milestones: v.202307, v.202312 Aug 3, 2023

mergify bot added the conflicts label Aug 3, 2023

LiChenda added 2 commits August 9, 2023 15:17

update for conflicts

6b2d1df

update for testing

7fdefe9

mergify bot removed the conflicts label Aug 9, 2023

pre-commit-ci bot and others added 3 commits August 9, 2023 07:46

[pre-commit.ci] auto fixes from pre-commit.com hooks

77657df

for more information, see https://pre-commit.ci

fix parameter issue

ebe3ac7

Merge remote-tracking branch 'origin/hackthon23' into hackthon23

a6c897e

sw005320 changed the title ~~[WIP] Add Whisper SOT recipe for Librimix~~ Add Whisper SOT recipe for Librimix Aug 9, 2023

LiChenda and others added 6 commits August 16, 2023 15:46

update for making speaker-change token a variable

d47b387

Merge remote-tracking branch 'upstream/master' into hackthon23

7dc5134

[pre-commit.ci] auto fixes from pre-commit.com hooks

8975b9c

for more information, see https://pre-commit.ci

update decoding

0768756

update for testing

7848af6

Merge remote-tracking branch 'origin/hackthon23' into hackthon23

736d132

LiChenda added 2 commits August 31, 2023 20:02

Merge branch 'master' into hackthon23

71bddcc

Merge branch 'master' into hackthon23

e54bbd4

pengchengguo reviewed Sep 26, 2023

View reviewed changes

update timestamp tokens to Whisper exported tokens

1bb8a96

sw005320 approved these changes Sep 27, 2023

View reviewed changes

add comments

18903d4

sw005320 merged commit 522fb13 into espnet:master Sep 28, 2023
25 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Whisper SOT recipe for Librimix #5371

Add Whisper SOT recipe for Librimix #5371

LiChenda commented Jul 25, 2023

simpleoier left a comment

simpleoier Jul 25, 2023

LiChenda Jul 26, 2023

simpleoier Jul 25, 2023

simpleoier Jul 25, 2023

LiChenda Jul 26, 2023 •

edited

pengchengguo Aug 2, 2023

LiChenda Aug 16, 2023

codecov bot commented Jul 27, 2023 •

edited

pengchengguo Aug 2, 2023

LiChenda Aug 9, 2023

sw005320 Aug 9, 2023

LiChenda Aug 16, 2023

mergify bot commented Aug 3, 2023

sw005320 commented Aug 16, 2023

sw005320 commented Sep 25, 2023

pengchengguo left a comment

pengchengguo Sep 26, 2023

LiChenda Sep 27, 2023

pengchengguo Sep 26, 2023

LiChenda Sep 27, 2023

sw005320 left a comment

sw005320 Sep 25, 2023

LiChenda Sep 27, 2023

Add Whisper SOT recipe for Librimix #5371

Add Whisper SOT recipe for Librimix #5371

Conversation

LiChenda commented Jul 25, 2023

What?

Why?

simpleoier left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LiChenda Jul 26, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Jul 27, 2023 • edited

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mergify bot commented Aug 3, 2023

sw005320 commented Aug 16, 2023

sw005320 commented Sep 25, 2023

pengchengguo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sw005320 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LiChenda Jul 26, 2023 •

edited

codecov bot commented Jul 27, 2023 •

edited