You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Dear authors,
thanks for the great work. I have two questions about the paper.
In Section 4.1 about the experimental setup, it's written: For both AVSR and AVST, we use an English AV-HuBERT large pre-trained model [3], which is trained on the combination of LRS3-TED [8] and the English portion of VoxCeleb2 [27]. We follow [3] for fine-tuning hyper-parameters, except that we fine-tune our bilingual models to 30K updates and our multilingual AVSR model to 90K updates.
I would ask, how many warmup_steps, hold_steps, and decay_steps did you use? And how many freeze_finetune_updates did you set? Because the original configuration file for the large model has 60k updates. We may need to change the above-mentioned hyperparameters if the max_updates is changed to 30k.
The second question is about punctuation removal and lowercasing before calculating WER. Because I also observed some special tokens, e.g. the music token ♪ in the dictionary. Which tokens have you removed and how?
I'm looking forward to your reply and thank you in advance :)
Best regards,
Zhengyang
The text was updated successfully, but these errors were encountered:
For your reference, the following are the answer to your questions:
how many warmup_steps?
10,000 steps
how many hold_steps?
always 0
how many decay_steps?
20,000 steps
And how many freeze_finetune_updates did you set?
non-English models used 4,000 steps out of 30,000 total steps. And the English model used 24,000 steps out of 90,000 total steps.
The second question is about punctuation removal and lowercasing before calculating WER. Because I also observed some special tokens, e.g. the music token ♪ in the dictionary. Which tokens have you removed and how?
Yes, we've used Fairseq's WerScorer which removes punctuations and lower-case the text.
I hope I answered all of your question. I'm gonna close this for now, but feel free to re-open it when needed.
Dear authors,
thanks for the great work. I have two questions about the paper.
In Section 4.1 about the experimental setup, it's written:
For both AVSR and AVST, we use an English AV-HuBERT large pre-trained model [3], which is trained on the combination of LRS3-TED [8] and the English portion of VoxCeleb2 [27]. We follow [3] for fine-tuning hyper-parameters, except that we fine-tune our bilingual models to 30K updates and our multilingual AVSR model to 90K updates.
I would ask, how many warmup_steps, hold_steps, and decay_steps did you use? And how many freeze_finetune_updates did you set? Because the original configuration file for the large model has 60k updates. We may need to change the above-mentioned hyperparameters if the max_updates is changed to 30k.
The second question is about punctuation removal and lowercasing before calculating WER. Because I also observed some special tokens, e.g. the music token ♪ in the dictionary. Which tokens have you removed and how?
I'm looking forward to your reply and thank you in advance :)
Best regards,
Zhengyang
The text was updated successfully, but these errors were encountered: