Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions towards hyper-parameters and the token post-processing #10

Closed
joyolee opened this issue May 2, 2023 · 1 comment
Closed
Labels
question Further information is requested

Comments

@joyolee
Copy link

joyolee commented May 2, 2023

Dear authors,
thanks for the great work. I have two questions about the paper.
In Section 4.1 about the experimental setup, it's written:
For both AVSR and AVST, we use an English AV-HuBERT large pre-trained model [3], which is trained on the combination of LRS3-TED [8] and the English portion of VoxCeleb2 [27]. We follow [3] for fine-tuning hyper-parameters, except that we fine-tune our bilingual models to 30K updates and our multilingual AVSR model to 90K updates.

I would ask, how many warmup_steps, hold_steps, and decay_steps did you use? And how many freeze_finetune_updates did you set? Because the original configuration file for the large model has 60k updates. We may need to change the above-mentioned hyperparameters if the max_updates is changed to 30k.

The second question is about punctuation removal and lowercasing before calculating WER. Because I also observed some special tokens, e.g. the music token ♪ in the dictionary. Which tokens have you removed and how?

I'm looking forward to your reply and thank you in advance :)

Best regards,
Zhengyang

@Anwarvic Anwarvic added the question Further information is requested label May 3, 2023
@Anwarvic
Copy link
Contributor

Hi @joyolee.

Sorry for the late reply!

My team and I have added the training and decoding scripts recently, so feel free to check them. All hyper-parameters used for fine-tuning can be found in this YAML configuration file. However, we had to change a few parameters as shown in the training script

For your reference, the following are the answer to your questions:

how many warmup_steps?

10,000 steps

how many hold_steps?

always 0

how many decay_steps?

20,000 steps

And how many freeze_finetune_updates did you set?

non-English models used 4,000 steps out of 30,000 total steps. And the English model used 24,000 steps out of 90,000 total steps.

The second question is about punctuation removal and lowercasing before calculating WER. Because I also observed some special tokens, e.g. the music token ♪ in the dictionary. Which tokens have you removed and how?

Yes, we've used Fairseq's WerScorer which removes punctuations and lower-case the text.

I hope I answered all of your question. I'm gonna close this for now, but feel free to re-open it when needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants