Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[INPUT] Text (or Speech) Length of Blaser 2.0 #22

Open
foreveronehundred opened this issue Mar 28, 2024 · 1 comment
Open

[INPUT] Text (or Speech) Length of Blaser 2.0 #22

foreveronehundred opened this issue Mar 28, 2024 · 1 comment

Comments

@foreveronehundred
Copy link

For translation quality estimation of Blaser 2.0, I think there is no limitation of the text (or the speech) length. However, from my personal perspective, I do not think the estimation will be accurate if the text (or the speech) is too long.

So, what text length and speech length (of source, reference, and hypothesis) do you recommend?

@avidale
Copy link
Contributor

avidale commented Apr 16, 2024

SONAR was trained as a sentence encoder, so it expects the source, reference, and hypothesis to be single sentences.

Speech (or transcribed text without punctuation) is not always easy to segment into sentences, you can think of 30 seconds maximum as a rule of thumb.

Also, please check out the discussions of input lengths in the Seamless Communication repo (https://github.com/search?q=repo%3Afacebookresearch%2Fseamless_communication+length&type=issues), because Seamless and SONAR models were trained on similar tasks and similar data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants