-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug in SeamlessM4T-Large: t2tt when target is 'yue' #64
Comments
Hello, how do I solve this problem !! |
pip install git+https://github.com/facebookresearch/seamless_communication if you are in jupyter notebook or colab
|
Where can I see the abbreviation of the language, similar to eng |
ValueError: |
In the paper and code in this repo 😀 Basically, they are from ISO 639-3 |
You can also check the list in https://github.com/facebookresearch/seamless_communication/tree/main/scripts/m4t/predict#supported-languages |
some inconsistency between this and the YAML files for medium and large models:
translated_text, _, _ = translator_medium.predict(to_translate, "t2tt", 'zho_Hant', src_lang='eng')
print(translated_text)
translated_text, _, _ = translator_medium.predict(to_translate, "t2tt", 'yue', src_lang='eng')
print(translated_text) results:
Some glyphs in |
That's correct, the list in the README is for the large model only. The medium model's tokenizer actually supports more languages (it's the same tokenizer as NLLB-200 with the langs in https://github.com/facebookresearch/fairseq/blob/nllb/examples/nllb/modeling/scripts/flores200/langs.txt). |
Nope, something must be wrong. From what returned by |
Thanks @freedomtan for your observations. The major differences between the medium and large models are:
|
@elbayadm to summarise what I know
I guess you do get better chrF++ points, but something in character/glyph-level is wrong. |
@freedomtan import torch
from seamless_communication.models.inference import Translator
translator_medium = Translator("seamlessM4T_medium", "vocoder_36langs", torch.device("cuda:0"), torch.float16)
translator_large = Translator("seamlessM4T_large", "vocoder_36langs", torch.device("cuda:0"), torch.float16)
message_to_translate = "If you visit the Arctic or Antarctic areas in the winter you will experience the polar night, which means that the sun doesn\'t rise above the horizon."
translated_text, _, _ = translator_medium.predict(message_to_translate, "t2tt", 'cmn_Hant', src_lang='eng')
print(f'from medium model: {translated_text}')
translated_text, _, _ = translator_large.predict(message_to_translate, "t2tt", 'cmn_Hant', src_lang='eng')
print(f'from large model: {translated_text}') from medium model: 如果你喺冬天去訪北極或者南極, 你會經歷極夜, 意思係太陽唔會喺地平線上升. And: translated_text, _, _ = translator_medium.predict(message_to_translate, "t2tt", 'yue', src_lang='eng')
print(f'from medium model: {translated_text}')
translated_text, _, _ = translator_large.predict(message_to_translate, "t2tt", 'yue', src_lang='eng')
print(f'from large model: {translated_text}') from medium model: 如果你喺冬天去北極或者南極, 你會發現北極嘅夜晚, 即係話太陽唔會喺地平線上升. AFAICT:
is True even for the FLORES example.
this one looks like it's correctly translated in Cantonese. |
@elbayadm Thanks for spending time checking this issue.
Yes, it turned out that the to_translate_1 = "The forces of Syria's president, Bashar al-Assad, fight back."
to_translate_2 = "The forces of Syria's president, Bashar al-Assad, fight back soon."
translated_text, _, _ = translator_large.predict(to_translate_1, "t2tt", 'yue', src_lang='eng')
print(f'from large model 1: {translated_text}')
translated_text, _, _ = translator_large.predict(to_translate_2, "t2tt", 'yue', src_lang='eng')
print(f'from large model 2: {translated_text}') The result of
|
When the
seamlessM4T_large
model is used fort2tt
,tgt_lang='yue', src_lang='eng'
, the returned results are in Mandarin with Simplified Han glyphs (expected results are in Cantonese with Traditional Han glyphs.The results
The text was updated successfully, but these errors were encountered: