Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Language abbreviation error #27

Closed
lovishmadaan opened this issue Jan 27, 2019 · 2 comments
Closed

Language abbreviation error #27

lovishmadaan opened this issue Jan 27, 2019 · 2 comments

Comments

@lovishmadaan
Copy link

I tried to get the embeddings for hindi language using ./embed.sh ${LASER}/data/tatoeba/v1/tatoeba.hin-eng.hin hi check.raw, but I got the following output on my console:

 - Encoder: loading ${LASER}/models/bilstm.93langs.2018-12-26.pt
 - Tokenizer:  in language hi
WARNING: No known abbreviations for language 'hi', attempting fall-back to English version...
 - fast BPE: processing tok
 - Encoder: bpe to check.raw
 - Encoder: 1000 sentences in 17s

I also tried different abbreviations like hn, hin, hindi etc but none of them worked.

Also, if can you provide the mappings for all the languages to their respective abbreviations, that would be helpful.

@hoschwenk
Copy link
Contributor

This is actually not an error but just a warning from the Moses tokenizer.
We do not use a specific tokenization for Hindi and simplify apply the default English tokenizer (which may be useful for numbers etc).
"hi" is the correct ISO2 code for Hindi. You output should be correct

@dilanSachi
Copy link

Hi, I get the same response when trying to generate embeddings for Sinhala language. First I thought my use of abbreviation was incorrect. Now that I see this, it means that using "sin" as the abbreviation for sinhala language is correct. Am I right?

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants