Language abbreviation error #27

lovishmadaan · 2019-01-27T12:50:50Z

I tried to get the embeddings for hindi language using ./embed.sh ${LASER}/data/tatoeba/v1/tatoeba.hin-eng.hin hi check.raw, but I got the following output on my console:

 - Encoder: loading ${LASER}/models/bilstm.93langs.2018-12-26.pt
 - Tokenizer:  in language hi
WARNING: No known abbreviations for language 'hi', attempting fall-back to English version...
 - fast BPE: processing tok
 - Encoder: bpe to check.raw
 - Encoder: 1000 sentences in 17s

I also tried different abbreviations like hn, hin, hindi etc but none of them worked.

Also, if can you provide the mappings for all the languages to their respective abbreviations, that would be helpful.

The text was updated successfully, but these errors were encountered:

hoschwenk · 2019-01-27T19:15:51Z

This is actually not an error but just a warning from the Moses tokenizer.
We do not use a specific tokenization for Hindi and simplify apply the default English tokenizer (which may be useful for numbers etc).
"hi" is the correct ISO2 code for Hindi. You output should be correct

dilanSachi · 2020-03-24T15:20:33Z

Hi, I get the same response when trying to generate embeddings for Sinhala language. First I thought my use of abbreviation was incorrect. Now that I see this, it means that using "sin" as the abbreviation for sinhala language is correct. Am I right?

hoschwenk closed this as completed Jan 27, 2019

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Language abbreviation error #27

Language abbreviation error #27

lovishmadaan commented Jan 27, 2019

hoschwenk commented Jan 27, 2019

dilanSachi commented Mar 24, 2020

Language abbreviation error #27

Language abbreviation error #27

Comments

lovishmadaan commented Jan 27, 2019

hoschwenk commented Jan 27, 2019

dilanSachi commented Mar 24, 2020