You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I tried to get the embeddings for hindi language using ./embed.sh ${LASER}/data/tatoeba/v1/tatoeba.hin-eng.hin hi check.raw, but I got the following output on my console:
- Encoder: loading ${LASER}/models/bilstm.93langs.2018-12-26.pt
- Tokenizer: in language hi
WARNING: No known abbreviations for language 'hi', attempting fall-back to English version...
- fast BPE: processing tok
- Encoder: bpe to check.raw
- Encoder: 1000 sentences in 17s
I also tried different abbreviations like hn, hin, hindi etc but none of them worked.
Also, if can you provide the mappings for all the languages to their respective abbreviations, that would be helpful.
The text was updated successfully, but these errors were encountered:
This is actually not an error but just a warning from the Moses tokenizer.
We do not use a specific tokenization for Hindi and simplify apply the default English tokenizer (which may be useful for numbers etc).
"hi" is the correct ISO2 code for Hindi. You output should be correct
Hi, I get the same response when trying to generate embeddings for Sinhala language. First I thought my use of abbreviation was incorrect. Now that I see this, it means that using "sin" as the abbreviation for sinhala language is correct. Am I right?
I tried to get the embeddings for hindi language using
./embed.sh ${LASER}/data/tatoeba/v1/tatoeba.hin-eng.hin hi check.raw
, but I got the following output on my console:I also tried different abbreviations like hn, hin, hindi etc but none of them worked.
Also, if can you provide the mappings for all the languages to their respective abbreviations, that would be helpful.
The text was updated successfully, but these errors were encountered: