You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.
Now my training languages are zh-en, as we all konw, they are unrelated。
According to my understanding, I did the following things:
First, apply BPE on the two languages separately .
Sencond, use fastText on two languages separately to get embeddings zh.vec and en.vec.
Third, train MUSE on zh.vec and en.vec, get vectors-zh.txt, vectors-en.txt, and vectors-zh-en.txt. vectors-zh.txt, vectors-en.txt are aligned by Muse. vectors-zh-en.txt is formed by joining vectors-zh.txtvectors-zh.txt together.
Is there any problem with what I did above?
Should I put the two language embeddings together when running the code? Which of the following input arguments should I choose,1 or 2?
I would suggest to try: concatenating the En and Zh corpora (in equal proportion) shuffle the output, and train fastText on that. It will provide cross-lingual embeddings of higher quality when languages are related. Here, they are not related but you may have some English words in your Chinese corpus, which may help the alignment. Then you can do 2)
Otherwise, if the above does not work, then go back to 1) after training your embeddings separately, and aligning them with MUSE as you did.
It's not easy to say which of these 2 is best, it is really empirical and will depend on the overlap between your languages (which is not necessarily negligible even for en-zh).
Hi @glample
Now my training languages are zh-en, as we all konw, they are unrelated。
According to my understanding, I did the following things:
Is there any problem with what I did above?
Should I put the two language embeddings together when running the code? Which of the following input arguments should I choose,1 or 2?
--share_lang_emb False --share_output_emb False --pretrained_emb 'vectors-zh.txt,vectors-en.txt'
--share_lang_emb True --share_output_emb True --pretrained_emb 'vectors-zh-en.txt'
The text was updated successfully, but these errors were encountered: