What should I do if the languages are unrelated #44

socaty · 2018-11-25T12:16:51Z

Now my training languages are zh-en, as we all konw, they are unrelated。
According to my understanding, I did the following things：

First, apply BPE on the two languages separately .
Sencond, use fastText on two languages separately to get embeddings zh.vec and en.vec.
Third, train MUSE on zh.vec and en.vec， get vectors-zh.txt, vectors-en.txt, and vectors-zh-en.txt. vectors-zh.txt, vectors-en.txt are aligned by Muse. vectors-zh-en.txt is formed by joining vectors-zh.txt vectors-zh.txt together.

Is there any problem with what I did above？
Should I put the two language embeddings together when running the code? Which of the following input arguments should I choose，1 or 2?

--share_lang_emb False --share_output_emb False --pretrained_emb 'vectors-zh.txt,vectors-en.txt'
--share_lang_emb True --share_output_emb True --pretrained_emb 'vectors-zh-en.txt'

The text was updated successfully, but these errors were encountered:

glample · 2018-11-25T15:20:53Z

I would suggest to try: concatenating the En and Zh corpora (in equal proportion) shuffle the output, and train fastText on that. It will provide cross-lingual embeddings of higher quality when languages are related. Here, they are not related but you may have some English words in your Chinese corpus, which may help the alignment. Then you can do 2)

Otherwise, if the above does not work, then go back to 1) after training your embeddings separately, and aligning them with MUSE as you did.

It's not easy to say which of these 2 is best, it is really empirical and will depend on the overlap between your languages (which is not necessarily negligible even for en-zh).

socaty · 2018-11-27T02:49:40Z

Thanks a lot for your response！

socaty closed this as completed Nov 27, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What should I do if the languages are unrelated #44

What should I do if the languages are unrelated #44

socaty commented Nov 25, 2018

glample commented Nov 25, 2018

socaty commented Nov 27, 2018

What should I do if the languages are unrelated #44

What should I do if the languages are unrelated #44

Comments

socaty commented Nov 25, 2018

glample commented Nov 25, 2018

socaty commented Nov 27, 2018