Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

What should I do if the languages are unrelated #44

Closed
socaty opened this issue Nov 25, 2018 · 2 comments
Closed

What should I do if the languages are unrelated #44

socaty opened this issue Nov 25, 2018 · 2 comments

Comments

@socaty
Copy link

socaty commented Nov 25, 2018

Hi @glample

Now my training languages are zh-en, as we all konw, they are unrelated。
According to my understanding, I did the following things:

  1. First, apply BPE on the two languages separately .
  2. Sencond, use fastText on two languages separately to get embeddings zh.vec and en.vec.
  3. Third, train MUSE on zh.vec and en.vec, get vectors-zh.txt, vectors-en.txt, and vectors-zh-en.txt. vectors-zh.txt, vectors-en.txt are aligned by Muse. vectors-zh-en.txt is formed by joining vectors-zh.txt vectors-zh.txt together.

Is there any problem with what I did above?
Should I put the two language embeddings together when running the code? Which of the following input arguments should I choose,1 or 2?

  1. --share_lang_emb False --share_output_emb False --pretrained_emb 'vectors-zh.txt,vectors-en.txt'
  2. --share_lang_emb True --share_output_emb True --pretrained_emb 'vectors-zh-en.txt'
@glample
Copy link
Contributor

glample commented Nov 25, 2018

I would suggest to try: concatenating the En and Zh corpora (in equal proportion) shuffle the output, and train fastText on that. It will provide cross-lingual embeddings of higher quality when languages are related. Here, they are not related but you may have some English words in your Chinese corpus, which may help the alignment. Then you can do 2)

Otherwise, if the above does not work, then go back to 1) after training your embeddings separately, and aligning them with MUSE as you did.

It's not easy to say which of these 2 is best, it is really empirical and will depend on the overlap between your languages (which is not necessarily negligible even for en-zh).

@socaty
Copy link
Author

socaty commented Nov 27, 2018

Thanks a lot for your response!

@socaty socaty closed this as completed Nov 27, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants