### MUSE and FasText Colab setup

For training MUSE unsupervised models, you can use this environment to train your models. All you need to do is run the setup cells, upload the europarl corpora you want to create alignments of, and run whatever training code in the appropriate cells at the bottom.

NOTE SUGGEST RUNNING THIS IN GPU MODE (Runtime --> Change Runtime Type --> GPU) While Fasttext is CPU only, MUSE is very slow on CPU.

Total time to train everything should be ~1.5 hrs

#### Submission instructions

Use this training code to complete the MUSE related questions for Lab 1. Submit this colab notebook alongside your completed lab1.ipynb.

In [None]:
import torch

In [None]:
!git clone https://github.com/facebookresearch/MUSE.git

Download the evaluation data for MUSE

#### [NOTE: We ran into some issues getting MUSE to correctly use the evaluation data. It should be possible to skip this cell (with no impact on training quality) if you follow the note about commenting a line in the evaluator.py file.]

Takes a few minutes to download

In [None]:

!cd ./MUSE/data/; chmod +x get_evaluation.sh
!cd ./MUSE/; ./data/get_evaluation.sh

In [None]:
#get fasText 

! wget https://github.com/facebookresearch/fastText/archive/v0.9.2.zip
! unzip v0.9.2.zip
! cd fastText-0.9.2; make


In [None]:
#get europarl fr_en
!wget https://www.statmt.org/europarl/v7/fr-en.tgz

In [None]:
#unpack
!tar -xvzf fr-en.tgz

#### Run your training code:

To run the training scripts you will need to prefix the command with a ! to run it as if it was a linux terminal.

With the Europarl corpora we will build the fasText alignments  (see below). Once embeddings are created, we'll then feed them to MUSE. 

fastText does not need GPU to run and takes about 30 minutes each, you can run it in a CPU only notebook, save the files that are created (MUSE/en.vec MUSE/fr.vec) and then move them over to a GPU notebook for running MUSE, if you are concerned about your GPU quota.


In [None]:
### FastText Here 
!./fastText-0.9.2/fasttext skipgram -input europarl-v7.fr-en.en -output MUSE/en 
!./fastText-0.9.2/fasttext skipgram -input europarl-v7.fr-en.fr -output MUSE/fr


In [None]:
#https://github.com/facebookresearch/MUSE/blob/master/demo.ipynb
import io
import numpy as np

def load_vec(emb_path, nmax=50000):
    vectors = []
    word2id = {}
    with io.open(emb_path, 'r', encoding='utf-8', newline='\n', errors='ignore') as f:
        next(f)
        for i, line in enumerate(f):
            word, vect = line.rstrip().split(' ', 1)
            vect = np.fromstring(vect, sep=' ')
            assert word not in word2id, 'word found twice'
            vectors.append(vect)
            word2id[word] = len(word2id)
            if len(word2id) == nmax:
                break
    id2word = {v: k for k, v in word2id.items()}
    embeddings = np.vstack(vectors)
    return embeddings, id2word, word2id

## modified this to return a result list
def get_nn(word, src_emb, src_id2word, tgt_emb, tgt_id2word, K=5):
    print("Nearest neighbors of \"%s\":" % word)
    word2id = {v: k for k, v in src_id2word.items()}
    word_emb = src_emb[word2id[word]]
    scores = (tgt_emb / np.linalg.norm(tgt_emb, 2, 1)[:, None]).dot(word_emb / np.linalg.norm(word_emb))
    k_best = scores.argsort()[-K:][::-1]
    for i, idx in enumerate(k_best):
        result.append((scores[idx], tgt_id2word[idx]))
    return result

In [None]:
# load english and french word embeddings
MUSE_PATH = "MUSE"
en_embeddings, en_id2word, en_word2id = load_vec(MUSE_PATH + "/en.vec", nmax=50000)
fr_embeddings, fr_id2word, fr_word2id = load_vec(MUSE_PATH + "/fr.vec", nmax=50000)

You can use the get_nn function as follows (where K is the number of results, feel free to increase). Do this for the words in English (Minutes, minutes, vote) and French (vous, intervienne, accord)

In [None]:
print('most similar word to Minutes is %s'%get_nn('Minutes', en_embeddings, en_id2word, en_embeddings, en_id2word, K=2))

## TO COMPLETE*** GET REST OF WORDS

In [None]:
# FAISS is a tool to speed training of some facebook models, this is how you can import it.
!apt install libomp-dev
!python -m pip install --upgrade faiss faiss-gpu
import faiss

Now we are going to run MUSE.
Note: We found an issues with running the eval parts of the training, to get around this comment out line 217 in /MUSE/src/evaluation/evaluator.py:
 self.word_translation(to_log)

### Training time should take around 30 minutes on GPU, plan accordingly.

In [None]:
### MUSE Here
%cd MUSE
!python unsupervised.py --src_lang fr --tgt_lang en --src_emb fr.vec --tgt_emb en.vec --n_refinement 5 --emb_dim 100 --dis_most_frequent 0



In [None]:
en_embeddings, en_id2word, en_word2id = load_vec("path/to/en_vectors.txt", nmax=50000)
fr_embeddings, fr_id2word, fr_word2id = load_vec("path/to/fr_vectors.txt", nmax=50000)

### TO COMPLETE*** Get nearest neighbor (get_nn) of 'disaster' 'vote' 'excessively' and any other words that you want to compare
