# EA Assignment 08 - MultiLingual WordEmbeddings
__Authored by: Álvaro Bartolomé del Canto (alvarobartt @ GitHub)__

---

<img src="https://media-exp1.licdn.com/dms/image/C561BAQFjp6F5hjzDhg/company-background_10000/0?e=2159024400&v=beta&t=OfpXJFCHCqdhcTu7Ud-lediwihm0cANad1Kc_8JcMpA">

This Jupyter Notebook is part of the Future Work, so as to test how good do trained Word Embeddings work, which in this case we will be testing the [facebookresearch/MUSE](https://github.com/facebookresearch/MUSE) ones. We will use the multilingual fastText Wikipedia supervised word embeddings for Spanish and English, aligned in a single vector space.

## Loading Word Embeddings

__Reproducibility Warning__: you will not find the trained fastText word embeddings from MUSE, since they have been included in the .gitignore file since they are too big for GitHub, so uploading them is not possible due to the established space quotas.

Anyway, you can just download them using the following script from the `research/` directory:

```
mkdir data
cd data/
curl -Lo wiki.multi.en.vec https://dl.fbaipublicfiles.com/arrival/vectors/wiki.multi.en.vec
curl -Lo wiki.multi.es.vec https://dl.fbaipublicfiles.com/arrival/vectors/wiki.multi.es.vec
curl -Lo wiki.multi.fr.vec https://dl.fbaipublicfiles.com/arrival/vectors/wiki.multi.fr.vec
```

__Note__: we will be using some functions provided in the multilingual demo available at: https://github.com/facebookresearch/MUSE/blob/master/demo.ipynb

In [64]:
import io
import numpy as np

In [118]:
def load_vec(emb_path, nmax=2000000):
    vectors = []
    word2id = {}
    with io.open(emb_path, 'r', encoding='utf-8', newline='\n', errors='ignore') as f:
        next(f)
        for i, line in enumerate(f):
            word, vect = line.rstrip().split(' ', 1)
            vect = np.fromstring(vect, sep=' ')
            assert word not in word2id, 'word found twice'
            vectors.append(vect)
            word2id[word] = len(word2id)
            if len(word2id) == nmax:
                break
    id2word = {v: k for k, v in word2id.items()}
    embeddings = np.vstack(vectors)
    return embeddings, id2word, word2id

In [119]:
def get_nn(word, src_emb, src_id2word, tgt_emb, tgt_id2word):
    word2id = {v: k for k, v in src_id2word.items()}
    word_emb = src_emb[word2id[word]]
    scores = (tgt_emb / np.linalg.norm(tgt_emb, 2, 1)[:, None]).dot(word_emb / np.linalg.norm(word_emb))
    scores = scores.argsort()[::-1]
    return tgt_id2word[scores[0]]

In [120]:
en_path = 'data/wiki.multi.en.vec'
es_path = 'data/wiki.multi.es.vec'
fr_path = 'data/wiki.multi.fr.vec'

In [121]:
en_embeddings, en_id2word, en_word2id = load_vec(en_path)
es_embeddings, es_id2word, es_word2id = load_vec(es_path)
fr_embeddings, fr_id2word, fr_word2id = load_vec(es_path)

In [122]:
es_word = 'futbolista'
best = get_nn(es_word, es_embeddings, es_id2word, en_embeddings, en_id2word)
best

'footballer'

---

## Loading PreProcessed Data

__Reproducibility Warning__: you will not find the `PreProcessedDocuments.jsonl` file when cloning the repository from GitHub, since it has been included in the .gitignore file due to the GitHub quotas when uploading big files. So on, if you want to reproduce this Jupyter Notebook, please refer to `02 - Data Preprocessing.ipynb` where the NLP preprocessing pipeline is explained and this file is generated.

In [95]:
import json

data = list()

with open('PreProcessedDocuments.jsonl', 'r') as f:
    for line in f.readlines():
        data.append(json.loads(line))

In [96]:
import pandas as pd

data = pd.DataFrame(data)
data.head()

Unnamed: 0,lang,context,preprocessed_text
0,en,wikipedia,watchmen twelve issue comic book limited serie...
1,en,wikipedia,citigroup center formerly citicorp center tall...
2,en,wikipedia,birth_place death_date death_place party conse...
3,en,wikipedia,marbod maroboduus born died king marcomanni no...
4,en,wikipedia,sylvester medal bronze medal awarded every yea...


In [97]:
with open('resources/id2context.json', 'r') as f:
    ID2CONTEXT = json.load(f)
ID2CONTEXT

{'3': 'wikipedia', '1': 'conference_papers', '0': 'apr', '2': 'pan11'}

In [98]:
CONTEXT2ID = {value: int(key) for key, value in ID2CONTEXT.items()}
CONTEXT2ID

{'wikipedia': 3, 'conference_papers': 1, 'apr': 0, 'pan11': 2}

In [99]:
data['tokenized_text'] = data['preprocessed_text'].str.split(' ')

---

## MUSE Embedding Vector Average

In [126]:
def doc2vector(doc, vec_map):
    vector = np.zeros((300,), dtype=np.float64)
    for token in doc:
        if token.lower() in vec_map:
            vector += vec_map[token.lower()]
    vector /= len(doc)
    return np.array(vector)

In [127]:
en = data[data['lang'] == 'en']

X_en, y_en = list(), list()

for index, row in en.iterrows():
    try:
        X_en.append(doc2vector(doc=row['tokenized_text'], vec_map=en_word2id))
        y_en.append(CONTEXT2ID[row['context']])
    except Exception as e:
        continue

In [128]:
es = data[data['lang'] == 'es']

X_es, y_es = list(), list()

for index, row in es.iterrows():
    try:
        X_es.append(doc2vector(doc=row['tokenized_text'], vec_map=es_word2id))
        y_es.append(CONTEXT2ID[row['context']])
    except:
        continue

In [129]:
fr = data[data['lang'] == 'fr']

X_fr, y_fr = list(), list()

for index, row in fr.iterrows():
    try:
        X_fr.append(doc2vector(doc=row['tokenized_text'], vec_map=fr_word2id))
        y_fr.append(CONTEXT2ID[row['context']])
    except:
        continue

In [130]:
X = np.asarray(X_en + X_es + X_fr)
y = np.asarray(y_en + y_es + y_fr)

In [131]:
X.shape, y.shape

((23011, 300), (23011,))

In [133]:
from sklearn.model_selection import StratifiedShuffleSplit

train_test = StratifiedShuffleSplit(n_splits=5, test_size=.2)

for train_index, test_index in train_test.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

---

## Model Training

In [134]:
from sklearn.svm import LinearSVC

model = LinearSVC()
model.fit(X_train, y_train)



LinearSVC()

In [135]:
model.score(X_test, y_test)

0.5904844666521833

---

## Language Detection

In order to use the trained MUSE multilingual word embeddings, we will need to identify the language of the text, so we will be using the Python library `langdetect` which seems to work pretty fine. Anyway, so as to check that it is consisten enought, we will just discard the detected languages that either do not match the known language or the ones that are none of the available languages: English (en), French (fr) and Spanish (es).

In [15]:
from langdetect import detect

In [16]:
detect('i would love to work at ea')

'en'

In [17]:
detect('me encantaria trabajar en ea')

'es'

In [18]:
detect('je adorerais travailler chez ea')

'fr'