In [1]:
import pandas as pd
import numpy as np

from gensim.models import KeyedVectors
import pickle

from library import w2v_modeling as wm, utils_wt as uwt

In [2]:
wm.build_w2v_model()

Start building Word2Vec models...
The length of the english to indonesia training dictionary is 5000
The length of the english to indonesia test dictionary is 1500
Models have been built and stored into model directory!


In [2]:
en_embeddings_subset = pickle.load(open("model/en_embeddings.p", "rb"))
id_embeddings_subset = pickle.load(open("model/id_embeddings.p", "rb"))

In [3]:
en_id_train = uwt.get_dict_en_id('RAW/train_test/en-id.train.txt')
print('The length of the English to Indonesian training dictionary is', len(en_id_train))
en_id_test = uwt.get_dict_en_id('RAW/train_test/en-id.test.txt')
print('The length of the English to Indonesian test dictionary is', len(en_id_train))

The length of the English to Indonesian training dictionary is 5000
The length of the English to Indonesian test dictionary is 5000


In [4]:
X_train, Y_train = uwt.get_matrices(en_id_train, id_embeddings_subset, en_embeddings_subset)

In [5]:
R_train = uwt.align_embeddings(X_train, Y_train, train_steps=400, learning_rate=0.95)

loss at iteration 0 is: 989.8128
loss at iteration 25 is: 87.0141
loss at iteration 50 is: 30.7166
loss at iteration 75 is: 19.0784
loss at iteration 100 is: 15.7499
loss at iteration 125 is: 14.5854
loss at iteration 150 is: 14.1179
loss at iteration 175 is: 13.9113
loss at iteration 200 is: 13.8134
loss at iteration 225 is: 13.7644
loss at iteration 250 is: 13.7387
loss at iteration 275 is: 13.7249
loss at iteration 300 is: 13.7171
loss at iteration 325 is: 13.7126
loss at iteration 350 is: 13.7100
loss at iteration 375 is: 13.7084


In [6]:
X_test, Y_test = uwt.get_matrices(en_id_test, id_embeddings_subset, en_embeddings_subset)

---
# MODEL ACCURACY

$$\text{accuracy}=\frac{\#(\text{correct predictions})}{\#(\text{total predictions})}$$

In [7]:
acc = uwt.test_vocabulary(X_test, Y_test, R_train)
print("accuracy on test set is {:.3f}%".format(acc*100))

accuracy on test set is 47.869%


### The model managed to translate words from one language to another language with almost 48% accuracy by using basic linear algebra and learning a mapping of words from one English to Indonesia.

---
---
# FORMULAS EXPLANATION

## 1.1 Generate embedding and transform matrices

- `get_matrices` function will takes the loaded data and returns matrices `X` and `Y`
- Matrix `X` and matrix `Y`, where each row in X is the word embedding for an english word, and the same row in Y is the word embedding for the Indonesian version of that English word.
- Use the `en_id` dictionary to ensure that the ith row in the `X` matrix corresponds to the ith row in the `Y` matrix.

## 1.2 Compute the loss

* The loss function will be squared Frobenius norm of the difference between matrix and its approximation, divided by the number of training examples $m$.
* Its formula is: $$ L(X, Y, R)=\frac{1}{m}\sum_{i=1}^{m} \sum_{j=1}^{n}\left( a_{i j} \right)^{2}$$
* This formula is applied in `compute_loss()` function
* Compute the approximation of `Y` by matrix multiplying `X` and `R`
* Compute difference `XR - Y`
* Compute the squared Frobenius norm of the difference and divide it by $m$

where $a_{i j}$ is value in $i$th row and $j$th column of the matrix $\mathbf{XR}-\mathbf{Y}$.

## 1.3 Compute the gradient of loss to transform matrix R

* The formula for the gradient of the loss function $𝐿(𝑋,𝑌,𝑅)$ is: $$\frac{d}{dR}𝐿(𝑋,𝑌,𝑅)=\frac{d}{dR}\Big(\frac{1}{m}\| X R -Y\|_{F}^{2}\Big) = \frac{2}{m}X^{T} (X R - Y)$$
* Calculate the gradient of the loss with respect to transform matrix `R`.
* The gradient is a matrix that encodes how much a small change in `R` affect the change in the loss function.
* The gradient gives us the direction in which we should decrease `R` to minimize the loss
* $m$ is the number of training examples (number of rows in $X$)
* This formula is applied into `compute_gradient()` function

## 1.4 Find the optimal R with Gradient Descent Algorithm

* [Gradient descent](https://ml-cheatsheet.readthedocs.io/en/latest/gradient_descent.html) is an iterative algorithm which is used in searching for the optimum of the function
* Earlier, we've mentioned that the gradient of the loss with respect to the matrix encodes how much a tiny change in some coordinate of that matrix affect the change of loss function
* Gradient descent uses that information to iteratively change matrix `R` until we reach a point where the loss is minimized
* Calculate gradient $g$ of the loss with respect to the matrix $R$.
* Update $R$ with the formula: $$R_{\text{new}}= R_{\text{old}}-\alpha g$$
* Where $\alpha$ is the learning rate, which is a scalar
* The learning rate or "step size" $\alpha$ is a coefficient which decides how much we want to change $R$ in each step
* If we change $R$ too much, we could skip the optimum by taking too large of a step
* If we make only small changes to $R$, we will need many steps to reach the optimum
* Learning rate $\alpha$ is used to control those changes
* Values of $\alpha$ are chosen depending on the problem

Using the training set, the transformation matrix $\mathbf{R}$ can be found by calling the function `align_embeddings()`

## 1.5 Test the translation using _K-Nearest Neighbors Algorithm_ with _Cosine Similarity_

* Since we're approximating the translation function from English to Indonesia embeddings by a linear transformation matrix $\mathbf{R}$, most of the time we won't get the exact embedding of a Indonesia word when we transform embedding $\mathbf{e}$ of some particular English word into the Indonesia embedding space. 
* This is where $k$-NN becomes really useful! By using $1$-NN with $\mathbf{eR}$ as input, we can search for an embedding $\mathbf{f}$ (as a row) in the matrix $\mathbf{Y}$ which is the closest to the transformed vector $\mathbf{eR}$
* This formula is applied in `nearest_neighbors()` function
<br><br><br>
Cosine similarity between vectors $u$ and $v$ calculated as the cosine of the angle between them.
The formula is $$\cos(u,v)=\frac{u\cdot v}{\left\|u\right\|\left\|v\right\|}$$
* $\cos(u,v)$ = $1$ when $u$ and $v$ lie on the same line and have the same direction
* $\cos(u,v)$ is $-1$ when they have exactly opposite directions
* $\cos(u,v)$ is $0$ when the vectors are orthogonal (perpendicular) to each other
* This formula is applied in `cosine_similarity()` function